| | --- |
| | license: cc-by-nc-sa-4.0 |
| | library_name: sentence-transformers |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - patent |
| | - embeddings |
| | - mteb |
| | language: |
| | - en |
| | pipeline_tag: sentence-similarity |
| | --- |
| | |
| | # patembed-base_small |
| | |
| | This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval. |
| | |
| | **Note:** This model uses task-specific instruction prompts during inference for optimal performance. |
| | |
| | ## Model Details |
| | |
| | - **Model Type**: Sentence Transformer |
| | - **Base Architecture**: Distilled from patembed-large using layers {0,3,6,9,12,15,18,21} |
| | - **Parameters**: 143M |
| | - **Number of Layers**: 8 |
| | - **Hidden Size**: 1024 |
| | - **Embedding Dimension**: 512 |
| | - **Max Sequence Length**: 512 tokens |
| | - **Language**: English |
| | - **License**: CC BY-NC-SA 4.0 |
| | |
| | ## Model Description |
| | |
| | Memory-constrained deployment variant. Maintains 1024 hidden size with projection to 512-dim embeddings. |
| | |
| | This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper. |
| | |
| | |
| | |
| | ## Usage |
| | |
| | ### Using Sentence Transformers |
| | |
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| |
|
| | # Load the model |
| | model = SentenceTransformer('datalyes/patembed-base_small') |
| | |
| | # Encode patent texts |
| | patent_texts = [ |
| | "A method for manufacturing semiconductor devices...", |
| | "An apparatus for processing chemical compounds...", |
| | ] |
| | embeddings = model.encode(patent_texts) |
| | |
| | # Compute similarity |
| | from sentence_transformers import util |
| | similarity = util.cos_sim(embeddings[0], embeddings[1]) |
| | print(f"Similarity: {similarity.item():.4f}") |
| | ``` |
| | |
| | ### Using Transformers |
| | |
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | import torch |
| | import torch.nn.functional as F |
| |
|
| | # Load model and tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-base_small') |
| | model = AutoModel.from_pretrained('datalyes/patembed-base_small') |
| |
|
| | def mean_pooling(model_output, attention_mask): |
| | token_embeddings = model_output[0] |
| | input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
| | return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
| |
|
| | # Tokenize and encode |
| | texts = ["A method for manufacturing semiconductor devices..."] |
| | encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') |
| | |
| | with torch.no_grad(): |
| | model_output = model(**encoded) |
| | embeddings = mean_pooling(model_output, encoded['attention_mask']) |
| | embeddings = F.normalize(embeddings, p=2, dim=1) |
| | ``` |
| | |
| | ### Patent Retrieval Example |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer, util |
| | |
| | model = SentenceTransformer('datalyes/patembed-base_small') |
| | |
| | # Query patent |
| | query = "Method for reducing power consumption in mobile devices" |
| | |
| | # Candidate patents |
| | candidates = [ |
| | "A power management system for portable electronic devices...", |
| | "Chemical composition for battery manufacturing...", |
| | "Method for wireless data transmission in mobile networks...", |
| | ] |
| | |
| | # Encode and retrieve |
| | query_emb = model.encode(query) |
| | candidate_embs = model.encode(candidates) |
| | |
| | # Compute similarities |
| | scores = util.cos_sim(query_emb, candidate_embs)[0] |
| | |
| | # Get ranked results |
| | results = [(candidates[i], scores[i].item()) for i in range(len(candidates))] |
| | results.sort(key=lambda x: x[1], reverse=True) |
| | |
| | for patent, score in results: |
| | print(f"Score: {score:.4f} - {patent[:100]}...") |
| | ``` |
| |
|
| | ## Intended Use |
| |
|
| | This model is designed for patent-specific tasks including: |
| | - Patent search and retrieval |
| | - Prior art search |
| | - Patent classification and clustering |
| | - Technology landscape analysis |
| |
|
| | For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite our paper: |
| |
|
| | ```bibtex |
| | @misc{ayaou2025patentebcomprehensivebenchmarkmodel, |
| | title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, |
| | author={Iliass Ayaou and Denis Cavallucci}, |
| | year={2025}, |
| | eprint={2510.22264}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2510.22264} |
| | } |
| | ``` |
| |
|
| | **Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264) |
| |
|
| | ## License |
| |
|
| | This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license. |
| |
|
| | **Key Terms:** |
| | - ✅ You can use, share, and adapt the model |
| | - ✅ You must give appropriate credit |
| | - ❌ You may not use the model for commercial purposes |
| | - ⚠️ If you adapt or build upon this model, you must distribute under the same license |
| |
|
| | For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/ |
| |
|
| | ## Contact |
| |
|
| | - **Authors**: Iliass Ayaou, Denis Cavallucci |
| | - **Institution**: ICUBE Laboratory, INSA Strasbourg |
| | - **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb) |
| | - **HuggingFace**: [datalyes](https://huggingface.co/datalyes) |
| |
|