Different Input Lengths

by ShivanshMathur007 - opened Jan 12, 2024

Jan 12, 2024

Can anyone describe the difference between max_pos_embeddings=512(config.json) , max_seq_length=128(sentence_bert_config.json), model_max_length=512(tokenizer_config.json).
Also how can I set these values by using langchain.

tomaarsen

Sentence Transformers org Jan 17, 2024

Hello!

The 512 values are defined by the MiniLM base model, whereas 128 is the maximum sequence length that was used when finetuning the model to be an embedding model. As a result, 128 is the recommended maximum sequence length (after which you'll get much worse embeddings), and 512 is the maximum sequence length after which the model will simply crash.
I'm not sure how to set these values in LangChain.

Tom Aarsen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment