Instructions to use OpenAssistant/falcon-40b-sft-mix-1226 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenAssistant/falcon-40b-sft-mix-1226 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenAssistant/falcon-40b-sft-mix-1226", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("OpenAssistant/falcon-40b-sft-mix-1226", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OpenAssistant/falcon-40b-sft-mix-1226 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenAssistant/falcon-40b-sft-mix-1226" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-40b-sft-mix-1226", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/OpenAssistant/falcon-40b-sft-mix-1226
- SGLang
How to use OpenAssistant/falcon-40b-sft-mix-1226 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenAssistant/falcon-40b-sft-mix-1226" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-40b-sft-mix-1226", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenAssistant/falcon-40b-sft-mix-1226" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenAssistant/falcon-40b-sft-mix-1226", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use OpenAssistant/falcon-40b-sft-mix-1226 with Docker Model Runner:
docker model run hf.co/OpenAssistant/falcon-40b-sft-mix-1226
Expand Output after deploying it on SageMaker
I get back a response from the model but it is not complete. I manage to deploy it on ml.g5.12xlarge following the instructions.
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import json
model = "OpenAssistant/falcon-40b-sft-mix-1226"
tokenizer = AutoTokenizer.from_pretrained(model)
# grab environment variables
ENDPOINT_NAME = "huggingface-pytorch-tgi-inference-2023-06-14-22-44-39-458"
runtime= boto3.client('runtime.sagemaker')
prompt = "<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"
input_data = {
"inputs": prompt,
"parameters": {
"do_sample": True,
"temperature":0.1,
"include_prompt_in_result": False,
"top_k":10,
"num_return_sequences":10,
"max_length": 10,
#"eos_token_id":tokenizer.eos_token_id,
"return_full_text":False,
}
}
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=json.dumps(input_data).encode('utf-8'))
response_json = json.loads(response['Body'].read().decode("utf-8"))
response_json
By "not complete" do you mean that it cuts off early? If so it's likely because of the "max_length": 10 parameter you pass. That limits the generation to 10 tokens, which is really not a lot. If you want a somewhat detailed answer you should set it to at least 300. Though keep in mind that it is a max length, not an enforced length, so the answer can be shorter than this length.