Instructions to use NousResearch/Hermes-4.3-36B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NousResearch/Hermes-4.3-36B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NousResearch/Hermes-4.3-36B")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("NousResearch/Hermes-4.3-36B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use NousResearch/Hermes-4.3-36B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NousResearch/Hermes-4.3-36B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/Hermes-4.3-36B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/NousResearch/Hermes-4.3-36B
- SGLang
How to use NousResearch/Hermes-4.3-36B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NousResearch/Hermes-4.3-36B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/Hermes-4.3-36B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NousResearch/Hermes-4.3-36B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/Hermes-4.3-36B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use NousResearch/Hermes-4.3-36B with Docker Model Runner:
docker model run hf.co/NousResearch/Hermes-4.3-36B
Noice!
First of all, thanks for this release! Good mid-sized releases have been rare this year. I can't say I tested the base Bytedance model as its instruction format was really a pain to get going on my own backend, but this one is definitely interesting. Thank you for using a less arcane formatting.
Due to RAM limitations I could only run it in IQ4_XS (16K context), but even at this low quant level, it's surprisingly good. It's decently uncensored, I might need prompt nudging, but overall, refusals are rare even for obviously "wrong" questions. It did well on my personal test bench (do web queries based on user prompt, summary, Q&A, structured output, haystack, decision tree, menu navigation, and finding the correct info in a confusing 16K prompt). I have yet to test the function calling stuff, but so far so good.
The CoT is occasionally a bit weird. It works just fine for academic/work/Q&A/..., but I've noticed that in more creative areas, it'd occasionally respond "as the persona" in the thinking tag (like it's speaking to me) and then reformulate the same thing in the final response. Not a big deal, didn't impede the model, but afaik it's the first time I've ever noticed this behavior in any model, so it's worth sharing.
Cheers.