Qwen2.5Code-14B - MLX mxfp4 Quantized
- Repository:
johnlockejrr/Qwen2.5-7B-Instruct-mxfp4 - Base model: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
- Quantization: MLX mxfp4 (4-bit)
- Quantized by:
johnlockejrr - Framework: MLX + mlx-lm
- Quantization tool: https://github.com/EricFillion/quantize
Model Summary
This repository contains an MLX-quantized version of Qwen2.5-7B-Instruct, optimized for Apple Silicon (M1/M2/M3/M4) devices.
The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 14-15 GB (FP16) to approximately 5-6 GB while maintaining strong instruction-following performance.
This quantized model is ideal for:
- local assistants
- offline workflows
- VS Code integration
- fast inference on Apple GPUs
- running large models on 8 GB, 16 GB, or 24 GB Apple Silicon machines
Quantization Details
| Setting | Value |
|---|---|
| Quantization mode | mxfp4 |
| Bits per weight | 4 |
| Group size | 64 |
| Activation dtype | bfloat16 |
| Framework | MLX |
| Quantization tool | EricFillion/quantize |
Command used:
python3 quantize.py \
--model_name Qwen/Qwen2.5-7B-Instruct \
--save_model_path models/qwen2.5-7b-instruct-mxfp4 \
--q_mode mxfp4 \
--q_bits 4 \
--q_group_size 64
Resulting model size: approximately 5-6 GB
Running the Model (MLX)
CLI (mx-lm)
mlx_lm.generate \
--model johnlockejrr/qwen2.5-7b-instruct-mxfp4 \
--prompt "Write a poem about the Fibonacci numbers." \
--max-tokens 512
Python API
from mlx_lm import load, generate
model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")
prompt = "Explain recursion in simple terms."
output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)
Chat Mode
from mlx_lm import load, chat
model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")
messages = [
{"role": "user", "content": "What is a binary search tree?"}
]
response = chat(model, tokenizer, messages)
print(response)
Performance (Mac Mini M4, 16 GB)
| Metric | Value |
|---|---|
| Generation speed | approximately 20-30 tokens/sec |
| Peak memory usage | approximately 5.3 G B |
| GPU | Apple M4 GPU |
| Framework | MLX |
Repository Contents
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md
License
This model inherits the license of the original model:
Qwen2.5-7B-Instruct License: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#license
Please review the license before using this model in commercial applications.
Limitations and Bias
- The model may generate incorrect or insecure code.
- It may hallucinate APIs or functions.
- It may produce biased or harmful statements if prompted.
- It should not be used for production-critical code without human review.
Acknowledgements
- Qwen Team for the original Qwen2.5-7B-Instruct model
- Apple MLX Team for the MLX framework
- Eric Fillion for the MLX quantization tool
- Hugging Face for hosting the model
- Downloads last month
- 82
Model size
1B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit