Qwen2.5­Code-14B - MLX mxfp4 Quantized


Model Summary

This repository contains an MLX-quantized version of Qwen2.5-7B-Instruct, optimized for Apple Silicon (M1/M2/M3/M4) devices.
The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 14-15 GB (FP16) to approximately 5-6 GB while maintaining strong instruction-following performance.

This quantized model is ideal for:

  • local assistants
  • offline workflows
  • VS Code integration
  • fast inference on Apple GPUs
  • running large models on 8 GB, 16 GB, or 24 GB Apple Silicon machines

Quantization Details

Setting Value
Quantization mode mxfp4
Bits per weight 4
Group size 64
Activation dtype bfloat16
Framework MLX
Quantization tool EricFillion/quantize

Command used:

python3 quantize.py \
  --model_name Qwen/Qwen2.5-7B-Instruct \
  --save_model_path models/qwen2.5-7b-instruct-mxfp4 \
  --q_mode mxfp4 \
  --q_bits 4 \
  --q_group_size 64

Resulting model size: approximately 5-6 GB


Running the Model (MLX)

CLI (mx-lm)

mlx_lm.generate \
  --model johnlockejrr/qwen2.5-7b-instruct-mxfp4 \
  --prompt "Write a poem about the Fibonacci numbers." \
  --max-tokens 512

Python API

from mlx_lm import load, generate

model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")

prompt = "Explain recursion in simple terms."

output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)

Chat Mode

from mlx_lm import load, chat

model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")

messages = [
    {"role": "user", "content": "What is a binary search tree?"}
]

response = chat(model, tokenizer, messages)
print(response)

Performance (Mac Mini M4, 16 GB)

Metric Value
Generation speed approximately 20-30 tokens/sec
Peak memory usage approximately 5.3 G B
GPU Apple M4 GPU
Framework MLX

Repository Contents

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md

License

This model inherits the license of the original model:

Qwen2.5-7B-Instruct License: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#license

Please review the license before using this model in commercial applications.


Limitations and Bias

  • The model may generate incorrect or insecure code.
  • It may hallucinate APIs or functions.
  • It may produce biased or harmful statements if prompted.
  • It should not be used for production-critical code without human review.

Acknowledgements

  • Qwen Team for the original Qwen2.5-7B-Instruct model
  • Apple MLX Team for the MLX framework
  • Eric Fillion for the MLX quantization tool
  • Hugging Face for hosting the model
Downloads last month
82
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnlockejrr/Qwen2.5-7B-Instruct-mxfp4

Base model

Qwen/Qwen2.5-7B
Quantized
(288)
this model