Qwen2.5Code-14B - MLX mxfp4 Quantized

Repository: johnlockejrr/Qwen2.5-7B-Instruct-mxfp4
Base model: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Quantization: MLX mxfp4 (4-bit)
Quantized by: johnlockejrr
Framework: MLX + mlx-lm
Quantization tool: https://github.com/EricFillion/quantize

Model Summary

This repository contains an MLX-quantized version of Qwen2.5-7B-Instruct, optimized for Apple Silicon (M1/M2/M3/M4) devices.
The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 14-15 GB (FP16) to approximately 5-6 GB while maintaining strong instruction-following performance.

This quantized model is ideal for:

local assistants
offline workflows
VS Code integration
fast inference on Apple GPUs
running large models on 8 GB, 16 GB, or 24 GB Apple Silicon machines

Quantization Details

Setting	Value
Quantization mode	mxfp4
Bits per weight	4
Group size	64
Activation dtype	bfloat16
Framework	MLX
Quantization tool	EricFillion/quantize

Command used:

python3 quantize.py \
  --model_name Qwen/Qwen2.5-7B-Instruct \
  --save_model_path models/qwen2.5-7b-instruct-mxfp4 \
  --q_mode mxfp4 \
  --q_bits 4 \
  --q_group_size 64

Resulting model size: approximately 5-6 GB

Running the Model (MLX)

CLI (mx-lm)

mlx_lm.generate \
  --model johnlockejrr/qwen2.5-7b-instruct-mxfp4 \
  --prompt "Write a poem about the Fibonacci numbers." \
  --max-tokens 512

Python API

from mlx_lm import load, generate

model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")

prompt = "Explain recursion in simple terms."

output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)

Chat Mode

from mlx_lm import load, chat

model, tokenizer = load("johnlockejrr/qwen2.5-7b-instruct-mxfp4")

messages = [
    {"role": "user", "content": "What is a binary search tree?"}
]

response = chat(model, tokenizer, messages)
print(response)

Performance (Mac Mini M4, 16 GB)

Metric	Value
Generation speed	approximately 20-30 tokens/sec
Peak memory usage	approximately 5.3 G B
GPU	Apple M4 GPU
Framework	MLX

Repository Contents

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md

License

This model inherits the license of the original model:

Qwen2.5-7B-Instruct License: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#license

Please review the license before using this model in commercial applications.

Limitations and Bias

The model may generate incorrect or insecure code.
It may hallucinate APIs or functions.
It may produce biased or harmful statements if prompted.
It should not be used for production-critical code without human review.

Acknowledgements

Qwen Team for the original Qwen2.5-7B-Instruct model
Apple MLX Team for the MLX framework
Eric Fillion for the MLX quantization tool
Hugging Face for hosting the model

Downloads last month: 82

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for johnlockejrr/Qwen2.5-7B-Instruct-mxfp4

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Quantized

(288)

this model

Qwen2.5­Code-14B - MLX mxfp4 Quantized