Instructions to use SandLogicTechnologies/gemma-4-E4B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SandLogicTechnologies/gemma-4-E4B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SandLogicTechnologies/gemma-4-E4B-GGUF",
	filename="gemma-4-E4B-it.mmproj-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use SandLogicTechnologies/gemma-4-E4B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Use Docker

docker model run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16

LM Studio
Jan

vLLM

How to use SandLogicTechnologies/gemma-4-E4B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SandLogicTechnologies/gemma-4-E4B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SandLogicTechnologies/gemma-4-E4B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Ollama
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Ollama:
```
ollama run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16
```

Unsloth Studio new

How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/gemma-4-E4B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/gemma-4-E4B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SandLogicTechnologies/gemma-4-E4B-GGUF to start chatting

Pi new

How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "SandLogicTechnologies/gemma-4-E4B-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Run Hermes

hermes

Docker Model Runner
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Docker Model Runner:
```
docker model run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16
```

Lemonade

How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SandLogicTechnologies/gemma-4-E4B-GGUF:F16

Run and chat with the model

lemonade run user.gemma-4-E4B-GGUF-F16

List all available models

lemonade list

gemma-4-E4B

gemma-4-E4B is a multimodal model from the gemma family, built to handle more demanding reasoning and generation tasks across both visual and textual inputs. With a larger parameter count, it offers improved reasoning depth, stronger consistency across long contexts, and better performance on complex problem-solving workloads.

The model supports multimodal interactions, allowing it to process text, images, and structured content such as long-form documents. It is designed for advanced conversational systems, agentic pipelines, and applications that require higher accuracy and structured outputs.

gemma-4-E4B is particularly suited for tasks that involve multi-step reasoning, technical workflows, and multilingual processing, while still maintaining practical efficiency for optimized deployments.

Model Overview

Model Name: gemma-4-E4B
Architecture: Decoder-only Transformer with multimodal extensions
Parameter Count: 4B parameters
Context Window: 128K tokens
Modalities: Text, Image (multimodal input support)
Primary Languages: English (with multilingual generalization)
Developer: Google
License: Apache 2.0

Quantization Details

This repository provides various GGUF quantized versions of the gemma-4-E4B model, optimized for efficient local inference using llama.cpp. Below are the details of the available I-Matrix (IQ) formats.

Quantization Formats (I-Quants)

IQ3_M (3-bit Medium)

IQ3_M prioritizes extreme compression, enabling the model to run within very tight memory constraints while still preserving essential behavior.
It represents weights at approximately 3 bits per parameter and uses importance-aware scaling to retain the most impactful components.
This format is useful for experimental setups, edge environments, or scenarios where fitting the model is the primary constraint.
Due to the aggressive reduction in precision, performance on complex reasoning, long-context tasks, and structured outputs can degrade, and reconstruction overhead may influence runtime efficiency.
Size reduction of approx 68.7% (4.39 GB) compared to 16-bit (14.02 GB)

IQ4_XS

IQ4_XS provides a compact yet capable configuration by combining 4-bit quantization with importance-driven weighting.
It maintains a practical balance between size and performance, making it suitable for a wide range of real-world applications.
This format handles conversational tasks, coding prompts, and moderate reasoning workloads effectively without requiring large memory budgets.
While internally more complex than traditional quantization, it delivers consistent generation performance once inference begins.
Size reduction of approx 66.2% (4.74 GB) compared to 16-bit (14.02 GB)

IQ4_NL

IQ4_NL introduces non-linear transformations to the quantization process, allowing it to better model variations in weight distributions.
This results in improved fidelity for critical layers, particularly in scenarios involving structured reasoning and long-form generation.
It is well-suited for higher-precision workloads such as technical explanations, analytical outputs, and agentic task execution.
The increased representational quality comes with slightly higher computational cost and marginally larger model size compared to simpler formats.
Size reduction of approx 65.4% (4.85 GB) compared to 16-bit (14.02 GB)

Training Overview

Pretraining

The model is trained on a large-scale dataset combining diverse textual corpora and multimodal data sources, enabling it to understand both language and visual information in a unified manner.

Training objectives include:

Cross-modal representation learning
Large-scale language modeling
Visual-text alignment
Contextual reasoning across long sequences

Alignment and Optimization

Post-training steps refine the model for real-world usability and instruction-following:

Instruction tuning for conversational tasks
Reinforcement learning and alignment techniques
Optimization for reasoning-heavy and agentic workflows
Enhanced multimodal grounding and response consistency

Core Capabilities

Deep multi-step reasoning Handles complex chains of thought and maintains logical consistency across extended problem-solving tasks.
Agentic workflow support Enables structured task execution, tool-like reasoning patterns, and multi-stage interactions.
High-fidelity multimodal understanding Combines visual and textual signals to produce context-aware and precise outputs.
Advanced technical generation Produces detailed, structured, and accurate responses for coding, analysis, and technical domains.
Long-context stability Maintains coherence and relevance across extended documents and multi-turn conversations.
Multilingual robustness Performs reliably across diverse languages and mixed-language inputs.

Example Usage

llama.cpp

./llama-cli \
  -m SandlogicTechnologies/gemma-4-E4B_IQ4_NL.gguf / \
  -p "Explain the concept of attention in transformer models."

Recommended Use Cases

Advanced reasoning systems and analytical pipelines
Agentic AI workflows and structured task automation
Technical research assistants and domain-specific copilots
Multimodal document and data interpretation
Code generation, debugging, and system design
Long-context knowledge extraction and summarization
Educational platforms for complex problem solving
High-quality conversational agents requiring consistency and depth

Acknowledgments

These quantized models are based on the original work by the Google development team.

Special thanks to:

The Google team for developing and releasing the gemma-4-E4B model.
Georgi Gerganov and the llama.cpp open-source community for enabling efficient quantization and inference via the GGUF format.

Contact

For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.

Downloads last month: 458

GGUF

Model size

8B params

Architecture

gemma4

Hardware compatibility

3-bit

4-bit

Model tree for SandLogicTechnologies/gemma-4-E4B-GGUF

Base model

google/gemma-4-E4B

Quantized

(27)

this model