Instructions to use SandLogicTechnologies/gemma-4-E4B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="SandLogicTechnologies/gemma-4-E4B-GGUF", filename="gemma-4-E4B-it.mmproj-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Use Docker
docker model run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SandLogicTechnologies/gemma-4-E4B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SandLogicTechnologies/gemma-4-E4B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16
- Ollama
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Ollama:
ollama run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16
- Unsloth Studio new
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SandLogicTechnologies/gemma-4-E4B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SandLogicTechnologies/gemma-4-E4B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for SandLogicTechnologies/gemma-4-E4B-GGUF to start chatting
- Pi new
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "SandLogicTechnologies/gemma-4-E4B-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Docker Model Runner:
docker model run hf.co/SandLogicTechnologies/gemma-4-E4B-GGUF:F16
- Lemonade
How to use SandLogicTechnologies/gemma-4-E4B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull SandLogicTechnologies/gemma-4-E4B-GGUF:F16
Run and chat with the model
lemonade run user.gemma-4-E4B-GGUF-F16
List all available models
lemonade list
gemma-4-E4B
gemma-4-E4B is a multimodal model from the gemma family, built to handle more demanding reasoning and generation tasks across both visual and textual inputs. With a larger parameter count, it offers improved reasoning depth, stronger consistency across long contexts, and better performance on complex problem-solving workloads.
The model supports multimodal interactions, allowing it to process text, images, and structured content such as long-form documents. It is designed for advanced conversational systems, agentic pipelines, and applications that require higher accuracy and structured outputs.
gemma-4-E4B is particularly suited for tasks that involve multi-step reasoning, technical workflows, and multilingual processing, while still maintaining practical efficiency for optimized deployments.
Model Overview
- Model Name: gemma-4-E4B
- Architecture: Decoder-only Transformer with multimodal extensions
- Parameter Count: 4B parameters
- Context Window: 128K tokens
- Modalities: Text, Image (multimodal input support)
- Primary Languages: English (with multilingual generalization)
- Developer: Google
- License: Apache 2.0
Quantization Details
This repository provides various GGUF quantized versions of the gemma-4-E4B model, optimized for efficient local inference using llama.cpp. Below are the details of the available I-Matrix (IQ) formats.
Quantization Formats (I-Quants)
IQ3_M (3-bit Medium)
- IQ3_M prioritizes extreme compression, enabling the model to run within very tight memory constraints while still preserving essential behavior.
- It represents weights at approximately 3 bits per parameter and uses importance-aware scaling to retain the most impactful components.
- This format is useful for experimental setups, edge environments, or scenarios where fitting the model is the primary constraint.
- Due to the aggressive reduction in precision, performance on complex reasoning, long-context tasks, and structured outputs can degrade, and reconstruction overhead may influence runtime efficiency.
- Size reduction of approx 68.7% (4.39 GB) compared to 16-bit (14.02 GB)
IQ4_XS
- IQ4_XS provides a compact yet capable configuration by combining 4-bit quantization with importance-driven weighting.
- It maintains a practical balance between size and performance, making it suitable for a wide range of real-world applications.
- This format handles conversational tasks, coding prompts, and moderate reasoning workloads effectively without requiring large memory budgets.
- While internally more complex than traditional quantization, it delivers consistent generation performance once inference begins.
- Size reduction of approx 66.2% (4.74 GB) compared to 16-bit (14.02 GB)
IQ4_NL
- IQ4_NL introduces non-linear transformations to the quantization process, allowing it to better model variations in weight distributions.
- This results in improved fidelity for critical layers, particularly in scenarios involving structured reasoning and long-form generation.
- It is well-suited for higher-precision workloads such as technical explanations, analytical outputs, and agentic task execution.
- The increased representational quality comes with slightly higher computational cost and marginally larger model size compared to simpler formats.
- Size reduction of approx 65.4% (4.85 GB) compared to 16-bit (14.02 GB)
Training Overview
Pretraining
The model is trained on a large-scale dataset combining diverse textual corpora and multimodal data sources, enabling it to understand both language and visual information in a unified manner.
Training objectives include:
- Cross-modal representation learning
- Large-scale language modeling
- Visual-text alignment
- Contextual reasoning across long sequences
Alignment and Optimization
Post-training steps refine the model for real-world usability and instruction-following:
- Instruction tuning for conversational tasks
- Reinforcement learning and alignment techniques
- Optimization for reasoning-heavy and agentic workflows
- Enhanced multimodal grounding and response consistency
Core Capabilities
Deep multi-step reasoning Handles complex chains of thought and maintains logical consistency across extended problem-solving tasks.
Agentic workflow support Enables structured task execution, tool-like reasoning patterns, and multi-stage interactions.
High-fidelity multimodal understanding Combines visual and textual signals to produce context-aware and precise outputs.
Advanced technical generation Produces detailed, structured, and accurate responses for coding, analysis, and technical domains.
Long-context stability Maintains coherence and relevance across extended documents and multi-turn conversations.
Multilingual robustness Performs reliably across diverse languages and mixed-language inputs.
Example Usage
llama.cpp
./llama-cli \
-m SandlogicTechnologies/gemma-4-E4B_IQ4_NL.gguf / \
-p "Explain the concept of attention in transformer models."
Recommended Use Cases
- Advanced reasoning systems and analytical pipelines
- Agentic AI workflows and structured task automation
- Technical research assistants and domain-specific copilots
- Multimodal document and data interpretation
- Code generation, debugging, and system design
- Long-context knowledge extraction and summarization
- Educational platforms for complex problem solving
- High-quality conversational agents requiring consistency and depth
Acknowledgments
These quantized models are based on the original work by the Google development team.
Special thanks to:
The Google team for developing and releasing the gemma-4-E4B model.
Georgi Gerganov and the
llama.cppopen-source community for enabling efficient quantization and inference via the GGUF format.
Contact
For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.
- Downloads last month
- 458
3-bit
4-bit
Model tree for SandLogicTechnologies/gemma-4-E4B-GGUF
Base model
google/gemma-4-E4B