Built with Axolotl

See axolotl config

axolotl version: 0.16.2.dev0

base_model: swiss-ai/Apertus-8B-Instruct-2509
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: true

load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: bfloat16
bnb_4bit_use_double_quant: true

datasets:
  - path: Jackrong/DeepSeek-V4-Distill-8000x
    type:
      field_instruction: input
      field_output: output

      format: |
        ### Instruction:
        {instruction}

        ### Response:

      no_input_format: |
        ### Instruction:
        {instruction}

        ### Response:
  
#dataset_prepared_path: /data/prepared/apertus8b_deepseekv4

output_dir: /outputs/apertus-8b-deepseekv4-qlora

val_set_size: 0.05

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

adapter: qlora

lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

micro_batch_size: 2
gradient_accumulation_steps: 8

num_epochs: 3

optimizer: paged_adamw_8bit

learning_rate: 0.00015
lr_scheduler: cosine
warmup_ratio: 0.03
weight_decay: 0.0

bf16: true
tf32: true

gradient_checkpointing: true

logging_steps: 10
eval_steps: 100
save_steps: 100
save_total_limit: 3

#############################################
# HUGGING FACE HUB
#############################################

hub_model_id: laurent-maille/apertus-8b-deepseekv4-fr
hub_strategy: every_save


hf_use_auth_token: true

hub_private_repo: true

save_safetensors: true

#############################################
# OPTIONAL
#############################################

#wandb_project: apertus-8b-ft
#wandb_name: apertus-8b-deepseekv4

special_tokens:
  pad_token: "<|endoftext|>"

apertus-8b-deepseekv4-fr

This model is a fine-tuned version of swiss-ai/Apertus-8B-Instruct-2509 on the Jackrong/DeepSeek-V4-Distill-8000x dataset. It achieves the following results on the evaluation set:

  • Loss: 1.1878
  • Ppl: 3.2797
  • Memory/max Active (gib): 26.08
  • Memory/max Allocated (gib): 26.08
  • Memory/device Reserved (gib): 43.59

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.00015
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 16
  • optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 10
  • training_steps: 360

Training results

Training Loss Epoch Step Validation Loss Ppl Active (gib) Allocated (gib) Reserved (gib)
No log 0 0 4.0401 56.8296 26.06 26.06 27.67
1.2896 0.8282 100 1.2931 3.6441 26.08 26.08 43.39
1.2472 1.6625 200 1.2097 3.3524 26.08 26.08 43.84
1.2004 2.4886 300 1.1893 3.2847 26.08 26.08 43.84
1.1840 2.9855 360 1.1878 3.2797 26.08 26.08 43.59

Framework versions

  • PEFT 0.19.1
  • Transformers 5.5.4
  • Pytorch 2.10.0+cu128
  • Datasets 4.8.5
  • Tokenizers 0.22.2
Downloads last month
62
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laurent-maille/apertus-8b-deepseekv4-fr

Adapter
(31)
this model

Dataset used to train laurent-maille/apertus-8b-deepseekv4-fr