Fine-Tuning Gemma 3 1B: My Actual Workflow + Costs
I fine-tune a small open-weight model every week or two for narrow tasks. Here's the actual workflow, including data prep, training script, evaluation and the costs at each stage.
When fine-tuning beats prompting
Fine-tuning wins when you have:
- A high-volume, narrow task where prompt-with-examples gets expensive (each call carries the few-shot examples in input tokens).
- A specific output format you need consistently.
- Domain language a base model under-represents.
It loses when the task is small-volume or when prompting already passes your eval. "I'd prefer to fine-tune" is not a reason.
Data prep
Format: JSONL of {prompt, completion}. Cleaner than chat-template formatting for a small specialist model. Targets:
- 1,000–5,000 high-quality pairs. More past 5,000 helps surprisingly little for a 1B model.
- Strict deduplication (exact and near-dup). Contamination is the #1 silent killer of fine-tunes.
- Eval split held out before any training run. 10% is fine.
Training script (Unsloth)
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/gemma-3-1b-it",
max_seq_length=1024,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
ds = load_dataset("json", data_files="train.jsonl")["train"]
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=ds,
args=SFTConfig(
output_dir="./out", num_train_epochs=3,
per_device_train_batch_size=4, gradient_accumulation_steps=4,
learning_rate=2e-4, warmup_ratio=0.03,
bf16=True, logging_steps=10, save_steps=100,
),
)
trainer.train()
trainer.push_to_hub("your-handle/gemma-3-1b-mytask-v1")
Cost shape
| Stage | Where | Indicative cost |
|---|---|---|
| Data prep | Local | $0 + your time |
| Training (LoRA, 3 epochs, 2k pairs) | Single consumer or rental GPU, ~30-60 min | Single-digit dollars on a rental tier |
| Eval | Local or vLLM box | $0 |
| Push to HF | HuggingFace | $0 |
| Serving (vLLM with adapter) | Same rental tier as base | $0.10–$0.20/hr while running |
Numbers above are illustrative ranges, not vendor quotes — every rental marketplace has different spot/on-demand math.
Eval that catches regressions
Same held-out set every time, scored automatically where possible. Two metrics:
- Exact-match accuracy on structured outputs (formats, classifications).
- Pairwise LLM-judge score against the previous version, on freeform outputs. Treat as a regression detector, not absolute truth.
What goes wrong
- Catastrophic format loss. Model forgets to emit JSON. Fix: include format examples in >30% of training pairs.
- Overfitting on common patterns. Loss looks great, eval set tanks. Fix: smaller LoRA rank, fewer epochs.
- Contaminated training data. Eval pairs leaked into training. Fix: hash check before every run.
Deployment
Push the LoRA adapter to HuggingFace, point vLLM at base + adapter. Inference cost is the same as the base model — no extra GPU footprint.