Build a Local LLM From Scratch

01

Phase 01 · Foundation

Setup & Prerequisites

Beginner ⏱ 1 week

▼

Before writing a single line of training code, your development environment must be ready. Python version, GPU access, git configuration, and CI/CD infrastructure are completed in this phase. These seemingly small steps form the foundation for all subsequent phases.

// Environment Setup

🐍Python 3.10+

Runtime

Verify Python version: python --version. Python 3.10+ is required for PyTorch, transformers, and datasets libraries.

🖥️CUDA & GPU

Hardware

torch.cuda.is_available() must return True. NVIDIA driver and CUDA toolkit must be installed. Verify with nvcc --version.

🐙Git & GitLab CI

Version Control

Create repo, add .gitignore, set up basic CI pipeline with .gitlab-ci.yml. Every commit should trigger automated lint/test runs.

📦requirements.txt

Dependencies

Pin all packages: torch==2.x, transformers, datasets, accelerate, wandb. Save with pip freeze > requirements.txt.

◈ Tip: Create an isolated environment with Conda or venv. Never touch the system Python — package conflicts in later phases cause serious problems.

02

Phase 02 · Data

Data Collection & Preparation

Intermediate ⏱ 2–4 weeks

▼

"Garbage in, garbage out" has never been more true. A model's knowledge ceiling is its training data. In this phase, data sources are researched, raw data is downloaded, licenses are checked, and initial quality analysis is performed.

// Data Sources

🌐Common Crawl / FineWeb

Web Corpus

Petabytes of web text. FineWeb is a pre-cleaned subset — saves months of cleaning work.

📚The Pile / RedPajama-v2

Curated Dataset

Books, arXiv papers, GitHub code, Wikipedia. Essential for diversity. Open and pre-tokenized versions available.

🔬Domain-Specific Data

Custom Sources

Medical, legal, technical data. Critical for domain-focused models. Always verify the license of every source.

📊Statistics Report

Quality Analysis

Token count, language distribution, source distribution, average document length. Without this report, you're flying blind.

⚠ License Warning: Document the license of every data source. Commercial use requires CC-BY, Apache 2.0, or similar open licenses. Distributing a model trained on unlicensed data carries legal risk.

03

Phase 03 · Data

Data Processing & Cleaning

Intermediate ⏱ 2–3 weeks

▼

Raw data cannot be used directly. Duplicate content, low-quality documents, and personally identifiable information (PII) must be removed; train/val/test splits must be created at correct ratios. The quality of this phase directly determines model quality.

// Cleaning Pipeline

01

Language ID

fastText

02

Quality Filter

perplexity

03

Exact Dedup

suffix array

04

Fuzzy Dedup

MinHashLSH

05

PII Removal

regex / NER

06

Split

90/5/5

// Split Ratios

Split	Ratio	Purpose
Train	90%	Model training
Validation	5%	Monitoring during training
Test	5%	Final evaluation

◈ Why is deduplication critical? Repeated data causes the model to memorize rather than generalize. MinHash fuzzy dedup produces ~15-30% cleaner data and measurably improves downstream perplexity.

04

Phase 04 · Data

Tokenizer Training

Intermediate ⏱ 1 week

▼

The model doesn't see raw text — it sees integer token IDs. The tokenizer converts text into the numerical representation the model processes. Training a custom tokenizer for your language or domain directly affects model quality and efficiency.

// Tokenizer Options

🔢SentencePiece (BPE)

Algorithm

Byte Pair Encoding. Train on a representative corpus. Target vocab size: 32K–64K. Training on your own data is mandatory.

⚡HuggingFace Tokenizers

Implementation

Rust-backed, 10-100× faster than Python. Supports BPE, WordPiece, Unigram. Standard choice for production.

🏷️Special Tokens

Configuration

Define <pad>, <eos>, <bos>, <unk> tokens. For chat models, add special tokens like <|user|>, <|assistant|>.

📏Quality Test

Validation

Token/word ratio should be < 1.5. Test on your target language. Too many tokens → insufficient vocab coverage.

◈ Vocab size selection: 32K is sufficient for most projects. 64K+ provides better multilingual representation but increases the embedding table. Start with 32K for your first build.

05

Phase 05 · Architecture

Model Architecture Design

Advanced ⏱ 1–2 weeks

▼

Every modern LLM is a transformer — but the details are critically important. How many layers, which attention mechanism, which positional encoding, which normalization strategy. The decisions made in this phase determine the model's capacity and efficiency.

// Core Components

📐Decoder-Only Transformer

Core Architecture

GPT-style autoregressive next-token prediction. Best choice for text generation, chat, and code. N identical blocks stacked.

🔩RMSNorm + SwiGLU

Modern Defaults

RMSNorm instead of LayerNorm (faster), SwiGLU activation instead of ReLU. Standard choices from the LLaMA-3 architecture.

🌀RoPE Positional Encoding

Position Information

Rotary Position Embedding. Directly injects relative position information into attention. Superior performance for long context.

📏GQA / MHA

Attention Variant

Grouped Query Attention: reduces KV cache size. For 7B+ models, prefer GQA over MHA — significant memory and speed advantages.

// Model Size Selection

Size	Parameters	VRAM (BF16)	VRAM (4-bit)	Recommended GPU
Small	125M	0.5 GB	0.2 GB	Any GPU
Medium	1B	2 GB	0.5 GB	RTX 3060+
Large	7B	14 GB	4 GB	RTX 3090 / 4090
XL	13B	26 GB	8 GB	2× A6000

◈ First build recommendation: 125M–1B parameters, decoder-only, using LLaMA-3 architecture as reference. Small enough for fast iteration, large enough for meaningful language generation.

06

Phase 06 · Infrastructure

Training Infrastructure Setup

Advanced ⏱ 1 week

▼

LLM training is as much an engineering problem as a research problem. Saturating GPUs, eliminating data loading bottlenecks, and establishing stable training loops are the goals of this phase.

// Infrastructure Components

🔀Mixed Precision (BF16)

Memory Optimization

BF16 saves 2× memory compared to FP32 with minimal precision loss. Enable with torch.autocast.

🗜️Gradient Checkpointing

Memory Savings

Recompute activations instead of storing them. 30-40% slower but 4-8× memory savings. Mandatory for large models.

💾Checkpoint System

Resilience

Save checkpoint every N steps. Training should be resumable if interrupted. Keep the best val-loss checkpoint separately.

📊WandB / TensorBoard

Monitoring

Loss curve, gradient norm, learning rate, GPU utilization. Training without logging is flying blind — always monitor.

◈ Smoke Test: After infrastructure setup, train for 100 steps. Is loss decreasing? Any NaN? Is GPU utilization above 90%? Don't proceed to full training without passing these checks.

07

Phase 07 · Training

Pretraining

Advanced ⏱ 2–8 weeks

▼

Pretraining is where your model learns language. Trained on massive text corpora to predict the next token, it learns grammar, facts, reasoning patterns, and world knowledge. This is the most compute-intensive phase.

// Training Hyperparameters

📈Learning Rate Schedule

Optimization

Cosine annealing + warmup. Peak LR: 3e-4 (1B model). Warmup: 1-2% of total steps. Too high LR → loss spike.

📦Batch Size

Efficiency

Large effective batch size via gradient accumulation. Target: 0.5M–4M tokens/batch. Small batch → noisy gradients.

📉Loss Tracking

Monitoring

Measure train and validation loss after every epoch. If val loss starts increasing, overfitting has begun.

🏆Best Checkpoint

Model Selection

Select checkpoint by val loss. Use the one with lowest val perplexity — not necessarily the final checkpoint.

⚠ Chinchilla Scaling Law: Optimal training for an N-parameter model requires 20×N tokens. 1B model needs minimum 20B tokens. Fewer tokens → model is undertrained relative to its capacity.

☠ Loss Spike: Sudden loss increases are common. Causes: bad data batch, LR too high, NaN in gradients. Log gradient norm, roll back to previous checkpoint on spike.

08

Phase 08 · Fine-Tuning

Supervised Fine-Tuning (SFT)

Intermediate ⏱ 1–3 weeks

▼

A pretrained model is a text completion engine — it doesn't follow instructions. Supervised Fine-Tuning (SFT) teaches the model to follow instructions. Less but high-quality data outperforms more but noisy data.

// SFT Data & Methods

📝Instruction Dataset

SFT Data

Alpaca, ShareGPT, Dolly-15K. Format: {system, instruction, response}. Quality over quantity — 10K clean beats 1M noisy.

🎯LoRA / QLoRA

Parameter-Efficient FT

Inject low-rank adapters into attention matrices. Train 0.1–1% of parameters. QLoRA with 4-bit → fine-tune 70B on a single GPU.

💬Chat Template

Prompt Format

Define consistent {System}{User}{Assistant} template with special tokens. Mixed templates break model behavior.

🔧Axolotl / TRL

Training Framework

Axolotl: config-driven, supports full/LoRA/QLoRA. TRL: HuggingFace's battle-tested SFT/DPO trainer.

◈ LoRA rank selection: r=8 is low, r=64 is high-capacity adaptation. Start with r=16 or r=32 for first runs. alpha=2×r is a good starting point.

09

Phase 09 · Alignment

Alignment & DPO

Intermediate ⏱ 1–2 weeks

▼

An SFT model follows instructions but isn't aligned with human preferences. DPO (Direct Preference Optimization) steers the model toward "preferred" responses. Simpler and more stable than PPO-based RLHF.

// DPO Pipeline

📊Preference Dataset

DPO Data

(prompt, chosen, rejected) triplets. HH-RLHF, UltraFeedback. Or use a stronger model as a judge to generate synthetic pairs.

🤝DPO Training

Alignment

Reward margin must be positive. β hyperparameter (0.1–0.5) controls KL divergence — too high β → weak alignment.

✅Alignment Evaluation

Validation

Compare with MT-Bench, AlpacaEval. Test responses to harmful prompts. Compare against prior SFT checkpoint.

◈ DPO vs RLHF: DPO eliminates the reward model — it optimizes directly from preference data. Less GPU, less code, more stable training. Ideal for first-time builders.

10

Phase 10 · Evaluation

Evaluation & Safety

Beginner ⏱ 1–2 weeks

▼

How good is the model? This question must be answered before deployment. Standard benchmarks, toxicity tests, and a model card are the outputs of this phase. Never tune hyperparameters directly against benchmark scores — that's data contamination.

// Benchmark Suite

🧠MMLU

General Knowledge

57 subjects, 14K questions. 5-shot evaluation. Standard benchmark for general world knowledge.

🔢GSM8K

Mathematical Reasoning

8.5K elementary math problems. Test with chain-of-thought prompting.

🌊HellaSwag / ARC

Language Understanding

Commonsense reasoning (HellaSwag) and science Q&A (ARC). Use for fast evaluation at every checkpoint.

🛡️Toxicity Testing

Safety

Test against harmful content generation. Use RealToxicityPrompts or a custom prompt set. Document in model card.

◈ Eval Strategy: Run a fast suite (ARC, HellaSwag, perplexity) at every checkpoint. Run the full suite (MMLU, GSM8K) only at major milestones. Never tune hyperparameters against benchmark scores — that's data contamination.

11

Phase 11 · Optimization

Quantization

Intermediate ⏱ 1 week

▼

A full-precision 7B model weighs ~28GB. Most people don't have that much VRAM. Quantization compresses the model to 4–8 bits, reducing size 4–8× with minimal quality loss. This is what makes local deployment practical on consumer hardware.

// Quantization Methods

📦GGUF + llama.cpp

CPU/GPU Inference

Convert to GGUF format. Supports Q2 through Q8 quants. Runs on CPU with optional GPU offload. The go-to for local deployment.

⚡GPTQ / AWQ

GPU Quantization

Post-training quantization for GPU inference. AWQ is more accurate and faster. Both produce INT4 models with near-FP16 quality.

📊Quality Control

Validation

Compare perplexity before and after quantization. Q4_K_M typically shows 2-3% perplexity increase — acceptable.

// Quality vs Size Trade-off

Format	7B Size	VRAM Needed	Quality Loss	Use Case
FP16	14 GB	16 GB	None	GPU fine-tuning
Q8_0 GGUF	7.7 GB	10 GB	Negligible	Best quality local
Q5_K_M	5.0 GB	6 GB	Minimal	Balanced
Q4_K_M	4.1 GB	5 GB	Low	Consumer GPU sweet spot
Q2_K	2.8 GB	4 GB	High	Very constrained hardware only

12

Phase 12 · Deployment

Deployment & API

Beginner ⏱ 1 week

▼

A model sitting as weights on disk helps no one. Wrap it in an API, build a minimal interface, add monitoring for latency and throughput, and document your system prompt and chat template clearly.

// Serving Stack

🖥️Ollama

Local Serving

Wraps llama.cpp with a clean REST API and model management. Ideal for developer testing. OpenAI-compatible endpoint.

🔥vLLM

Production Serving

PagedAttention for efficient KV cache management. 10–20× throughput improvement over HuggingFace generate(). Production-grade.

🐳Docker Container

Packaging

Package the model and serving stack with Dockerfile + docker-compose. Makes it runnable on any machine.

📊Monitoring

Observability

Requests/sec, latency (p50/p95/p99), TTFT (time-to-first-token), tokens/sec, GPU utilization. Track with Prometheus + Grafana.

# 1. Convert to GGUF (from HuggingFace format)
python llama.cpp/convert_hf_to_gguf.py ./my-model-dpo \
  --outfile ./my-model-7b-f16.gguf --outtype f16

# 2. Quantize to Q4_K_M
./llama.cpp/quantize my-model-7b-f16.gguf \
                    my-model-7b-q4_k_m.gguf Q4_K_M

# 3. Serve via Ollama
ollama create my-model -f Modelfile
ollama serve   # starts on :11434

# 4. Test with curl (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"my-model","messages":[{"role":"user","content":"Hello!"}]}'

◈ System Prompt Engineering: A well-crafted system prompt is worth 10% quality improvement for free. Define the model's role, capabilities, limitations, and response format explicitly. Test edge cases: refusal behavior, long context, multi-turn consistency.