AI Engineering Lab — Local LLM Bootcamp

BUILD A LOCAL LLM FROM ZERO

A complete engineering blueprint — from raw data to a deployed, quantized model running on your own hardware. Every component. Every decision. No shortcuts.

12Phases
50+Tasks
~5MGPU Hours (7B)
Learning
01
Phase 01 · Foundation
Setup & Prerequisites
Beginner ⏱ 1 week
Before writing a single line of training code, your development environment must be ready. Python version, GPU access, git configuration, and CI/CD infrastructure are completed in this phase. These seemingly small steps form the foundation for all subsequent phases.
🐍Python 3.10+
Runtime
Verify Python version: python --version. Python 3.10+ is required for PyTorch, transformers, and datasets libraries.
🖥️CUDA & GPU
Hardware
torch.cuda.is_available() must return True. NVIDIA driver and CUDA toolkit must be installed. Verify with nvcc --version.
🐙Git & GitLab CI
Version Control
Create repo, add .gitignore, set up basic CI pipeline with .gitlab-ci.yml. Every commit should trigger automated lint/test runs.
📦requirements.txt
Dependencies
Pin all packages: torch==2.x, transformers, datasets, accelerate, wandb. Save with pip freeze > requirements.txt.
Tip: Create an isolated environment with Conda or venv. Never touch the system Python — package conflicts in later phases cause serious problems.
02
Phase 02 · Data
Data Collection & Preparation
Intermediate ⏱ 2–4 weeks
"Garbage in, garbage out" has never been more true. A model's knowledge ceiling is its training data. In this phase, data sources are researched, raw data is downloaded, licenses are checked, and initial quality analysis is performed.
🌐Common Crawl / FineWeb
Web Corpus
Petabytes of web text. FineWeb is a pre-cleaned subset — saves months of cleaning work.
📚The Pile / RedPajama-v2
Curated Dataset
Books, arXiv papers, GitHub code, Wikipedia. Essential for diversity. Open and pre-tokenized versions available.
🔬Domain-Specific Data
Custom Sources
Medical, legal, technical data. Critical for domain-focused models. Always verify the license of every source.
📊Statistics Report
Quality Analysis
Token count, language distribution, source distribution, average document length. Without this report, you're flying blind.
License Warning: Document the license of every data source. Commercial use requires CC-BY, Apache 2.0, or similar open licenses. Distributing a model trained on unlicensed data carries legal risk.
03
Phase 03 · Data
Data Processing & Cleaning
Intermediate ⏱ 2–3 weeks
Raw data cannot be used directly. Duplicate content, low-quality documents, and personally identifiable information (PII) must be removed; train/val/test splits must be created at correct ratios. The quality of this phase directly determines model quality.
01
Language ID
fastText
02
Quality Filter
perplexity
03
Exact Dedup
suffix array
04
Fuzzy Dedup
MinHashLSH
05
PII Removal
regex / NER
06
Split
90/5/5
SplitRatioPurpose
Train90%Model training
Validation5%Monitoring during training
Test5%Final evaluation
Why is deduplication critical? Repeated data causes the model to memorize rather than generalize. MinHash fuzzy dedup produces ~15-30% cleaner data and measurably improves downstream perplexity.
04
Phase 04 · Data
Tokenizer Training
Intermediate ⏱ 1 week
The model doesn't see raw text — it sees integer token IDs. The tokenizer converts text into the numerical representation the model processes. Training a custom tokenizer for your language or domain directly affects model quality and efficiency.
🔢SentencePiece (BPE)
Algorithm
Byte Pair Encoding. Train on a representative corpus. Target vocab size: 32K–64K. Training on your own data is mandatory.
HuggingFace Tokenizers
Implementation
Rust-backed, 10-100× faster than Python. Supports BPE, WordPiece, Unigram. Standard choice for production.
🏷️Special Tokens
Configuration
Define <pad>, <eos>, <bos>, <unk> tokens. For chat models, add special tokens like <|user|>, <|assistant|>.
📏Quality Test
Validation
Token/word ratio should be < 1.5. Test on your target language. Too many tokens → insufficient vocab coverage.
Vocab size selection: 32K is sufficient for most projects. 64K+ provides better multilingual representation but increases the embedding table. Start with 32K for your first build.
05
Phase 05 · Architecture
Model Architecture Design
Advanced ⏱ 1–2 weeks
Every modern LLM is a transformer — but the details are critically important. How many layers, which attention mechanism, which positional encoding, which normalization strategy. The decisions made in this phase determine the model's capacity and efficiency.
📐Decoder-Only Transformer
Core Architecture
GPT-style autoregressive next-token prediction. Best choice for text generation, chat, and code. N identical blocks stacked.
🔩RMSNorm + SwiGLU
Modern Defaults
RMSNorm instead of LayerNorm (faster), SwiGLU activation instead of ReLU. Standard choices from the LLaMA-3 architecture.
🌀RoPE Positional Encoding
Position Information
Rotary Position Embedding. Directly injects relative position information into attention. Superior performance for long context.
📏GQA / MHA
Attention Variant
Grouped Query Attention: reduces KV cache size. For 7B+ models, prefer GQA over MHA — significant memory and speed advantages.
SizeParametersVRAM (BF16)VRAM (4-bit)Recommended GPU
Small125M0.5 GB0.2 GBAny GPU
Medium1B2 GB0.5 GBRTX 3060+
Large7B14 GB4 GBRTX 3090 / 4090
XL13B26 GB8 GB2× A6000
First build recommendation: 125M–1B parameters, decoder-only, using LLaMA-3 architecture as reference. Small enough for fast iteration, large enough for meaningful language generation.
06
Phase 06 · Infrastructure
Training Infrastructure Setup
Advanced ⏱ 1 week
LLM training is as much an engineering problem as a research problem. Saturating GPUs, eliminating data loading bottlenecks, and establishing stable training loops are the goals of this phase.
🔀Mixed Precision (BF16)
Memory Optimization
BF16 saves 2× memory compared to FP32 with minimal precision loss. Enable with torch.autocast.
🗜️Gradient Checkpointing
Memory Savings
Recompute activations instead of storing them. 30-40% slower but 4-8× memory savings. Mandatory for large models.
💾Checkpoint System
Resilience
Save checkpoint every N steps. Training should be resumable if interrupted. Keep the best val-loss checkpoint separately.
📊WandB / TensorBoard
Monitoring
Loss curve, gradient norm, learning rate, GPU utilization. Training without logging is flying blind — always monitor.
Smoke Test: After infrastructure setup, train for 100 steps. Is loss decreasing? Any NaN? Is GPU utilization above 90%? Don't proceed to full training without passing these checks.
07
Phase 07 · Training
Pretraining
Advanced ⏱ 2–8 weeks
Pretraining is where your model learns language. Trained on massive text corpora to predict the next token, it learns grammar, facts, reasoning patterns, and world knowledge. This is the most compute-intensive phase.
📈Learning Rate Schedule
Optimization
Cosine annealing + warmup. Peak LR: 3e-4 (1B model). Warmup: 1-2% of total steps. Too high LR → loss spike.
📦Batch Size
Efficiency
Large effective batch size via gradient accumulation. Target: 0.5M–4M tokens/batch. Small batch → noisy gradients.
📉Loss Tracking
Monitoring
Measure train and validation loss after every epoch. If val loss starts increasing, overfitting has begun.
🏆Best Checkpoint
Model Selection
Select checkpoint by val loss. Use the one with lowest val perplexity — not necessarily the final checkpoint.
Chinchilla Scaling Law: Optimal training for an N-parameter model requires 20×N tokens. 1B model needs minimum 20B tokens. Fewer tokens → model is undertrained relative to its capacity.
Loss Spike: Sudden loss increases are common. Causes: bad data batch, LR too high, NaN in gradients. Log gradient norm, roll back to previous checkpoint on spike.
08
Phase 08 · Fine-Tuning
Supervised Fine-Tuning (SFT)
Intermediate ⏱ 1–3 weeks
A pretrained model is a text completion engine — it doesn't follow instructions. Supervised Fine-Tuning (SFT) teaches the model to follow instructions. Less but high-quality data outperforms more but noisy data.
📝Instruction Dataset
SFT Data
Alpaca, ShareGPT, Dolly-15K. Format: {system, instruction, response}. Quality over quantity — 10K clean beats 1M noisy.
🎯LoRA / QLoRA
Parameter-Efficient FT
Inject low-rank adapters into attention matrices. Train 0.1–1% of parameters. QLoRA with 4-bit → fine-tune 70B on a single GPU.
💬Chat Template
Prompt Format
Define consistent {System}{User}{Assistant} template with special tokens. Mixed templates break model behavior.
🔧Axolotl / TRL
Training Framework
Axolotl: config-driven, supports full/LoRA/QLoRA. TRL: HuggingFace's battle-tested SFT/DPO trainer.
LoRA rank selection: r=8 is low, r=64 is high-capacity adaptation. Start with r=16 or r=32 for first runs. alpha=2×r is a good starting point.
09
Phase 09 · Alignment
Alignment & DPO
Intermediate ⏱ 1–2 weeks
An SFT model follows instructions but isn't aligned with human preferences. DPO (Direct Preference Optimization) steers the model toward "preferred" responses. Simpler and more stable than PPO-based RLHF.
📊Preference Dataset
DPO Data
(prompt, chosen, rejected) triplets. HH-RLHF, UltraFeedback. Or use a stronger model as a judge to generate synthetic pairs.
🤝DPO Training
Alignment
Reward margin must be positive. β hyperparameter (0.1–0.5) controls KL divergence — too high β → weak alignment.
Alignment Evaluation
Validation
Compare with MT-Bench, AlpacaEval. Test responses to harmful prompts. Compare against prior SFT checkpoint.
DPO vs RLHF: DPO eliminates the reward model — it optimizes directly from preference data. Less GPU, less code, more stable training. Ideal for first-time builders.
10
Phase 10 · Evaluation
Evaluation & Safety
Beginner ⏱ 1–2 weeks
How good is the model? This question must be answered before deployment. Standard benchmarks, toxicity tests, and a model card are the outputs of this phase. Never tune hyperparameters directly against benchmark scores — that's data contamination.
🧠MMLU
General Knowledge
57 subjects, 14K questions. 5-shot evaluation. Standard benchmark for general world knowledge.
🔢GSM8K
Mathematical Reasoning
8.5K elementary math problems. Test with chain-of-thought prompting.
🌊HellaSwag / ARC
Language Understanding
Commonsense reasoning (HellaSwag) and science Q&A (ARC). Use for fast evaluation at every checkpoint.
🛡️Toxicity Testing
Safety
Test against harmful content generation. Use RealToxicityPrompts or a custom prompt set. Document in model card.
Eval Strategy: Run a fast suite (ARC, HellaSwag, perplexity) at every checkpoint. Run the full suite (MMLU, GSM8K) only at major milestones. Never tune hyperparameters against benchmark scores — that's data contamination.
11
Phase 11 · Optimization
Quantization
Intermediate ⏱ 1 week
A full-precision 7B model weighs ~28GB. Most people don't have that much VRAM. Quantization compresses the model to 4–8 bits, reducing size 4–8× with minimal quality loss. This is what makes local deployment practical on consumer hardware.
📦GGUF + llama.cpp
CPU/GPU Inference
Convert to GGUF format. Supports Q2 through Q8 quants. Runs on CPU with optional GPU offload. The go-to for local deployment.
GPTQ / AWQ
GPU Quantization
Post-training quantization for GPU inference. AWQ is more accurate and faster. Both produce INT4 models with near-FP16 quality.
📊Quality Control
Validation
Compare perplexity before and after quantization. Q4_K_M typically shows 2-3% perplexity increase — acceptable.
Format7B SizeVRAM NeededQuality LossUse Case
FP1614 GB16 GBNoneGPU fine-tuning
Q8_0 GGUF7.7 GB10 GBNegligibleBest quality local
Q5_K_M5.0 GB6 GBMinimalBalanced
Q4_K_M4.1 GB5 GBLowConsumer GPU sweet spot
Q2_K2.8 GB4 GBHighVery constrained hardware only
12
Phase 12 · Deployment
Deployment & API
Beginner ⏱ 1 week
A model sitting as weights on disk helps no one. Wrap it in an API, build a minimal interface, add monitoring for latency and throughput, and document your system prompt and chat template clearly.
🖥️Ollama
Local Serving
Wraps llama.cpp with a clean REST API and model management. Ideal for developer testing. OpenAI-compatible endpoint.
🔥vLLM
Production Serving
PagedAttention for efficient KV cache management. 10–20× throughput improvement over HuggingFace generate(). Production-grade.
🐳Docker Container
Packaging
Package the model and serving stack with Dockerfile + docker-compose. Makes it runnable on any machine.
📊Monitoring
Observability
Requests/sec, latency (p50/p95/p99), TTFT (time-to-first-token), tokens/sec, GPU utilization. Track with Prometheus + Grafana.
Bash — Full Local Stack Startup
# 1. Convert to GGUF (from HuggingFace format) python llama.cpp/convert_hf_to_gguf.py ./my-model-dpo \ --outfile ./my-model-7b-f16.gguf --outtype f16 # 2. Quantize to Q4_K_M ./llama.cpp/quantize my-model-7b-f16.gguf \ my-model-7b-q4_k_m.gguf Q4_K_M # 3. Serve via Ollama ollama create my-model -f Modelfile ollama serve # starts on :11434 # 4. Test with curl (OpenAI-compatible) curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"my-model","messages":[{"role":"user","content":"Hello!"}]}'
System Prompt Engineering: A well-crafted system prompt is worth 10% quality improvement for free. Define the model's role, capabilities, limitations, and response format explicitly. Test edge cases: refusal behavior, long context, multi-turn consistency.