Pretraining is where your model learns language. Trained on massive text corpora to predict the next token, it learns grammar, facts, reasoning patterns, and world knowledge. This is the most compute-intensive phase.
// Training Hyperparameters
📈Learning Rate Schedule
Optimization
Cosine annealing + warmup. Peak LR: 3e-4 (1B model). Warmup: 1-2% of total steps. Too high LR → loss spike.
📦Batch Size
Efficiency
Large effective batch size via gradient accumulation. Target: 0.5M–4M tokens/batch. Small batch → noisy gradients.
📉Loss Tracking
Monitoring
Measure train and validation loss after every epoch. If val loss starts increasing, overfitting has begun.
🏆Best Checkpoint
Model Selection
Select checkpoint by val loss. Use the one with lowest val perplexity — not necessarily the final checkpoint.
⚠
Chinchilla Scaling Law: Optimal training for an N-parameter model requires 20×N tokens. 1B model needs minimum 20B tokens. Fewer tokens → model is undertrained relative to its capacity.
☠
Loss Spike: Sudden loss increases are common. Causes: bad data batch, LR too high, NaN in gradients. Log gradient norm, roll back to previous checkpoint on spike.