Fundamentals of LLM (Expanded, Ultra-Detailed Version A → G)
This is the improved master version for your Notion.
A — CORE FOUNDATIONS (Mathematics, CS, Python, PyTorch)
Why this section exists:
To train any LLM from scratch, you must understand the math, CS, and tooling that every transformer block is built upon. This section ensures you have the bedrock for everything later.
A1. Mathematics Foundations (for LLM Architecture & Training)
(1) Linear Algebra (Critical for Attention & Transformers)
- Scalars, vectors, matrices, tensors
- Dot products (used in Q·K attention)
- Matrix multiplication (core operation)
- Norms (L2, normalization layers)
- Orthogonality & projections
- Eigenvalues & stability (rare but helpful for optimization insight)
(2) Calculus (Optimization & Backprop)
- Derivatives (basic + vectorized)
- Chain rule
- Jacobians (conceptual)
- Gradients of loss functions
- Activation derivatives
(3) Probability & Statistics (For NLL, Cross-Entropy, Sampling)
- Probability distributions
- Softmax workings
- Cross-entropy & KL divergence
- Expectation & variance
- Sampling from categorical distributions
- Temperature scaling
(4) Discrete Math / Algorithms
- Asymptotic complexity (why attention is O(n²))
- Graph thinking (attention maps)
- Hashing (tokenization, vocab compression)
- Trie structures (used in tokenizers)
A2. Computer Science Foundations
- Memory layout (GPU & CPU differences)
- GPU vs CPU compute
- CUDA basics
- Parallelism vs vectorization
- FP16/BF16/FP8 numeric behavior
- CPU cache, batching, locality
A3. Python Foundations
- Python OOP
- Decorators
- Typing (Type hints)
- Context managers
- Generators (for data streaming)
- Error handling
- Performance tips (NumPy vectorization, avoiding Python loops)
A4. PyTorch Deep Foundations
- Tensors
- Autograd
- nn.Module
- Optimizers
- DataLoader
- Dataset class
- Training loop
- Mixed-precision training (torch.autocast)
- Distributed training basics (torch.distributed)
B — LLM ARCHITECTURE FOUNDATIONS (Tokenizers → Transformers)
Why this section exists:
This is how models think. You cannot build or modify a model without mastering every detail here.
B1. Tokenization
- Why tokenization matters
- Subword tokenization
- Character-level vs WordPiece vs BPE
- SentencePiece (unigram)
- Byte-level tokenization (GPT-2, LLaMA)
- Vocab size tradeoffs
- Special tokens (BOS, EOS, PAD, UNK)
- Tokenizer training process
- Token merging strategies
- Tokenization for multilingual models
B2. Input/Output Embeddings
- Token embeddings
- Learned positional embeddings
- RoPE (rotary embedding mathematics)
- ALiBi
- Relative position encodings
B3. Attention Mechanisms
Fundamental Attention Concepts
- Q (query), K (key), V (value)
- Dot-product attention
- Scaled attention (why scale by √d)
- Multi-head attention
- Self-attention vs cross-attention
Advanced Variants
- FlashAttention (memory optimized)
- GQA (Grouped Query Attention)
- Multi-Query Attention (MQA)
- MLA (DeepSeek-V3 innovation)
- Linear attention variants
- Attention-free models (Mamba, Hyena) (intro only)
B4. Transformer Block Internals
- LayerNorm vs RMSNorm
- Residual connections
- SwiGLU feedforward
- Dropout
- Attention + MLP ordering
- Pre-norm vs Post-norm
- Transformer depth vs width tradeoffs
B5. Training Stability
- Initialization strategies
- Learning rate warmup
- Cosine annealing
- Gradient clipping
- Mixed precision (BF16/FP16)
- Logits scaling
- Loss spikes & cures
C — PRETRAINING FOUNDATIONS (Data → Objective → Scaling Laws)
Why this section exists:
Pretraining shapes the LM’s general intelligence before reasoning, alignment, or fine-tuning.
C1. Pretraining Objective
- Next-token prediction
- Why it’s enough for complex reasoning
- Masking (BERT style) — know differences
C2. Data Engineering
- Dataset creation pipeline
- Document boundaries
- Packing sequences
- Deduplication methods
- Internet-scale text sources
- Tokenization throughput optimization
- Data mixture strategies (code, math, safe)
- FineWeb, Pile, SlimPajama
C3. Scaling Laws
- Neural scaling theory
- Compute-optimal scaling (Chinchilla)
- Parameters vs training tokens vs FLOPs
- Optimal batch size
- Dimensionality choices
C4. Multi-token Prediction (DeepSeek-V3 / Gemini style)
- Why multi-token prediction accelerates learning
- Architecture modifications needed
- Performance vs stability tradeoffs
C5. Distributed Training
- Data parallel
- Tensor parallel
- Pipeline parallel
- FSDP
- ZeRO stages
- Sharded optimizers
- GPU memory fragmentation effects
D — POST-TRAINING FOUNDATIONS (Reasoning, SFT, RLHF, Distillation)
Why this section exists:
This step teaches models how to behave, not just predict text.
D1. SFT (Supervised Fine-tuning)
- Task-specific fine-tuning
- Instruction tuning
- Formatting data correctly
- Preference between short/long answers
- CoT (Chain of Thought) SFT
D2. Preference Optimization
- DPO (Direct Preference Optimization)
- ORPO (Odds ratio preference optimization)
- RRHF
- Stable preference modeling
- Pros/cons vs PPO
D3. RLHF
- Reward modeling
- PPO
- GRPO (DeepSeek’s simpler PPO alternative)
- Advantage shaping
- Sampling strategies
- KL penalty tuning
D4. Distillation
- Sequence-level distillation
- Logit distillation
- Hidden-state distillation
- Chain-of-thought distillation
- Verifier-guided distillation
- Distilling from test-time compute (DeepSeek-R1 style)
E — DISTILLATION + EVALUATION FOUNDATIONS
Why this section exists:
You want to build small-but-smart models; distillation is vital.
E1. Distillation Goals
- Reduce hallucination
- Compress reasoning
- Transfer knowledge
- Improve inference speed
E2. Distillation Techniques
- Behavior cloning
- Student-teacher pipelines
- Offline vs online distillation
- Reinforced distillation
- CoT to short-form distillation
E3. Evaluation
- Perplexity
- Accuracy benchmarks (GSM8K, MATH)
- Reasoning faithfulness
- Long context evaluation
- Compression ratio vs accuracy tradeoffs
F — MULTIMODAL FOUNDATIONS (Vision, Audio, Video, Alignment)
Why this section exists:
Next-gen models are almost all multimodal.
F1. Vision Models
- ViT
- CLIP
- SigLIP
- Q-Former (BLIP-2)
- Image tokenization
F2. Audio Models
- Whisper architecture
- Mel spectrograms
- CTC loss
F3. Video Models
- Spatio-temporal attention
- Frame embeddings
F4. Multimodal Fusion
- Early fusion
- Late fusion
- Projector layers
- Visual grounding
- Multimodal CoT
G — FUTURE DIRECTIONS & ACTIVE RESEARCH AREAS
Why this section exists:
You want DeepSeek-level competency — meaning you must know where the field is evolving.
G1. Architecture Evolution
- MLA (DeepSeek)
- SSM hybrids (Mamba, RWKV)
- Efficient attention
- MoE 2.0
G2. Pretraining Evolution
- Multi-token prediction
- Synthetic data scaling
- Large-scale distillation
- Self-correction in pretraining
G3. Post-training Evolution
- GRPO replacing PPO
- Self-play for reasoning
- Verifier models
G4. Reasoning Evolution
- Best-of-N reasoning
- Self-verification
- Planner–worker–critic loops
- Test-time compute scaling
G5. Hallucination Reduction
- Retrieval enhanced pretraining (RPT)
- Multimodal grounding
- Verifier-guided decoding
G6. Agentic AI
- Tool use
- Planning
- Memory modules
- Reflection
- Multi-agent coordination
G7. System-Level Trends
- On-device LLMs
- Edge + Cloud hybrid
- Efficient quantization (AWQ, GPTQ, GGUF)
G8. Self-Improving Loops
- Auto-data generation
- Auto-distillation
- Auto-evaluation
- AI-assisted training pipelines
H — DATA & DATASET FOUNDATIONS (REAL + SYNTHETIC)
Why this exists: Even a perfectly implemented transformer is useless without good data. DeepSeek / OpenAI / Anthropic’s real power is in:
- What data they use
- How they clean, mix, and label it
- How they generate, filter, and distill synthetic data
Phase 1 = foundations: you won’t build a web-scale pipeline yet, but you’ll learn the full structure on a small scale so you can scale it up later.
H1. Types of Data You Need Across the LLM Lifecycle
H1.1 Pretraining Data (Base LM) Goal: Teach the model general language + code + basic world knowledge.
- Unlabeled, large-scale text: Web pages, books, Wikipedia, code, forums.
- Key properties: Diversity, clean text, balanced mixture.
H1.2 Supervised Fine-Tuning (SFT) Data Goal: Teach the model how to act as an assistant or specific tool.
- Labeled (input → output) pairs: Instruction → answer, Question → reasoning → answer.
- Typically higher-quality, smaller volume than pretraining data.
H1.3 Preference / RLHF / Reward Modeling Data Goal: Teach the model which outputs are better.
- Pairs: (prompt, response_A, response_B, preference).
- Used for: Preference modeling, RL (PPO/GRPO), DPO/ORPO.
H1.4 Evaluation Data Goal: Measure how good the model actually is.
- Held-out sets for Math, Coding, Reasoning.
- Must be never used in training or synthetic generation.
H1.5 Synthetic Data Cross-cuts all the above:
- Synthetic pretraining data (Teacher model generates corpora).
- Synthetic SFT data (Teacher answers tasks).
- Synthetic preference data (AI feedback - RLAIF).
H2. Collecting Raw Real Data
H2.1 Sources (for small-scale experiments)
- Public text datasets (Wikipedia, BookCorpus).
- Code datasets (GitHub subset).
- Domain data (Exported docs).
H2.2 Ways to Collect
- Direct downloads (preferred).
- APIs (Wikipedia, Hacker News).
- Web scraping (Respect robots.txt).
H3. Cleaning & Filtering (Quality, Safety, Deduplication)
H3.1 Basic Cleaning
- Remove HTML/markup.
- Normalize whitespace and Unicode.
- Strip boilerplate.
H3.2 Language Detection
- Filter to languages your tokenizer/model will handle.
H3.3 Deduplication
- Exact dedupe (Hash full documents).
- Near-duplicate detection (MinHash).
H3.4 Safety & PII Filtering
- Remove emails, phone numbers, keys.
- Remove explicitly harmful content.
H4. Building Pretraining Datasets (for Your Tiny LM)
H4.1 Decide the “World” Your Tiny LM Will Live In
- Pick a narrow-ish domain (e.g., Story world, Tech articles).
H4.2 Document Segmentation
- Split into paragraphs or sections.
- Remove very short/long docs.
H4.3 Tokenization & Sequence Packing
- Tokenize all text.
- Concatenate token streams.
- Split into fixed-length sequences.
H4.4 Dataset Statistics & Sanity Checks
- Compute total tokens, seq length distribution.
- Check for content dominance or unsafe strings.
H5. Building Fine-Tuning (SFT) Datasets
H5.1 Define the Task Schema
- Pick 1–2 concrete task types (Math reasoning, Q&A).
- Define a JSON-like schema.
H5.2 Sourcing SFT Data
- Real: Public Q&A datasets.
- Synthetic: Use a strong model (GPT-4) to generate problems and solutions.
H5.3 Formatting for Training
- Create a canonical prompt format (Instruction, User, Assistant).
H5.4 Quality Filtering for SFT
- Remove trivial or nonsensical examples.
- Enforce clear reasoning and explicit final answers.
H6. Building Preference / RL / Reward Data
H6.1 Pairwise Preference Data
- Two responses (A and B) for each prompt.
- Label: A > B, B > A, or Tie.
H6.2 Rating Data
- Single response + scalar score or rubric-based labels.
H7. Synthetic Data Pipelines (Teacher → Student)
H7.1 Types of Synthetic Data
- Pretraining-style, SFT-style, CoT-style, Preference data.
H7.2 Synthetic Data Loop
- Pick target domain.
- Write meta-prompts for teacher.
- Generate data.
- Filter.
- Store.
H7.3 Teacher-Student Distillation Data
- Teacher generates CoT reasoning + final answer.
- Student trains to reproduce CoT or final answer.
H8. Dataset Versioning, Metadata & Governance
H8.1 Versioning
- Give each dataset a version (e.g., pretrain_corpus_v1).
- Track source, date, preprocessing, tokenizer.
H8.2 Metadata & Data Cards
- Note intended use, domains, biases, safety notes.
H8.3 Legal/Ethical Basics
- Respect licenses, ToS, privacy.
H9. Practical Mini-Pipeline for Your Colab Tiny LLM
H9.1 Pretraining Data
- Pick domain -> Collect 1-10MB -> Clean -> Tokenize -> Pack -> Split -> Store.
H9.2 SFT Data
- Define task -> Use teacher model -> Format -> Filter -> Store.
H9.3 Evaluation Data
- Hold out 20-50 problems -> Save canonical answer -> Use for manual eval.
H9.4 Synthetic Preference Data
- Select eval prompts -> Produce multiple answers -> Manually label -> Store.