Fundamentals of LLM (Expanded, Ultra-Detailed Version A → G)

This is the improved master version for your Notion.

A — CORE FOUNDATIONS (Mathematics, CS, Python, PyTorch)

Why this section exists:

To train any LLM from scratch, you must understand the math, CS, and tooling that every transformer block is built upon. This section ensures you have the bedrock for everything later.

A1. Mathematics Foundations (for LLM Architecture & Training)

(1) Linear Algebra (Critical for Attention & Transformers)

Scalars, vectors, matrices, tensors
Dot products (used in Q·K attention)
Matrix multiplication (core operation)
Norms (L2, normalization layers)
Orthogonality & projections
Eigenvalues & stability (rare but helpful for optimization insight)

(2) Calculus (Optimization & Backprop)

Derivatives (basic + vectorized)
Chain rule
Jacobians (conceptual)
Gradients of loss functions
Activation derivatives

(3) Probability & Statistics (For NLL, Cross-Entropy, Sampling)

Probability distributions
Softmax workings
Cross-entropy & KL divergence
Expectation & variance
Sampling from categorical distributions
Temperature scaling

(4) Discrete Math / Algorithms

Asymptotic complexity (why attention is O(n²))
Graph thinking (attention maps)
Hashing (tokenization, vocab compression)
Trie structures (used in tokenizers)

A2. Computer Science Foundations

Memory layout (GPU & CPU differences)
GPU vs CPU compute
CUDA basics
Parallelism vs vectorization
FP16/BF16/FP8 numeric behavior
CPU cache, batching, locality

A3. Python Foundations

Python OOP
Decorators
Typing (Type hints)
Context managers
Generators (for data streaming)
Error handling
Performance tips (NumPy vectorization, avoiding Python loops)

A4. PyTorch Deep Foundations

Tensors
Autograd
nn.Module
Optimizers
DataLoader
Dataset class
Training loop
Mixed-precision training (torch.autocast)
Distributed training basics (torch.distributed)

B — LLM ARCHITECTURE FOUNDATIONS (Tokenizers → Transformers)

Why this section exists:

This is how models think. You cannot build or modify a model without mastering every detail here.

B1. Tokenization

Why tokenization matters
Subword tokenization
Character-level vs WordPiece vs BPE
SentencePiece (unigram)
Byte-level tokenization (GPT-2, LLaMA)
Vocab size tradeoffs
Special tokens (BOS, EOS, PAD, UNK)
Tokenizer training process
Token merging strategies
Tokenization for multilingual models

B2. Input/Output Embeddings

Token embeddings
Learned positional embeddings
RoPE (rotary embedding mathematics)
ALiBi
Relative position encodings

B3. Attention Mechanisms

Fundamental Attention Concepts

Q (query), K (key), V (value)
Dot-product attention
Scaled attention (why scale by √d)
Multi-head attention
Self-attention vs cross-attention

Advanced Variants

FlashAttention (memory optimized)
GQA (Grouped Query Attention)
Multi-Query Attention (MQA)
MLA (DeepSeek-V3 innovation)
Linear attention variants
Attention-free models (Mamba, Hyena) (intro only)

B4. Transformer Block Internals

LayerNorm vs RMSNorm
Residual connections
SwiGLU feedforward
Dropout
Attention + MLP ordering
Pre-norm vs Post-norm
Transformer depth vs width tradeoffs

B5. Training Stability

Initialization strategies
Learning rate warmup
Cosine annealing
Gradient clipping
Mixed precision (BF16/FP16)
Logits scaling
Loss spikes & cures

C — PRETRAINING FOUNDATIONS (Data → Objective → Scaling Laws)

Why this section exists:

Pretraining shapes the LM’s general intelligence before reasoning, alignment, or fine-tuning.

C1. Pretraining Objective

Next-token prediction
Why it’s enough for complex reasoning
Masking (BERT style) — know differences

C2. Data Engineering

Dataset creation pipeline
Document boundaries
Packing sequences
Deduplication methods
Internet-scale text sources
Tokenization throughput optimization
Data mixture strategies (code, math, safe)
FineWeb, Pile, SlimPajama

C3. Scaling Laws

Neural scaling theory
Compute-optimal scaling (Chinchilla)
Parameters vs training tokens vs FLOPs
Optimal batch size
Dimensionality choices

C4. Multi-token Prediction (DeepSeek-V3 / Gemini style)

Why multi-token prediction accelerates learning
Architecture modifications needed
Performance vs stability tradeoffs

C5. Distributed Training

Data parallel
Tensor parallel
Pipeline parallel
FSDP
ZeRO stages
Sharded optimizers
GPU memory fragmentation effects

D — POST-TRAINING FOUNDATIONS (Reasoning, SFT, RLHF, Distillation)

Why this section exists:

This step teaches models how to behave, not just predict text.

D1. SFT (Supervised Fine-tuning)

Task-specific fine-tuning
Instruction tuning
Formatting data correctly
Preference between short/long answers
CoT (Chain of Thought) SFT

D2. Preference Optimization

DPO (Direct Preference Optimization)
ORPO (Odds ratio preference optimization)
RRHF
Stable preference modeling
Pros/cons vs PPO

D3. RLHF

Reward modeling
PPO
GRPO (DeepSeek’s simpler PPO alternative)
Advantage shaping
Sampling strategies
KL penalty tuning

D4. Distillation

Sequence-level distillation
Logit distillation
Hidden-state distillation
Chain-of-thought distillation
Verifier-guided distillation
Distilling from test-time compute (DeepSeek-R1 style)

E — DISTILLATION + EVALUATION FOUNDATIONS

Why this section exists:

You want to build small-but-smart models; distillation is vital.

E1. Distillation Goals

Reduce hallucination
Compress reasoning
Transfer knowledge
Improve inference speed

E2. Distillation Techniques

Behavior cloning
Student-teacher pipelines
Offline vs online distillation
Reinforced distillation
CoT to short-form distillation

E3. Evaluation

Perplexity
Accuracy benchmarks (GSM8K, MATH)
Reasoning faithfulness
Long context evaluation
Compression ratio vs accuracy tradeoffs

F — MULTIMODAL FOUNDATIONS (Vision, Audio, Video, Alignment)

Why this section exists:

Next-gen models are almost all multimodal.

F1. Vision Models

ViT
CLIP
SigLIP
Q-Former (BLIP-2)
Image tokenization

F2. Audio Models

Whisper architecture
Mel spectrograms
CTC loss

F3. Video Models

Spatio-temporal attention
Frame embeddings

F4. Multimodal Fusion

Early fusion
Late fusion
Projector layers
Visual grounding
Multimodal CoT

G — FUTURE DIRECTIONS & ACTIVE RESEARCH AREAS

Why this section exists:

You want DeepSeek-level competency — meaning you must know where the field is evolving.

G1. Architecture Evolution

MLA (DeepSeek)
SSM hybrids (Mamba, RWKV)
Efficient attention
MoE 2.0

G2. Pretraining Evolution

Multi-token prediction
Synthetic data scaling
Large-scale distillation
Self-correction in pretraining

G3. Post-training Evolution

GRPO replacing PPO
Self-play for reasoning
Verifier models

G4. Reasoning Evolution

Best-of-N reasoning
Self-verification
Planner–worker–critic loops
Test-time compute scaling

G5. Hallucination Reduction

Retrieval enhanced pretraining (RPT)
Multimodal grounding
Verifier-guided decoding

G6. Agentic AI

Tool use
Planning
Memory modules
Reflection
Multi-agent coordination

G7. System-Level Trends

On-device LLMs
Edge + Cloud hybrid
Efficient quantization (AWQ, GPTQ, GGUF)

G8. Self-Improving Loops

Auto-data generation
Auto-distillation
Auto-evaluation
AI-assisted training pipelines

H — DATA & DATASET FOUNDATIONS (REAL + SYNTHETIC)

Why this exists: Even a perfectly implemented transformer is useless without good data. DeepSeek / OpenAI / Anthropic’s real power is in:

What data they use
How they clean, mix, and label it
How they generate, filter, and distill synthetic data

Phase 1 = foundations: you won’t build a web-scale pipeline yet, but you’ll learn the full structure on a small scale so you can scale it up later.

H1. Types of Data You Need Across the LLM Lifecycle

H1.1 Pretraining Data (Base LM) Goal: Teach the model general language + code + basic world knowledge.

Unlabeled, large-scale text: Web pages, books, Wikipedia, code, forums.
Key properties: Diversity, clean text, balanced mixture.

H1.2 Supervised Fine-Tuning (SFT) Data Goal: Teach the model how to act as an assistant or specific tool.

Labeled (input → output) pairs: Instruction → answer, Question → reasoning → answer.
Typically higher-quality, smaller volume than pretraining data.

H1.3 Preference / RLHF / Reward Modeling Data Goal: Teach the model which outputs are better.

Pairs: (prompt, response_A, response_B, preference).
Used for: Preference modeling, RL (PPO/GRPO), DPO/ORPO.

H1.4 Evaluation Data Goal: Measure how good the model actually is.

Held-out sets for Math, Coding, Reasoning.
Must be never used in training or synthetic generation.

H1.5 Synthetic Data Cross-cuts all the above:

Synthetic pretraining data (Teacher model generates corpora).
Synthetic SFT data (Teacher answers tasks).
Synthetic preference data (AI feedback - RLAIF).

H2. Collecting Raw Real Data

H2.1 Sources (for small-scale experiments)

Public text datasets (Wikipedia, BookCorpus).
Code datasets (GitHub subset).
Domain data (Exported docs).

H2.2 Ways to Collect

Direct downloads (preferred).
APIs (Wikipedia, Hacker News).
Web scraping (Respect robots.txt).

H3. Cleaning & Filtering (Quality, Safety, Deduplication)

H3.1 Basic Cleaning

Remove HTML/markup.
Normalize whitespace and Unicode.
Strip boilerplate.

H3.2 Language Detection

Filter to languages your tokenizer/model will handle.

H3.3 Deduplication

Exact dedupe (Hash full documents).
Near-duplicate detection (MinHash).

H3.4 Safety & PII Filtering

Remove emails, phone numbers, keys.
Remove explicitly harmful content.

H4. Building Pretraining Datasets (for Your Tiny LM)

H4.1 Decide the “World” Your Tiny LM Will Live In

Pick a narrow-ish domain (e.g., Story world, Tech articles).

H4.2 Document Segmentation

Split into paragraphs or sections.
Remove very short/long docs.

H4.3 Tokenization & Sequence Packing

Tokenize all text.
Concatenate token streams.
Split into fixed-length sequences.

H4.4 Dataset Statistics & Sanity Checks

Compute total tokens, seq length distribution.
Check for content dominance or unsafe strings.

H5. Building Fine-Tuning (SFT) Datasets

H5.1 Define the Task Schema

Pick 1–2 concrete task types (Math reasoning, Q&A).
Define a JSON-like schema.

H5.2 Sourcing SFT Data

Real: Public Q&A datasets.
Synthetic: Use a strong model (GPT-4) to generate problems and solutions.

H5.3 Formatting for Training

Create a canonical prompt format (Instruction, User, Assistant).

H5.4 Quality Filtering for SFT

Remove trivial or nonsensical examples.
Enforce clear reasoning and explicit final answers.

H6. Building Preference / RL / Reward Data

H6.1 Pairwise Preference Data

Two responses (A and B) for each prompt.
Label: A > B, B > A, or Tie.

H6.2 Rating Data

Single response + scalar score or rubric-based labels.

H7. Synthetic Data Pipelines (Teacher → Student)

H7.1 Types of Synthetic Data

Pretraining-style, SFT-style, CoT-style, Preference data.

H7.2 Synthetic Data Loop

Pick target domain.
Write meta-prompts for teacher.
Generate data.
Filter.
Store.

H7.3 Teacher-Student Distillation Data

Teacher generates CoT reasoning + final answer.
Student trains to reproduce CoT or final answer.

H8. Dataset Versioning, Metadata & Governance

H8.1 Versioning

Give each dataset a version (e.g., pretrain_corpus_v1).
Track source, date, preprocessing, tokenizer.

H8.2 Metadata & Data Cards

Note intended use, domains, biases, safety notes.

H8.3 Legal/Ethical Basics

Respect licenses, ToS, privacy.

H9. Practical Mini-Pipeline for Your Colab Tiny LLM

H9.1 Pretraining Data

Pick domain -> Collect 1-10MB -> Clean -> Tokenize -> Pack -> Split -> Store.

H9.2 SFT Data

Define task -> Use teacher model -> Format -> Filter -> Store.

H9.3 Evaluation Data

Hold out 20-50 problems -> Save canonical answer -> Use for manual eval.

H9.4 Synthetic Preference Data

Select eval prompts -> Produce multiple answers -> Manually label -> Store.