Back to Engineering
10 min read
Updated recently

Fundamentals of LLM

Master the bedrock of LLMs: Mathematics, Computer Science, Python, and PyTorch. Includes expanded, ultra-detailed curriculum.

Fundamentals of LLM (Expanded, Ultra-Detailed Version A → G)

This is the improved master version for your Notion.


A — CORE FOUNDATIONS (Mathematics, CS, Python, PyTorch)

Why this section exists:

To train any LLM from scratch, you must understand the math, CS, and tooling that every transformer block is built upon. This section ensures you have the bedrock for everything later.

A1. Mathematics Foundations (for LLM Architecture & Training)

(1) Linear Algebra (Critical for Attention & Transformers)

  • Scalars, vectors, matrices, tensors
  • Dot products (used in Q·K attention)
  • Matrix multiplication (core operation)
  • Norms (L2, normalization layers)
  • Orthogonality & projections
  • Eigenvalues & stability (rare but helpful for optimization insight)

(2) Calculus (Optimization & Backprop)

  • Derivatives (basic + vectorized)
  • Chain rule
  • Jacobians (conceptual)
  • Gradients of loss functions
  • Activation derivatives

(3) Probability & Statistics (For NLL, Cross-Entropy, Sampling)

  • Probability distributions
  • Softmax workings
  • Cross-entropy & KL divergence
  • Expectation & variance
  • Sampling from categorical distributions
  • Temperature scaling

(4) Discrete Math / Algorithms

  • Asymptotic complexity (why attention is O(n²))
  • Graph thinking (attention maps)
  • Hashing (tokenization, vocab compression)
  • Trie structures (used in tokenizers)

A2. Computer Science Foundations

  • Memory layout (GPU & CPU differences)
  • GPU vs CPU compute
  • CUDA basics
  • Parallelism vs vectorization
  • FP16/BF16/FP8 numeric behavior
  • CPU cache, batching, locality

A3. Python Foundations

  • Python OOP
  • Decorators
  • Typing (Type hints)
  • Context managers
  • Generators (for data streaming)
  • Error handling
  • Performance tips (NumPy vectorization, avoiding Python loops)

A4. PyTorch Deep Foundations

  • Tensors
  • Autograd
  • nn.Module
  • Optimizers
  • DataLoader
  • Dataset class
  • Training loop
  • Mixed-precision training (torch.autocast)
  • Distributed training basics (torch.distributed)

B — LLM ARCHITECTURE FOUNDATIONS (Tokenizers → Transformers)

Why this section exists:

This is how models think. You cannot build or modify a model without mastering every detail here.

B1. Tokenization

  • Why tokenization matters
  • Subword tokenization
  • Character-level vs WordPiece vs BPE
  • SentencePiece (unigram)
  • Byte-level tokenization (GPT-2, LLaMA)
  • Vocab size tradeoffs
  • Special tokens (BOS, EOS, PAD, UNK)
  • Tokenizer training process
  • Token merging strategies
  • Tokenization for multilingual models

B2. Input/Output Embeddings

  • Token embeddings
  • Learned positional embeddings
  • RoPE (rotary embedding mathematics)
  • ALiBi
  • Relative position encodings

B3. Attention Mechanisms

Fundamental Attention Concepts

  • Q (query), K (key), V (value)
  • Dot-product attention
  • Scaled attention (why scale by √d)
  • Multi-head attention
  • Self-attention vs cross-attention

Advanced Variants

  • FlashAttention (memory optimized)
  • GQA (Grouped Query Attention)
  • Multi-Query Attention (MQA)
  • MLA (DeepSeek-V3 innovation)
  • Linear attention variants
  • Attention-free models (Mamba, Hyena) (intro only)

B4. Transformer Block Internals

  • LayerNorm vs RMSNorm
  • Residual connections
  • SwiGLU feedforward
  • Dropout
  • Attention + MLP ordering
  • Pre-norm vs Post-norm
  • Transformer depth vs width tradeoffs

B5. Training Stability

  • Initialization strategies
  • Learning rate warmup
  • Cosine annealing
  • Gradient clipping
  • Mixed precision (BF16/FP16)
  • Logits scaling
  • Loss spikes & cures

C — PRETRAINING FOUNDATIONS (Data → Objective → Scaling Laws)

Why this section exists:

Pretraining shapes the LM’s general intelligence before reasoning, alignment, or fine-tuning.

C1. Pretraining Objective

  • Next-token prediction
  • Why it’s enough for complex reasoning
  • Masking (BERT style) — know differences

C2. Data Engineering

  • Dataset creation pipeline
  • Document boundaries
  • Packing sequences
  • Deduplication methods
  • Internet-scale text sources
  • Tokenization throughput optimization
  • Data mixture strategies (code, math, safe)
  • FineWeb, Pile, SlimPajama

C3. Scaling Laws

  • Neural scaling theory
  • Compute-optimal scaling (Chinchilla)
  • Parameters vs training tokens vs FLOPs
  • Optimal batch size
  • Dimensionality choices

C4. Multi-token Prediction (DeepSeek-V3 / Gemini style)

  • Why multi-token prediction accelerates learning
  • Architecture modifications needed
  • Performance vs stability tradeoffs

C5. Distributed Training

  • Data parallel
  • Tensor parallel
  • Pipeline parallel
  • FSDP
  • ZeRO stages
  • Sharded optimizers
  • GPU memory fragmentation effects

D — POST-TRAINING FOUNDATIONS (Reasoning, SFT, RLHF, Distillation)

Why this section exists:

This step teaches models how to behave, not just predict text.

D1. SFT (Supervised Fine-tuning)

  • Task-specific fine-tuning
  • Instruction tuning
  • Formatting data correctly
  • Preference between short/long answers
  • CoT (Chain of Thought) SFT

D2. Preference Optimization

  • DPO (Direct Preference Optimization)
  • ORPO (Odds ratio preference optimization)
  • RRHF
  • Stable preference modeling
  • Pros/cons vs PPO

D3. RLHF

  • Reward modeling
  • PPO
  • GRPO (DeepSeek’s simpler PPO alternative)
  • Advantage shaping
  • Sampling strategies
  • KL penalty tuning

D4. Distillation

  • Sequence-level distillation
  • Logit distillation
  • Hidden-state distillation
  • Chain-of-thought distillation
  • Verifier-guided distillation
  • Distilling from test-time compute (DeepSeek-R1 style)

E — DISTILLATION + EVALUATION FOUNDATIONS

Why this section exists:

You want to build small-but-smart models; distillation is vital.

E1. Distillation Goals

  • Reduce hallucination
  • Compress reasoning
  • Transfer knowledge
  • Improve inference speed

E2. Distillation Techniques

  • Behavior cloning
  • Student-teacher pipelines
  • Offline vs online distillation
  • Reinforced distillation
  • CoT to short-form distillation

E3. Evaluation

  • Perplexity
  • Accuracy benchmarks (GSM8K, MATH)
  • Reasoning faithfulness
  • Long context evaluation
  • Compression ratio vs accuracy tradeoffs

F — MULTIMODAL FOUNDATIONS (Vision, Audio, Video, Alignment)

Why this section exists:

Next-gen models are almost all multimodal.

F1. Vision Models

  • ViT
  • CLIP
  • SigLIP
  • Q-Former (BLIP-2)
  • Image tokenization

F2. Audio Models

  • Whisper architecture
  • Mel spectrograms
  • CTC loss

F3. Video Models

  • Spatio-temporal attention
  • Frame embeddings

F4. Multimodal Fusion

  • Early fusion
  • Late fusion
  • Projector layers
  • Visual grounding
  • Multimodal CoT

G — FUTURE DIRECTIONS & ACTIVE RESEARCH AREAS

Why this section exists:

You want DeepSeek-level competency — meaning you must know where the field is evolving.

G1. Architecture Evolution

  • MLA (DeepSeek)
  • SSM hybrids (Mamba, RWKV)
  • Efficient attention
  • MoE 2.0

G2. Pretraining Evolution

  • Multi-token prediction
  • Synthetic data scaling
  • Large-scale distillation
  • Self-correction in pretraining

G3. Post-training Evolution

  • GRPO replacing PPO
  • Self-play for reasoning
  • Verifier models

G4. Reasoning Evolution

  • Best-of-N reasoning
  • Self-verification
  • Planner–worker–critic loops
  • Test-time compute scaling

G5. Hallucination Reduction

  • Retrieval enhanced pretraining (RPT)
  • Multimodal grounding
  • Verifier-guided decoding

G6. Agentic AI

  • Tool use
  • Planning
  • Memory modules
  • Reflection
  • Multi-agent coordination
  • On-device LLMs
  • Edge + Cloud hybrid
  • Efficient quantization (AWQ, GPTQ, GGUF)

G8. Self-Improving Loops

  • Auto-data generation
  • Auto-distillation
  • Auto-evaluation
  • AI-assisted training pipelines

H — DATA & DATASET FOUNDATIONS (REAL + SYNTHETIC)

Why this exists: Even a perfectly implemented transformer is useless without good data. DeepSeek / OpenAI / Anthropic’s real power is in:

  • What data they use
  • How they clean, mix, and label it
  • How they generate, filter, and distill synthetic data

Phase 1 = foundations: you won’t build a web-scale pipeline yet, but you’ll learn the full structure on a small scale so you can scale it up later.

H1. Types of Data You Need Across the LLM Lifecycle

H1.1 Pretraining Data (Base LM) Goal: Teach the model general language + code + basic world knowledge.

  • Unlabeled, large-scale text: Web pages, books, Wikipedia, code, forums.
  • Key properties: Diversity, clean text, balanced mixture.

H1.2 Supervised Fine-Tuning (SFT) Data Goal: Teach the model how to act as an assistant or specific tool.

  • Labeled (input → output) pairs: Instruction → answer, Question → reasoning → answer.
  • Typically higher-quality, smaller volume than pretraining data.

H1.3 Preference / RLHF / Reward Modeling Data Goal: Teach the model which outputs are better.

  • Pairs: (prompt, response_A, response_B, preference).
  • Used for: Preference modeling, RL (PPO/GRPO), DPO/ORPO.

H1.4 Evaluation Data Goal: Measure how good the model actually is.

  • Held-out sets for Math, Coding, Reasoning.
  • Must be never used in training or synthetic generation.

H1.5 Synthetic Data Cross-cuts all the above:

  • Synthetic pretraining data (Teacher model generates corpora).
  • Synthetic SFT data (Teacher answers tasks).
  • Synthetic preference data (AI feedback - RLAIF).

H2. Collecting Raw Real Data

H2.1 Sources (for small-scale experiments)

  • Public text datasets (Wikipedia, BookCorpus).
  • Code datasets (GitHub subset).
  • Domain data (Exported docs).

H2.2 Ways to Collect

  • Direct downloads (preferred).
  • APIs (Wikipedia, Hacker News).
  • Web scraping (Respect robots.txt).

H3. Cleaning & Filtering (Quality, Safety, Deduplication)

H3.1 Basic Cleaning

  • Remove HTML/markup.
  • Normalize whitespace and Unicode.
  • Strip boilerplate.

H3.2 Language Detection

  • Filter to languages your tokenizer/model will handle.

H3.3 Deduplication

  • Exact dedupe (Hash full documents).
  • Near-duplicate detection (MinHash).

H3.4 Safety & PII Filtering

  • Remove emails, phone numbers, keys.
  • Remove explicitly harmful content.

H4. Building Pretraining Datasets (for Your Tiny LM)

H4.1 Decide the “World” Your Tiny LM Will Live In

  • Pick a narrow-ish domain (e.g., Story world, Tech articles).

H4.2 Document Segmentation

  • Split into paragraphs or sections.
  • Remove very short/long docs.

H4.3 Tokenization & Sequence Packing

  • Tokenize all text.
  • Concatenate token streams.
  • Split into fixed-length sequences.

H4.4 Dataset Statistics & Sanity Checks

  • Compute total tokens, seq length distribution.
  • Check for content dominance or unsafe strings.

H5. Building Fine-Tuning (SFT) Datasets

H5.1 Define the Task Schema

  • Pick 1–2 concrete task types (Math reasoning, Q&A).
  • Define a JSON-like schema.

H5.2 Sourcing SFT Data

  • Real: Public Q&A datasets.
  • Synthetic: Use a strong model (GPT-4) to generate problems and solutions.

H5.3 Formatting for Training

  • Create a canonical prompt format (Instruction, User, Assistant).

H5.4 Quality Filtering for SFT

  • Remove trivial or nonsensical examples.
  • Enforce clear reasoning and explicit final answers.

H6. Building Preference / RL / Reward Data

H6.1 Pairwise Preference Data

  • Two responses (A and B) for each prompt.
  • Label: A > B, B > A, or Tie.

H6.2 Rating Data

  • Single response + scalar score or rubric-based labels.

H7. Synthetic Data Pipelines (Teacher → Student)

H7.1 Types of Synthetic Data

  • Pretraining-style, SFT-style, CoT-style, Preference data.

H7.2 Synthetic Data Loop

  1. Pick target domain.
  2. Write meta-prompts for teacher.
  3. Generate data.
  4. Filter.
  5. Store.

H7.3 Teacher-Student Distillation Data

  • Teacher generates CoT reasoning + final answer.
  • Student trains to reproduce CoT or final answer.

H8. Dataset Versioning, Metadata & Governance

H8.1 Versioning

  • Give each dataset a version (e.g., pretrain_corpus_v1).
  • Track source, date, preprocessing, tokenizer.

H8.2 Metadata & Data Cards

  • Note intended use, domains, biases, safety notes.

H8.3 Legal/Ethical Basics

  • Respect licenses, ToS, privacy.

H9. Practical Mini-Pipeline for Your Colab Tiny LLM

H9.1 Pretraining Data

  • Pick domain -> Collect 1-10MB -> Clean -> Tokenize -> Pack -> Split -> Store.

H9.2 SFT Data

  • Define task -> Use teacher model -> Format -> Filter -> Store.

H9.3 Evaluation Data

  • Hold out 20-50 problems -> Save canonical answer -> Use for manual eval.

H9.4 Synthetic Preference Data

  • Select eval prompts -> Produce multiple answers -> Manually label -> Store.