Pre-Trained Foundation Models
1. Concept Introduction

The original 2017 Transformer had two halves: An Encoder (To understand the input text) and a Decoder (To generate the translated output text).

The revolution of the 2020s happened when Google and OpenAI realized they didn't need both halves. By ripping the architecture in half and training massive, billions-parameter clusters purely on the "Predict the next word" task using the entirety of the open internet, they created Foundation Models. BERT (Google) is an isolated Encoder. GPT (OpenAI) is an isolated Decoder.

2. Concept Intuition

BERT (The Detective - Encoder Only):

BERT reads an entire sentence simultaneously, left-to-right AND right-to-left. "The child ate the [BLANK] with a fork." Because BERT sees the word "fork" at the end of the sentence, it perfectly deduces the blank must be "spaghetti". It is extraordinary for understanding context, searching documents, or classifying text.

GPT (The Author - Decoder Only):

GPT is strictly mathematically forbidden from looking at future words. It generates text sequentially. "The child ate the". GPT has to predict the next word without seeing the end of the sentence. It generates "apple". Then it feeds "The child ate the apple" back into itself, and guesses the next word. It is an Autoregressive engine, terrible at searching, but flawlessly eloquent at writing.

3. Python Syntax (HuggingFace Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer # 1. Download the massive Pre-Trained Weights from HuggingFace model_id = "gpt2" # (Or Llama-3, Mistral, etc.) tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) # 2. Tokenize Input Strings into Integer Tensors inputs = tokenizer("The capital of France is", return_tensors="pt") # 3. Model Generation (The Autoregressive Loop) outputs = model.generate(inputs.input_ids, max_length=10) # 4. Decode the Integers back to English Strings print(tokenizer.decode(outputs[0], skip_special_tokens=True))
4. Python Code Example (Text Generation)
python
from transformers import pipeline

# Scenario: We want to use a massive LLM with exactly 2 lines of code.
# The `pipeline` object obscures all the extreme matrix multiplication 
# and Pytorch/TensorFlow syntax automatically.

# Load a Text-Generation LM (like GPT/Llama)
# 'device=0' forces the workload onto your primary NVIDIA GPU.
generator = pipeline("text-generation", model="gpt2", device=0)

# Provide the prompt. 
# We configure Temperature. Lower = More rigid, Higher = More creative.
result = generator("Artificial Intelligence will eventually", 
                   max_length=50, 
                   temperature=0.7, 
                   num_return_sequences=1)

print(result[0]['generated_text'])
# Output: "Artificial Intelligence will eventually become the underlying 
# infrastructure of human society, operating seamlessly in the background..."
6. Internal Mechanism (Causal Masking in GPT)

How do we mathematically force GPT to NOT look at future words during its training?

If we provide the training sentence "The cat sat on the mat", GPT computes the $O(N^2)$ Self-Attention matrix (a 6x6 grid). Before it calculates the output, the network physically injects Negative Infinity (`-inf`) into the top-right triangle of the matrix.

When the Softmax function sees `-inf`, it mathematically converts it to exactly `0.0`. So, when the word "cat" is doing its math, the values for "sat", "on", "the", and "mat" are multiplied by `0.0`. They are physically blinded. "cat" can only absorb context from "The" and itself. This strictly enforces the unidirectional arrow of time.

7. Masked Language Modeling (BERT)

Unlike GPT, BERT is an Encoder. It does not use the `-inf` Causal Mask. It looks randomly at everything.

Google trained BERT by downloading the entire Wikipedia database. They took sentences and randomly swapped 15% of the words with a literal [MASK] token. "The man was bitten by a [MASK]". BERT processes the sentence, looks heavily at "man" and "bitten", and outputs a probability array over all 50,000 words. When it guesses "dog", Loss is low. When it guesses "cloud", Loss is high. Through this massive, un-masked puzzle game, BERT developed a structural understanding of the English language that fundamentally altered Google Search.

8. Edge Cases (Fine-Tuning Catastrophe)

If you download Llama-3 (Pre-Trained on the internet), and you Fine-Tune it on 10,000 medical documents, the AI will learn the medical jargon flawlessly.

However, running gradient descent on the new data causes Catastrophic Forgetting. The weight matrices are violently overwritten by the medical data. The AI suddenly forgets how to write poetry, translate to Spanish, or even hold a normal conversational greeting. To solve this, researchers use PEFT (Parameter-Efficient Fine-Tuning) mechanisms like LoRA (Low-Rank Adaptation), which absolutely freezes the massive 8B foundational weights in RAM, and only trains a tiny, disposable 10MB "adapter" matrix mathematically super-imposed on top of the brain.

9. Variations & Alternatives (RLHF)

A pure "Predict the next word" GPT model does not answer questions. If you type: "How to fix a car?", a raw GPT model will just continue the document: "...and other questions mechanics ask."

To turn it into an Assistant (like ChatGPT), OpenAI uses RLHF (Reinforcement Learning from Human Feedback). An initial supervised phase teaches it question/answer format. Then, the AI generates 4 different answers to a single prompt. A human contractor reads all 4, and physically ranks them `1st, 2nd, 3rd, 4th`. A massive Reinforcement Learning reward algorithm (PPO) then mathematically sculpts the model's neuroplasticity to highly favor the tone, alignment, and formatting style of the Human's #1 preference.

10. Common Mistakes

Mistake: Not Quantizing models for local inference.

Llama-3 (8 Billion parameters) using standard 32-bit floats (`fp32`) requires 32 Gigabytes of GPU VRAM just to turn on. Nobody has that at home.

Fix: load_in_4bit=True (BitsAndBytes Integration). We strip the 32-bit decimals down to massive, primitive 4-bit block approximations. The model loses less than 1% of its accuracy, but the RAM requirement crashes from 32GB down to 5.5GB, allowing it to easily run locally on consumer gaming laptops or MacBooks completely offline.

11. Advanced Explanation (KV-Caching)

Autoregressive Generation (token by token) is horrifically slow because of redundant math. To predict word 100, the AI has to re-calculate the massive Attention matrices for words 1 through 99. To predict word 101, it throws the old math away, and recalculates 1 through 100 entirely from scratch!

All production LLMs use KV-Caching. After calculating the Keys (K) and Values (V) matrix blocks for word 99, it permanently stores those heavily-computed coordinates physically in GPU memory. When generating word 101, it only computes the matrix for word 101, and instantaneously references the frozen KV-memories of the past, reducing compute time from $O(N)$ back to $O(1)$ per step, but fundamentally creating a massive, ballooning RAM leak during long generations.

Next Steps: If you want, I can also give you a "100 Most Important Concepts for AI/ML Engineers" (a compact list that interviews and advanced courses focus on).
On this page
LLMs & Architecture