Sequence Models (RNN & LSTM)
Master Recurrence, Backpropagation Through Time, and Long Short-Term Memory Cell logic.
Standard Neural Networks (Dense layers) are amnesiacs. If you feed word #1 ("I") into the network, it computes it and outputs a guess. If you then feed word #2 ("went") into the network, it calculates it entirely independently. It has absolutely zero memory of the word "I". This destroys Time-Series Analytics, Stock Market projections, and Natural Language Translation, where the order of words is mathematically hyper-critical.
Recurrent Neural Networks (RNNs) solve this by adding a "Memory Loop" (Hidden State). They pass the output of the previous word calculation physically back into themselves as a secondary input for the next word calculation.
Imagine reading a murder mystery novel.
A Dense Network reads page 200, realizes someone was shot, and tries to guess the killer based solely on the words printed on page 200. It's impossible.
An LSTM (Long Short-Term Memory) reads page 1, takes notes on a clipboard, and carries the clipboard to page 2. It reads page 2 alongside its clipboard notes. If it reads that "The Butler went to France indefinitely", it mathematically erases the Butler from its clipboard (Forget Gate) to save RAM space. By page 200, its clipboard contains a flawlessly condensed, highly-weighted vector of all relevant historical truth necessary to perfectly solve the murder.
import tensorflow as tf
from tensorflow.keras import layers, models
# Scenario: Sentiment Analysis for Movie Reviews
# The inputs are lists of 50 integers (50 words long sequence)
# X_train shape: (1000, 50) | y_train shape: (1000,) [0=Bad, 1=Good]
model = models.Sequential()
# Embed: Converts (1000, 50) -> (1000, 50, 128)
model.add(layers.Embedding(input_dim=20000, output_dim=128))
# LSTM 1: Stacked RNN requires returning the full sequence to the next layer!
# Processes (1000, 50, 128) -> (1000, 50, 64)
model.add(layers.LSTM(64, return_sequences=True, dropout=0.2))
# LSTM 2: Final processor. Collapses the 50 timesteps into ONE singular thought vector.
# Processes (1000, 50, 64) -> (1000, 32)
model.add(layers.LSTM(32, return_sequences=False))
# Binary Classification Head
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# model.fit(X_train, y_train, epochs=3)
| Code Line | Explanation |
|---|---|
return_sequences=True |
A massive pain point for engineers. If an LSTM is set to `False` (default), it reads all 50 words internally, throws away the intermediate steps, and only physically outputs the very absolute final "Memory Output State" matrix (Shape: 32). If set to `True`, it spits out a tensor at EVERY SINGLE word step, resulting in a 3D matrix (50, 32) explicitly required if you want to stack a second LSTM directly on top of it. |
layers.Dense(1, 'sigmoid') |
The LSTM is just an information-compressing engine. Once it crushes the entire 50-word paragraph into an incredible 32-float mathematical summary vector, we hand it back to a standard Math layer. The Dense layer multiplies the 32 floats, adds a bias, and uses a Sigmoid curve to bend it perfectly into a `0` to `1` probability percentage! |
How does calculus trace an error backwards through "Time"?
A basic Recurrent Neural Network (RNN) acts like a `for` loop. If a sentence has 50 words, TensorFlow literally unrolls the single RNN hardware block into 50 physical clones lined up side-by-side in GPU RAM. The error explodes out of the `Dense` layer and travels violently backward through all 50 unrolled clones, sequentially applying the calculus chain rule fraction 50 times (BPTT). If the chain-rule fraction is `0.1`, multiplying `0.1 ^ 50` collapses the gradient to `0.0`. The early layers (representing the first 10 words of the review) never receive the error update. The AI permanently suffers from immediate short-term memory loss (The Vanishing Gradient Problem).
Because vanilla RNNs mathematically failed due to Vanishing Gradients, engineers developed the Long Short-Term Memory (LSTM) Cell.
Inside an LSTM block, there is a literal, unobstructed internal "Highway" called the Cell State \((C_t)\). Calculus gradients can mathematically sprint backward down this highway across thousands of timesteps completely unhindered by fractional multiplications!
The LSTM protects this highway using 3 neural "Gates" (Sigmoid valves that output 0 to
1):
1. Forget Gate: Decides what [0.0 to 1.0] percentage of the
historical highway data to delete.
2. Input Gate: Decides what [0.0 to 1.0] percentage of the NEW
word to insert into the highway.
3. Output Gate: Takes the highway data, bends it via `Tanh`, and reveals a
subset of it as the output Prediction.
If you read the sentence: "The bank of the river...", you know "bank" implies water. If you read: "The bank was robbed...", "bank" implies money. An LSTM reading left-to-right hits the word "bank" at Step 2 and has absolutely no idea what the context is because the word "robbed" hasn't executed yet!
Solution: Bidirectional(LSTM(64)). TensorFlow physically
creates two completely isolated LSTM cores. Core A reads the sentence normally (Left to
Right). Core B reads the sentence entirely backward (Right to Left). At the final step, it
concatenates their massive output vectors together `(64 + 64 = 128)`. The AI now
mathematically possesses both "Past Context" and "Future Context" simultaneously for every
single word!
LSTMs are extraordinarily heavy on GPU RAM because they use 4 internal matrix equations per word.
The Gated Recurrent Unit (GRU) is a massive architectural optimization. It completely removes the internal "Highway" (Cell State) and merges the Forget and Input gates into a single Unified Update Gate. It requires 30% fewer parameters, trains overwhelmingly faster on modern hardware, and achieves physically identical benchmark accuracy to archaic LSTMs on almost all text tasks.
Mistake: Using TimeDistributed wrapping incorrectly.
If you want to classify the overall Movie Review as `Good` or `Bad`, you place a single
Dense(1) at the absolute end of the network (after the LSTM has collapsed the
sequence).
If you are building an AI to tag Parts-of-Speech (Noun, Verb, Adjective) for EVERY single
word in the 50-word string, you must use return_sequences=True, and wrap the
classifier: TimeDistributed(Dense(3, activation='softmax')). This violently
forces TensorFlow to clone the final Dense classifier 50 individual times and attach it to
every single intermediate timestep output along the chain.
LSTMs ruled the world from 2012 to 2017. Their fatal flaw? Sequential Bottlenecking. An LSTM literally CANNOT read Word #30 until it has physically finished executing the C-loop for Word #29. You cannot parallelize the training across a 10,000-core TPU cluster because time is stubbornly linear.
In 2017, Google released Transformers (Attention Is All You Need). Transformers completely delete Recurrence. They read the entire 50-word sentence in exactly one parallel, instantaneous hardware strike (like a CNN). They use a matrix operation called "Self-Attention" to mathematically calculate the relational gravity between every single word and every other word simultaneously. This `O(1)` parallelization bottleneck removal is the sole mechanical reason ChatGPT exists today.