Natural Language Processing (NLP)
Master the Tokenizer, Sequence Padding, and the
Embedding Layer.
Neural Networks are blind calculators. You cannot feed the string `"I love Python"` into a matrix multiplication equation. The entire discipline of Natural Language Processing (NLP) is dedicated to one single goal: Squeezing the complex nuance of human language into cold, rigid arrays of floating-point numbers.
This process requires two distinct mathematical bridges: Tokenization (Converting words into integer IDs) and Embeddings (Converting integer IDs into multi-dimensional spatial vectors).
Imagine reading a foreign book as a machine.
Step 1 (Tokenization): You buy a Dictionary. You look up the word "Apple". The dictionary says it's word #451. You cross out "Apple" in your book and write `451`. Your book is now just a sequence of integers: `[12, 451, 9]`. But `451` doesn't explain what a fruit actually is.
Step 2 (Embedding): You place every single word on a massive 3D Cartesian grid. You place `Apple` physically next to `Orange`. You place `King` next to `Queen`. When the AI reads the coordinate `[0.5, 0.9, -0.2]` for Apple, it instantly realizes it is mathematically surrounded by fruits, inherently understanding its semantic meaning due to its physical geometry in space.
from tensorflow.keras.preprocessing.text import Tokenizer
# Scenario: You have massive textual data
eng = ['hi', 'hello', 'what is your name', 'thank you']
spa = ['hola', 'que tal', 'como estas', 'gracias']
# 1. The Tokenizer Engine
token_eng = Tokenizer()
# 2. Scanning Phase
# The AI reads every letter, splits by spaces, strips punctuation,
# and builds a massive Dictionary mapping string keys to int values.
token_eng.fit_on_texts(eng)
# 3. Translation Phase
# We feed the raw strings back in. It outputs the Integer Matrix.
x = token_eng.texts_to_sequences(eng)
print("Vocabulary Dictionary (Word -> ID):")
print(token_eng.word_index)
# Outputs: {'hi': 1, 'hello': 2, 'what': 3, 'is': 4...}
print("\nInteger Translation Matrix:")
print(x)
# Outputs: [[1], [2], [3, 4, 5, 6], [7, 8]]
| Code Line | Explanation |
|---|---|
Tokenizer(num_words=10000) |
The English language has 170,000 words. Most (like "Aardvark") appear exactly once. To save RAM, the Tokenizer mathematically ranks words by frequency. Setting `10000` forces the engine to instantly throw away the bottom 160,000 useless words, keeping the Neural Network perfectly optimized. |
texts_to_sequences() |
It loops through the string arrays, executes an `O(1)` HashMap dictionary lookup for each word, and outputs pure lists of integers. Neural Nets can natively multiply integers. |
[[1], [3, 4, 5, 6]] |
Neural Networks cannot process Jagged Arrays (where Row 1 has 1 number, but Row 2 has 4 numbers). The GPU requires perfect rectangular C-matrices. This requires the `pad_sequences` function to append `0`s to make them equal length. |
We just turned the word "Dog" into the ID `7`, and "Cat" into `8`. By standard math rules, the AI assumes `8 is greater than 7` (Cat is greater than Dog). This is devastatingly false. They are fundamentally equally-weighted Categories, not hierarchical magnitudes!
In TensorFlow, the layers.Embedding(input_dim=10000, output_dim=64) acts as a
massive Lookup Table (A weights matrix). When the integer `8` hits the layer, the Embedding
mathematically bypasses multiplication entirely. It physically grabs the exact Row 8 from
its massive matrix. Row 8 contains exactly 64 floating point decimals (e.g. `[0.1, -0.4,
0.99...]`). The Embedding just seamlessly swapped the fake ID integer '8' for a rich,
beautiful 64-dimensional spatial coordinate vector. The AI then trains these 64 numbers
using Calculus Gradient Descent, physically dragging the "Cat" coordinate vector closer to
the "Dog" coordinate vector in cyberspace over millions of epochs.
This is the most confusing aspect of NLP architecture:
Input: `[Batch_Size, Sentence_Length]` (e.g., A batch of 32 sentences, each cut exactly to 50 words via Padding). The Matrix is `(32, 50)` containing integers.
Through Embedding: Every single integer is blown up into a massive 128-float array.
Output: `[Batch_Size, Sentence_Length, Embedding_Dim]`. The tensor expands dimensionally to `(32, 50, 128)`. Your 2D spreadsheet just became a 3D Rubik's Cube of data, ready to be fed into heavy sequential filters like 1D-Convolutions or LSTMs.
When you put your NLP Sentiment Analysis AI into Production, a teenager types a tweet containing the slang word "Rizz".
Your Tokenizer was trained in 2018. It searches its HashMap for "Rizz", fails to find it, and
violently throws a python KeyError, crashing the multi-million dollar backend.
Solution: You must always initialize
Tokenizer(oov_token='<UNK>'). During evaluation, if a string word escapes
the dictionary, the Tokenizer gracefully strips the string and violently overwrites it with
the literal ID for `<UNK>` (Unknown). The Neural Net processes it as "An Unknown
Word", preventing the server explosion.
Training an Embedding Layer from scratch takes weeks of GPU time. You are trying to teach a machine the entire English dictionary purely from scratch.
Instead, Data Scientists use Transfer Learning via Word2Vec
or GloVe (Global Vectors). Stanford University bought 1,000 GPUs, trained
an embedding matrix on the entirety of Wikipedia for 6 months, and released the raw `.txt`
file of weights. You literally just open the file with Pandas, dump Stanford's
floating-point coordinates directly into your Keras Embedding layer, freeze the
weights, and your tiny home-AI instantly understands the spatial semantic relationship of
the entire English language before epoch 1 even begins.
Mistake: Post-Padding vs Pre-Padding for Sequence Models (RNNs).
Why is this disastrous?: pad_sequences(padding='post') places
trailing zeros at the end: `[12, 45, 9, 0, 0, 0]`. If you feed this into a Recurrent Neural
Network (LSTM), the LSTM reads the final `0` the hardest! It suffers catastrophic
forgetting—the 3 zeros overrode the actual words! Fix: Sequence models MUST
use padding='pre': `[0, 0, 0, 12, 45, 9]`. The network ignores the zeros early,
and the most recent token (9) hits the calculation wall directly before output.
Word Tokenization (mapping space-separated words) is dead. It requires massive 100,000+ vocabularies to work, and fails violently on foreign languages or typos ("hellooo").
Modern Large Language Models (LLMs) like ChatGPT use BPE (Byte-Pair Encoding) Sub-word Tokenization. Instead of tokenizing "unbelievable", it breaks it into physical morphological chunks: `["un", "believ", "able"]`. This genius mechanism restricts the entire global vocabulary to exactly 50,000 sub-words, allowing the model to flawlessly parse misspelled words, unseen medical jargon, and new grammatical syntax by literally putting the Lego blocks together algebraically.