Mathematical Representation of Language
1. Concept Introduction

Neural Networks are blind calculators. You cannot feed the string `"I love Python"` into a matrix multiplication equation. The entire discipline of Natural Language Processing (NLP) is dedicated to one single goal: Squeezing the complex nuance of human language into cold, rigid arrays of floating-point numbers.

This process requires two distinct mathematical bridges: Tokenization (Converting words into integer IDs) and Embeddings (Converting integer IDs into multi-dimensional spatial vectors).

2. Concept Intuition

Imagine reading a foreign book as a machine.

Step 1 (Tokenization): You buy a Dictionary. You look up the word "Apple". The dictionary says it's word #451. You cross out "Apple" in your book and write `451`. Your book is now just a sequence of integers: `[12, 451, 9]`. But `451` doesn't explain what a fruit actually is.

Step 2 (Embedding): You place every single word on a massive 3D Cartesian grid. You place `Apple` physically next to `Orange`. You place `King` next to `Queen`. When the AI reads the coordinate `[0.5, 0.9, -0.2]` for Apple, it instantly realizes it is mathematically surrounded by fruits, inherently understanding its semantic meaning due to its physical geometry in space.

3. Python Syntax
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.layers import Embedding # 1. Boot up the Dictionary Builder tokenizer = Tokenizer(num_words=10000, oov_token="") # 2. Learn the Vocabulary tokenizer.fit_on_texts(training_sentences) # 3. Convert Strings to Integer Sequences sequences = tokenizer.texts_to_sequences(training_sentences) # 4. Standardize the sequence lengths (Padding) padded_seqs = pad_sequences(sequences, maxlen=50, padding='post')
4. Python Code Example
python
from tensorflow.keras.preprocessing.text import Tokenizer

# Scenario: You have massive textual data
eng = ['hi', 'hello', 'what is your name', 'thank you']
spa = ['hola', 'que tal', 'como estas', 'gracias']

# 1. The Tokenizer Engine
token_eng = Tokenizer()

# 2. Scanning Phase
# The AI reads every letter, splits by spaces, strips punctuation, 
# and builds a massive Dictionary mapping string keys to int values.
token_eng.fit_on_texts(eng)

# 3. Translation Phase
# We feed the raw strings back in. It outputs the Integer Matrix.
x = token_eng.texts_to_sequences(eng)

print("Vocabulary Dictionary (Word -> ID):")
print(token_eng.word_index)
# Outputs: {'hi': 1, 'hello': 2, 'what': 3, 'is': 4...}

print("\nInteger Translation Matrix:")
print(x)
# Outputs: [[1], [2], [3, 4, 5, 6], [7, 8]]
6. Internal Mechanism (The Embedding Layer)

We just turned the word "Dog" into the ID `7`, and "Cat" into `8`. By standard math rules, the AI assumes `8 is greater than 7` (Cat is greater than Dog). This is devastatingly false. They are fundamentally equally-weighted Categories, not hierarchical magnitudes!

In TensorFlow, the layers.Embedding(input_dim=10000, output_dim=64) acts as a massive Lookup Table (A weights matrix). When the integer `8` hits the layer, the Embedding mathematically bypasses multiplication entirely. It physically grabs the exact Row 8 from its massive matrix. Row 8 contains exactly 64 floating point decimals (e.g. `[0.1, -0.4, 0.99...]`). The Embedding just seamlessly swapped the fake ID integer '8' for a rich, beautiful 64-dimensional spatial coordinate vector. The AI then trains these 64 numbers using Calculus Gradient Descent, physically dragging the "Cat" coordinate vector closer to the "Dog" coordinate vector in cyberspace over millions of epochs.

7. Shape and Dimensionality Expansion

This is the most confusing aspect of NLP architecture:

Input: `[Batch_Size, Sentence_Length]` (e.g., A batch of 32 sentences, each cut exactly to 50 words via Padding). The Matrix is `(32, 50)` containing integers.

Through Embedding: Every single integer is blown up into a massive 128-float array.

Output: `[Batch_Size, Sentence_Length, Embedding_Dim]`. The tensor expands dimensionally to `(32, 50, 128)`. Your 2D spreadsheet just became a 3D Rubik's Cube of data, ready to be fed into heavy sequential filters like 1D-Convolutions or LSTMs.

8. Edge Cases (Out of Vocabulary - OOV)

When you put your NLP Sentiment Analysis AI into Production, a teenager types a tweet containing the slang word "Rizz".

Your Tokenizer was trained in 2018. It searches its HashMap for "Rizz", fails to find it, and violently throws a python KeyError, crashing the multi-million dollar backend. Solution: You must always initialize Tokenizer(oov_token='<UNK>'). During evaluation, if a string word escapes the dictionary, the Tokenizer gracefully strips the string and violently overwrites it with the literal ID for `<UNK>` (Unknown). The Neural Net processes it as "An Unknown Word", preventing the server explosion.

9. Variations & Alternatives (Pre-Trained Embeddings)

Training an Embedding Layer from scratch takes weeks of GPU time. You are trying to teach a machine the entire English dictionary purely from scratch.

Instead, Data Scientists use Transfer Learning via Word2Vec or GloVe (Global Vectors). Stanford University bought 1,000 GPUs, trained an embedding matrix on the entirety of Wikipedia for 6 months, and released the raw `.txt` file of weights. You literally just open the file with Pandas, dump Stanford's floating-point coordinates directly into your Keras Embedding layer, freeze the weights, and your tiny home-AI instantly understands the spatial semantic relationship of the entire English language before epoch 1 even begins.

10. Common Mistakes

Mistake: Post-Padding vs Pre-Padding for Sequence Models (RNNs).

Why is this disastrous?: pad_sequences(padding='post') places trailing zeros at the end: `[12, 45, 9, 0, 0, 0]`. If you feed this into a Recurrent Neural Network (LSTM), the LSTM reads the final `0` the hardest! It suffers catastrophic forgetting—the 3 zeros overrode the actual words! Fix: Sequence models MUST use padding='pre': `[0, 0, 0, 12, 45, 9]`. The network ignores the zeros early, and the most recent token (9) hits the calculation wall directly before output.

11. Advanced Explanation (Sub-Word Tokenization)

Word Tokenization (mapping space-separated words) is dead. It requires massive 100,000+ vocabularies to work, and fails violently on foreign languages or typos ("hellooo").

Modern Large Language Models (LLMs) like ChatGPT use BPE (Byte-Pair Encoding) Sub-word Tokenization. Instead of tokenizing "unbelievable", it breaks it into physical morphological chunks: `["un", "believ", "able"]`. This genius mechanism restricts the entire global vocabulary to exactly 50,000 sub-words, allowing the model to flawlessly parse misspelled words, unseen medical jargon, and new grammatical syntax by literally putting the Lego blocks together algebraically.

Next Steps: If you want, I can also give you a "100 Most Important Concepts for AI/ML Engineers" (a compact list that interviews and advanced courses focus on).
On this page
NLP Tokenization