Building Models: Sequential vs Functional API

πŸ—οΈ
Sequential / Model (Functional) / Model subclassing
Choose the right abstraction β€” from simple stacks to multi-input/output architectures
SequentialFunctionalsubclassing
β–Ύ
Sequential
Functional
Subclassing
When to Use
python
import tensorflow as tf
from tensorflow import keras

# Sequential: linear stack, no branching
model = keras.Sequential([
    keras.layers.Input(shape=(128,)),
    keras.layers.Dense(256, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(1,   activation='sigmoid')  # binary
], name='binary_clf')

model.summary()
print(f'Total params: {model.count_params():,}')
python
# Functional API: branching, shared weights, multi-input/output
# Dual-input: tabular + text features
num_input  = keras.Input(shape=(30,),   name='numeric')
text_input = keras.Input(shape=(100,),  name='text_ids')

# Numeric branch
x1 = keras.layers.Dense(64, activation='relu')(num_input)
x1 = keras.layers.Dropout(0.2)(x1)

# Text branch
x2 = keras.layers.Embedding(10000, 32)(text_input)
x2 = keras.layers.GlobalAveragePooling1D()(x2)
x2 = keras.layers.Dense(64, activation='relu')(x2)

# Merge branches
merged = keras.layers.Concatenate()([x1, x2])
out    = keras.layers.Dense(1, activation='sigmoid')(merged)

model = keras.Model(
    inputs=[num_input, text_input],
    outputs=out, name='multi_input_clf'
)
# Forward pass: pass dict or list
model.predict({'numeric': X_num, 'text_ids': X_text})
python
# Model subclassing: maximum flexibility
class MLPBlock(keras.Model):
    def __init__(self, units=[256, 128], dropout=0.3):
        super().__init__()
        self.dense_layers   = [keras.layers.Dense(u, activation='relu') for u in units]
        self.dropout_layers = [keras.layers.Dropout(dropout) for _ in units]
        self.bn_layers      = [keras.layers.BatchNormalization() for _ in units]
        self.output_layer   = keras.layers.Dense(1, activation='sigmoid')

    def call(self, x, training=False):
        for dense, drop, bn in zip(self.dense_layers, self.dropout_layers, self.bn_layers):
            x = dense(x)
            x = bn(x, training=training)
            x = drop(x, training=training)
        return self.output_layer(x)

model = MLPBlock(units=[256, 128, 64])
API Flexibility Debugging Best For
Sequential ⭐ Low βœ… Easy Simple linear stacks, tutorials, prototyping
Functional ⭐⭐⭐ High βœ… Easy Multi-input/output, residuals, shared weights
Subclassing ⭐⭐⭐⭐ Max ⚠️ Harder Research models, dynamic graphs, custom ops

Core Layers: Dense / Conv2D / LSTM / Embedding

πŸ–ΌοΈ
Conv2D / MaxPooling2D / GlobalAveragePooling2D
Spatial feature extraction for image classification and computer vision tasks
Conv2Dpoolingimage classification
β–Ύ
Syntax
Example
Internals
keras.layers.Conv2D( filters=32, # number of learnable kernels kernel_size=(3, 3), # (height, width) of filter strides=(1, 1), padding='same', # 'valid'=no pad; 'same'=output=input size activation='relu', kernel_regularizer=keras.regularizers.l2(1e-4) ) keras.layers.MaxPooling2D(pool_size=(2, 2)) keras.layers.GlobalAveragePooling2D() # replaces Flatten+Dense for efficiency
python
# CNN for 32x32 RGB image classification (CIFAR-10 style)
inputs = keras.Input(shape=(32, 32, 3))

# Block 1
x = keras.layers.Conv2D(32, (3,3), padding='same', activation='relu')(inputs)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(32, (3,3), padding='same', activation='relu')(x)
x = keras.layers.MaxPooling2D((2,2))(x)
x = keras.layers.Dropout(0.25)(x)

# Block 2
x = keras.layers.Conv2D(64, (3,3), padding='same', activation='relu')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(64, (3,3), padding='same', activation='relu')(x)
x = keras.layers.GlobalAveragePooling2D()(x)  # 64-dim vector
x = keras.layers.Dropout(0.4)(x)
outputs = keras.layers.Dense(10, activation='softmax')(x)

model = keras.Model(inputs, outputs, name='cnn_cifar')

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
  • Output shape: With same padding: H_out = ⌈H_in/strideβŒ‰. With valid: H_out = ⌈(H_in - kernel + 1)/strideβŒ‰.
  • GlobalAveragePooling2D vs Flatten: GAP reduces spatial dims to a single mean per channel (no params). Flatten preserves all spatial info but creates a large Dense layer.
  • Receptive field: Stack multiple 3Γ—3 convolutions instead of one large kernel. Two 3Γ—3 convolutions have the same receptive field as one 5Γ—5, but half the parameters and more non-linearity.
πŸ”„
LSTM / GRU / Bidirectional
Sequential modelling for text, time-series, and event data β€” return sequences and states
LSTMGRUBidirectionalNLP
β–Ύ
Syntax
Example
Internals
keras.layers.LSTM( units=64, # hidden state dimension return_sequences=False, # True for stacked LSTM / seq2seq return_state=False, # True to access h_t and c_t dropout=0.2, recurrent_dropout=0.0 # dropout on recurrent connections ) keras.layers.Bidirectional( keras.layers.LSTM(64, return_sequences=True), merge_mode='concat' # or 'sum','mul','ave' )
python
# Sentiment analysis: Embedding + BiLSTM + Dense
VOCAB_SIZE, EMB_DIM, MAX_LEN = 20000, 64, 200

inputs = keras.Input(shape=(MAX_LEN,))
x = keras.layers.Embedding(VOCAB_SIZE, EMB_DIM, mask_zero=True)(inputs)
x = keras.layers.Bidirectional(
        keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Bidirectional(
        keras.layers.LSTM(32))(x)   # last layer: return_sequences=False
x = keras.layers.Dense(32, activation='relu')(x)
x = keras.layers.Dropout(0.4)(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', keras.metrics.AUC(name='auc')])

# Time-series: multi-step prediction
ts_input = keras.Input(shape=(30, 5))  # 30 timesteps, 5 features
ts_out   = keras.layers.LSTM(64, return_sequences=False)(ts_input)
ts_out   = keras.layers.Dense(7)(ts_out)  # predict next 7 steps
  • LSTM vs GRU: GRU has fewer parameters (2 gates vs 3 in LSTM) and is faster. Performance is often comparable. Use GRU as default; LSTM when long-range dependencies matter.
  • return_sequences=True: Returns output at every timestep β€” required for stacked LSTMs or attention. False returns only the final timestep (for classification).
  • mask_zero=True: In Embedding, tells downstream layers to ignore padded 0s. Critical for variable-length sequences.

tf.data Pipeline

⚑
tf.data.Dataset β€” from_tensor_slices / map / batch / prefetch / cache
High-performance data loading and augmentation pipeline β€” eliminate GPU idle time
tf.dataprefetchperformanceaugmentation
β–Ύ
Syntax
Example
Performance
dataset = tf.data.Dataset.from_tensor_slices((X, y)) .shuffle(buffer_size, seed=42) .map(preprocess_fn, num_parallel_calls=tf.data.AUTOTUNE) .batch(batch_size, drop_remainder=False) .prefetch(tf.data.AUTOTUNE) # overlap CPU prep + GPU train .cache() # cache after expensive map
python
# Full training pipeline with augmentation
AUTOTUNE = tf.data.AUTOTUNE

def augment_image(image, label):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_brightness(image, max_delta=0.1)
    image = tf.image.random_contrast(image, lower=0.9, upper=1.1)
    image = tf.clip_by_value(image, 0.0, 1.0)
    return image, label

def make_dataset(X, y, batch_size=32, training=False):
    ds = tf.data.Dataset.from_tensor_slices((X, y))
    if training:
        ds = ds.shuffle(len(X), seed=42)
        ds = ds.map(augment_image, num_parallel_calls=AUTOTUNE)
    ds = (ds
          .batch(batch_size)
          .cache()          # after batch, cache in memory
          .prefetch(AUTOTUNE))
    return ds

train_ds = make_dataset(X_train, y_train, training=True)
val_ds   = make_dataset(X_val,   y_val,   training=False)

# Load from files (TFRecord)
raw_ds = tf.data.TFRecordDataset(filenames)
ds = raw_ds.map(parse_fn, num_parallel_calls=AUTOTUNE)

model.fit(train_ds, validation_data=val_ds, epochs=30)
AUTOTUNE
Let TF automatically tune parallelism. Always use num_parallel_calls=tf.data.AUTOTUNE in map() and prefetch(tf.data.AUTOTUNE).
cache() placement
Place cache() after decode/resize but before augmentation (augmentation should be random each epoch). If data fits in RAM, cache after map.
shuffle buffer
For proper shuffling, buffer_size should equal dataset size. Smaller buffers = biased ordering. For large datasets, use the full size or a large fraction.

Compile, Training & Callbacks

πŸ“ž
model.compile / model.fit / EarlyStopping / ReduceLROnPlateau / ModelCheckpoint
Configure training, prevent overfitting with patience, and save best weights automatically
callbacksEarlyStoppinglearning rate schedule
β–Ύ
Syntax
Example
model.compile( optimizer=keras.optimizers.Adam(learning_rate=1e-3), loss='binary_crossentropy', metrics=['accuracy', keras.metrics.AUC()] ) # Core callbacks keras.callbacks.EarlyStopping( monitor='val_loss', patience=10, restore_best_weights=True, min_delta=1e-4 ) keras.callbacks.ReduceLROnPlateau( monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6, verbose=1 ) keras.callbacks.ModelCheckpoint( filepath='best_model.keras', monitor='val_auc', mode='max', save_best_only=True )
python
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor='val_auc', mode='max',
        patience=15, restore_best_weights=True
    ),
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss', factor=0.5, patience=7, min_lr=1e-7
    ),
    keras.callbacks.ModelCheckpoint(
        'checkpoints/best.keras', save_best_only=True,
        monitor='val_auc', mode='max'
    ),
    keras.callbacks.TensorBoard(log_dir='logs/', histogram_freq=1)
]

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=200,             # EarlyStopping will stop early
    callbacks=callbacks,
    class_weight={0: 1.0, 1: 5.0},  # imbalanced: penalize false negatives
    verbose=1
)

# Load best saved model
best_model = keras.models.load_model('checkpoints/best.keras')
y_prob = best_model.predict(X_test, batch_size=256)[:, 0]

Custom Training Loop

πŸ”§
GradientTape β€” Custom Training Loop
Full control over forward pass, gradient computation, and metric updates β€” for research
GradientTapecustom trainingresearch
β–Ύ
Example
python
optimizer = keras.optimizers.Adam(1e-3)
loss_fn   = keras.losses.BinaryCrossentropy()

train_loss  = keras.metrics.Mean(name='train_loss')
train_auc   = keras.metrics.AUC(name='train_auc')

@tf.function                            # compile to graph for speed
def train_step(X_batch, y_batch):
    with tf.GradientTape() as tape:
        y_pred = model(X_batch, training=True)
        loss   = loss_fn(y_batch, y_pred)
        loss  += sum(model.losses)           # L2 reg losses

    grads = tape.gradient(loss, model.trainable_variables)
    # Gradient clipping (prevents exploding gradients in RNNs)
    grads, _ = tf.clip_by_global_norm(grads, clip_norm=1.0)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    train_loss.update_state(loss)
    train_auc.update_state(y_batch, y_pred)

for epoch in range(NUM_EPOCHS):
    train_loss.reset_state(); train_auc.reset_state()
    for X_batch, y_batch in train_ds:
        train_step(X_batch, y_batch)
    print(f'Epoch {epoch+1}: loss={train_loss.result():.4f}, '
          f'AUC={train_auc.result():.4f}')

Transfer Learning & Fine-tuning

πŸ”„
keras.applications β€” Transfer Learning Pattern
Freeze pretrained weights, add classification head, unfreeze for fine-tuning
transfer learningfine-tuningMobileNetfreeze
β–Ύ
Example
python
# Step 1: Load pretrained base without top
base_model = keras.applications.MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,         # drop classifier
    weights='imagenet'
)
base_model.trainable = False    # freeze base weights

# Step 2: Add custom classification head
inputs  = keras.Input(shape=(224, 224, 3))
x = keras.applications.mobilenet_v2.preprocess_input(inputs)
x = base_model(x, training=False)     # BN in eval mode
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(num_classes, activation='softmax')(x)
model = keras.Model(inputs, outputs)

# Step 3: Train head only (fast)
model.compile('adam', 'sparse_categorical_crossentropy', ['accuracy'])
model.fit(train_ds, epochs=10, validation_data=val_ds)

# Step 4: Unfreeze top layers for fine-tuning
base_model.trainable = True
# Only fine-tune top 30 layers
for layer in base_model.layers[:-30]:
    layer.trainable = False

# Use lower LR for fine-tuning
model.compile(optimizer=keras.optimizers.Adam(1e-5),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_ds, epochs=20, validation_data=val_ds)