0% found this document useful (0 votes)
35 views30 pages

Deep Learning with ANNs in Python

Deep Learning is a specialized branch of Machine Learning that utilizes Artificial Neural Networks (ANNs) with multiple layers to learn complex patterns from data. It excels in automatic feature extraction, performance with large datasets, and handling unstructured data, making it suitable for various applications like image and text processing. The document also covers the architecture of ANNs, their learning process, and provides code snippets for implementing basic ANNs and Convolutional Neural Networks (CNNs) using TensorFlow/Keras.

Uploaded by

Bhargav Vemula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views30 pages

Deep Learning with ANNs in Python

Deep Learning is a specialized branch of Machine Learning that utilizes Artificial Neural Networks (ANNs) with multiple layers to learn complex patterns from data. It excels in automatic feature extraction, performance with large datasets, and handling unstructured data, making it suitable for various applications like image and text processing. The document also covers the architecture of ANNs, their learning process, and provides code snippets for implementing basic ANNs and Convolutional Neural Networks (CNNs) using TensorFlow/Keras.

Uploaded by

Bhargav Vemula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Excellent!

Deep Learning is where things get truly fascinating and powerful, especially with
today's massive datasets and computational resources. It's a specialized branch of Machine
Learning that focuses on Artificial Neural Networks (ANNs) with multiple layers, allowing them
to learn complex patterns and representations from data.

Let's dive deep into Deep Learning, covering its core concepts, popular architectures, and
practical implementation using Python with TensorFlow/Keras.

Deep Learning: The Power of Multi-Layered Neural Networks

Deep Learning is a subset of Machine Learning where the algorithms are inspired by the
structure and function of the human brain's neural networks. These are called Artificial Neural
Networks (ANNs). The "deep" in deep learning refers to the number of layers in the neural
network. A network with more than one hidden layer is generally considered a "deep" neural
network.

Key Idea: Deep Learning models learn representations of data with multiple levels of
abstraction. For example, in an image, the first layer might learn to detect edges, the second layer
might combine edges to form shapes, and subsequent layers might combine shapes to recognize
objects. This hierarchical learning allows deep neural networks to automatically discover
intricate features from raw data, bypassing the need for manual feature engineering that is often
required in traditional machine learning.

Why Deep Learning?

 Automatic Feature Extraction: Unlike traditional ML, deep learning models can
automatically learn relevant features from raw data, reducing the need for manual feature
engineering.
 Performance with Big Data: Deep learning models excel when given vast amounts of
data, often outperforming traditional ML algorithms as data size increases.
 Handling Unstructured Data: State-of-the-art results in tasks involving images, text,
audio, and video are predominantly achieved with deep learning.
 Scalability: With advancements in hardware (GPUs, TPUs) and distributed computing,
training very large deep learning models is now feasible.

1. Artificial Neural Networks (ANNs) - The Foundation

An Artificial Neural Network is a computational model inspired by the biological neural


networks in the brain. It consists of interconnected nodes (neurons) organized in layers.

Core Components of an ANN:


 Neurons (Nodes): The fundamental unit of an ANN. Each neuron receives input,
performs a calculation (weighted sum and activation), and then passes the output to
subsequent neurons.
 Layers:
o Input Layer: Receives the raw data (features). The number of neurons in this
layer typically corresponds to the number of features in your dataset.
o Hidden Layers: Layers between the input and output layers. These are where the
"deep" learning happens, as the network learns complex patterns and
representations. A network with multiple hidden layers is a deep neural network.
o Output Layer: Produces the final prediction. The number of neurons and the
activation function depend on the type of problem (e.g., 1 neuron for binary
classification/regression, multiple for multi-class classification).
 Weights (w): Numerical values that represent the strength of the connection between two
neurons. Each input to a neuron is multiplied by its corresponding weight. Weights are
the parameters that the network learns during training.
 Biases (b): An additional parameter added to the weighted sum of inputs. Biases allow
the activation function to be shifted, enabling the network to learn more complex
relationships.
 Activation Functions (sigma): Non-linear functions applied to the weighted sum of
inputs plus bias. They introduce non-linearity into the network, enabling it to learn
complex, non-linear relationships in the data. Without activation functions, a neural
network would simply be a linear model, regardless of how many layers it has.
o Common Activation Functions:
 Sigmoid: sigma(x)=frac11+e−x. Outputs values between 0 and 1. Used in
output layers for binary classification. Prone to vanishing gradients.
 ReLU (Rectified Linear Unit): sigma(x)=max(0,x). Most popular choice
for hidden layers due to its computational efficiency and effectiveness in
mitigating vanishing gradients.
 Leaky ReLU: A variant of ReLU that allows a small, non-zero gradient
when $x \< 0$, preventing "dying ReLUs."
 Softmax: Used in the output layer for multi-class classification. It
converts a vector of numbers into a probability distribution, where the sum
of probabilities is 1.

How an ANN Learns (Forward and Backward Propagation):

1. Forward Propagation:
o Input data is fed into the input layer.
o Each neuron in a hidden layer performs a weighted sum of its inputs, adds a bias,
and applies an activation function.
o The output of one layer becomes the input for the next layer.
o This process continues until the output layer produces a prediction.

Mathematically, for a single neuron: z=∑(wixi)+b a=σ(z) Where:

o x_i are the inputs


o w_i are the weights
o b is the bias
o z is the weighted sum
o sigma is the activation function
o a is the output (activation) of the neuron
2. Loss Function: After forward propagation, the model's prediction (haty) is compared to
the actual target (y) using a loss function. This function quantifies the error or
discrepancy between the prediction and the true value.
o Examples: Mean Squared Error (MSE) for regression, Binary Cross-Entropy for
binary classification, Categorical Cross-Entropy for multi-class classification.
3. Backward Propagation (Backpropagation):
o The calculated loss is propagated backward through the network, from the output
layer to the input layer.
o Using gradient descent (or its variants like Adam, RMSprop), the algorithm
calculates the gradient of the loss function with respect to each weight and bias in
the network. The gradient indicates the direction and magnitude of change needed
to reduce the loss.
o The weights and biases are then updated iteratively to minimize the loss. This
process is essentially how the network "learns."
o The learning rate is a hyperparameter that determines the step size for each
update. A high learning rate can cause overshooting the minimum, while a low
learning rate can make training very slow.

Epochs and Batch Size:

 Epoch: One full pass through the entire training dataset. During one epoch, the model
sees every training example once.
 Batch Size: The number of training examples utilized in one iteration. The network
updates its weights and biases after processing each batch.
o Stochastic Gradient Descent (SGD): Batch size = 1 (updates after each
example).
o Mini-Batch Gradient Descent: Batch size > 1 and &lt; total training examples
(most common).
o Batch Gradient Descent: Batch size = total training examples (updates once per
epoch).

Code Snippet: Basic ANN for Classification (using TensorFlow/Keras)

Let's build a simple ANN to classify data. We'll use a synthetic dataset for demonstration.

Python
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from [Link] import make_classification
from [Link] import accuracy_score, classification_report,
confusion_matrix
import [Link] as plt
import seaborn as sns

# 1. Generate Synthetic Data for Binary Classification


# X: Features, y: Target (0 or 1)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_clusters_per_class=1,
random_state=42)

print(f"Shape of X: {[Link]}")
print(f"Shape of y: {[Link]}")

# Convert to DataFrame for better viewing


data_df = [Link](X, columns=[f'feature_{i}' for i in
range([Link][1])])
data_df['target'] = y
print("\nFirst 5 rows of the synthetic dataset:")
print(data_df.head())

# 2. Data Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# stratify=y ensures that the proportion of target classes is the same in
train and test sets

print(f"\nTraining set size: {X_train.shape[0]}")


print(f"Test set size: {X_test.shape[0]}")

# 3. Data Scaling (Crucial for Neural Networks)


# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)

print("\nFeatures scaled.")

# 4. Model Definition (Building the ANN)


# Keras Sequential API is used for building a stack of layers
model = [Link]([
# Input Layer (implicitly defined by the first Dense layer's input_shape)
# Dense: Fully connected layer, each neuron connected to every neuron in
the previous layer.
[Link](units=64, activation='relu',
input_shape=(X_train_scaled.shape[1],)),
# Hidden Layer 1: 64 neurons, ReLU activation
# input_shape is only needed for the first layer

[Link](units=32, activation='relu'),
# Hidden Layer 2: 32 neurons, ReLU activation

[Link](units=1, activation='sigmoid')
# Output Layer: 1 neuron for binary classification, Sigmoid activation to
output probability between 0 and 1
])
# 5. Model Compilation
# Configure the learning process
[Link](optimizer='adam', # Adam is an efficient optimization algorithm
loss='binary_crossentropy', # Loss function for binary
classification
metrics=['accuracy']) # Metric to monitor during training

[Link]() # Print a summary of the model architecture

# 6. Model Training
history = [Link](X_train_scaled, y_train,
epochs=50, # Number of full passes through the training
data
batch_size=32, # Number of samples per gradient update
validation_split=0.1, # Use 10% of training data as
validation set during training
verbose=1) # Display progress during training

# 7. Model Evaluation
print("\nEvaluating model on the test set:")
loss, accuracy = [Link](X_test_scaled, y_test, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

# Generate predictions (probabilities for binary classification)


y_pred_proba = [Link](X_test_scaled)
# Convert probabilities to binary class labels (0 or 1) based on a threshold
(e.g., 0.5)
y_pred = (y_pred_proba > 0.5).astype(int)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
[Link](cm, annot=True, fmt='d', cmap='Blues')
[Link]('Predicted')
[Link]('Actual')
[Link]('Confusion Matrix')
[Link]()

# Plot training history (loss and accuracy)


[Link](figsize=(12, 5))

# Plot training & validation accuracy values


[Link](1, 2, 1)
[Link]([Link]['accuracy'])
[Link]([Link]['val_accuracy'])
[Link]('Model Accuracy')
[Link]('Accuracy')
[Link]('Epoch')
[Link](['Train', 'Validation'], loc='upper left')

# Plot training & validation loss values


[Link](1, 2, 2)
[Link]([Link]['loss'])
[Link]([Link]['val_loss'])
[Link]('Model Loss')
[Link]('Loss')
[Link]('Epoch')
[Link](['Train', 'Validation'], loc='upper left')
plt.tight_layout()
[Link]()

# 8. Making Predictions on new data (conceptual)


# new_data = [Link]([[...]]) # Your new data point(s)
# new_data_scaled = [Link](new_data)
# prediction_proba = [Link](new_data_scaled)
# prediction_class = (prediction_proba > 0.5).astype(int)
# print(f"\nPrediction for new data:
Probability={prediction_proba[0][0]:.4f}, Class={prediction_class[0][0]}")

2. Common Deep Learning Architectures

Beyond the basic ANN, specialized architectures have been developed to handle specific types of
data and tasks.

A. Convolutional Neural Networks (CNNs)

 Best for: Image and video processing.


 Key Idea: CNNs are designed to automatically and adaptively learn spatial hierarchies of
features from input data. They achieve this using specialized layers that exploit the grid-
like topology of images.
 Core Layers:
o Convolutional Layer: This is the heart of a CNN. It applies a set of learnable
filters (kernels) to the input image (or feature map) to produce feature maps.
Each filter detects a specific pattern (e.g., edge, corner, texture) at different
locations in the image. The operation involves sliding the filter over the input and
performing element-wise multiplications and summations.
o Pooling Layer (Subsampling): Reduces the spatial dimensions (width and
height) of the feature maps, thereby reducing the number of parameters and
computations in the network, and helping to control overfitting. Common pooling
operations include:
 Max Pooling: Takes the maximum value from each patch (window) of the
feature map.
 Average Pooling: Takes the average value from each patch.
o Fully Connected (Dense) Layer: After several convolutional and pooling layers,
the flattened feature maps are fed into one or more fully connected layers, similar
to a traditional ANN. These layers perform the final classification or regression
based on the high-level features learned by the convolutional layers.
o Activation Layers: ReLU is commonly used after convolutional layers.
 How they work (Intuition): Early convolutional layers detect low-level features (edges,
corners). Deeper layers combine these to form higher-level features (parts of objects like
eyes, wheels). The final fully connected layers use these rich features for classification.
Code Snippet: Simple CNN for Image Classification (Fashion MNIST)

Python
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from [Link] import classification_report, confusion_matrix
import [Link] as plt
import seaborn as sns
import numpy as np

# 1. Load and Prepare the Fashion MNIST Dataset


# Fashion MNIST is a dataset of Zalando's fashion article images, consisting
of a training set
# of 60,000 examples and a test set of 10,000 examples. Each example is a
28x28 grayscale image,
# associated with a label from 10 classes.
(X_train_full, y_train_full), (X_test, y_test) =
[Link].fashion_mnist.load_data()

# Normalize pixel values to be between 0 and 1


X_train_full = X_train_full / 255.0
X_test = X_test / 255.0

# Split the full training set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full,
test_size=0.1, random_state=42, stratify=y_train_full)

# CNNs expect an extra dimension for channels (e.g., 1 for grayscale, 3 for
RGB)
X_train = X_train[..., [Link]] # Add channel dimension
X_val = X_val[..., [Link]]
X_test = X_test[..., [Link]]

print(f"Shape of X_train: {X_train.shape}") # (54000, 28, 28, 1)


print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_val: {X_val.shape}")
print(f"Shape of X_test: {X_test.shape}")

# Define class names for better readability


class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# Display some images


[Link](figsize=(10, 10))
for i in range(25):
[Link](5, 5, i + 1)
[Link]([])
[Link]([])
[Link](False)
[Link](X_train[i].squeeze(), cmap=[Link]) # .squeeze() removes
the channel dimension for display
[Link](class_names[y_train[i]])
[Link]('Sample Fashion MNIST Images')
[Link]()
# 2. Model Definition (Building the CNN)
model_cnn = [Link]([
# Input Layer (implicitly defined by the first Conv2D's input_shape)
# Conv2D: Convolutional layer for 2D inputs (images)
# filters: Number of filters (feature detectors)
# kernel_size: Dimensions of the convolution window (e.g., 3x3)
# activation: ReLU is common
# input_shape: (height, width, channels)
[Link].Conv2D(filters=32, kernel_size=(3, 3), activation='relu',
input_shape=(28, 28, 1)),
[Link].MaxPooling2D(pool_size=(2, 2)), # Max Pooling layer

[Link].Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),


[Link].MaxPooling2D(pool_size=(2, 2)),

[Link](), # Flatten the 2D feature maps into a 1D vector to


feed into Dense layers
[Link](units=128, activation='relu'), # Fully connected
hidden layer
[Link](0.5), # Dropout for regularization (helps prevent
overfitting)
[Link](units=10, activation='softmax') # Output layer: 10
neurons for 10 classes, Softmax for probabilities
])

# 3. Model Compilation
model_cnn.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', # Use
sparse_categorical_crossentropy for integer labels
metrics=['accuracy'])

model_cnn.summary()

# 4. Model Training
history_cnn = model_cnn.fit(X_train, y_train,
epochs=10,
batch_size=64,
validation_data=(X_val, y_val),
verbose=1)

# 5. Model Evaluation
print("\nEvaluating CNN model on the test set:")
loss_cnn, accuracy_cnn = model_cnn.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {loss_cnn:.4f}")
print(f"Test Accuracy: {accuracy_cnn:.4f}")

# Generate predictions
y_pred_proba_cnn = model_cnn.predict(X_test)
y_pred_cnn = [Link](y_pred_proba_cnn, axis=1) # Get the class with the
highest probability

print("\nClassification Report for CNN:")


print(classification_report(y_test, y_pred_cnn, target_names=class_names))

print("\nConfusion Matrix for CNN:")


cm_cnn = confusion_matrix(y_test, y_pred_cnn)
[Link](cm_cnn, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names, yticklabels=class_names)
[Link]('Predicted')
[Link]('Actual')
[Link]('CNN Confusion Matrix')
[Link]()

# Plot training history


[Link](figsize=(12, 5))
[Link](1, 2, 1)
[Link](history_cnn.history['accuracy'])
[Link](history_cnn.history['val_accuracy'])
[Link]('CNN Model Accuracy')
[Link]('Accuracy')
[Link]('Epoch')
[Link](['Train', 'Validation'], loc='upper left')

[Link](1, 2, 2)
[Link](history_cnn.history['loss'])
[Link](history_cnn.history['val_loss'])
[Link]('CNN Model Loss')
[Link]('Loss')
[Link]('Epoch')
[Link](['Train', 'Validation'], loc='upper left')
plt.tight_layout()
[Link]()

B. Recurrent Neural Networks (RNNs)

 Best for: Sequential data (time series, natural language, audio).


 Key Idea: RNNs are designed to process sequences of inputs, maintaining an internal
"memory" that captures information about previous elements in the sequence. Unlike
ANNs, RNNs have loops, allowing information to persist from one step to the next.
 Challenge with Vanilla RNNs: Vanishing/Exploding Gradients when dealing with long
sequences, making it difficult to learn long-range dependencies.
 Variants to address challenges:
o Long Short-Term Memory (LSTM): A special type of RNN that can learn long-
term dependencies. LSTMs have "gates" (input, forget, output) that regulate the
flow of information into and out of the cell state, allowing them to selectively
remember or forget information over long periods.
o Gated Recurrent Unit (GRU): A simplified version of LSTM with fewer gates
(reset and update gates), making it computationally less expensive while often
achieving similar performance.
 Applications:
o Natural Language Processing (NLP): Machine translation, sentiment analysis,
text generation, speech recognition.
o Time Series Analysis: Stock price prediction, weather forecasting.
o Video Analysis: Action recognition.

Code Snippet: Simple RNN (LSTM) for Sequence Classification


Let's imagine we have a sequence of numbers and want to classify them (e.g., a simple time
series pattern recognition).

Python
import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from [Link] import MinMaxScaler
from [Link] import classification_report, confusion_matrix
import [Link] as plt
import seaborn as sns

# 1. Generate Synthetic Sequential Data


# Let's create sequences that follow a simple pattern to classify (e.g.,
increasing vs decreasing)
def generate_sequences(num_samples, seq_length):
X = []
y = []
for _ in range(num_samples):
# Generate a random starting point
start_val = [Link](0, 10)
# Generate sequence type (0 for increasing, 1 for decreasing)
seq_type = [Link](0, 2)

sequence = [start_val]
for i in range(1, seq_length):
if seq_type == 0: # Increasing
next_val = sequence[-1] + [Link](0.1, 1.0)
else: # Decreasing
next_val = sequence[-1] - [Link](0.1, 1.0)
[Link](next_val)
[Link](sequence)
[Link](seq_type)
return [Link](X), [Link](y)

SEQ_LENGTH = 10
NUM_SAMPLES = 2000

X, y = generate_sequences(NUM_SAMPLES, SEQ_LENGTH)

print(f"Shape of X: {[Link]}") # (2000, 10)


print(f"Shape of y: {[Link]}") # (2000,)

# 2. Data Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)

# 3. Data Scaling (important for numerical stability)


scaler = MinMaxScaler(feature_range=(0, 1)) # Scale to 0-1 range
# Apply scaler column-wise if treating each element independently, or reshape
and apply for full sequence scaling
# For sequences, it's often better to scale each feature (time step)
independently if features are consistent across time.
# Or, if you want global scaling:
X_train_scaled = scaler.fit_transform(X_train.reshape(-1,
1)).reshape(X_train.shape)
X_test_scaled = [Link](X_test.reshape(-1, 1)).reshape(X_test.shape)

# RNNs expect input in the format (batch_size, timesteps, features)


# Here, timesteps = SEQ_LENGTH, features = 1 (since each element is a single
value)
X_train_scaled = X_train_scaled[..., [Link]]
X_test_scaled = X_test_scaled[..., [Link]]

print(f"\nShape of X_train (after reshaping for RNN):


{X_train_scaled.shape}") # (1600, 10, 1)

# 4. Model Definition (Building the LSTM RNN)


model_lstm = [Link]([
# LSTM layer:
# units: Dimensionality of the output space (number of LSTM cells)
# input_shape: (timesteps, features)
[Link](units=50, return_sequences=False,
input_shape=(SEQ_LENGTH, 1)),
# return_sequences=False means the LSTM layer returns only the output of
the last timestep
# If set to True, it returns the output for each timestep, useful for
stacked LSTMs

[Link](units=32, activation='relu'), # Hidden dense layer


[Link](0.3), # Dropout for regularization
[Link](units=1, activation='sigmoid') # Output layer for
binary classification
])

# 5. Model Compilation
model_lstm.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

model_lstm.summary()

# 6. Model Training
history_lstm = model_lstm.fit(X_train_scaled, y_train,
epochs=30,
batch_size=64,
validation_split=0.1,
verbose=1)

# 7. Model Evaluation
print("\nEvaluating LSTM model on the test set:")
loss_lstm, accuracy_lstm = model_lstm.evaluate(X_test_scaled, y_test,
verbose=0)
print(f"Test Loss: {loss_lstm:.4f}")
print(f"Test Accuracy: {accuracy_lstm:.4f}")

# Generate predictions
y_pred_proba_lstm = model_lstm.predict(X_test_scaled)
y_pred_lstm = (y_pred_proba_lstm > 0.5).astype(int)

print("\nClassification Report for LSTM:")


print(classification_report(y_test, y_pred_lstm))

print("\nConfusion Matrix for LSTM:")


cm_lstm = confusion_matrix(y_test, y_pred_lstm)
[Link](cm_lstm, annot=True, fmt='d', cmap='Blues')
[Link]('Predicted')
[Link]('Actual')
[Link]('LSTM Confusion Matrix')
[Link]()

# Plot training history


[Link](figsize=(12, 5))
[Link](1, 2, 1)
[Link](history_lstm.history['accuracy'])
[Link](history_lstm.history['val_accuracy'])
[Link]('LSTM Model Accuracy')
[Link]('Accuracy')
[Link]('Epoch')
[Link](['Train', 'Validation'], loc='upper left')

[Link](1, 2, 2)
[Link](history_lstm.history['loss'])
[Link](history_lstm.history['val_loss'])
[Link]('LSTM Model Loss')
[Link]('Loss')
[Link]('Epoch')
[Link](['Train', 'Validation'], loc='upper left')
plt.tight_layout()
[Link]()

C. Transformers

 Best for: Advanced sequence-to-sequence tasks, particularly in NLP.


 Key Idea: Transformers revolutionized NLP (and now increasingly computer vision) by
abandoning recurrence and convolutions in favor of attention mechanisms.
 Attention Mechanism: Allows the model to weigh the importance of different parts of
the input sequence when processing each element. This enables the model to focus on
relevant parts of the input regardless of their position, overcoming the long-range
dependency issues of RNNs and LSTMs.
 Architecture: Consists of an encoder (maps an input sequence to a sequence of
continuous representations) and a decoder (generates an output sequence one element at
a time, looking at the encoder's output and previously generated elements). Both encoder
and decoder stacks are composed of multiple identical layers, each containing multi-head
self-attention mechanisms and feed-forward networks.
 Positional Encoding: Since Transformers don't have recurrence, they use positional
encodings to inject information about the relative or absolute position of tokens in the
sequence.
 Applications: Machine translation (Google Translate), text summarization, question
answering, text generation (GPT-3/4, LLMs like Gemini).

Note: Implementing a full Transformer from scratch is complex. High-level libraries like
Hugging Face's transformers simplify their usage. Due to the complexity, a full
implementation snippet is beyond a concise example, but here's a conceptual idea of using a pre-
trained Transformer.

Conceptual Code Snippet (using Hugging Face Transformers):

# !pip install transformers sentencepiece accelerate # Uncomment to install necessary libraries


from transformers import pipeline
# Use a pre-trained sentiment analysis model
# This uses a DistilBERT model, a smaller, faster version of BERT
sentiment_pipeline = pipeline("sentiment-analysis")
text1 = "I love deep learning! It's so powerful and exciting."
text2 = "This movie was incredibly boring and a waste of time."
text3 = "The weather today is neither good nor bad."
print(f"Sentiment for '{text1}': {sentiment_pipeline(text1)}")
print(f"Sentiment for '{text2}': {sentiment_pipeline(text2)}")
print(f"Sentiment for '{text3}': {sentiment_pipeline(text3)}")
# Example for text generation
# generator = pipeline("text-generation", model="distilgpt2")
# print("\nGenerated text:")
# print(generator("Once upon a time, in a land far, far away,", max_length=50,
num_return_sequences=1))

. Key Challenges and Advanced Topics in Deep Learning

 Overfitting: A common problem where the model learns the training data too well,
including noise, and performs poorly on unseen data.
o Mitigation: More data, regularization (L1/L2), Dropout, Early Stopping, Data
Augmentation.
 Vanishing/Exploding Gradients: During backpropagation, gradients can become
extremely small (vanishing) or extremely large (exploding), making it difficult for the
network to learn.
o Mitigation: ReLU activation, Batch Normalization, proper weight initialization,
Gradient Clipping (for exploding gradients), using LSTMs/GRUs for RNNs.
 Computational Resources: Training deep models requires significant computational
power (GPUs, TPUs) and memory.
 Data Hunger: Deep learning models typically require large amounts of labeled data to
achieve high performance.
 Interpretability: Deep neural networks are often considered "black boxes," making it
hard to understand why they make certain predictions. Research in Explainable AI (XAI)
aims to address this.
 Hyperparameter Tuning: Finding the optimal combination of hyperparameters
(learning rate, number of layers, neurons per layer, batch size, etc.) can be challenging
and time-consuming. Techniques like Grid Search, Random Search, and Bayesian
Optimization are used.
 Transfer Learning: As demonstrated with Hugging Face, training a huge model from
scratch is rarely necessary. Transfer learning involves using a pre-trained model (trained
on a massive dataset for a generic task) and fine-tuning it for a specific, smaller task. This
saves time, resources, and often leads to better performance, especially with limited data.
 Generative Models (GANs, VAEs): Networks designed to generate new data samples
that resemble the training data (e.g., generating realistic images, text, or audio).
 Reinforcement Learning with Deep Learning (Deep RL): Combining deep learning
with reinforcement learning, allowing agents to learn complex policies directly from
high-dimensional sensor inputs (e.g., playing Atari games from screen pixels, controlling
robots).

Natural Language Processing (NLP) is a fascinating and rapidly evolving field at the intersection
of Artificial Intelligence (AI), computer science, and linguistics. It focuses on enabling
computers to understand, interpret, and generate human language in a way that is valuable and
meaningful. In essence, NLP aims to bridge the communication gap between humans and
machines.

Let's explore NLP in depth:

1. What is Natural Language Processing?

At its core, NLP deals with the complexities of human language. Unlike structured data (like
numbers in a spreadsheet), natural language is highly unstructured, ambiguous, and context-
dependent. NLP systems strive to:

 Understand: Comprehend the meaning, intent, and sentiment behind human language.
This includes both written text and spoken words (when combined with speech
recognition).
 Process: Convert raw language data into a format that computers can analyze.
 Manipulate: Generate new text, summarize existing text, translate between languages,
and more.

Subfields related to NLP:


 Natural Language Understanding (NLU): Focuses on deciphering the meaning of text,
including handling ambiguity, grammar, and context.
 Natural Language Generation (NLG): Deals with generating human-like text or speech
from structured data or internal representations.
 Computational Linguistics (CL): The scientific study of language from a computational
perspective, providing the theoretical underpinnings for NLP.

2. Why is NLP Challenging?

Human language is incredibly complex and nuanced, posing significant challenges for machines:

 Ambiguity: Words and sentences can have multiple meanings depending on context
(e.g., "bank" as a river bank vs. a financial institution). Sarcasm, irony, and metaphors are
even harder to grasp.
 Context Dependency: The meaning of a word or phrase often relies on the surrounding
text and even external world knowledge.
 Synonymy and Polysemy: Different words can have the same meaning (synonymy), and
the same word can have multiple meanings (polysemy).
 Variability: Different ways of expressing the same idea, slang, dialects, misspellings,
and grammatical errors.
 Lack of Labeled Data: For many specific tasks or languages, high-quality, human-
labeled data is scarce and expensive to produce.
 Multilingualism: Each language has its own unique grammar, syntax, morphology, and
cultural nuances.
 Common Sense Reasoning: Machines lack the vast background knowledge and
common sense that humans implicitly use to understand language.
 Ethical Concerns & Bias: NLP models trained on biased data can perpetuate or even
amplify societal biases (e.g., gender, racial bias in word embeddings or language
models).

3. Core Components of an NLP Pipeline (Traditional vs. Deep Learning Era)

Historically, NLP involved a series of sequential steps. While Deep Learning models often
automate many of these, understanding them provides crucial context.

A. Traditional NLP Pipeline / Preprocessing Steps:

Before any meaningful analysis, raw text usually undergoes several preprocessing steps:

1. Text Acquisition: Gathering the raw textual data from various sources (web scraping,
databases, APIs, documents, speech-to-text transcripts).
2. Text Cleaning: Removing noise from the text.
o Punctuation Removal: Removing characters like commas, periods, etc.
o Special Character Removal: Removing emojis, symbols, HTML tags.
o Lowercase Conversion: Converting all text to lowercase to treat "Word" and
"word" as the same.
o Number Removal (Optional): Depending on the task, numbers might be
removed.
o Whitespace Normalization: Reducing multiple spaces to single spaces.
3. Tokenization: Breaking down raw text into smaller units called tokens. Tokens can be:
o Word Tokens: Splitting sentences into individual words.
o Sentence Tokens: Splitting paragraphs into individual sentences.
o Subword Tokens: Breaking words into smaller components (e.g., "unbreakable"
-> "un", "break", "able"), useful for handling rare words and out-of-vocabulary
words in deep learning.
4. Stop Word Removal: Eliminating common words (like "the", "a", "is", "and") that often
carry little semantic meaning and might not be useful for analysis.
5. Stemming: Reducing words to their root form by removing suffixes, often resulting in
non-dictionary words (e.g., "running", "runs", "ran" -> "run"). Simpler and faster, but less
accurate than lemmatization.
6. Lemmatization: Reducing words to their dictionary base form (lemma) using a
vocabulary and morphological analysis (e.g., "better" -> "good", "am", "is", "are" ->
"be"). More sophisticated and accurate than stemming.
7. Part-of-Speech (POS) Tagging: Identifying the grammatical role of each word in a
sentence (e.g., noun, verb, adjective, adverb).
8. Named Entity Recognition (NER): Identifying and classifying named entities in text
into predefined categories such as person names, organizations, locations, dates, etc.
9. Dependency Parsing / Syntactic Parsing: Analyzing the grammatical structure of
sentences to understand the relationships between words (e.g., which words are subjects,
objects, or modifiers).
10. Feature Engineering (Traditional ML): Converting preprocessed text into numerical
features that traditional machine learning algorithms can understand.
o Bag-of-Words (BoW): Represents text as a bag (multiset) of its words,
disregarding grammar and word order but keeping multiplicity. Features are word
counts or frequencies.
o TF-IDF (Term Frequency-Inverse Document Frequency): Weights words
based on how frequently they appear in a document (TF) and how rare they are
across the entire corpus (IDF). It highlights words that are important to a specific
document but not common overall.
o N-grams: Sequences of N words (e.g., "natural language" is a bigram, "natural
language processing" is a trigram). Captures some local word order.

B. Deep Learning Era - Text Representations (Embeddings):

Deep learning models bypass much of the explicit feature engineering by learning dense, low-
dimensional numerical representations of words or phrases, known as embeddings.

1. Word Embeddings:
o Word2Vec (Mikolov et al., Google): Learns word embeddings by predicting
surrounding words (Skip-gram) or predicting a target word from its context
(CBOW). Words with similar meanings have similar vector representations.
oGloVe (Global Vectors for Word Representation): Combines aspects of global
matrix factorization and local context window methods.
o FastText (Facebook AI): Extends Word2Vec by learning embeddings for
character n-grams, enabling it to handle out-of-vocabulary words and
morphologically rich languages better.
2. Contextual Embeddings:
o ELMo (Embeddings from Language Models): First to propose deep
contextualized word representations. It generates different embeddings for the
same word based on its context in a sentence, capturing polysemy.
o BERT (Bidirectional Encoder Representations from Transformers): A
groundbreaking model that learns deep bidirectional representations from
unlabeled text by jointly conditioning on both left and right context in all layers.
This means a word's meaning is derived from words before and after it.
o GPT (Generative Pre-trained Transformer): A series of models (GPT-1, GPT-
2, GPT-3, GPT-4, etc.) that are transformer-based and pre-trained to predict the
next word in a sequence (unidirectional). Excellent for text generation,
summarization, and various other tasks.
o Other Transformer-based Models: RoBERTa, XLNet, T5, ALBERT,
ELECTRA, and many more, each with specific optimizations or training
objectives.

4. Deep Learning Models for NLP (Deeper Dive from previous section):

While we touched on these, let's emphasize their role specifically in NLP.

 Recurrent Neural Networks (RNNs) and their variants (LSTMs, GRUs):


o Why they were good for NLP: Their ability to process sequences and maintain
an internal "memory" made them natural fits for understanding the sequential
nature of language. LSTMs and GRUs largely solved the vanishing gradient
problem of vanilla RNNs, allowing them to capture longer-range dependencies.
o Applications: Machine translation (early models), sentiment analysis, text
generation, speech recognition, named entity recognition.
o Limitation: Still suffered from issues with very long sequences and inherently
sequential processing, limiting parallelization.
 Convolutional Neural Networks (CNNs) in NLP:
o Why they were used: While primarily known for images, 1D CNNs can be
applied to text by sliding filters over sequences of word embeddings. They are
good at extracting local features (like n-grams) and identifying important patterns
regardless of their position in the text.
o Applications: Text classification, sentiment analysis, spam detection.
o Limitation: Less effective at capturing long-range dependencies and global
context compared to RNNs or Transformers.
 Transformers:
o The Game Changer: As discussed, Transformers, particularly with the advent of
self-attention, removed the sequential processing bottleneck of RNNs. They can
process all words in a sequence simultaneously, allowing for massive
parallelization and the capture of very long-range dependencies.
o Self-Attention: Allows each word in a sequence to "attend" to (weigh the
importance of) every other word in the sequence, dynamically determining
relevant contextual information for each word's representation.
o Multi-Head Attention: Multiple attention mechanisms running in parallel,
allowing the model to learn different types of relationships.
o Positional Encodings: Crucial for Transformers since they don't have inherent
sequential processing. These encodings add information about the position of
words in the sequence.
o Encoder-Decoder Architecture: Used for sequence-to-sequence tasks like
machine translation. The encoder processes the input, and the decoder generates
the output, attending to the encoder's output.
o Pre-training and Fine-tuning: The dominant paradigm for Transformer usage.
Large models are pre-trained on massive amounts of unlabeled text (e.g.,
predicting masked words, next sentence prediction), then fine-tuned on smaller,
task-specific labeled datasets. This leverages the general linguistic knowledge
acquired during pre-training.

5. Key NLP Tasks and Applications:

NLP powers a vast array of applications that we interact with daily:

 Text Classification: Categorizing text into predefined labels.


o Examples: Spam detection, sentiment analysis (positive/negative/neutral
reviews), topic labeling (news articles), intent recognition (chatbot inquiries).
 Sentiment Analysis (Opinion Mining): Determining the emotional tone or sentiment
expressed in a piece of text.
o Examples: Analyzing customer reviews, social media monitoring for brand
reputation, market research.
 Named Entity Recognition (NER): Identifying and classifying proper nouns (people,
organizations, locations, dates, etc.) in text.
o Examples: Information extraction from documents, populating knowledge
graphs, clinical text analysis.
 Machine Translation: Automatically translating text or speech from one language to
another.
o Examples: Google Translate, real-time translation in communication apps.
 Text Summarization: Condensing a long text into a shorter, coherent summary.
o Examples: Summarizing news articles, legal documents, research papers.
o Extractive Summarization: Selects important sentences directly from the
original text.
o Abstractive Summarization: Generates new sentences that capture the core
meaning, potentially paraphrasing.
 Question Answering (QA): Enabling systems to answer questions posed in natural
language based on a given text or knowledge base.
o Examples: Search engines, chatbots, virtual assistants.
 Chatbots and Virtual Assistants: Systems that can converse with humans, understand
commands, and provide information.
o Examples: Siri, Alexa, Google Assistant, customer service chatbots.
 Language Modeling & Text Generation: Predicting the next word in a sequence, or
generating coherent and contextually relevant text.
o Examples: Predictive text, auto-complete, creative writing tools, generating
marketing copy, large language models (LLMs).
 Speech Recognition (Speech-to-Text): Converting spoken language into written text.
o Examples: Voice assistants, dictation software, transcribing meetings.
 Text-to-Speech (TTS): Converting written text into synthesized speech.
o Examples: Screen readers, navigation systems.
 Information Retrieval: Finding relevant documents or information from large datasets
based on a user's query.
o Examples: Search engines, document search within enterprises.
 Spelling and Grammar Correction: Identifying and correcting errors in written text.
o Examples: Grammarly, spell checkers in word processors.

6. The Rise of Large Language Models (LLMs)

The past few years have seen an explosion in the capabilities and applications of Large Language
Models (LLMs), primarily driven by the Transformer architecture and access to vast
computational resources and enormous amounts of text data.

 Scale: LLMs are trained on truly massive datasets (trillions of tokens) with billions or
even trillions of parameters.
 Emergent Capabilities: Due to their scale, LLMs exhibit "emergent capabilities" –
abilities they were not explicitly trained for, such as reasoning, code generation, creative
writing, and complex problem-solving.
 Few-shot/Zero-shot Learning: LLMs can perform tasks with very few (few-shot) or
even no (zero-shot) examples, relying on their vast pre-trained knowledge.
 Instruction Following: Many LLMs are further fine-tuned (e.g., with Reinforcement
Learning from Human Feedback - RLHF) to better understand and follow human
instructions, leading to powerful conversational AI.
 Impact: LLMs are transforming virtually every industry, from content creation and
customer service to scientific research and education.

Conclusion

Natural Language Processing has evolved from rule-based systems to statistical models and,
most recently, to sophisticated deep learning architectures, particularly Transformers. This
progression has unlocked unprecedented capabilities in understanding and generating human
language, making NLP a cornerstone of modern AI. As research continues, we can expect even
more nuanced and human-like interactions with machines, further blurring the lines between
artificial and natural intelligence.
Okay, let's bring the NLP concepts to life with Python coding snippets. We'll use popular
libraries like NLTK, spaCy, and Hugging Face Transformers to demonstrate the different stages
and techniques.

Due to the length and complexity of some deep learning models (especially full Transformer
implementations), I'll focus on demonstrating the core steps and concepts rather than building a
state-of-the-art model from scratch for every example.

Natural Language Processing: Concepts with Code Snippets

We'll follow the typical NLP pipeline, starting from basic preprocessing and moving towards
more advanced deep learning techniques.

Setup: Installing Libraries

First, ensure you have the necessary libraries installed.

Bash
pip install nltk spacy pandas numpy scikit-learn matplotlib seaborn
transformers datasets accelerate
python -m spacy download en_core_web_sm # Download spaCy's small English
model

1. Traditional NLP Pipeline / Preprocessing Steps

Let's start with raw text and apply some common preprocessing techniques.

Python
import nltk
from [Link] import word_tokenize, sent_tokenize
from [Link] import stopwords
from [Link] import PorterStemmer, WordNetLemmatizer
import spacy
import pandas as pd
import re # Regular expressions for cleaning
from collections import Counter

# Download necessary NLTK data (do this once)


try:
[Link]('tokenizers/punkt')
except [Link]:
[Link]('punkt')
try:
[Link]('corpora/stopwords')
except [Link]:
[Link]('stopwords')
try:
[Link]('corpora/wordnet')
except [Link]:
[Link]('wordnet')
try:
[Link]('corpora/omw-1.4') # Open Multilingual Wordnet for
lemmatizer
except [Link]:
[Link]('omw-1.4')

# Load spaCy English model


nlp_spacy = [Link]("en_core_web_sm")

# --- Example Text ---


raw_text = """Natural Language Processing (NLP) is a fascinating field.
It enables computers to understand, interpret, and generate human language.
However, NLP is challenging due to ambiguity, context-dependency, and the
sheer variability of human expression.
"Spacy is faster than NLTK for many tasks!" said John Doe (CEO of XYZ Inc.)
on 2023-03-15.
I am running, she runs, he ran. Better, best, good."""

print("--- Original Text ---")


print(raw_text)
print("-" * 30)

# --- 1. Text Cleaning ---


print("\n--- Text Cleaning ---")
cleaned_text = raw_text.lower() # Convert to lowercase
cleaned_text = [Link](r'[^\w\s]', '', cleaned_text) # Remove punctuation and
special characters (keep alphanumeric and whitespace)
cleaned_text = [Link](r'\d+', '', cleaned_text) # Remove numbers
cleaned_text = [Link](r'\s+', ' ', cleaned_text).strip() # Normalize
whitespace

print(f"Cleaned Text: {cleaned_text}")


print("-" * 30)

# --- 2. Tokenization ---


print("\n--- Tokenization ---")
# NLTK Word Tokenization
nltk_word_tokens = word_tokenize(cleaned_text)
print(f"NLTK Word Tokens: {nltk_word_tokens[:10]}...")

# NLTK Sentence Tokenization


nltk_sent_tokens = sent_tokenize(raw_text) # Often better to tokenize
sentences before lowercasing/cleaning
print(f"NLTK Sentence Tokens: {nltk_sent_tokens}")

# spaCy Tokenization (more sophisticated, retains context)


doc_spacy = nlp_spacy(raw_text)
spacy_tokens = [[Link] for token in doc_spacy]
print(f"spaCy Tokens: {spacy_tokens[:10]}...")
print("-" * 30)

# --- 3. Stop Word Removal ---


print("\n--- Stop Word Removal ---")
stop_words = set([Link]('english'))
filtered_tokens = [word for word in nltk_word_tokens if word not in
stop_words]
print(f"Tokens after Stop Word Removal: {filtered_tokens[:10]}...")
print("-" * 30)

# --- 4. Stemming ---


print("\n--- Stemming (NLTK PorterStemmer) ---")
stemmer = PorterStemmer()
stemmed_tokens = [[Link](word) for word in filtered_tokens]
print(f"Stemmed Tokens: {stemmed_tokens[:10]}...")
print("-" * 30)

# --- 5. Lemmatization ---


print("\n--- Lemmatization (NLTK WordNetLemmatizer) ---")
lemmatizer = WordNetLemmatizer()
# Lemmatizer can benefit from POS tag (e.g., 'v' for verb)
lemmatized_tokens = [[Link](word) for word in filtered_tokens]
print(f"Lemmatized Tokens: {lemmatized_tokens[:10]}...")

print("\n--- Lemmatization (spaCy - more accurate with POS) ---")


spacy_lemmas = [token.lemma_ for token in doc_spacy if not token.is_stop and
not token.is_punct and not token.is_space]
print(f"spaCy Lemmas: {spacy_lemmas[:10]}...")
print("-" * 30)

# --- 6. Part-of-Speech (POS) Tagging (with spaCy) ---


print("\n--- Part-of-Speech (POS) Tagging (spaCy) ---")
pos_tags = [([Link], token.pos_) for token in doc_spacy]
print(f"POS Tags: {pos_tags[:15]}...")
print("-" * 30)

# --- 7. Named Entity Recognition (NER) (with spaCy) ---


print("\n--- Named Entity Recognition (NER) (spaCy) ---")
entities = [([Link], ent.label_) for ent in doc_spacy.ents]
print(f"Named Entities: {entities}")
print("-" * 30)

# --- 8. Dependency Parsing (with spaCy) ---


print("\n--- Dependency Parsing (spaCy - showing first sentence) ---")
first_sentence = list(doc_spacy.sents)[0]
for token in first_sentence:
print(f"{[Link]:<10} {token.dep_:<10} {[Link]:<10}
{token.pos_:<10}")
print("-" * 30)

2. Feature Engineering (Traditional ML)

Converting text into numerical features.

Python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

documents = [
"The quick brown fox jumps over the lazy dog",
"Never jump over the lazy fox",
"A quick brown dog is not lazy"
]

print("--- Original Documents ---")


for i, doc in enumerate(documents):
print(f"Doc {i+1}: {doc}")
print("-" * 30)

# --- A. Bag-of-Words (BoW) ---


print("\n--- Bag-of-Words (CountVectorizer) ---")
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(documents)

# Get feature names (words in vocabulary)


feature_names_bow = vectorizer_bow.get_feature_names_out()
print(f"Vocabulary (features): {feature_names_bow}")
print("BoW Matrix (Document-Term Matrix):")
print([Link](X_bow.toarray(), columns=feature_names_bow))
print("-" * 30)

# --- B. TF-IDF (Term Frequency-Inverse Document Frequency) ---


print("\n--- TF-IDF (TfidfVectorizer) ---")
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(documents)

feature_names_tfidf = vectorizer_tfidf.get_feature_names_out()
print(f"Vocabulary (features): {feature_names_tfidf}")
print("TF-IDF Matrix:")
print([Link](X_tfidf.toarray(), columns=feature_names_tfidf).round(3))
# Round for readability
print("-" * 30)

# --- C. N-grams (using CountVectorizer with ngram_range) ---


print("\n--- N-grams (Bigrams) ---")
vectorizer_ngram = CountVectorizer(ngram_range=(2, 2)) # For bigrams
X_ngram = vectorizer_ngram.fit_transform(documents)

feature_names_ngram = vectorizer_ngram.get_feature_names_out()
print(f"Vocabulary (bigrams): {feature_names_ngram}")
print("Bigram Matrix:")
print([Link](X_ngram.toarray(), columns=feature_names_ngram))
print("-" * 30)

3. Deep Learning Era - Text Representations (Embeddings)

Let's demonstrate word embeddings (conceptual with GloVe/Word2Vec idea) and then the
power of contextual embeddings using Hugging Face Transformers.

Python
import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from [Link] import LabelEncoder
from [Link] import Tokenizer
from [Link] import pad_sequences

# --- A. Word Embeddings (Conceptual using Keras Embedding Layer) ---


# In a real scenario, you'd use pre-trained Word2Vec/GloVe or train your own.
# Here, we'll demonstrate how Keras's Embedding layer works internally.

print("--- Word Embeddings (Keras Embedding Layer) ---")


texts = [
"i love natural language processing",
"natural language processing is fun",
"i enjoy machine learning",
"deep learning is powerful"
]

# 1. Tokenize and create vocabulary


tokenizer = Tokenizer(num_words=None, oov_token="<unk>") # No limit on words,
add OOV token
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index
print(f"Vocabulary (word_index): {word_index}")
vocab_size = len(word_index) + 1 # +1 for padding (0) or OOV token

# 2. Convert texts to sequences of integers


sequences = tokenizer.texts_to_sequences(texts)
print(f"Text sequences: {sequences}")

# 3. Pad sequences to uniform length


max_seq_length = max([len(seq) for seq in sequences])
padded_sequences = pad_sequences(sequences, maxlen=max_seq_length,
padding='post')
print(f"Padded sequences: \n{padded_sequences}")

# 4. Define an Embedding Layer (conceptual - not trained yet)


embedding_dim = 50 # Desired dimension for word vectors

# The Embedding layer is typically the first layer in a deep learning model
for NLP.
# It learns a dense vector for each word in the vocabulary.
embedding_layer = [Link](
input_dim=vocab_size, # Size of the vocabulary
output_dim=embedding_dim, # Dimension of the dense embedding
input_length=max_seq_length # Length of input sequences
)

# You can access the randomly initialized weights of the embedding layer
# In a real model, these weights would be learned during training or loaded
from pre-trained embeddings.
# (This is just to show the structure, not a trained embedding)
random_embedding_matrix = embedding_layer.[Link]()
print(f"\nShape of randomly initialized embedding matrix:
{random_embedding_matrix.shape}")
print(f"Embedding for 'love' (word_index['love'] = {word_index['love']}):
{random_embedding_matrix[word_index['love']][:5]}...")
print("-" * 30)

# --- B. Contextual Embeddings with Hugging Face Transformers ---


# We'll use a pre-trained BERT model to get contextual embeddings.

from transformers import AutoTokenizer, AutoModel


import torch

print("\n--- Contextual Embeddings (BERT via Hugging Face) ---")

model_name = "bert-base-uncased" # A popular small BERT model


tokenizer_hf = AutoTokenizer.from_pretrained(model_name)
model_hf = AutoModel.from_pretrained(model_name) # This loads the BERT
model's encoder part

texts_hf = [
"I ate an apple.",
"Apple is a big company.",
"The bank is on the river.",
"I need to go to the bank to withdraw money."
]

# Tokenize and create input IDs and attention masks


inputs = tokenizer_hf(texts_hf, return_tensors="pt", padding=True,
truncation=True)
print(f"Input IDs (tokenized text): \n{inputs['input_ids']}")
print(f"Attention Mask (identifies real tokens vs padding):
\n{inputs['attention_mask']}")

# Get model outputs


with torch.no_grad(): # Disable gradient calculation for inference
outputs = model_hf(**inputs)

# The 'last_hidden_state' contains the contextual embeddings for each token


# Shape: (batch_size, sequence_length, hidden_size)
# hidden_size for bert-base is 768
last_hidden_state = outputs.last_hidden_state
print(f"\nShape of contextual embeddings (last_hidden_state):
{last_hidden_state.shape}")

# Let's inspect the embedding for the word "bank" in two different contexts
# Token for "bank": tokenizer_hf.vocab["bank"] -> 2924
# Find indices of "bank" in the tokenized sequences
bank_token_id = tokenizer_hf.vocab["bank"]

# Sentence 3: "The bank is on the river."


# Token IDs: [101, 1996, 2924, 2003, 2006, 1996, 4004, 102]
# Index of "bank" is 2
embedding_bank_river = last_hidden_state[2, 2, :5].numpy() # First 5
dimensions
print(f"Embedding of 'bank' (river context - first 5 dims):
{embedding_bank_river}")

# Sentence 4: "I need to go to the bank to withdraw money."


# Token IDs: [101, 1045, 2342, 2000, 2171, 2000, 1996, 2924, 2000, 15998,
2816, 102]
# Index of "bank" is 7
embedding_bank_money = last_hidden_state[3, 7, :5].numpy() # First 5
dimensions
print(f"Embedding of 'bank' (money context - first 5 dims):
{embedding_bank_money}")

# Notice how these embeddings are different, showcasing contextuality.


# If you calculate cosine similarity, they would be less similar than if it
were the same word in similar contexts.

# You can also get the embedding for the [CLS] token (first token), which is
often used as a
# sentence-level representation for classification tasks.
cls_embedding = last_hidden_state[:, 0, :].numpy()
print(f"\nShape of [CLS] token embeddings (sentence embeddings):
{cls_embedding.shape}")
print("-" * 30)

4. NLP Applications with Deep Learning

Let's demonstrate a simple text classification task using a pre-trained Transformer model from
Hugging Face. This shows the "fine-tuning" paradigm.

Python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score, classification_report,
confusion_matrix
import [Link] as plt
import seaborn as sns

from transformers import AutoTokenizer, AutoModelForSequenceClassification,


Trainer, TrainingArguments
from datasets import Dataset
import torch

print("--- Text Classification with Hugging Face Transformers (Fine-tuning) -


--")

# --- 1. Create a Synthetic Dataset for Text Classification ---


data = {
'text': [
"This movie was fantastic! I loved every minute.",
"The food was terrible, completely inedible.",
"Great acting, but the plot was a bit weak.",
"Such a wonderful experience, highly recommend.",
"Worst customer service ever, I'm so disappointed.",
"It was okay, nothing special, nothing bad.",
"A masterpiece of cinema, truly groundbreaking.",
"Absolutely hated it, boring and predictable.",
"Pretty good, enjoyed most parts, a few slow moments.",
"Utter garbage, never again.",
"Mind-blowing special effects and compelling story.",
"Just a normal day at the office, very average.",
"Exceptional performance, captivating plot twists.",
"Left me speechless, in a bad way.",
"Quite enjoyable, definitely worth a watch."
],
'label': [1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] # 1:
Positive/Neutral, 0: Negative
}
df = [Link](data)

# For simplicity, let's relabel to pure Positive/Negative for binary


classification example
# 1 -> Positive, 0 -> Negative. Let's make the "neutral" ones more clearly
positive/negative.
# [Link][df['text'].[Link]('okay|nothing special|average'), 'label'] =
0 # Example if you want to explicitly label neutrals as 0

# Create a mapping for labels (optional, but good practice for clarity)
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}

df['label_name'] = df['label'].map(id2label)

print("Sample Dataset:")
print([Link]())
print(f"Label distribution:\n{df['label_name'].value_counts()}")
print("-" * 30)

# --- 2. Data Splitting ---


train_df, test_df = train_test_split(df, test_size=0.3, random_state=42,
stratify=df['label'])

# Convert pandas DataFrames to Hugging Face Dataset objects


# Hugging Face `Trainer` expects `[Link]`
train_dataset =
Dataset.from_pandas(train_df.drop(columns=['__index_level_0__',
'label_name'], errors='ignore'))
test_dataset = Dataset.from_pandas(test_df.drop(columns=['__index_level_0__',
'label_name'], errors='ignore'))

print(f"Train Dataset Size: {len(train_dataset)}")


print(f"Test Dataset Size: {len(test_dataset)}")

# --- 3. Load Pre-trained Tokenizer and Model ---


model_name = "distilbert-base-uncased" # A smaller, faster BERT model for
demonstration
tokenizer_classifier = AutoTokenizer.from_pretrained(model_name)
model_classifier = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=len(id2label), id2label=id2label,
label2id=label2id
)

# --- 4. Tokenization Function ---


def tokenize_function(examples):
return tokenizer_classifier(examples["text"], padding="max_length",
truncation=True, max_length=128)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)


tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)
# Remove the original text column and rename 'label' to 'labels' for Trainer
compatibility
tokenized_train_dataset = tokenized_train_dataset.remove_columns(["text"])
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["text"])
tokenized_train_dataset.set_format("torch") # Set format to PyTorch tensors
tokenized_test_dataset.set_format("torch")

print("\nSample tokenized input for training:")


print(tokenized_train_dataset[0])
print("-" * 30)

# --- 5. Define Training Arguments ---


training_args = TrainingArguments(
output_dir="./results", # Directory for logs and checkpoints
num_train_epochs=5, # Number of training epochs
per_device_train_batch_size=8, # Batch size for training
per_device_eval_batch_size=8, # Batch size for evaluation
warmup_steps=500, # Number of warmup steps for learning rate
scheduler
weight_decay=0.01, # Strength of weight decay
logging_dir="./logs", # Directory for tensorboard logs
logging_steps=100,
evaluation_strategy="epoch", # Evaluate every epoch
save_strategy="epoch", # Save checkpoint every epoch
load_best_model_at_end=True, # Load the best model at the end of training
metric_for_best_model="accuracy", # Metric to use to determine the best
model
)

# --- 6. Define Metrics (for evaluation) ---


from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = [Link](logits, axis=-1)
return [Link](predictions=predictions, references=labels)

# --- 7. Initialize Trainer and Train ---


trainer = Trainer(
model=model_classifier,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_test_dataset, # Use the test set as eval_dataset
for this example
tokenizer=tokenizer_classifier,
compute_metrics=compute_metrics,
)

print("\n--- Starting Model Fine-tuning ---")


[Link]()

print("\n--- Training Complete ---")


print("-" * 30)

# --- 8. Model Evaluation on the Test Set ---


print("\n--- Evaluating Fine-tuned Model on Test Set ---")
predictions = [Link](tokenized_test_dataset)
predicted_labels = [Link]([Link], axis=-1)
true_labels = tokenized_test_dataset["labels"]

print("\nClassification Report:")
print(classification_report(true_labels, predicted_labels,
target_names=list([Link]())))

print("\nConfusion Matrix:")
cm_hf = confusion_matrix(true_labels, predicted_labels)
[Link](cm_hf, annot=True, fmt='d', cmap='Blues',
xticklabels=list([Link]()), yticklabels=list([Link]()))
[Link]('Predicted')
[Link]('Actual')
[Link]('Transformer Classification Confusion Matrix')
[Link]()

# --- 9. Making a prediction on a new sentence ---


print("\n--- Making a prediction on a new sentence ---")
new_sentence = "This product is an absolute game-changer!"
inputs_new = tokenizer_classifier(new_sentence, return_tensors="pt",
padding="max_length", truncation=True, max_length=128)

# Move inputs to the same device as the model (e.g., GPU if available)
device = [Link]("cuda" if [Link].is_available() else "cpu")
model_classifier.to(device)
inputs_new = {k: [Link](device) for k, v in inputs_new.items()}

with torch.no_grad():
logits = model_classifier(**inputs_new).logits

predicted_class_id = [Link]().item()
predicted_label = id2label[predicted_class_id]
print(f"New sentence: '{new_sentence}'")
print(f"Predicted sentiment: {predicted_label}")
print("-" * 30)

Explanation of the Fine-tuning Code:

1. Synthetic Data: We create a small, artificial dataset for binary text classification
(positive/negative sentiment). In a real scenario, this would be a much larger dataset of
reviews, tweets, etc.
2. Dataset Preparation: We split the data and convert it into Hugging Face Dataset
objects, which are efficient for handling text data.
3. Tokenizer & Model Loading:
o AutoTokenizer.from_pretrained(model_name): Loads the appropriate
tokenizer for the chosen pre-trained model (e.g., distilbert-base-uncased).
This tokenizer handles subword tokenization, adding special tokens ([CLS],
[SEP]), padding, and truncation.
o AutoModelForSequenceClassification.from_pretrained(model_name,
...): Loads the pre-trained Transformer model specifically adapted for sequence
classification. The num_labels, id2label, label2id arguments configure the
output layer for our specific classification task.
4. Tokenization Function (tokenize_function): This function applies the tokenizer to
our text data, converting raw text into numerical input_ids, attention_mask, and
token_type_ids (though token_type_ids might not be used by DistilBERT for single-
sequence tasks).
5. Training Arguments: TrainingArguments define various hyperparameters for the
training loop, such as batch size, number of epochs, learning rate schedule, and
evaluation strategy.
6. compute_metrics: A function required by the Trainer to calculate performance metrics
(here, accuracy) during evaluation.
7. Trainer: Hugging Face's Trainer API simplifies the fine-tuning process. You provide it
with the model, arguments, datasets, and tokenizer, and it handles the training loop,
evaluation, logging, and saving checkpoints.
8. Prediction: After training, we use [Link]() to get predictions on the test set
and then apply standard sklearn metrics for evaluation. Finally, we demonstrate how to
predict on a completely new, unseen sentence.

You might also like