CSE 5111: Deep Learning
Lecture 4: Building Our First Deep Neural Network [Class 9,10]
Master of Science in Computer Science & Engineering
Department of Computer Science and Engineering
Comilla University
Instructor: Mahmudul Hasan, PhD
Reference Text: Deep Learning (GBC) - Chapter 6.1, 6.3
Slide 2: Recap: The Complete Puzzle So Far
What We've Learned
We now have all the mathematical pieces:
1. Gradient Descent: The optimization
algorithm that minimizes loss.
2. Backpropagation: The efficient
algorithm for calculating gradients.
3. PyTorch Autograd: Automates
backpropagation for us.
Today's Mission: Assemble these pieces to build our first Deep Neural Network
(DNN) and understand why depth is powerful.
Slide 3: The Limitation of Linear Models
Why Go Deep? The Need for Non-Linearity
Problem: Single-layer networks (like linear regression) can
only learn linear relationships.
Real-World Example: The XOR Problem
• Can you separate True/False with a single straight line?
• Input: (0,0) → Output: 0
• Input: (0,1) → Output: 1
• Input: (1,0) → Output: 1
• Input: (1,1) → Output: 0
Answer: No! This is a fundamental limitation of linear models.
Slide 4: The Solution: Adding Layers & Non-Linearity
Building Complexity Step by Step
Think of it like this:
• Layer 1: Creates simple decision boundaries (straight
lines)
• Layer 2: Combines these lines to create more complex
shapes
• Layer 3: Combines those shapes to create even more complex regions
Analogy: Building with LEGO
• Single layer = Basic bricks
• Multiple layers = Complex structures from simple bricks
• Activation functions = The connectors that hold everything together
Slide 5: Activation Functions: The "Spark" of Neural Networks
What Are Activation Functions?
Activation functions determine whether a neuron
should "fire" or not. They introduce non-linearity!
Without activation functions:
• Deep network = Just multiple linear
transformations
• Multiple layers = Equivalent to a single layer
With activation functions:
• Deep network = Can learn complex, non-linear
relationships
Slide 6: Popular Activation Functions
Meet the Activation Function Family
1. Sigmoid
• σ(x) = 1/(1 + e⁻ˣ)
• Range: (0, 1)
• Problem: Vanishing gradients, not zero-centered
2. Tanh
• tanh(x) = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ)
• Range: (-1, 1)
• Better than sigmoid (zero-centered)
3. ReLU (Rectified Linear Unit)
• ReLU(x) = max(0, x)
• Range: [0, ∞)
• Default choice for most networks
Slide 7: Why ReLU is the Default Choice
The ReLU Revolution
Advantages:
• Computationally simple: Just max(0, x)
• Avoids vanishing gradient: Gradient is either 0 or 1
• Sparsity: About 50% of neurons can be inactive
Disadvantage:
• Dying ReLU: If inputs are always negative, neuron never
activates
Solution variants:
• Leaky ReLU: max(0.01x, x) - small slope for negative values
• Parametric ReLU (PReLU): Learn the slope
Slide 8: Architecture of a Deep Neural Network
Anatomy of a DNN
Input Layer → Hidden Layer 1 → Hidden Layer 2 → ... → Output Layer
↑ ↑ ↑ ↑
Raw Data Simple More Complex Final Prediction
Features Features
Key Components:
• Input Layer: Size = number of features
• Hidden Layers: Where the magic happens
• Output Layer: Size = number of classes (classification) or 1 (regression)
• Connections: Fully connected = each neuron connects to all neurons in next layer
Slide 9: Real-World Example: Image Classification
Hands-On Example: Fashion-MNIST
The Dataset:
• 70,000 grayscale images
• 10 categories (T-shirt, trousers, pullover, dress, etc.)
• 28×28 pixels = 784 features per image
Our Goal: Build a DNN that can classify clothing items!
Why this is perfect for learning:
• Simple enough to train quickly
• Complex enough to need a real neural network
Slide 10: Building a DNN in PyTorch: Step 1 - Imports
Setting Up Our Toolkit
python
import torch
import [Link] as nn
import [Link] as optim
import torchvision
import [Link] as transforms
import [Link] as plt
# Check if GPU is available
device = [Link]("cuda" if [Link].is_available() else "cpu")
print(f"Using device: {device}")
Key Imports:
• [Link]: Neural network modules
• [Link]: Optimization algorithms
• torchvision: Computer vision datasets
Slide 11: Building a DNN in PyTorch: Step 2 - Define the Model
Creating Our Neural Network Class
class FashionDNN([Link]):
def __init__(self):
super(FashionDNN, self).__init__()
[Link] = [Link](
# Input: 784 features (28x28 pixels)
[Link](784, 128), # First hidden layer
[Link](), # Activation function
[Link](128, 64), # Second hidden layer
[Link](), # Activation function
[Link](64, 10) # Output: 10 classes
)
def forward(self, x):
# Flatten the image from 28x28 to 784
x = [Link]([Link](0), -1)
return [Link](x)
# Create model and move to GPU
model = FashionDNN().to(device)
print(model)
Slide 12: Understanding [Link]
What is [Link]?
[Link] is a container that chains layers together:
Input → Linear(784,128) → ReLU() → Linear(128,64) → ReLU() →
Linear(64,10) → Output
It's like a pipeline:
• Data flows through each layer in order
• Output of one layer becomes input to the next
• Makes code clean and readable
Alternative: You can also define each layer separately and connect them manually in
the forward method.
Slide 13: Building a DNN: Step 3 - Prepare the Data
Getting Our Data Ready
python
# Transform: convert images to tensors and normalize
transform = [Link]([
[Link](),
[Link]((0.5,), (0.5,))
])
# Download and load training data
trainset = [Link](
root='./data', train=True, download=True, transform=transform)
trainloader = [Link](
trainset, batch_size=64, shuffle=True)
# Download and load test data
testset = [Link](
root='./data', train=False, download=True, transform=transform)
testloader = [Link](
testset, batch_size=64, shuffle=False)
Why batch_size=64?
• Training with mini-batches is more efficient
• Provides more stable gradient estimates
• Common sizes: 32, 64, 128, 256
Why Batch Sizes Are Often Powers of 2 in Deep Learning
In machine learning, particularly deep learning, the batch size refers to the number of training
samples processed together in one iteration before updating the model's parameters.
• GPU Parallelism: Aligns with GPU core counts (e.g., 32) for efficient processing.
• Memory Alignment: Matches memory page sizes to reduce padding and waste.
• Matrix Optimization: Speeds up cuDNN matrix operations for multiples of 8.
Slide 14: Building a DNN: Step 4 - Loss Function & Optimizer
Choosing the Right Tools
# Loss function for classification
criterion = [Link]()
# Optimizer - Adam is usually a good default
optimizer = [Link]([Link](), lr=0.001)
# Learning rate scheduler (optional but helpful)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10,
gamma=0.1)
Why CrossEntropyLoss?
• Perfect for multi-class classification
• Combines softmax and negative log likelihood
• Handles probabilities nicely
Why Adam?
a) Adaptive Learning Rates: Adam adjusts learning rates for each parameter using estimates
of first and second moments, leading to faster convergence.
b) Efficiency with Large Datasets: It performs well with large-scale data and parameters,
requiring less memory than other optimizers.
c) Robust to Noisy Gradients: Adam handles noisy or sparse gradients effectively, making it
suitable for complex, non-convex problems.
d) Combines Momentum and RMSProp: By integrating momentum and RMSProp, Adam
balances speed and stability in optimization.
Slide 15: The Complete Training Loop
Putting It All Together: Training
python
def train_model(model, trainloader, criterion, optimizer, epochs=10):
[Link]() # Set model to training mode
train_losses = []
for epoch in range(epochs):
running_loss = 0.0
for images, labels in trainloader:
# Move data to GPU
images, labels = [Link](device), [Link](device)
# Zero the gradients
optimizer.zero_grad()
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward pass and optimize
[Link]()
[Link]()
running_loss += [Link]()
epoch_loss = running_loss / len(trainloader)
train_losses.append(epoch_loss)
print(f'Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}')
return train_losses
# Train the model!
loss_history = train_model(model, trainloader, criterion, optimizer)
Slide 16: Understanding Model Training vs Evaluation Modes
[Link]() vs [Link]()
Training Mode ([Link]()):
• Enables dropout and batch normalization
• Tracks gradients for backpropagation
• Used during training
Evaluation Mode ([Link]()):
• Disables dropout and uses full network
• Uses running statistics for batch norm
• No gradient tracking (saves memory)
• Used during testing/validation
python
# For testing:
[Link]()
with torch.no_grad(): # No gradients needed for testing
test_outputs = model(test_images)
# Back to training:
[Link]()
Slide 17: Evaluating Our Model
How Good is Our Model?
python
def evaluate_model(model, testloader):
[Link]()
correct = 0
total = 0
with torch.no_grad(): # No gradients needed = faster!
for images, labels in testloader:
images, labels = [Link](device), [Link](device)
outputs = model(images)
_, predicted = [Link]([Link], 1)
total += [Link](0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f'Test Accuracy: {accuracy:.2f}%')
return accuracy
# Test our trained model
accuracy = evaluate_model(model, testloader)
What's happening here:
• [Link](outputs, 1): Get the predicted class (highest probability)
• Compare predictions with true labels
• Calculate percentage of correct predictions
Slide 18: Visualizing Training Progress
Learning Curves: Our Training Report Card
python
[Link](loss_history)
[Link]('Training Loss Over Time')
[Link]('Epoch')
[Link]('Loss')
[Link](True)
[Link]()
What to look for:
• Good: Smooth, steady decrease in loss
• Bad: Loss oscillating wildly (learning rate too high)
• Bad: Loss not decreasing (learning rate too low or model too simple)
• Bad: Loss suddenly becomes NaN (exploding gradients)
Typical results: You should see loss drop from ~2.0 to ~0.3 in 10 epochs!
Slide 19: Making Predictions
Using Our Trained Model
# Get a batch of test images
dataiter = iter(testloader)
images, labels = next(dataiter)
images, labels = [Link](device), [Link](device)
# Make predictions
[Link]()
with torch.no_grad():
outputs = model(images)
_, predictions = [Link](outputs, 1)
# Display results
class_names = ['T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
for i in range(5): # Show first 5 predictions
print(f'Predicted: {class_names[predictions[i]]}, '
f'Actual: {class_names[labels[i]]}')
Slide 20: Lab 2 Preview: Your Turn!
Lab 2: Implement and Train a DNN
Your Tasks:
1. Implement the DNN architecture we just built
2. Experiment with different architectures:
o Try different numbers of hidden layers
o Try different numbers of neurons per layer
o Try different activation functions
3. Tune hyperparameters:
o Learning rate (try 0.1, 0.01, 0.001)
o Batch size (try 32, 64, 128)
4. Achieve at least 85% test accuracy
Due Date: [Insert your due date here]
Slide 21: Key Takeaways
What We Learned Today
1. Why Depth Matters: Deep networks can learn complex, non-linear relationships
2. Activation Functions: ReLU is the default choice for hidden layers
3. DNN Architecture: Input → Hidden Layers → Output
4. PyTorch Workflow:
o Define model as [Link] subclass
o Use [Link] for simple architectures
o Choose appropriate loss function and optimizer
o Implement training loop
o Evaluate on test data
You now have everything needed to build and train real neural networks!
Slide 22: What's Next?
Preview of Lecture 5
• Problem: Our DNN treats images as flat
vectors - it ignores spatial structure!
• Solution: Convolutional Neural
Networks (CNNs)
• Topics:
o Convolution operation
o Pooling layers
o Building CNNs in PyTorch
o Transfer learning
• Reading: GBC, Chapter 9
Slide 23: References & Questions
References & Resources
1. Primary: GBC, Chapter 6.1, 6.3
2. PyTorch Tutorials: "Deep Learning with PyTorch: A 60 Minute Blitz"
3. Dataset: Fashion-MNIST documentation
4. Visualization: TensorBoard for tracking experiments