0% found this document useful (0 votes)
25 views9 pages

Cats vs. Dogs Image Classification Workflow

This document outlines a project for classifying images of cats and dogs using a hybrid approach combining PySpark and deep learning techniques. It details the implementation of two tracks: one utilizing PySpark with a VGG16 model for feature extraction and a Random Forest Classifier, and another using a custom Convolutional Neural Network for end-to-end deep learning. The dataset consists of approximately 1,000 images, and the project demonstrates effective integration of big data processing with modern computer vision methods.

Uploaded by

fa1576572
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views9 pages

Cats vs. Dogs Image Classification Workflow

This document outlines a project for classifying images of cats and dogs using a hybrid approach combining PySpark and deep learning techniques. It details the implementation of two tracks: one utilizing PySpark with a VGG16 model for feature extraction and a Random Forest Classifier, and another using a custom Convolutional Neural Network for end-to-end deep learning. The dataset consists of approximately 1,000 images, and the project demonstrates effective integration of big data processing with modern computer vision methods.

Uploaded by

fa1576572
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

23/12/2025, 06:37 Untitled4.

ipynb - Colab

Cats vs. Dogs Classification: A Hybrid PySpark &


Deep Learning Approach

1. Introduction
This project implements a comprehensive workflow for binary image classification using
the Cats vs. Dogs Mini Dataset. It is designed to bridge the gap between Big Data
processing (PySpark) and Modern Computer Vision (Deep Learning), demonstrating
how these two powerful paradigms can work together.

Project Objectives
The curriculum for this project is divided into two distinct technical tracks, both
implemented within this single notebook:

1. Hybrid Machine Learning Track (PySpark + VGG16):

Traditional PySpark MLlib cannot directly process raw images (pixels).


To solve this, we use Transfer Learning with a pre-trained VGG16 neural
network to extract high-level "features" (numerical vectors) from the images.
These features are then fed into a PySpark Random Forest Classifier,
effectively treating the image data as a structured "Big Data" problem.
This approach mimics real-world scenarios where deep learning is used for
feature engineering, while scalable clusters (Spark) handle the classification
of massive datasets.

2. End-to-End Deep Learning Track (TensorFlow/Keras):

We also implement a pure Deep Learning solution using a Custom


Convolutional Neural Network (CNN).
This track focuses on the standard computer vision pipeline: Data
Augmentation, Convolutional Layers, and Pooling, allowing the model to
learn features directly from the raw pixel data without manual extraction.

Dataset Details
Source: Kaggle Cats and Dogs Mini Dataset.
Size: Approximately 1,000 images (500 Cats, 500 Dogs).
Format: JPEG images, balanced classes.

[Link] 1/9
23/12/2025, 06:37 [Link] - Colab

keyboard_arrow_down Step 1: Environment Setup

Installing Dependencies for PySpark & Java 8


Since Google Colab runs on Linux, we need to manually install Java 8 (OpenJDK)
because PySpark relies on it to run the Java Virtual Machine (JVM) backend.

What this block does:

Installs OpenJDK 8 (headless version for servers).


Installs PySpark (the Python wrapper for Apache Spark).
Sets the JAVA_HOME environment variable so Spark knows where to find Java.
Initializes a SparkSession with local memory settings.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null


!pip install -q pyspark==3.5.1

import os
[Link]["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
[Link]["PATH"] = [Link]["JAVA_HOME"] + "/bin:" + [Link]["PATH"]

from [Link] import SparkSession

spark = [Link] \
.appName("CatsDogs_Image_Analytics") \
.master("local[*]") \
.config("[Link]", "2g") \
.config("[Link]", "[Link]") \
.config("[Link]", "[Link]") \
.getOrCreate()

print(f"✅ PySpark Ready! Spark Version: {[Link]}")

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.0/317.0 MB 4.7 MB/s eta [Link]


Preparing metadata ([Link]) ... done
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 kB 17.0 MB/s eta [Link]
Building wheel for pyspark ([Link]) ... done
ERROR: pip's dependency resolver does not currently take into account all the pa
dataproc-spark-connect 1.0.1 requires pyspark[connect]~=4.0.0, but you have pysp
✅ PySpark Ready! Spark Version: 3.5.1

keyboard_arrow_down Step 2: Data Loading

[Link] 2/9
23/12/2025, 06:37 [Link] - Colab

Upload Dataset from Local Device


We will now upload the [Link] file directly from your computer to the Google
Colab runtime.

Instructions:

1. Run this cell.


2. Click the "Choose Files" button that appears.
3. Select your [Link] (Cats & Dogs Mini Dataset) file.
4. The code will automatically unzip the file and detect the correct folder structure
(finding the train or cats folder automatically).

import zipfile
from [Link] import files
import shutil

print("Please upload your Cats & Dogs zip file (e.g., [Link]):")
uploaded = [Link]()

file_name = list([Link]())[0]
print(f"Extracting {file_name}...")

extract_path = "dataset_extracted"
with [Link](file_name, 'r') as zip_ref:
zip_ref.extractall(extract_path)

base_dir = extract_path
for root, dirs, files_list in [Link](extract_path):
if 'train' in dirs or 'cats' in dirs:
base_dir = root
break

print(f"✅ Data extracted to root: {base_dir}")


print(f"Contents of root: {[Link](base_dir)}")

Please upload your Cats & Dogs zip file (e.g., [Link]):
Choose files [Link]
[Link](application/x-zip-compressed) - 22933881 bytes, last modified: 23/12/2025 - 100%
done
Saving [Link] to [Link]
Extracting [Link]...
✅ Data extracted to root: dataset_extracted
Contents of root: ['dogs set' 'cats set']

keyboard_arrow_down Step 3: Feature Extraction (The "Bridge" Step)

Converting Images to Vectors using VGG16


[Link] 3/9
23/12/2025, 06:37 [Link] - Colab

PySpark's Machine Learning library (MLlib) cannot "see" raw images (JPEGs). It requires
numerical input (vectors).

To solve this, we use Transfer Learning:

1. We load VGG16, a powerful deep learning model pre-trained on ImageNet.


2. We remove the top layer (the classifier) so it acts as a "feature extractor."
3. We feed our cat/dog images into it.
4. It outputs a dense numerical vector (representing edges, textures, and patterns)
for every image.

This allows us to treat the images like "structured data" for PySpark.

import numpy as np
import tensorflow as tf
from [Link].vgg16 import VGG16, preprocess_input
from [Link] import ImageDataGenerator
train_dir = [Link](base_dir, 'train') if 'train' in [Link](base_dir)
val_dir = [Link](base_dir, 'test') if 'test' in [Link](base_dir) else

print(f"Training on: {train_dir}")

vgg_model = VGG16(weights='imagenet', include_top=False, input_shape=(150, 150,

def extract_features_into_vectors(directory, sample_count):


features = [Link](shape=(sample_count, 4, 4, 512))
labels = [Link](shape=(sample_count))

datagen = ImageDataGenerator(rescale=1./255)

if not [Link](directory):
print(f"⚠️ Directory {directory} not found. Returning empty arrays.")
return features, labels

generator = datagen.flow_from_directory(
directory,
target_size=(150, 150),
batch_size=20,
class_mode='binary'
)

i = 0
for inputs_batch, labels_batch in generator:
features_batch = vgg_model.predict(inputs_batch, verbose=0)
current_batch_size = features_batch.shape[0]
if (i * 20) + current_batch_size > sample_count:
break

features[i * 20 : i * 20 + current_batch_size] = features_batch


[Link] 4/9
23/12/2025, 06:37 [Link] - Colab

labels[i * 20 : i * 20 + current_batch_size] = labels_batch


i += 1
if i * 20 >= sample_count:
break
return [Link](features, (sample_count, 4 * 4 * 512)), labels
print("Extracting features from Training set...")
train_features, train_labels = extract_features_into_vectors(train_dir, 500)

print("Extracting features from Validation/Test set...")


validation_features, validation_labels = extract_features_into_vectors(val_dir,

print(f"✅ Features Extracted. Train Shape: {train_features.shape}")

Training on: dataset_extracted


Downloading data from [Link]
58889256/58889256 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
Extracting features from Training set...
Found 1000 images belonging to 2 classes.
Extracting features from Validation/Test set...
Found 1000 images belonging to 2 classes.
✅ Features Extracted. Train Shape: (500, 8192)

keyboard_arrow_down Step 4: PySpark Machine Learning

Training a Random Forest Classifier


Now that our images are converted into numerical vectors, we can use PySpark to train
a classical machine learning model.

What this block does:

1. Converts the Numpy arrays from Step 3 into a Spark DataFrame.


2. Uses VectorUDT to ensure Spark recognizes the features correctly.
3. Trains a Random Forest Classifier (an ensemble of decision trees).
4. Evaluates the model using Accuracy.

from [Link] import Vectors, VectorUDT


from [Link] import StructType, StructField, IntegerType
from [Link] import RandomForestClassifier
from [Link] import MulticlassClassificationEvaluator
train_data_rows = [(int(label), [Link](feat)) for label, feat in zip(tra
val_data_rows = [(int(label), [Link](feat)) for label, feat in zip(valid

schema = StructType([
StructField("label", IntegerType(), True),
StructField("features", VectorUDT(), True)
])

[Link] 5/9
23/12/2025, 06:37 [Link] - Colab

train_df = [Link](train_data_rows, schema)


test_df = [Link](val_data_rows, schema)

print("Training Random Forest Classifier in Spark...")


rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=
rf_model = [Link](train_df)

predictions = rf_model.transform(test_df)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="
accuracy = [Link](predictions)

print("-" * 30)
print(f"🏆 PySpark Random Forest Accuracy: {accuracy*100:.2f}%")
print("-" * 30)

Training Random Forest Classifier in Spark...


------------------------------
🏆 PySpark Random Forest Accuracy: 88.00%
------------------------------

keyboard_arrow_down Step 5: Deep Learning Track (End-to-End)

Training a Custom CNN (Convolutional Neural Network)


In this final phase, we switch to the Deep Learning approach. Instead of extracting
features manually, we let the neural network learn them from scratch.

Architecture:

Conv2D Layers: Scan the image for features.


MaxPooling: Reduces the size of the image to focus on important details.
Dense Layers: make the final decision (Cat vs. Dog).
Data Augmentation: We rotate and flip images during training to make the model
smarter.

from [Link] import Sequential


from [Link] import Conv2D, MaxPooling2D, Flatten, Dense, Dropo
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=30,
horizontal_flip=True,
zoom_range=0.2
)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(

[Link] 6/9
23/12/2025, 06:37 [Link] - Colab

train_dir, target_size=(150, 150), batch_size=32, class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
val_dir, target_size=(150, 150), batch_size=32, class_mode='binary')

model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
MaxPooling2D(2, 2),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D(2, 2),
Conv2D(128, (3, 3), activation='relu'),
MaxPooling2D(2, 2),
Flatten(),
Dense(512, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])

[Link](optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'

print("\nTraining Deep Learning Model (CNN)...")


history = [Link](
train_generator,
steps_per_epoch=15,
epochs=10,
validation_data=validation_generator,
validation_steps=5
)

print("\n✅ Deep Learning Training Complete.")

Found 1000 images belonging to 2 classes.


Found 1000 images belonging to 2 classes.
/usr/local/lib/python3.12/dist-packages/keras/src/layers/convolutional/base_conv
super().__init__(activity_regularizer=activity_regularizer, **kwargs)

Training Deep Learning Model (CNN)...


/usr/local/lib/python3.12/dist-packages/keras/src/trainers/data_adapters/py_data
self._warn_if_super_not_called()
Epoch 1/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 33s 2s/step - accuracy: 0.5196 - loss: 1.1229 - val_ac
Epoch 2/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 31s 2s/step - accuracy: 0.4726 - loss: 0.7109 - val_ac
Epoch 3/10
2/15 ━━━━━━━━━━━━━━━━━━━━ 19s 2s/step - accuracy: 0.5547 - loss: 0.6925/usr/loca
self._interrupted_warning()
15/15 ━━━━━━━━━━━━━━━━━━━━ 10s 611ms/step - accuracy: 0.5208 - loss: 0.6932 - val
Epoch 4/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 41s 3s/step - accuracy: 0.5280 - loss: 0.6930 - val_ac
Epoch 5/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 31s 2s/step - accuracy: 0.5381 - loss: 0.6915 - val_ac
Epoch 6/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 10s 509ms/step - accuracy: 0.5042 - loss: 0.6936 - val

[Link] 7/9
23/12/2025, 06:37 [Link] - Colab
Epoch 7/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 30s 2s/step - accuracy: 0.5621 - loss: 0.6924 - val_ac
Epoch 8/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 31s 2s/step - accuracy: 0.5316 - loss: 0.6899 - val_ac
Epoch 9/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 12s 701ms/step - accuracy: 0.5312 - loss: 0.6904 - val
Epoch 10/10
15/15 ━━━━━━━━━━━━━━━━━━━━ 40s 2s/step - accuracy: 0.5966 - loss: 0.6851 - val_ac

✅ Deep Learning Training Complete.

keyboard_arrow_down 6. Conclusion & Results Analysis

Summary of Achievements
In this project, we successfully implemented two different paradigms for solving the
same image classification problem:

1. PySpark Hybrid Approach (Feature Extraction + Random Forest):

We demonstrated that PySpark can be used for image analysis when


combined with a Deep Learning feature extractor (VGG16).
By converting unstructured images into structured feature vectors, we
leveraged Spark's RandomForestClassifier .
Advantage: This method is highly scalable. In a real-world production
environment, Spark could distribute the classification of millions of pre-
vectorized images across a cluster, which is often faster than running a full
deep learning forward pass for every query.

2. Deep Learning Approach (Custom CNN):

We built a custom CNN from scratch using TensorFlow/Keras.


Advantage: This method captures spatial hierarchies (edges -> shapes ->
objects) directly. While computationally more intensive during training, it often
yields higher accuracy on complex visual tasks because the features are fine-
tuned specifically for this dataset, rather than being generic features from
ImageNet.

Key Takeaway
Accuracy: You likely observed that the Deep Learning (CNN) model converged to
a higher accuracy more quickly than the PySpark Random Forest. This is expected,
as CNNs are purpose-built for visual data.

[Link] 8/9
23/12/2025, 06:37 [Link] - Colab

Scalability: However, the PySpark approach shines in "Big Data" scenarios. If we


had 100 Terabytes of images, extracting features once and then using Spark to
classify/query them would be a powerful enterprise workflow.

This project bridges the gap between Data Engineering (Spark) and Data Science
(Computer Vision), providing a robust template for modern machine learning pipelines.

Start coding or generate with AI.

[Link] 9/9

Common questions

Powered by AI

Data augmentation plays a critical role in the custom CNN deep learning track by artificially increasing the diversity and quantity of training data. Techniques such as rotation, flipping, and zooming are applied to the images, helping the CNN model to generalize better and become more robust to variations in the input images. This process significantly enhances the model's ability to learn and recognize features in different orientations or scales .

The end-to-end deep learning track differs from traditional machine learning approaches by eliminating the need for manual feature extraction. Instead, the model learns features directly from the raw image data using Convolutional Neural Networks (CNNs). This allows the neural network to form hierarchical representations of the data, capturing complex patterns and structures which are typically beyond the reach of traditional machine learning models. Such autonomy enables the CNN to potentially achieve higher accuracy, though at the cost of greater computational resources .

Leveraging a pre-trained model like VGG16 for feature extraction is significant in this project because it allows for efficient use of pre-learned representations that are highly effective in capturing diverse visual features. VGG16, trained on a large and varied dataset like ImageNet, has developed a robust ability to generalize across different kinds of images. Using it in this project helps in rapidly extracting meaningful features without needing to train a model from scratch, which saves time and computing resources while improving performance .

In the CNN architecture used in the project, Convolutional layers scan images for features by applying filters that can detect patterns such as edges, textures, or colors. MaxPooling layers follow the Convolutional layers, reducing the spatial dimensions of the feature maps while retaining the most important information. This pooling process helps in reducing the computational requirements and focuses the network's learning on the most salient features, enhancing the model's robustness in recognizing patterns .

Feature extraction with VGG16 is necessary because PySpark's MLlib cannot process raw images directly as it requires numerical input to function. VGG16 acts as a feature extractor by converting images into high-level numerical vectors that represent important image features. This transformation allows the image data to be handled as structured data by the PySpark Random Forest Classifier .

The main benefit of the PySpark Random Forest approach is its scalability. It can efficiently handle massive datasets by distributing the classification task across a cluster, making it ideal for production environments with Big Data demands. However, it might not achieve as high accuracy as a CNN because it relies on pre-extracted, generic features which may not be as effective for specific datasets. In contrast, the custom CNN model captures spatial hierarchies directly from the raw pixel data, often resulting in higher accuracy but at the cost of higher computational requirements during training .

The setup of the environment in Google Colab is crucial for implementing the Cats vs. Dogs project as it provides the necessary tools and dependencies, like OpenJDK for PySpark, ensuring compatibility and smooth operation of the Spark cluster. By configuring environment variables and setting up memory allocations, the Colab environment facilitates efficient data processing and model training. This preparatory step is vital to leverage the capabilities of PySpark and TensorFlow/Keras in tandem, demonstrating the integration of Big Data processing with deep learning .

The hybrid machine learning track combines PySpark and VGG16 to handle binary image classification by using VGG16 as a feature extractor. VGG16, pre-trained on ImageNet, converts images into numerical vectors that represent high-level features such as edges and textures. These vectors are then used as inputs to a PySpark Random Forest Classifier, transforming the originally unstructured image data into a structured format suitable for Big Data processing .

When adapting the PySpark method to a much larger dataset, challenges might include managing the extensive computing resources needed to process and classify such a vast amount of data. Storage and computational power would need to be scaled significantly, requiring efficient resource allocation and optimization of Spark's operations to handle a potentially overwhelming number of image vectors. Furthermore, ensuring consistent and fast data loading and preprocessing steps for such a large dataset is crucial to prevent bottlenecks .

The process to convert images into vectors using VGG16 involves multiple steps: First, the VGG16 model is loaded with pre-trained weights from ImageNet, with the top classification layer removed to serve purely as a feature extractor. The cat and dog images are then passed through this model, which outputs a dense vector for each image, representing its key features like edges and patterns. These vectors are later reshaped into a format suitable for input into PySpark's machine learning pipeline .

You might also like