UNIT III: MACHINE LEARNING ON BIG DATA
Introduction to MLlib and Scikit-learn, Data Preprocessing for Big Data ML
Pipelines, Supervised Learning: Classification and Regression on Large
Datasets, Unsupervised Learning: Clustering and Dimensionality Reduction,
Model Evaluation and Validation Techniques, Distributed Training and
Optimization Techniques.
INTRODUCTION TO MLLIB
MLlib is a machine learning framework built on top of Apache Spark. It is designed
to make machine learning tasks faster and more efficient. MLlib supports various ML
algorithms and utilities.
Core Features of MLlib
MLlib is a scalable, easy to use and comprehensive library
Algorithms like Classification, Regression, Clustering, Collaborative Filtering, etc can be
implemented using it.
Similar to scikit-learn hence streamlines the process of building and tuning machine
learning workflows.
Includes feature selection, extraction, scaling, etc.
Provides tools to evaluate models with accuracy, precision, etc.
It provides 2 main APIs: RDD-based API and Data Frame-based APIWorking Principle
Workflow of MLlib Models
1. Data Ingestion: Load data using Spark DataFrames.
2. Data Preprocessing: Cleaning, handling missing values. Feature selection and
transformation
3. Model Selection: Choose a machine learning algorithm based on data.
4. Training: Fit the model on training data.
5. Prediction: Use the trained model to make predictions on test or new data.
6. Evaluation: Evaluate model performance using various metrics.
7. Pipeline Deployment: Build a pipeline combining all steps.
UNIT III: MACHINE LEARNING ON BIG DATA
Major Algorithms in MLlib
1. Classification Models
Used to categorize data into predefined labels.
Used in Email spam detection
Handle binary and multiclass problems
2. Regression Models
Used to predict continuous values.
Used in House price prediction
Minimize error between predicted and actual values
3. Clustering Models
Used to group data points without labels.
Used in Customer segmentation
Unsupervised learning based on similarity
4. Recommendation Algorithms
Used for personalized content delivery.
Used in Movie recommendations
Collaborative filtering based on user-item interactions
5. Dimensionality Reduction
Used to reduce feature space while preserving data variance.
Used in Visualization or preprocessing
Helps in improving performance and reducing noise
6. Feature Transformation
Essential for preparing raw data into a usable format.
Encoding categorical variables and scaling features
Required before model training
MLlib simplifies large-scale machine learning by combining Spark with ML algorithms. It
allows seamless integration of preprocessing, training and evaluation in one environment and
is ideal for production-level systems where both scalability and speed are crucial.
Implementation of MLlib
Here we will create a logistic regression model using PySpark, trains it on sample
data and then makes predictions on the same data, displaying the features, labels and
predictions.
UNIT III: MACHINE LEARNING ON BIG DATA
[Link]: Initializes a Spark session which is the entry point for
Spark functionality.
[Link]: Creates a DataFrame from a list of tuples where each tuple
contains a label and features represented as a dense vector.
LogisticRegression(): The LogisticRegression model from MLlib is initialized and then
trained on the sample data using the .fit()
[Link](data): Trains the logistic regression model using the input data (label and features).
[Link](data): Applies the trained model to the data to make predictions.
[Link]: Selects specific columns (features, label and prediction) from the
predictions DataFrame to display the results.
Example:
from [Link] import SparkSession
from [Link] import LogisticRegression
from [Link] import Vectors
# Start session
spark = [Link]("BasicMLlib").getOrCreate()
# Sample DataFrame
data = [Link]([
(0.0, [Link]([0.0, 1.1, 0.1])),
(1.0, [Link]([2.0, 1.0, -1.0])),
(0.0, [Link]([2.0, 1.3, 1.0])),
(1.0, [Link]([0.0, 1.2, -0.5]))
], ["label", "features"])
# Train logistic regression model
lr = LogisticRegression()
model = [Link](data)
# Prediction
predictions = [Link](data)
[Link]("features", "label", "prediction").show()
OUTPUT:
Real-World Use Case
UNIT III: MACHINE LEARNING ON BIG DATA
Predictive analytics of diseases based on patient data: Using machine learning models
to analyze historical patient data and predict the likelihood of diseases allowing for early
intervention and better healthcare planning.
Spam detection and sentiment analysis: It helps to classify emails or messages as spam
or non-spam and analyze the sentiment of text to understand whether it is positive,
negative or neutral.
Fraud detection in the finance sector: It can identify unusual patterns in transactions,
helping detect fraudulent activities such as credit card fraud or identity theft in real-time.
Real-time product recommendation in e-commerce: By analyzing user behavior,
preferences and purchase history it can suggest products to users in real-time, enhancing
the shopping experience and increasing sales.
Customer segmentation for marketing: It helps segment customers based on their
behavior, preferences and demographics enabling businesses to tailor their marketing
strategies and improve customer engagement.
Strengths of MLlib
Comprehensive ML support: It provides a wide range of algorithms for
classification, regression, clustering, recommendation and more making it suitable for
various machine learning tasks.
Scalable and fault-tolerant: Designed to handle large-scale datasets across
distributed systems. It ensures fault tolerance so processes continue smoothly even in
case of hardware failures.
Simplified API for common ML tasks: MLlib offers an easy-to-use API that
abstracts the complexity of distributed computing and makes it easier for users to
implement common machine learning tasks like training models and making
predictions.
Good speed: It uses Apache Spark for distributed processing, providing high-speed
model training and inference making it suitable for big data applications.
Easy integration: It integrates seamlessly with Spark’s data processing framework
allowing it to work well with Spark’s other components like Spark SQL, Spark
Streaming and ML pipelines.
INTRODUCTION TO SCIKIT-LEARN
Scikit-Learn (also known as sklearn) is an open-source Python library designed to
simplify the implementation of machine learning models. It provides a wide range of tools for
data preprocessing, model selection, and evaluation, making it a preferred choice for
beginners and professionals alike.
Key Features of Scikit-Learn:
Simple and Consistent API: Provides a unified interface for all machine learning
algorithms.
Efficient Implementation: Built on top of optimized scientific libraries like NumPy
and SciPy.
Wide Range of Algorithms: Includes classification, regression, clustering, and
dimensionality reduction techniques.
UNIT III: MACHINE LEARNING ON BIG DATA
Built-in Data Preprocessing Tools: Offers methods for handling missing values,
feature scaling, and encoding categorical variables.
Model Evaluation and Selection: Supports cross-validation, hyperparameter tuning,
and performance metrics.
Installing Scikit-Learn
To install Scikit-Learn, use the following command:
pip install scikit-learn
Methods in Scikit-Learn
Scikit-Learn provides various methods that make machine learning model development
easier. Some commonly used methods include:
1. Data Preprocessing Methods
[Link](): Standardizes features by removing the
mean and scaling to unit variance.
[Link](): Scales features to a given range (default 0
to 1).
[Link](): Encodes categorical labels as integers.
[Link](): Handles missing values by replacing them with
mean, median, or most frequent values.
2. Model Selection Methods
sklearn.model_selection.train_test_split(): Splits data into training and test sets.
sklearn.model_selection.GridSearchCV(): Performs exhaustive search over a given
parameter grid to find the best hyperparameters.
sklearn.model_selection.cross_val_score(): Evaluates a model using cross-
validation.
3. Classification Methods
UNIT III: MACHINE LEARNING ON BIG DATA
[Link](): Implements the K-Nearest Neighbors
classification algorithm.
[Link](): Builds a decision tree model for
classification.
[Link](): Implements Support Vector Classification.
sklearn.naive_bayes.GaussianNB(): Implements the Naïve Bayes classifier for
normally distributed data.
4. Regression Methods
sklearn.linear_model.LinearRegression(): Performs simple and multiple linear
regression.
sklearn.linear_model.Lasso(): Implements Lasso regression for feature selection.
[Link](): Uses an ensemble of decision trees
for regression tasks.
5. Clustering Methods
[Link](): Implements the K-Means clustering algorithm.
[Link](): Implements hierarchical clustering.
6. Model Evaluation Methods
[Link].accuracy_score(): Computes accuracy for classification models.
[Link].confusion_matrix(): Generates a confusion matrix for evaluating
classification results.
[Link].mean_squared_error(): Measures the mean squared error for
regression models.
Common Use Cases of Scikit-Learn
Scikit-Learn is widely used for various machine learning applications, including:
Classification – Identifying categories or labels for given data (e.g., spam detection,
handwriting recognition).
Regression – Predicting continuous values (e.g., house price prediction, stock market
trends).
Clustering – Grouping similar data points together (e.g., customer segmentation,
anomaly detection).
Dimensionality Reduction – Reducing the number of input variables in data (e.g.,
Principal Component Analysis).
Model Selection and Evaluation – Finding the best-performing machine learning
model using cross-validation.
Example: Performing Regression with Scikit-Learn
Let’s implement a simple linear regression model using the Boston Housing Dataset.
Step 1: Import Required Libraries
from sklearn.linear_model import LinearRegression
from [Link] import fetch_california_housing
Step 2: Load the Dataset
UNIT III: MACHINE LEARNING ON BIG DATA
housing = fetch_california_housing()
X = [Link]
y = [Link]
Step 3: Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
Step 4: Train the Model
regressor = LinearRegression()
[Link](X_train, y_train)
Step 5: Make Predictions and Evaluate
y_pred = [Link](X_test)
print("Model Coefficients:", regressor.coef_)
print("Intercept:", regressor.intercept_)
DATA PREPROCESSING FOR BIG DATA ML PIPELINES
Data preprocessing is the first step in any data analysis or machine learning pipeline.
It involves cleaning, transforming and organizing raw data into a structured format to ensure
accuracy, consistency and readiness for modelling. This step improves data quality and
directly impacts the performance of analytical or predictive models.
Steps-by-Step implementation
Let's implement various preprocessing features,
Step 1: Import Libraries and Load Dataset
We prepare the environment with libraries liike pandas, numpy, scikit learn,
matplotlib and seaborn for data manipulation, numerical operations, visualization and scaling.
Load the dataset for preprocessing.
import pandas as pd
import numpy as np
from [Link] import MinMaxScaler, StandardScaler
import seaborn as sns
import [Link] as plt
UNIT III: MACHINE LEARNING ON BIG DATA
df =
pd.read_csv('[Link]
)
[Link]()
Step 2: Inspect Data Structure and Check Missing Values
We understand dataset size, data types and identify any incomplete (missing) data that
needs handling.
[Link](): Prints concise summary including count of non-null entries and data type of
each column.
[Link]().sum(): Returns the number of missing values per column.
[Link]()
print([Link]().sum())
Step 3: Statistical Summary and Visualizing Outliers
Get numeric summaries like mean, median, min/max and detect unusual points
(outliers). Outliers can skew models if not handled.
[Link](): Computes count, mean, std deviation, min/max and quartiles for
numerical columns.
Boxplots: Visualize spread and detect outliers using matplotlib’s boxplot().
[Link]()
fig, axs = [Link](len([Link]), 1, figsize=(7, 18),
dpi=95)
for i, col in enumerate([Link]):
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
plt.tight_layout()
[Link]()
Step 4: Remove Outliers Using the Interquartile Range (IQR) Method
Remove extreme values beyond a reasonable range to improve model robustness.
IQR = Q3 (75th percentile) – Q1 (25th percentile).
Values below Q1 - 1.5IQR or above Q3 + 1.5IQR are outliers.
Calculate lower and upper bounds for each column separately.
Filter data points to keep only those within bounds.
q1, q3 = [Link](df['Insulin'], [25, 75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
clean_df = df[(df['Insulin'] >= lower) & (df['Insulin'] <=
upper)]
UNIT III: MACHINE LEARNING ON BIG DATA
Step 5: Correlation Analysis
Understand relationships between features and the target variable (Outcome).
Correlation helps gauge feature importance.
[Link](): Computes pairwise correlation coefficients between columns.
Heatmap via seaborn visualizes correlation matrix clearly.
Sorting correlations with corr['Outcome'].sort_values() highlights features most
correlated with the target.
corr = clean_df.corr()
[Link](dpi=130)
[Link](corr, annot=True, fmt='.2f', cmap='coolwarm')
[Link]()
print(corr['Outcome'].sort_values(ascending=False))
Step 6: Visualize Target Variable Distribution
Check if target classes (Diabetes vs Not Diabetes) are balanced, affecting model
training and evaluation.
[Link](): Pie chart to display proportion of each class in the target variable 'Outcome'.
[Link](claen_df['Outcome'].value_counts(), labels=[
'Diabetes', 'Not Diabetes'], autopct='%.f%%',
shadow=True)
[Link]('Outcome Proportionality')
[Link]()
Step 7: Separate Features and Target Variable
Prepare independent variables (features) and dependent variable (target) separately for
modeling.
[Link](columns=[...]): Drops the target column from features.
Direct column selection df['Outcome'] selects target column.
X = clean_df.drop(columns=['Outcome'])
y = clean_df['Outcome']
Step 8: Feature Scaling: Normalization and Standardization
Scale features to a common range or distribution, important for many ML algorithms
sensitive to feature magnitudes.
1. Normalization (Min-Max Scaling): Rescales features between 0 and 1. Good for
algorithms like k-NN and neural networks.
Class: MinMaxScaler from sklearn.
.fit_transform(): Learns min/max from data and applies scaling.
UNIT III: MACHINE LEARNING ON BIG DATA
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
print(X_normalized[:5])
2. Standardization: Transforms features to have mean = 0 and standard deviation = 1, useful
for normally distributed features.
Class: StandardScaler from sklearn.
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
print(X_standardized[:5])
SUPERVISED LEARNING
Supervised learning is a type of machine learning where a model learns from labelled
data—meaning every input has a corresponding correct output. The model makes predictions
and compares them with the true outputs, adjusting itself to reduce errors and improve
accuracy over time. The goal is to make accurate predictions on new, unseen data. For
example, a model trained on images of handwritten digits can recognise new digits it has
never seen before.
While training the model, data is usually split in the ratio of 80:20 i.e. 80% as training data
and the rest as testing data. In training data, we feed input as well as output for 80% of data.
The model learns from training data only.
Working of Supervised Machine Learning
The working of supervised machine learning follows these key steps:
1. Collect Labeled Data
Gather a dataset where each input has a known correct output (label).
Example: Images of handwritten digits with their actual numbers as labels.
2. Split the Dataset
Divide the data into training data (about 80%) and testing data (about 20%).
The model will learn from the training data and be evaluated on the testing data.
UNIT III: MACHINE LEARNING ON BIG DATA
3. Train the Model
Feed the training data (inputs and their labels) to a suitable supervised learning
algorithm (like Decision Trees, SVM or Linear Regression).
The model tries to find patterns that map inputs to correct outputs.
4. Validate and Test the Model
Evaluate the model using testing data it has never seen before.
The model predicts outputs and these predictions are compared with the actual labels
to calculate accuracy or error.
5. Deploy and Predict on New Data
Once the model performs well, it can be used to predict outputs for completely new,
unseen data.
CLASSIFICATION AND REGRESSION ON LARGE DATASETS
Supervised learning can be applied to two main types of problems:
Classification: Where the output is a categorical variable (e.g., spam vs. non-spam
emails, yes vs. no).
Regression: Where the output is a continuous variable (e.g., predicting house prices,
stock prices).
Classification on Large Datasets – Logistic Regression:
Logistic Regression is a supervised machine learning algorithm used for classification
problems. Unlike linear regression which predicts continuous values it predicts the
probability that an input belongs to a specific class. It is used for binary classification where
the output can be one of two possible categories such as Yes/No, True/False or 0/1. It uses
sigmoid function to convert inputs into a probability value between 0 and 1.
Types of Logistic Regression
Logistic regression can be classified into three main types based on the nature of the
dependent variable:
UNIT III: MACHINE LEARNING ON BIG DATA
1. Binomial Logistic Regression: This type is used when the dependent variable has only
two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most
common form of logistic regression and is used for binary classification problems.
from [Link] import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=23)
clf = LogisticRegression(max_iter=10000, random_state=0)
[Link](X_train, y_train)
acc = accuracy_score(y_test, [Link](X_test)) * 100
print(f"Logistic Regression model accuracy: {acc:.2f}%")
2. Multinomial Logistic Regression: This is used when the dependent variable has three or
more possible categories that are not ordered. For example, classifying animals into
categories like "cat," "dog" or "sheep." It extends the binary logistic regression to handle
multiple classes.
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model, metrics
digits = datasets.load_digits()
X = [Link]
y = [Link]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4, random_state=1)
reg = linear_model.LogisticRegression(max_iter=10000,
random_state=0)
[Link](X_train, y_train)
y_pred = [Link](X_test)
print(f"Logistic Regression model accuracy:
{metrics.accuracy_score(y_test, y_pred) * 100:.2f}%")
3. Ordinal Logistic Regression: This type applies when the dependent variable has three or
more categories with a natural order or ranking. Examples include ratings like "low,"
"medium" and "high." It takes the order of the categories into account when modeling.
How to Evaluate Logistic Regression Model?
Evaluating the logistic regression model helps assess its performance and ensure it
UNIT III: MACHINE LEARNING ON BIG DATA
generalizes well to new, unseen data. The following metrics are commonly used:
1. Accuracy: Accuracy provides the proportion of correctly classified instances.
2. Precision: Precision focuses on the accuracy of positive predictions.
3. Recall (Sensitivity or True Positive Rate): Recall measures the proportion of correctly
predicted positive instances among all actual positive instances.
4. F1 Score: F1 score is the harmonic mean of precision and recall.
5. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The ROC
curve plots the true positive rate against the false positive rate at various thresholds. AUC-
ROC measures the area under this curve which provides an aggregate measure of a model's
performance across different classification thresholds.
6. Area Under the Precision-Recall Curve (AUC-PR): Similar to AUC-ROC, AUC-PR
measures the area under the precision-recall curve helps in providing a summary of a model's
performance across different precision-recall trade-offs.
Regression on Large Datasets – Linear Regression:
Linear regression is a type of supervised machine-learning algorithm that learns from
the labelled datasets and maps the data points with most optimized linear functions which can
be used for prediction on new datasets. It assumes that there is a linear relationship between
the input and output, meaning the output changes at a constant rate as the input changes. This
relationship is represented by a straight line.
For example, we want to predict a student's exam score based on how many hours they
studied. We observe that as students study more hours, their scores go up. In the example of
predicting exam scores based on hours studied. Here
Independent variable (input): Hours studied because it's the factor we control or observe.
Dependent variable (output): Exam score because it depends on how many hours were
studied.
We use the independent variable to predict the dependent variable.
Why Linear Regression is Important?
Simplicity and Interpretability: It’s easy to understand and interpret, making it a
starting point for learning about machine learning.
Predictive Ability: Helps predict future outcomes based on past data, making it
UNIT III: MACHINE LEARNING ON BIG DATA
useful in various fields like finance, healthcare and marketing.
Basis for Other Models: Many advanced algorithms, like logistic regression or
neural networks, build on the concepts of linear regression.
Efficiency: It’s computationally efficient and works well for problems with a linear
relationship.
Widely Used: It’s one of the most widely used techniques in both statistics and
machine learning for regression tasks.
Analysis: It provides insights into relationships between variables (e.g., how much
one variable influences another).
Best Fit Line in Linear Regression
In linear regression, the best-fit line is the straight line that most accurately represents
the relationship between the independent variable (input) and the dependent variable (output).
It is the line that minimizes the difference between the actual data points and the predicted
values from the model.
1. Goal of the Best-Fit Line
The goal of linear regression is to find a straight line that minimizes the error (the
difference) between the observed data points and the predicted values. This line helps us
predict the dependent variable for new, unseen data.
Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used
for regression. A linear function is the simplest type of function. Here, X may be a single
feature or multiple features representing the problem.
2. Equation of the Best-Fit Line
For simple linear regression (with one independent variable), the best-fit line is represented
by the equation.
y=mx+b
Where:
y is the predicted value (dependent variable)
x is the input (independent variable)
m is the slope of the line (how much y changes when x changes)
b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and b (intercept) so that
UNIT III: MACHINE LEARNING ON BIG DATA
the predicted y values are as close as possible to the actual data points.
3. Minimizing the Error: The Least Squares Method
To find the best-fit line, we use a method called Least Squares. The idea behind this
method is to minimize the sum of squared differences between the actual values (data points)
and the predicted values from the line. These differences are called residuals.
The formula for residuals is:
Where:
yi is the actual observed value
^yi is the predicted value from the line for that xi.
The least squares method minimizes the sum of the squared residuals:
4. Interpretation of the Best-Fit Line
Slope (m): The slope of the best-fit line indicates how much the dependent variable
(y) changes with each unit change in the independent variable (x). For example if the
slope is 5, it means that for every 1-unit increase in x, the value of y increases by 5
units.
Intercept (b): The intercept represents the predicted value of y when x = 0. It’s the
point where the line crosses the y-axis.
In linear regression some hypothesis are made to ensure reliability of the model's results.
Python Implementation of Linear Regression
1. Import the necessary libraries:
import pandas as pd
import numpy as np
import [Link] as plt
import [Link] as ax
from [Link] import FuncAnimation
2. Load the dataset and separate input and Target variables:
url =
'[Link]
16/data_for_lr.csv'
data = pd.read_csv(url)
data = [Link]()
train_input = [Link](data.x[0:500]).reshape(500, 1)
train_output = [Link](data.y[0:500]).reshape(500, 1)
test_input = [Link](data.x[500:700]).reshape(199, 1)
test_output = [Link](data.y[500:700]).reshape(199, 1)
UNIT III: MACHINE LEARNING ON BIG DATA
4. Build the Linear Regression Model and Plot the regression line
class LinearRegression:
def __init__(self):
[Link] = {}
def forward_propagation(self, train_input):
m = [Link]['m']
c = [Link]['c']
predictions = [Link](m, train_input) + c
return predictions
def cost_function(self, predictions, train_output):
cost = [Link]((train_output - predictions) ** 2)
return cost
def backward_propagation(self, train_input, train_output,
predictions):
derivatives = {}
df = (predictions-train_output)
dm = 2 * [Link]([Link](train_input, df))
dc = 2 * [Link](df)
derivatives['dm'] = dm
derivatives['dc'] = dc
return derivatives
def update_parameters(self, derivatives, learning_rate):
[Link]['m'] = [Link]['m'] -
learning_rate * derivatives['dm']
[Link]['c'] = [Link]['c'] -
learning_rate * derivatives['dc']
def train(self, train_input, train_output, learning_rate,
iters):
[Link]['m'] = [Link](0, 1) * -1
[Link]['c'] = [Link](0, 1) * -1
[Link] = []
fig, ax = [Link]()
x_vals = [Link](min(train_input), max(train_input),
100)
line, = [Link](x_vals, [Link]['m'] * x_vals +
[Link]['c'], color='red', label='Regression Line')
[Link](train_input, train_output, marker='o',
color='green', label='Training Data')
ax.set_ylim(0, max(train_output) + 1)
def update(frame):
predictions = self.forward_propagation(train_input)
cost = self.cost_function(predictions, train_output)
UNIT III: MACHINE LEARNING ON BIG DATA
derivatives = self.backward_propagation(train_input,
train_output, predictions)
self.update_parameters(derivatives, learning_rate)
line.set_ydata([Link]['m'] * x_vals +
[Link]['c'])
[Link](cost)
print("Iteration = {}, Loss = {}".format(frame + 1,
cost))
return line,
ani = FuncAnimation(fig, update, frames=iters,
interval=200, blit=True)
[Link]('linear_regression_A.gif', writer='ffmpeg')
[Link]('Input')
[Link]('Output')
[Link]('Linear Regression')
[Link]()
[Link]()
return [Link], [Link]
5. Trained the model and Final Prediction:
linear_reg = LinearRegression()
parameters, loss = linear_reg.train(train_input, train_output,
0.0001, 20)
UNSUPERVISED LEARNING
Unsupervised learning is a type of machine learning that analyzes and models data without
labelled responses or predefined categories. Unlike supervised learning, where the algorithm
learns from input-output pairs, unsupervised learning algorithms work solely with input data
and aim to discover hidden patterns, structures or relationships within the dataset
independently, without any human intervention or prior knowledge of the data's meaning.
UNIT III: MACHINE LEARNING ON BIG DATA
Working of Unsupervised Learning
The working of unsupervised machine learning can be explained in these steps:
1. Collect Unlabeled Data
Gather a dataset without predefined labels or categories.
Example: Images of various animals without any tags.
2. Select an Algorithm
Choose a suitable unsupervised algorithm such as clustering like K-Means,
association rule learning like Apriori or dimensionality reduction like PCA based on the goal.
3. Train the Model on Raw Data
Feed the entire unlabeled dataset to the algorithm.
The algorithm looks for similarities, relationships or hidden structures within the data.
4. Group or Transform Data
The algorithm organizes data into groups (clusters), rules or lower-dimensional forms
without human input.
Example: It may group similar animals together or extract key patterns from large datasets.
5. Interpret and Use Results
Analyze the discovered groups, rules or features to gain insights or use them for
further tasks like visualization, anomaly detection or as input for other models.
Unsupervised Learning Algorithms
1. Clustering Algorithms
2. Association Rule Learning
3. Dimensionality Reduction
CLUSTERING AND DIMENSIONALITY REDUCTION
Clustering Algorithms:
Clustering is an unsupervised machine learning technique that groups unlabeled data
into clusters based on similarity. Its goal is to discover patterns or relationships within the
data without any prior knowledge of categories or labels.
Groups data points that share similar features or characteristics.
Helps find natural groupings in raw, unclassified data.
Commonly used for customer segmentation, anomaly detection and data organization.
Works purely from the input data without any output labels.
Enables understanding of data structure for further analysis or decision-making.
Some common clustering algorithms:
K-means Clustering: Groups data into K clusters based on how close the points are
to each other.
Hierarchical Clustering: Creates clusters by building a tree step-by-step, either
merging or splitting groups.
UNIT III: MACHINE LEARNING ON BIG DATA
Density-Based Clustering (DBSCAN): Finds clusters in dense areas and treats
scattered points as noise.
Mean-Shift Clustering: Discovers clusters by moving points toward the most
crowded areas.
Spectral Clustering: Groups data by analyzing connections between points using
graphs.
Dimensionality Reduction:
Dimensionality reduction is the process of decreasing the number of features or
variables in a dataset while retaining as much of the original information as possible. This
technique helps simplify complex data making it easier to analyze and visualize. It also
improves the efficiency and performance of machine learning algorithms by reducing noise
and computational cost.
It reduces the dataset’s feature space from many dimensions to fewer, more
meaningful ones.
Helps focus on the most important traits or patterns in the data.
Commonly used to improve model speed and reduce overfitting.
Here are some popular Dimensionality Reduction algorithms:
Principal Component Analysis (PCA): Reduces dimensions by transforming data
into uncorrelated principal components.
Linear Discriminant Analysis (LDA): Reduces dimensions while maximizing class
separability for classification tasks.
Non-negative Matrix Factorization (NMF): Breaks data into non-negative parts to
simplify representation.
Locally Linear Embedding (LLE): Reduces dimensions while preserving the
relationships between nearby points.
Isomap: Captures global data structure by preserving distances along a manifold.
MODEL EVALUATION AND VALIDATION TECHNIQUES
Model evaluation and validation are crucial stages in the machine learning workflow
to ensure a model performs reliably and accurately on unseen data. These techniques prevent
a model from memorizing training data (overfitting) and provide an honest estimate of its
real-world performance.
1. Cross-Validation
Cross-validation ensures that the model is tested on multiple subsets of data making it
less likely to overfit and improving its generalization ability.
(a) Holdout Method
In the Holdout method the dataset is split into train and test sets (commonly 7:3 or
8:2). Lets implement it where:
load_iris() loads the Iris dataset (flower measurements with 3 species).
UNIT III: MACHINE LEARNING ON BIG DATA
train_test_split() divides data into training and testing sets.
test_size=0.20: 20% for testing, 80% for training.
random_state=42 makes results reproducible
from sklearn.model_selection import train_test_split
from [Link] import load_iris
iris = load_iris()
X, y = [Link], [Link]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))
(b) K-Fold Cross-Validation
In K-Fold Cross-Validation the dataset is divided into k folds. Each fold is used once
as a test set and the model is trained on the remaining k-1 folds. Lets implement it where:
DecisionTreeClassifier(): A decision tree model is created.
KFold(n_splits=5): Data is divided into 5 folds.
cross_val_score(): Runs training/testing across folds.
scores: Accuracy for each fold.
[Link](): Average accuracy across all folds.
from sklearn.model_selection import KFold, cross_val_score
from [Link] import DecisionTreeClassifier
model = DecisionTreeClassifier()
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print("Cross-validation scores:", scores)
print("Average CV Score:", [Link]())
2. Evaluation Metrics for Classification Tasks
Classification models assign inputs to predefined labels. Their performance can be
measured using accuracy, precision, recall, F1 score, confusion matrix and AUC-ROC.
We’ll demonstrate these metrics using a Decision Tree Classifier on the Iris dataset.
Step 1: Importing Libraries, Loading Dataset, Splitting Dataset
UNIT III: MACHINE LEARNING ON BIG DATA
Importing libraries like pandas, numpy, matplotlib and scikit learn.
Loading Iris dataset with flower measurements.
Splitting into 80% training, 20% testing.
import pandas as pd
import numpy as np
from sklearn import datasets
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import (
precision_score, recall_score, f1_score, accuracy_score,
confusion_matrix, ConfusionMatrixDisplay, roc_auc_score
)
import [Link] as plt
iris = datasets.load_iris()
X, y = [Link], [Link]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=20
)
Step 2: Training Model
DecisionTreeClassifier(): Creates a decision tree model.
.fit(X_train, y_train): Trains the model on training data.
.predict(X_test): Generates predictions on test data.
tree = DecisionTreeClassifier()
[Link](X_train, y_train)
y_pred = [Link](X_test)
Step 3: Accuracy
We Will calculate the accuracy,
accuracy_score() computes the proportion of correct predictions.
print("Accuracy:", accuracy_score(y_test, y_pred))
Step 4: Precision and Recall
Precision measures how many predicted positives are actually positive.
UNIT III: MACHINE LEARNING ON BIG DATA
Focuses on the correctness of positive predictions.
High precision: few false positives.
Recall measures how many actual positives are correctly predicted.
Focuses on capturing all positives.
High recall: few false negatives.
print("Precision:", precision_score(y_test, y_pred,
average="weighted"))
print("Recall:", recall_score(y_test, y_pred, average="weighted"))
Step 5: F1 Score
We will calculate the F1 score which is Harmonic mean of precision and recall.
Balances both metrics.
Combines precision and recall into one metric.
Useful when we need a balance between false positives and false negatives.
print("F1 score:", f1_score(y_test, y_pred, average="weighted"))
Step 6: Confusion Matrix
We will create a confusion matrix:
confusion_matrix(): Creates matrix of actual vs predicted values.
Each cell shows correct/misclassified predictions.
cm = confusion_matrix(y_test, y_pred)
cm_display = ConfusionMatrixDisplay(
confusion_matrix=cm, display_labels=[0, 1, 2])
cm_display.plot()
[Link]()
Step 7: AUC-ROC Curve
TPR: True Positive Rate
FPR: False Positive Rate
AUC: Area under curve; higher is better (0.5 random, 1 perfect).
y_true = [1, 0, 0, 1]
y_pred = [1, 0, 0.9, 0.2]
auc = [Link](roc_auc_score(y_true, y_pred), 3)
UNIT III: MACHINE LEARNING ON BIG DATA
print("AUC:", auc)
3. Evaluation Metrics for Regression Tasks
Regression predicts continuous values (e.g., temperature). We use error-based metrics
to measure accuracy.
Step 1: Importing Data and Training Model
Dataset contains Temperature (independent variable) and Relative Humidity
(dependent variable).
Data split into training (80%) and testing (20%).
LinearRegression().fit() trains the regression model.
Predictions are stored in Y_pred.
from sklearn.linear_model import LinearRegression
from [Link] import mean_absolute_error,
mean_squared_error, mean_absolute_percentage_error
df = pd.read_csv('[Link]')
X, Y = [Link][:, 2].[Link](-1, 1), [Link][:,
3].values
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.20, random_state=0
)
regression = LinearRegression()
[Link](X_train, Y_train)
Y_pred = [Link](X_test)
Step 2: Mean Absolute Error (MAE)
Mean absolute error is average difference between actual and predicted values.
mae = mean_absolute_error(Y_test, Y_pred)
print("MAE:", mae)
Step 3: Mean Squared Error (MSE)
We will calculate the mean squared error which is average squared difference
between predicted and actual values.
Penalizes larger errors more than MAE.
Commonly used for regression model loss functions.
UNIT III: MACHINE LEARNING ON BIG DATA
mse = mean_squared_error(Y_test, Y_pred)
print("MSE:", mse)
Step 4: Root Mean Squared Error (RMSE)
We will calculate RMSE which is Square root of MSE. Converts error back to
original units.
Brings error back to the same units as the target variable.
Easier to interpret than MSE.
Still penalizes larger errors more than MAE.
rmse = mean_squared_error(Y_test, Y_pred, squared=False)
print("RMSE:", rmse)
DISTRIBUTED TRAINING
Distributed training is the practice of training machine learning models on large
datasets by distributing the computational workload across multiple machines or devices.
This approach is essential in big data analytics to overcome the limitations of a single
machine, such as insufficient memory and excessively long training times. It enables faster
experimentation and the development of larger, more complex models.
Core parallelization strategies
1. Data parallelism
This is the most common and straightforward method, best suited for models that fit
within a single machine's memory but require vast datasets for training.
How it works: The training dataset is divided into smaller chunks, and each computing node
(or worker) receives a copy of the full model. Each worker trains its model copy on its unique
subset of data.
Synchronization: After each training iteration, the workers' model updates (gradients) are
aggregated, typically by averaging. The aggregated parameters are then used to update each
worker's model, ensuring consistency.
Use cases: Training large computer vision models like convolutional neural networks
(CNNs) on datasets like ImageNet, where the model size is manageable but the data volume
is enormous.
2. Model parallelism
This approach is used when a model is too large to fit into the memory of a single
device.
How it works: The model itself is partitioned across different computing nodes, with each
node holding and training different layers or sections of the model. Data flows sequentially
through the network of machines during the forward and backward passes.
Challenges: The implementation is more complex due to the need for careful partitioning to
balance the computational load and manage communication between nodes.
UNIT III: MACHINE LEARNING ON BIG DATA
Use cases: Training massive models like large language models (LLMs), which can have
hundreds of billions of parameters that cannot fit on a single GPU.
3. Hybrid parallelism
A combination of data and model parallelism, this strategy is used for training
exceptionally large and complex models on vast datasets. It applies data parallelism to some
layers and model parallelism to others to achieve the best performance.
Communication architectures
To coordinate the updates between workers in a distributed setting, two main
architectures are used:
1. Parameter server (Centralized)
How it works: A set of nodes, called parameter servers, stores and manages the model's
parameters. Worker nodes pull the latest parameters from the servers, compute gradient
updates on their local data, and push the new gradients back to the servers. The servers then
aggregate the gradients and update the central parameters.
Pros: Conceptually simple to implement.
Cons: The central server can become a bottleneck, especially with a large number of
workers, as it must handle all the communication traffic.
2. All-Reduce (Decentralized)
How it works: Each worker holds a full replica of the model and communicates directly with
the other workers to collectively compute the average of the gradients. This is often
implemented in a ring-based topology to optimize communication.
Pros: Avoids the bottleneck of a central server, leading to better scalability for systems with
high-bandwidth networks.
Cons: The communication overhead increases with the number of workers.
Communication and synchronization strategies
When coordinating model updates, algorithms can follow a synchronous or
asynchronous approach.
1. Synchronous communication
How it works: All workers operate in lockstep. In each iteration, workers train on their data
subset and send their updates. A "synchronization barrier" prevents any worker from starting
the next iteration until all workers have completed the current one.
Pros: Guarantees strong model consistency and convergence.
Cons: The entire process is slowed down by the slowest worker, also known as a "straggler".
2. Asynchronous communication
How it works: Workers operate independently. A worker can send and receive model
updates without waiting for all other workers to finish their current iteration.
Pros: Avoids the problem of stragglers and makes more efficient use of computational
resources.
Cons: Can suffer from "stale gradients," where a fast worker uses outdated parameter updates
from a slow worker, potentially degrading convergence.
UNIT III: MACHINE LEARNING ON BIG DATA
Tools and frameworks
Many modern frameworks are designed to manage the complexities of distributed training:
TensorFlow: Provides a [Link] API for distributing training
across different devices and machines, including MultiWorkerMirroredStrategy
for synchronous data parallelism.
PyTorch: Offers DistributedDataParallel (DDP) for efficient data-parallel
training and uses the [Link] package to handle communication and
synchronization.
Apache Spark: Used for large-scale data processing, with libraries like BigDL that
enable distributed deep learning on Spark clusters.
Cloud Services: Platforms like Google's Vertex AI offer managed distributed training
services that simplify infrastructure setup, scaling, and resource management.
OPTIMIZATION TECHNIQUES
Optimization is the process of adjusting a machine learning model's parameters to
minimize the error or loss function. The goal is to find the set of parameters that allows the
model to produce the most accurate results.
Optimization techniques fall into several categories, including first-order optimizers,
advanced methods for deep learning, and techniques for hyperparameter tuning.
First-order optimization algorithms
These are iterative algorithms that rely on the first derivative (gradient) of the loss
function to navigate toward the minimum.
Gradient Descent (GD): A foundational algorithm that computes the gradient of the entire
training dataset to find the direction of steepest descent. It then takes a step in the opposite
direction of the gradient, scaled by a learning rate, to find the minimum.
Stochastic Gradient Descent (SGD): An extension of GD that uses a single, randomly
selected training example to compute the gradient and update the parameters in each iteration.
This makes it much faster and more memory-efficient for large datasets, though the
updates can be noisy.
Mini-Batch Gradient Descent: A compromise between GD and SGD. It computes the
gradient on a small, random subset (a "mini-batch") of the training data. This offers a balance
between the computational efficiency of SGD and the stability of GD.
Advanced and adaptive optimizers
Built on the principles of gradient descent, these optimizers introduce more
sophisticated mechanisms to accelerate convergence and handle complex loss landscapes
common in deep learning.
Gradient Descent with Momentum: Accelerates SGD in the right direction and dampens
oscillations. It does this by adding a fraction of the previous update vector to the current
update vector, building up momentum in consistent directions.
RMSProp (Root Mean Square Propagation): An adaptive learning rate method that
UNIT III: MACHINE LEARNING ON BIG DATA
addresses the issue of rapidly diminishing learning rates in earlier algorithms like AdaGrad. It
uses a moving average of squared gradients to normalize the learning rate for each parameter,
stabilizing the training process.
Adam (Adaptive Moment Estimation): One of the most popular optimizers today, Adam
combines the best features of Momentum and RMSProp. It uses moving averages of both the
gradients and the squared gradients to compute an adaptive learning rate for each parameter,
offering fast and stable convergence.
AdamW: A variant of Adam that separates weight decay regularization from the
optimization step. This often leads to better generalization performance, particularly for
transformer models and other advanced deep learning architectures.
Hyperparameter optimization
These techniques are used to find the best configuration of a model's hyperparameters,
which are the settings that are not learned from data.
Grid Search: Systematically evaluates a model's performance for all possible combinations
of a predefined set of hyperparameter values. It is thorough but can be computationally very
expensive.
Random Search: A more efficient alternative to grid search that randomly samples
hyperparameter values from a specified distribution. It is often more effective, especially
when only a few hyperparameters are critical to performance.
Bayesian Optimization: A probabilistic model-based approach that is more sample-efficient
than grid or random search. It uses information from past evaluations to intelligently select
the next set of hyperparameters to test, making it ideal for expensive or complex objective
functions.
Regularization techniques
Regularization methods are optimization techniques that are used to prevent
overfitting, which is when a model performs well on training data but poorly on new, unseen
data.
L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the magnitude of
the coefficients to the loss function. It can force some coefficients to become zero, effectively
removing certain features from the model.
L2 Regularization (Ridge): Adds a penalty equal to the squared magnitude of the
coefficients to the loss function. This encourages smaller, smoother coefficients, which helps
to prevent overfitting.
Dropout: A technique used in neural networks where a random subset of neurons is
temporarily "dropped out" during each training iteration. This prevents the network from
relying too heavily on any single neuron and encourages a more robust, generalized model.
Other optimization approaches
UNIT III: MACHINE LEARNING ON BIG DATA
Evolutionary Algorithms (e.g., Genetic Algorithms): Inspired by natural selection, these
algorithms evolve a population of candidate solutions over generations using genetic
operators like mutation and crossover. They are useful for complex optimization problems
that are difficult to solve with traditional methods.
Simulated Annealing: A stochastic optimization technique that draws an analogy to the
process of annealing in metallurgy. It explores the search space with a high degree of
randomness initially (high temperature), which is gradually reduced (low temperature) to
converge toward a better solution while avoiding local minima.
*****************