100% found this document useful (1 vote)
487 views28 pages

Machine Learning with MLlib & Scikit-learn

Unit III covers machine learning on big data, focusing on MLlib and Scikit-learn for data preprocessing, supervised and unsupervised learning, and model evaluation techniques. MLlib, built on Apache Spark, offers scalable machine learning capabilities, while Scikit-learn provides a user-friendly interface for implementing various algorithms. The document also discusses data preprocessing steps and the importance of preparing data for effective machine learning model training.

Uploaded by

focusgovernment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
487 views28 pages

Machine Learning with MLlib & Scikit-learn

Unit III covers machine learning on big data, focusing on MLlib and Scikit-learn for data preprocessing, supervised and unsupervised learning, and model evaluation techniques. MLlib, built on Apache Spark, offers scalable machine learning capabilities, while Scikit-learn provides a user-friendly interface for implementing various algorithms. The document also discusses data preprocessing steps and the importance of preparing data for effective machine learning model training.

Uploaded by

focusgovernment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT III: MACHINE LEARNING ON BIG DATA

Introduction to MLlib and Scikit-learn, Data Preprocessing for Big Data ML


Pipelines, Supervised Learning: Classification and Regression on Large
Datasets, Unsupervised Learning: Clustering and Dimensionality Reduction,
Model Evaluation and Validation Techniques, Distributed Training and
Optimization Techniques.

INTRODUCTION TO MLLIB

MLlib is a machine learning framework built on top of Apache Spark. It is designed


to make machine learning tasks faster and more efficient. MLlib supports various ML
algorithms and utilities.

Core Features of MLlib


 MLlib is a scalable, easy to use and comprehensive library
 Algorithms like Classification, Regression, Clustering, Collaborative Filtering, etc can be
implemented using it.
 Similar to scikit-learn hence streamlines the process of building and tuning machine
learning workflows.
 Includes feature selection, extraction, scaling, etc.
 Provides tools to evaluate models with accuracy, precision, etc.
 It provides 2 main APIs: RDD-based API and Data Frame-based APIWorking Principle

Workflow of MLlib Models


1. Data Ingestion: Load data using Spark DataFrames.
2. Data Preprocessing: Cleaning, handling missing values. Feature selection and
transformation
3. Model Selection: Choose a machine learning algorithm based on data.
4. Training: Fit the model on training data.
5. Prediction: Use the trained model to make predictions on test or new data.
6. Evaluation: Evaluate model performance using various metrics.
7. Pipeline Deployment: Build a pipeline combining all steps.
UNIT III: MACHINE LEARNING ON BIG DATA

Major Algorithms in MLlib

1. Classification Models
 Used to categorize data into predefined labels.
 Used in Email spam detection
 Handle binary and multiclass problems
2. Regression Models
 Used to predict continuous values.
 Used in House price prediction
 Minimize error between predicted and actual values
3. Clustering Models
 Used to group data points without labels.
 Used in Customer segmentation
 Unsupervised learning based on similarity
4. Recommendation Algorithms
 Used for personalized content delivery.
 Used in Movie recommendations
 Collaborative filtering based on user-item interactions
5. Dimensionality Reduction
 Used to reduce feature space while preserving data variance.
 Used in Visualization or preprocessing
 Helps in improving performance and reducing noise
6. Feature Transformation
 Essential for preparing raw data into a usable format.
 Encoding categorical variables and scaling features
 Required before model training
MLlib simplifies large-scale machine learning by combining Spark with ML algorithms. It
allows seamless integration of preprocessing, training and evaluation in one environment and
is ideal for production-level systems where both scalability and speed are crucial.

Implementation of MLlib
Here we will create a logistic regression model using PySpark, trains it on sample
data and then makes predictions on the same data, displaying the features, labels and
predictions.
UNIT III: MACHINE LEARNING ON BIG DATA

 [Link]: Initializes a Spark session which is the entry point for


Spark functionality.
 [Link]: Creates a DataFrame from a list of tuples where each tuple
contains a label and features represented as a dense vector.
 LogisticRegression(): The LogisticRegression model from MLlib is initialized and then
trained on the sample data using the .fit()
 [Link](data): Trains the logistic regression model using the input data (label and features).
 [Link](data): Applies the trained model to the data to make predictions.
 [Link]: Selects specific columns (features, label and prediction) from the
predictions DataFrame to display the results.

Example:
from [Link] import SparkSession
from [Link] import LogisticRegression
from [Link] import Vectors

# Start session
spark = [Link]("BasicMLlib").getOrCreate()

# Sample DataFrame
data = [Link]([
(0.0, [Link]([0.0, 1.1, 0.1])),
(1.0, [Link]([2.0, 1.0, -1.0])),
(0.0, [Link]([2.0, 1.3, 1.0])),
(1.0, [Link]([0.0, 1.2, -0.5]))
], ["label", "features"])

# Train logistic regression model


lr = LogisticRegression()
model = [Link](data)

# Prediction
predictions = [Link](data)
[Link]("features", "label", "prediction").show()

OUTPUT:

Real-World Use Case


UNIT III: MACHINE LEARNING ON BIG DATA
 Predictive analytics of diseases based on patient data: Using machine learning models
to analyze historical patient data and predict the likelihood of diseases allowing for early
intervention and better healthcare planning.
 Spam detection and sentiment analysis: It helps to classify emails or messages as spam
or non-spam and analyze the sentiment of text to understand whether it is positive,
negative or neutral.
 Fraud detection in the finance sector: It can identify unusual patterns in transactions,
helping detect fraudulent activities such as credit card fraud or identity theft in real-time.
 Real-time product recommendation in e-commerce: By analyzing user behavior,
preferences and purchase history it can suggest products to users in real-time, enhancing
the shopping experience and increasing sales.
 Customer segmentation for marketing: It helps segment customers based on their
behavior, preferences and demographics enabling businesses to tailor their marketing
strategies and improve customer engagement.

Strengths of MLlib
 Comprehensive ML support: It provides a wide range of algorithms for
classification, regression, clustering, recommendation and more making it suitable for
various machine learning tasks.
 Scalable and fault-tolerant: Designed to handle large-scale datasets across
distributed systems. It ensures fault tolerance so processes continue smoothly even in
case of hardware failures.
 Simplified API for common ML tasks: MLlib offers an easy-to-use API that
abstracts the complexity of distributed computing and makes it easier for users to
implement common machine learning tasks like training models and making
predictions.
 Good speed: It uses Apache Spark for distributed processing, providing high-speed
model training and inference making it suitable for big data applications.
 Easy integration: It integrates seamlessly with Spark’s data processing framework
allowing it to work well with Spark’s other components like Spark SQL, Spark
Streaming and ML pipelines.

INTRODUCTION TO SCIKIT-LEARN

Scikit-Learn (also known as sklearn) is an open-source Python library designed to


simplify the implementation of machine learning models. It provides a wide range of tools for
data preprocessing, model selection, and evaluation, making it a preferred choice for
beginners and professionals alike.

Key Features of Scikit-Learn:


 Simple and Consistent API: Provides a unified interface for all machine learning
algorithms.
 Efficient Implementation: Built on top of optimized scientific libraries like NumPy
and SciPy.
 Wide Range of Algorithms: Includes classification, regression, clustering, and
dimensionality reduction techniques.
UNIT III: MACHINE LEARNING ON BIG DATA
 Built-in Data Preprocessing Tools: Offers methods for handling missing values,
feature scaling, and encoding categorical variables.
 Model Evaluation and Selection: Supports cross-validation, hyperparameter tuning,
and performance metrics.

Installing Scikit-Learn
To install Scikit-Learn, use the following command:
pip install scikit-learn

Methods in Scikit-Learn
Scikit-Learn provides various methods that make machine learning model development
easier. Some commonly used methods include:

1. Data Preprocessing Methods


 [Link](): Standardizes features by removing the
mean and scaling to unit variance.
 [Link](): Scales features to a given range (default 0
to 1).
 [Link](): Encodes categorical labels as integers.
 [Link](): Handles missing values by replacing them with
mean, median, or most frequent values.

2. Model Selection Methods


 sklearn.model_selection.train_test_split(): Splits data into training and test sets.
 sklearn.model_selection.GridSearchCV(): Performs exhaustive search over a given
parameter grid to find the best hyperparameters.
 sklearn.model_selection.cross_val_score(): Evaluates a model using cross-
validation.

3. Classification Methods
UNIT III: MACHINE LEARNING ON BIG DATA
 [Link](): Implements the K-Nearest Neighbors
classification algorithm.
 [Link](): Builds a decision tree model for
classification.
 [Link](): Implements Support Vector Classification.
 sklearn.naive_bayes.GaussianNB(): Implements the Naïve Bayes classifier for
normally distributed data.

4. Regression Methods
 sklearn.linear_model.LinearRegression(): Performs simple and multiple linear
regression.
 sklearn.linear_model.Lasso(): Implements Lasso regression for feature selection.
 [Link](): Uses an ensemble of decision trees
for regression tasks.

5. Clustering Methods
 [Link](): Implements the K-Means clustering algorithm.
 [Link](): Implements hierarchical clustering.

6. Model Evaluation Methods


 [Link].accuracy_score(): Computes accuracy for classification models.
 [Link].confusion_matrix(): Generates a confusion matrix for evaluating
classification results.
 [Link].mean_squared_error(): Measures the mean squared error for
regression models.

Common Use Cases of Scikit-Learn


Scikit-Learn is widely used for various machine learning applications, including:

 Classification – Identifying categories or labels for given data (e.g., spam detection,
handwriting recognition).
 Regression – Predicting continuous values (e.g., house price prediction, stock market
trends).
 Clustering – Grouping similar data points together (e.g., customer segmentation,
anomaly detection).
 Dimensionality Reduction – Reducing the number of input variables in data (e.g.,
Principal Component Analysis).
 Model Selection and Evaluation – Finding the best-performing machine learning
model using cross-validation.

Example: Performing Regression with Scikit-Learn


Let’s implement a simple linear regression model using the Boston Housing Dataset.

Step 1: Import Required Libraries


from sklearn.linear_model import LinearRegression
from [Link] import fetch_california_housing
Step 2: Load the Dataset
UNIT III: MACHINE LEARNING ON BIG DATA
housing = fetch_california_housing()
X = [Link]
y = [Link]

Step 3: Split the Data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

Step 4: Train the Model


regressor = LinearRegression()
[Link](X_train, y_train)

Step 5: Make Predictions and Evaluate


y_pred = [Link](X_test)
print("Model Coefficients:", regressor.coef_)
print("Intercept:", regressor.intercept_)

DATA PREPROCESSING FOR BIG DATA ML PIPELINES


Data preprocessing is the first step in any data analysis or machine learning pipeline.
It involves cleaning, transforming and organizing raw data into a structured format to ensure
accuracy, consistency and readiness for modelling. This step improves data quality and
directly impacts the performance of analytical or predictive models.

Steps-by-Step implementation
Let's implement various preprocessing features,

Step 1: Import Libraries and Load Dataset


We prepare the environment with libraries liike pandas, numpy, scikit learn,
matplotlib and seaborn for data manipulation, numerical operations, visualization and scaling.
Load the dataset for preprocessing.

import pandas as pd
import numpy as np
from [Link] import MinMaxScaler, StandardScaler
import seaborn as sns
import [Link] as plt
UNIT III: MACHINE LEARNING ON BIG DATA

df =
pd.read_csv('[Link]
)
[Link]()

Step 2: Inspect Data Structure and Check Missing Values


We understand dataset size, data types and identify any incomplete (missing) data that
needs handling.

 [Link](): Prints concise summary including count of non-null entries and data type of
each column.
 [Link]().sum(): Returns the number of missing values per column.

[Link]()
print([Link]().sum())

Step 3: Statistical Summary and Visualizing Outliers


Get numeric summaries like mean, median, min/max and detect unusual points
(outliers). Outliers can skew models if not handled.

 [Link](): Computes count, mean, std deviation, min/max and quartiles for
numerical columns.
 Boxplots: Visualize spread and detect outliers using matplotlib’s boxplot().

[Link]()

fig, axs = [Link](len([Link]), 1, figsize=(7, 18),


dpi=95)
for i, col in enumerate([Link]):
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
plt.tight_layout()
[Link]()

Step 4: Remove Outliers Using the Interquartile Range (IQR) Method


Remove extreme values beyond a reasonable range to improve model robustness.

 IQR = Q3 (75th percentile) – Q1 (25th percentile).


 Values below Q1 - 1.5IQR or above Q3 + 1.5IQR are outliers.
 Calculate lower and upper bounds for each column separately.
 Filter data points to keep only those within bounds.

q1, q3 = [Link](df['Insulin'], [25, 75])


iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
clean_df = df[(df['Insulin'] >= lower) & (df['Insulin'] <=
upper)]
UNIT III: MACHINE LEARNING ON BIG DATA
Step 5: Correlation Analysis
Understand relationships between features and the target variable (Outcome).
Correlation helps gauge feature importance.

 [Link](): Computes pairwise correlation coefficients between columns.


 Heatmap via seaborn visualizes correlation matrix clearly.
 Sorting correlations with corr['Outcome'].sort_values() highlights features most
correlated with the target.

corr = clean_df.corr()
[Link](dpi=130)
[Link](corr, annot=True, fmt='.2f', cmap='coolwarm')
[Link]()

print(corr['Outcome'].sort_values(ascending=False))

Step 6: Visualize Target Variable Distribution


Check if target classes (Diabetes vs Not Diabetes) are balanced, affecting model
training and evaluation.

 [Link](): Pie chart to display proportion of each class in the target variable 'Outcome'.

[Link](claen_df['Outcome'].value_counts(), labels=[
'Diabetes', 'Not Diabetes'], autopct='%.f%%',
shadow=True)
[Link]('Outcome Proportionality')
[Link]()

Step 7: Separate Features and Target Variable


Prepare independent variables (features) and dependent variable (target) separately for
modeling.

 [Link](columns=[...]): Drops the target column from features.


 Direct column selection df['Outcome'] selects target column.

X = clean_df.drop(columns=['Outcome'])
y = clean_df['Outcome']

Step 8: Feature Scaling: Normalization and Standardization


Scale features to a common range or distribution, important for many ML algorithms
sensitive to feature magnitudes.

1. Normalization (Min-Max Scaling): Rescales features between 0 and 1. Good for


algorithms like k-NN and neural networks.

 Class: MinMaxScaler from sklearn.


 .fit_transform(): Learns min/max from data and applies scaling.
UNIT III: MACHINE LEARNING ON BIG DATA
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
print(X_normalized[:5])

2. Standardization: Transforms features to have mean = 0 and standard deviation = 1, useful


for normally distributed features.

 Class: StandardScaler from sklearn.

scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
print(X_standardized[:5])

SUPERVISED LEARNING
Supervised learning is a type of machine learning where a model learns from labelled
data—meaning every input has a corresponding correct output. The model makes predictions
and compares them with the true outputs, adjusting itself to reduce errors and improve
accuracy over time. The goal is to make accurate predictions on new, unseen data. For
example, a model trained on images of handwritten digits can recognise new digits it has
never seen before.

While training the model, data is usually split in the ratio of 80:20 i.e. 80% as training data
and the rest as testing data. In training data, we feed input as well as output for 80% of data.
The model learns from training data only.
Working of Supervised Machine Learning
The working of supervised machine learning follows these key steps:

1. Collect Labeled Data


 Gather a dataset where each input has a known correct output (label).
 Example: Images of handwritten digits with their actual numbers as labels.
2. Split the Dataset
 Divide the data into training data (about 80%) and testing data (about 20%).
 The model will learn from the training data and be evaluated on the testing data.
UNIT III: MACHINE LEARNING ON BIG DATA
3. Train the Model
 Feed the training data (inputs and their labels) to a suitable supervised learning
algorithm (like Decision Trees, SVM or Linear Regression).
 The model tries to find patterns that map inputs to correct outputs.
4. Validate and Test the Model
 Evaluate the model using testing data it has never seen before.
 The model predicts outputs and these predictions are compared with the actual labels
to calculate accuracy or error.
5. Deploy and Predict on New Data
 Once the model performs well, it can be used to predict outputs for completely new,
unseen data.

CLASSIFICATION AND REGRESSION ON LARGE DATASETS


Supervised learning can be applied to two main types of problems:

 Classification: Where the output is a categorical variable (e.g., spam vs. non-spam
emails, yes vs. no).
 Regression: Where the output is a continuous variable (e.g., predicting house prices,
stock prices).

Classification on Large Datasets – Logistic Regression:


Logistic Regression is a supervised machine learning algorithm used for classification
problems. Unlike linear regression which predicts continuous values it predicts the
probability that an input belongs to a specific class. It is used for binary classification where
the output can be one of two possible categories such as Yes/No, True/False or 0/1. It uses
sigmoid function to convert inputs into a probability value between 0 and 1.

Types of Logistic Regression


Logistic regression can be classified into three main types based on the nature of the
dependent variable:
UNIT III: MACHINE LEARNING ON BIG DATA

1. Binomial Logistic Regression: This type is used when the dependent variable has only
two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most
common form of logistic regression and is used for binary classification problems.

from [Link] import load_breast_cancer


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.20, random_state=23)

clf = LogisticRegression(max_iter=10000, random_state=0)


[Link](X_train, y_train)

acc = accuracy_score(y_test, [Link](X_test)) * 100


print(f"Logistic Regression model accuracy: {acc:.2f}%")

2. Multinomial Logistic Regression: This is used when the dependent variable has three or
more possible categories that are not ordered. For example, classifying animals into
categories like "cat," "dog" or "sheep." It extends the binary logistic regression to handle
multiple classes.

from sklearn.model_selection import train_test_split


from sklearn import datasets, linear_model, metrics

digits = datasets.load_digits()

X = [Link]
y = [Link]

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.4, random_state=1)

reg = linear_model.LogisticRegression(max_iter=10000,
random_state=0)
[Link](X_train, y_train)

y_pred = [Link](X_test)

print(f"Logistic Regression model accuracy:


{metrics.accuracy_score(y_test, y_pred) * 100:.2f}%")

3. Ordinal Logistic Regression: This type applies when the dependent variable has three or
more categories with a natural order or ranking. Examples include ratings like "low,"
"medium" and "high." It takes the order of the categories into account when modeling.

How to Evaluate Logistic Regression Model?


Evaluating the logistic regression model helps assess its performance and ensure it
UNIT III: MACHINE LEARNING ON BIG DATA
generalizes well to new, unseen data. The following metrics are commonly used:

1. Accuracy: Accuracy provides the proportion of correctly classified instances.

2. Precision: Precision focuses on the accuracy of positive predictions.

3. Recall (Sensitivity or True Positive Rate): Recall measures the proportion of correctly
predicted positive instances among all actual positive instances.

4. F1 Score: F1 score is the harmonic mean of precision and recall.

5. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The ROC
curve plots the true positive rate against the false positive rate at various thresholds. AUC-
ROC measures the area under this curve which provides an aggregate measure of a model's
performance across different classification thresholds.

6. Area Under the Precision-Recall Curve (AUC-PR): Similar to AUC-ROC, AUC-PR


measures the area under the precision-recall curve helps in providing a summary of a model's
performance across different precision-recall trade-offs.

Regression on Large Datasets – Linear Regression:


Linear regression is a type of supervised machine-learning algorithm that learns from
the labelled datasets and maps the data points with most optimized linear functions which can
be used for prediction on new datasets. It assumes that there is a linear relationship between
the input and output, meaning the output changes at a constant rate as the input changes. This
relationship is represented by a straight line.

For example, we want to predict a student's exam score based on how many hours they
studied. We observe that as students study more hours, their scores go up. In the example of
predicting exam scores based on hours studied. Here

Independent variable (input): Hours studied because it's the factor we control or observe.
Dependent variable (output): Exam score because it depends on how many hours were
studied.
We use the independent variable to predict the dependent variable.

Why Linear Regression is Important?

 Simplicity and Interpretability: It’s easy to understand and interpret, making it a


starting point for learning about machine learning.
 Predictive Ability: Helps predict future outcomes based on past data, making it
UNIT III: MACHINE LEARNING ON BIG DATA
useful in various fields like finance, healthcare and marketing.
 Basis for Other Models: Many advanced algorithms, like logistic regression or
neural networks, build on the concepts of linear regression.
 Efficiency: It’s computationally efficient and works well for problems with a linear
relationship.
 Widely Used: It’s one of the most widely used techniques in both statistics and
machine learning for regression tasks.
 Analysis: It provides insights into relationships between variables (e.g., how much
one variable influences another).

Best Fit Line in Linear Regression


In linear regression, the best-fit line is the straight line that most accurately represents
the relationship between the independent variable (input) and the dependent variable (output).
It is the line that minimizes the difference between the actual data points and the predicted
values from the model.

1. Goal of the Best-Fit Line


The goal of linear regression is to find a straight line that minimizes the error (the
difference) between the observed data points and the predicted values. This line helps us
predict the dependent variable for new, unseen data.

Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used
for regression. A linear function is the simplest type of function. Here, X may be a single
feature or multiple features representing the problem.

2. Equation of the Best-Fit Line


For simple linear regression (with one independent variable), the best-fit line is represented
by the equation.
y=mx+b

Where:
y is the predicted value (dependent variable)
x is the input (independent variable)
m is the slope of the line (how much y changes when x changes)
b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and b (intercept) so that
UNIT III: MACHINE LEARNING ON BIG DATA
the predicted y values are as close as possible to the actual data points.

3. Minimizing the Error: The Least Squares Method


To find the best-fit line, we use a method called Least Squares. The idea behind this
method is to minimize the sum of squared differences between the actual values (data points)
and the predicted values from the line. These differences are called residuals.

The formula for residuals is:

Where:
yi is the actual observed value
^yi is the predicted value from the line for that xi.

The least squares method minimizes the sum of the squared residuals:

4. Interpretation of the Best-Fit Line


 Slope (m): The slope of the best-fit line indicates how much the dependent variable
(y) changes with each unit change in the independent variable (x). For example if the
slope is 5, it means that for every 1-unit increase in x, the value of y increases by 5
units.
 Intercept (b): The intercept represents the predicted value of y when x = 0. It’s the
point where the line crosses the y-axis.
In linear regression some hypothesis are made to ensure reliability of the model's results.

Python Implementation of Linear Regression


1. Import the necessary libraries:
import pandas as pd
import numpy as np
import [Link] as plt
import [Link] as ax
from [Link] import FuncAnimation

2. Load the dataset and separate input and Target variables:


url =
'[Link]
16/data_for_lr.csv'
data = pd.read_csv(url)

data = [Link]()

train_input = [Link](data.x[0:500]).reshape(500, 1)
train_output = [Link](data.y[0:500]).reshape(500, 1)

test_input = [Link](data.x[500:700]).reshape(199, 1)
test_output = [Link](data.y[500:700]).reshape(199, 1)
UNIT III: MACHINE LEARNING ON BIG DATA
4. Build the Linear Regression Model and Plot the regression line

class LinearRegression:
def __init__(self):
[Link] = {}

def forward_propagation(self, train_input):


m = [Link]['m']
c = [Link]['c']
predictions = [Link](m, train_input) + c
return predictions

def cost_function(self, predictions, train_output):


cost = [Link]((train_output - predictions) ** 2)
return cost

def backward_propagation(self, train_input, train_output,


predictions):
derivatives = {}
df = (predictions-train_output)
dm = 2 * [Link]([Link](train_input, df))
dc = 2 * [Link](df)
derivatives['dm'] = dm
derivatives['dc'] = dc
return derivatives

def update_parameters(self, derivatives, learning_rate):


[Link]['m'] = [Link]['m'] -
learning_rate * derivatives['dm']
[Link]['c'] = [Link]['c'] -
learning_rate * derivatives['dc']

def train(self, train_input, train_output, learning_rate,


iters):
[Link]['m'] = [Link](0, 1) * -1
[Link]['c'] = [Link](0, 1) * -1

[Link] = []

fig, ax = [Link]()
x_vals = [Link](min(train_input), max(train_input),
100)
line, = [Link](x_vals, [Link]['m'] * x_vals +
[Link]['c'], color='red', label='Regression Line')
[Link](train_input, train_output, marker='o',
color='green', label='Training Data')

ax.set_ylim(0, max(train_output) + 1)

def update(frame):
predictions = self.forward_propagation(train_input)
cost = self.cost_function(predictions, train_output)
UNIT III: MACHINE LEARNING ON BIG DATA
derivatives = self.backward_propagation(train_input,
train_output, predictions)
self.update_parameters(derivatives, learning_rate)
line.set_ydata([Link]['m'] * x_vals +
[Link]['c'])
[Link](cost)
print("Iteration = {}, Loss = {}".format(frame + 1,
cost))
return line,

ani = FuncAnimation(fig, update, frames=iters,


interval=200, blit=True)
[Link]('linear_regression_A.gif', writer='ffmpeg')

[Link]('Input')
[Link]('Output')
[Link]('Linear Regression')
[Link]()
[Link]()

return [Link], [Link]

5. Trained the model and Final Prediction:

linear_reg = LinearRegression()
parameters, loss = linear_reg.train(train_input, train_output,
0.0001, 20)

UNSUPERVISED LEARNING
Unsupervised learning is a type of machine learning that analyzes and models data without
labelled responses or predefined categories. Unlike supervised learning, where the algorithm
learns from input-output pairs, unsupervised learning algorithms work solely with input data
and aim to discover hidden patterns, structures or relationships within the dataset
independently, without any human intervention or prior knowledge of the data's meaning.
UNIT III: MACHINE LEARNING ON BIG DATA

Working of Unsupervised Learning


The working of unsupervised machine learning can be explained in these steps:

1. Collect Unlabeled Data


Gather a dataset without predefined labels or categories.
Example: Images of various animals without any tags.
2. Select an Algorithm
Choose a suitable unsupervised algorithm such as clustering like K-Means,
association rule learning like Apriori or dimensionality reduction like PCA based on the goal.

3. Train the Model on Raw Data


Feed the entire unlabeled dataset to the algorithm.
The algorithm looks for similarities, relationships or hidden structures within the data.
4. Group or Transform Data
The algorithm organizes data into groups (clusters), rules or lower-dimensional forms
without human input.
Example: It may group similar animals together or extract key patterns from large datasets.
5. Interpret and Use Results
Analyze the discovered groups, rules or features to gain insights or use them for
further tasks like visualization, anomaly detection or as input for other models.

Unsupervised Learning Algorithms

1. Clustering Algorithms
2. Association Rule Learning
3. Dimensionality Reduction

CLUSTERING AND DIMENSIONALITY REDUCTION


Clustering Algorithms:
Clustering is an unsupervised machine learning technique that groups unlabeled data
into clusters based on similarity. Its goal is to discover patterns or relationships within the
data without any prior knowledge of categories or labels.

 Groups data points that share similar features or characteristics.


 Helps find natural groupings in raw, unclassified data.
 Commonly used for customer segmentation, anomaly detection and data organization.
 Works purely from the input data without any output labels.
 Enables understanding of data structure for further analysis or decision-making.

Some common clustering algorithms:

 K-means Clustering: Groups data into K clusters based on how close the points are
to each other.
 Hierarchical Clustering: Creates clusters by building a tree step-by-step, either
merging or splitting groups.
UNIT III: MACHINE LEARNING ON BIG DATA
 Density-Based Clustering (DBSCAN): Finds clusters in dense areas and treats
scattered points as noise.
 Mean-Shift Clustering: Discovers clusters by moving points toward the most
crowded areas.
 Spectral Clustering: Groups data by analyzing connections between points using
graphs.

Dimensionality Reduction:
Dimensionality reduction is the process of decreasing the number of features or
variables in a dataset while retaining as much of the original information as possible. This
technique helps simplify complex data making it easier to analyze and visualize. It also
improves the efficiency and performance of machine learning algorithms by reducing noise
and computational cost.

 It reduces the dataset’s feature space from many dimensions to fewer, more
meaningful ones.
 Helps focus on the most important traits or patterns in the data.
 Commonly used to improve model speed and reduce overfitting.

Here are some popular Dimensionality Reduction algorithms:

 Principal Component Analysis (PCA): Reduces dimensions by transforming data


into uncorrelated principal components.
 Linear Discriminant Analysis (LDA): Reduces dimensions while maximizing class
separability for classification tasks.
 Non-negative Matrix Factorization (NMF): Breaks data into non-negative parts to
simplify representation.
 Locally Linear Embedding (LLE): Reduces dimensions while preserving the
relationships between nearby points.
 Isomap: Captures global data structure by preserving distances along a manifold.

MODEL EVALUATION AND VALIDATION TECHNIQUES


Model evaluation and validation are crucial stages in the machine learning workflow
to ensure a model performs reliably and accurately on unseen data. These techniques prevent
a model from memorizing training data (overfitting) and provide an honest estimate of its
real-world performance.

1. Cross-Validation
Cross-validation ensures that the model is tested on multiple subsets of data making it
less likely to overfit and improving its generalization ability.

(a) Holdout Method


In the Holdout method the dataset is split into train and test sets (commonly 7:3 or
8:2). Lets implement it where:

 load_iris() loads the Iris dataset (flower measurements with 3 species).


UNIT III: MACHINE LEARNING ON BIG DATA
 train_test_split() divides data into training and testing sets.
 test_size=0.20: 20% for testing, 80% for training.
 random_state=42 makes results reproducible

from sklearn.model_selection import train_test_split


from [Link] import load_iris

iris = load_iris()
X, y = [Link], [Link]

X_train, X_test, y_train, y_test = train_test_split(


X, y, test_size=0.20, random_state=42
)

print("Training set size:", len(X_train))


print("Testing set size:", len(X_test))

(b) K-Fold Cross-Validation


In K-Fold Cross-Validation the dataset is divided into k folds. Each fold is used once
as a test set and the model is trained on the remaining k-1 folds. Lets implement it where:

 DecisionTreeClassifier(): A decision tree model is created.


 KFold(n_splits=5): Data is divided into 5 folds.
 cross_val_score(): Runs training/testing across folds.
 scores: Accuracy for each fold.
 [Link](): Average accuracy across all folds.

from sklearn.model_selection import KFold, cross_val_score


from [Link] import DecisionTreeClassifier

model = DecisionTreeClassifier()

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kfold)

print("Cross-validation scores:", scores)


print("Average CV Score:", [Link]())

2. Evaluation Metrics for Classification Tasks


Classification models assign inputs to predefined labels. Their performance can be
measured using accuracy, precision, recall, F1 score, confusion matrix and AUC-ROC.

We’ll demonstrate these metrics using a Decision Tree Classifier on the Iris dataset.

Step 1: Importing Libraries, Loading Dataset, Splitting Dataset


UNIT III: MACHINE LEARNING ON BIG DATA
 Importing libraries like pandas, numpy, matplotlib and scikit learn.
 Loading Iris dataset with flower measurements.
 Splitting into 80% training, 20% testing.

import pandas as pd
import numpy as np
from sklearn import datasets
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import (
precision_score, recall_score, f1_score, accuracy_score,
confusion_matrix, ConfusionMatrixDisplay, roc_auc_score
)
import [Link] as plt

iris = datasets.load_iris()
X, y = [Link], [Link]

X_train, X_test, y_train, y_test = train_test_split(


X, y, test_size=0.20, random_state=20
)

Step 2: Training Model


DecisionTreeClassifier(): Creates a decision tree model.
 .fit(X_train, y_train): Trains the model on training data.
 .predict(X_test): Generates predictions on test data.

tree = DecisionTreeClassifier()
[Link](X_train, y_train)
y_pred = [Link](X_test)

Step 3: Accuracy
We Will calculate the accuracy,

accuracy_score() computes the proportion of correct predictions.

print("Accuracy:", accuracy_score(y_test, y_pred))

Step 4: Precision and Recall


Precision measures how many predicted positives are actually positive.
UNIT III: MACHINE LEARNING ON BIG DATA
 Focuses on the correctness of positive predictions.
 High precision: few false positives.

Recall measures how many actual positives are correctly predicted.

 Focuses on capturing all positives.


 High recall: few false negatives.
print("Precision:", precision_score(y_test, y_pred,
average="weighted"))
print("Recall:", recall_score(y_test, y_pred, average="weighted"))

Step 5: F1 Score
We will calculate the F1 score which is Harmonic mean of precision and recall.
Balances both metrics.

 Combines precision and recall into one metric.


 Useful when we need a balance between false positives and false negatives.

print("F1 score:", f1_score(y_test, y_pred, average="weighted"))

Step 6: Confusion Matrix


We will create a confusion matrix:
 confusion_matrix(): Creates matrix of actual vs predicted values.
 Each cell shows correct/misclassified predictions.

cm = confusion_matrix(y_test, y_pred)
cm_display = ConfusionMatrixDisplay(
confusion_matrix=cm, display_labels=[0, 1, 2])
cm_display.plot()
[Link]()

Step 7: AUC-ROC Curve

 TPR: True Positive Rate


 FPR: False Positive Rate
 AUC: Area under curve; higher is better (0.5 random, 1 perfect).

y_true = [1, 0, 0, 1]
y_pred = [1, 0, 0.9, 0.2]
auc = [Link](roc_auc_score(y_true, y_pred), 3)
UNIT III: MACHINE LEARNING ON BIG DATA
print("AUC:", auc)

3. Evaluation Metrics for Regression Tasks


Regression predicts continuous values (e.g., temperature). We use error-based metrics
to measure accuracy.

Step 1: Importing Data and Training Model


 Dataset contains Temperature (independent variable) and Relative Humidity
(dependent variable).
 Data split into training (80%) and testing (20%).
 LinearRegression().fit() trains the regression model.
 Predictions are stored in Y_pred.
from sklearn.linear_model import LinearRegression
from [Link] import mean_absolute_error,
mean_squared_error, mean_absolute_percentage_error

df = pd.read_csv('[Link]')

X, Y = [Link][:, 2].[Link](-1, 1), [Link][:,


3].values

X_train, X_test, Y_train, Y_test = train_test_split(


X, Y, test_size=0.20, random_state=0
)

regression = LinearRegression()
[Link](X_train, Y_train)
Y_pred = [Link](X_test)

Step 2: Mean Absolute Error (MAE)


Mean absolute error is average difference between actual and predicted values.

mae = mean_absolute_error(Y_test, Y_pred)


print("MAE:", mae)

Step 3: Mean Squared Error (MSE)


We will calculate the mean squared error which is average squared difference
between predicted and actual values.

 Penalizes larger errors more than MAE.


 Commonly used for regression model loss functions.
UNIT III: MACHINE LEARNING ON BIG DATA
mse = mean_squared_error(Y_test, Y_pred)
print("MSE:", mse)

Step 4: Root Mean Squared Error (RMSE)


We will calculate RMSE which is Square root of MSE. Converts error back to
original units.

 Brings error back to the same units as the target variable.


 Easier to interpret than MSE.
 Still penalizes larger errors more than MAE.
rmse = mean_squared_error(Y_test, Y_pred, squared=False)
print("RMSE:", rmse)

DISTRIBUTED TRAINING
Distributed training is the practice of training machine learning models on large
datasets by distributing the computational workload across multiple machines or devices.
This approach is essential in big data analytics to overcome the limitations of a single
machine, such as insufficient memory and excessively long training times. It enables faster
experimentation and the development of larger, more complex models.

Core parallelization strategies

1. Data parallelism
This is the most common and straightforward method, best suited for models that fit
within a single machine's memory but require vast datasets for training.
How it works: The training dataset is divided into smaller chunks, and each computing node
(or worker) receives a copy of the full model. Each worker trains its model copy on its unique
subset of data.
Synchronization: After each training iteration, the workers' model updates (gradients) are
aggregated, typically by averaging. The aggregated parameters are then used to update each
worker's model, ensuring consistency.
Use cases: Training large computer vision models like convolutional neural networks
(CNNs) on datasets like ImageNet, where the model size is manageable but the data volume
is enormous.

2. Model parallelism
This approach is used when a model is too large to fit into the memory of a single
device.
How it works: The model itself is partitioned across different computing nodes, with each
node holding and training different layers or sections of the model. Data flows sequentially
through the network of machines during the forward and backward passes.
Challenges: The implementation is more complex due to the need for careful partitioning to
balance the computational load and manage communication between nodes.
UNIT III: MACHINE LEARNING ON BIG DATA
Use cases: Training massive models like large language models (LLMs), which can have
hundreds of billions of parameters that cannot fit on a single GPU.

3. Hybrid parallelism
A combination of data and model parallelism, this strategy is used for training
exceptionally large and complex models on vast datasets. It applies data parallelism to some
layers and model parallelism to others to achieve the best performance.

Communication architectures
To coordinate the updates between workers in a distributed setting, two main
architectures are used:

1. Parameter server (Centralized)


How it works: A set of nodes, called parameter servers, stores and manages the model's
parameters. Worker nodes pull the latest parameters from the servers, compute gradient
updates on their local data, and push the new gradients back to the servers. The servers then
aggregate the gradients and update the central parameters.
Pros: Conceptually simple to implement.
Cons: The central server can become a bottleneck, especially with a large number of
workers, as it must handle all the communication traffic.

2. All-Reduce (Decentralized)
How it works: Each worker holds a full replica of the model and communicates directly with
the other workers to collectively compute the average of the gradients. This is often
implemented in a ring-based topology to optimize communication.
Pros: Avoids the bottleneck of a central server, leading to better scalability for systems with
high-bandwidth networks.
Cons: The communication overhead increases with the number of workers.

Communication and synchronization strategies


When coordinating model updates, algorithms can follow a synchronous or
asynchronous approach.

1. Synchronous communication
How it works: All workers operate in lockstep. In each iteration, workers train on their data
subset and send their updates. A "synchronization barrier" prevents any worker from starting
the next iteration until all workers have completed the current one.
Pros: Guarantees strong model consistency and convergence.
Cons: The entire process is slowed down by the slowest worker, also known as a "straggler".

2. Asynchronous communication
How it works: Workers operate independently. A worker can send and receive model
updates without waiting for all other workers to finish their current iteration.
Pros: Avoids the problem of stragglers and makes more efficient use of computational
resources.
Cons: Can suffer from "stale gradients," where a fast worker uses outdated parameter updates
from a slow worker, potentially degrading convergence.
UNIT III: MACHINE LEARNING ON BIG DATA

Tools and frameworks


Many modern frameworks are designed to manage the complexities of distributed training:
 TensorFlow: Provides a [Link] API for distributing training
across different devices and machines, including MultiWorkerMirroredStrategy
for synchronous data parallelism.
 PyTorch: Offers DistributedDataParallel (DDP) for efficient data-parallel
training and uses the [Link] package to handle communication and
synchronization.
 Apache Spark: Used for large-scale data processing, with libraries like BigDL that
enable distributed deep learning on Spark clusters.
 Cloud Services: Platforms like Google's Vertex AI offer managed distributed training
services that simplify infrastructure setup, scaling, and resource management.
OPTIMIZATION TECHNIQUES

Optimization is the process of adjusting a machine learning model's parameters to


minimize the error or loss function. The goal is to find the set of parameters that allows the
model to produce the most accurate results.
Optimization techniques fall into several categories, including first-order optimizers,
advanced methods for deep learning, and techniques for hyperparameter tuning.

First-order optimization algorithms

These are iterative algorithms that rely on the first derivative (gradient) of the loss
function to navigate toward the minimum.

Gradient Descent (GD): A foundational algorithm that computes the gradient of the entire
training dataset to find the direction of steepest descent. It then takes a step in the opposite
direction of the gradient, scaled by a learning rate, to find the minimum.

Stochastic Gradient Descent (SGD): An extension of GD that uses a single, randomly


selected training example to compute the gradient and update the parameters in each iteration.
This makes it much faster and more memory-efficient for large datasets, though the
updates can be noisy.

Mini-Batch Gradient Descent: A compromise between GD and SGD. It computes the


gradient on a small, random subset (a "mini-batch") of the training data. This offers a balance
between the computational efficiency of SGD and the stability of GD.

Advanced and adaptive optimizers


Built on the principles of gradient descent, these optimizers introduce more
sophisticated mechanisms to accelerate convergence and handle complex loss landscapes
common in deep learning.

Gradient Descent with Momentum: Accelerates SGD in the right direction and dampens
oscillations. It does this by adding a fraction of the previous update vector to the current
update vector, building up momentum in consistent directions.

RMSProp (Root Mean Square Propagation): An adaptive learning rate method that
UNIT III: MACHINE LEARNING ON BIG DATA
addresses the issue of rapidly diminishing learning rates in earlier algorithms like AdaGrad. It
uses a moving average of squared gradients to normalize the learning rate for each parameter,
stabilizing the training process.

Adam (Adaptive Moment Estimation): One of the most popular optimizers today, Adam
combines the best features of Momentum and RMSProp. It uses moving averages of both the
gradients and the squared gradients to compute an adaptive learning rate for each parameter,
offering fast and stable convergence.

AdamW: A variant of Adam that separates weight decay regularization from the
optimization step. This often leads to better generalization performance, particularly for
transformer models and other advanced deep learning architectures.
Hyperparameter optimization
These techniques are used to find the best configuration of a model's hyperparameters,
which are the settings that are not learned from data.

Grid Search: Systematically evaluates a model's performance for all possible combinations
of a predefined set of hyperparameter values. It is thorough but can be computationally very
expensive.

Random Search: A more efficient alternative to grid search that randomly samples
hyperparameter values from a specified distribution. It is often more effective, especially
when only a few hyperparameters are critical to performance.

Bayesian Optimization: A probabilistic model-based approach that is more sample-efficient


than grid or random search. It uses information from past evaluations to intelligently select
the next set of hyperparameters to test, making it ideal for expensive or complex objective
functions.

Regularization techniques

Regularization methods are optimization techniques that are used to prevent


overfitting, which is when a model performs well on training data but poorly on new, unseen
data.

L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the magnitude of
the coefficients to the loss function. It can force some coefficients to become zero, effectively
removing certain features from the model.

L2 Regularization (Ridge): Adds a penalty equal to the squared magnitude of the


coefficients to the loss function. This encourages smaller, smoother coefficients, which helps
to prevent overfitting.

Dropout: A technique used in neural networks where a random subset of neurons is


temporarily "dropped out" during each training iteration. This prevents the network from
relying too heavily on any single neuron and encourages a more robust, generalized model.

Other optimization approaches


UNIT III: MACHINE LEARNING ON BIG DATA
Evolutionary Algorithms (e.g., Genetic Algorithms): Inspired by natural selection, these
algorithms evolve a population of candidate solutions over generations using genetic
operators like mutation and crossover. They are useful for complex optimization problems
that are difficult to solve with traditional methods.

Simulated Annealing: A stochastic optimization technique that draws an analogy to the


process of annealing in metallurgy. It explores the search space with a high degree of
randomness initially (high temperature), which is gradually reduced (low temperature) to
converge toward a better solution while avoiding local minima.

*****************

Common questions

Powered by AI

Synchronous communication ensures strong model consistency and convergence as all workers must complete their updates before progressing. However, it can be slowed by the slowest worker (straggler effect). Asynchronous communication allows workers to update independently, avoiding straggler issues and making better use of computational resources, but can suffer from stale gradients, where outdated updates from slow workers affect convergence stability .

Mini-batch gradient descent strikes a balance between standard GD and SGD by computing the gradient on a small subset of training data, leading to faster convergence than GD and reducing the noise inherent in SGD. This method provides computational efficiency and stability, making it suitable for large datasets where using the entire dataset or only a single instance isn't feasible .

Binomial logistic regression handles binary classification problems by predicting the probability of an input belonging to one of two classes. It is used for tasks like spam detection, where outputs are binary . Multinomial logistic regression extends this to categorize inputs into three or more classes without ordering, such as animal type classification . Ordinal logistic regression applies when these categories have a natural order, like ratings or legacy systems .

Data parallelism divides the training dataset among computing nodes, allowing each node to train its model copy, which is efficient for models fitting into a single machine. It's ideal for large datasets, requiring synchronization of model updates . Model parallelism partitions the model across nodes, addressing memory limits for extremely large models, but involves complex implementation and careful balance of computational loads .

Model parallelism complicates the training process due to the need for even partitioning to balance computational load and managing communication between model sections distributed across different nodes. These challenges are addressed by optimizing data flow through nodes, reducing communication latency, and ensuring balanced workload distribution to maintain efficient training and model performance .

Scikit-Learn provides a wide range of algorithms for classification, regression, and clustering, alongside preprocessing tools and model evaluation methods, all within a consistent API. This integration supports tasks ranging from classification and regression to clustering and dimensionality reduction, streamlining the entire machine learning workflow and ensuring interoperability of different components .

Strategies to prevent overfitting include regularization techniques such as L1 and L2 regularization, which add penalties to the loss function to discourage overly complex models, and dropout, which randomly drops neurons during training to prevent reliance on any single feature . These methods improve model generalization by creating simpler models that perform reliably on unseen data .

Scikit-learn is beneficial for machine learning model development due to its simple and consistent API, efficient implementation built on optimized libraries like NumPy and SciPy, and a wide range of algorithms for classification, regression, clustering, and preprocessing. This makes it accessible for both beginners and professionals, allowing users to focus on model building rather than implementation details .

Bayesian optimization stands out by using a model-based approach, allowing it to select hyperparameter settings intelligently based on past evaluations. It is more sample-efficient, which makes it particularly useful for expensive or complex model tuning compared to the exhaustive grid search or the random nature of random search .

Model evaluation challenges include overfitting, underestimating model performance, and variability in results. Cross-validation helps mitigate these by testing the model on multiple subsets of data, reducing the likelihood of overfitting, providing a more accurate estimate of model performance in real-world scenarios, and ensuring stability across different datasets .

You might also like