0% found this document useful (0 votes)
39 views101 pages

Distinguishing Regression Techniques

This document covers various machine learning concepts, including instance-based and model-based learning, regression analysis, and decision tree learning. It explains algorithms like k-Nearest Neighbors, Weighted k-NN, and different types of regression methods, highlighting their advantages and disadvantages. Additionally, it discusses the structure and functioning of decision trees in predictive modeling.

Uploaded by

sharanyawork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views101 pages

Distinguishing Regression Techniques

This document covers various machine learning concepts, including instance-based and model-based learning, regression analysis, and decision tree learning. It explains algorithms like k-Nearest Neighbors, Weighted k-NN, and different types of regression methods, highlighting their advantages and disadvantages. Additionally, it discusses the structure and functioning of decision trees in predictive modeling.

Uploaded by

sharanyawork17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MACHINE LEARNING

MODULE – 3
COURSE CODE BCS602
TOPICS

Similarity-based Learning: Regression Analysis: Decision Tree Learning:


Nearest-Neighbor Learning Introduction to Regression Introduction to Decision Tree Learning
Weighted K-Nearest-Neighbor Algorithm Introduction to Linear Regression Model

Nearest Centroid Classifier Multiple Linear Regression Decision Tree Induction Algorithms.

Locally Weighted Regression (LWR) Polynomial Regression


Logistic Regression
What is Instance-Based Learning?
• The Machine Learning systems which are categorized
as instance-based learning are the systems that learn the
training examples by heart and then generalizes to new
instances based on some similarity measure.
• It is called instance-based because it builds the hypotheses
from the training instances.
• It is also known as memory-based learning or lazy-
learning (because they delay processing until a new instance
must be classified).
Advantage: Works well with incremental data (adding new data
over time).
Disadvantage: Requires a lot of memory and is slow at testing
time.
What is Model-Based Learning?
• It builds a general model using training data before making predictions.
• Uses algorithms like decision trees, neural networks, and SVMs.
• Also called eager learning because it processes training data beforehand.
Advantage: Faster predictions.
Disadvantage: Requires more time and computation upfront to build the
model.
Examples of Instance-Based Learning Algorithms
• k-Nearest Neighbors (k-NN) – Finds the closest neighbors to classify new
data.
• Locally Weighted Regression – Gives more importance to closer data
points.
• Learning Vector Quantization (LVQ) – Learns patterns from training data.
• Self-Organizing Map (SOM) – Groups data points using neural networks.
• Radial Basis Function (RBF) Networks – Uses neural networks for
classification.
Nearest-Neighbor Learning
k-Nearest Neighbors (k-NN) is a machine learning algorithm used for
classification and regression problems. It works on the principle that
similar things exist close to each other.
How Does k-NN Work?
• Store the Data: k-NN does not learn a model beforehand. Instead, it simply
stores all the training data.
• Find the k Nearest Neighbors: When a new data point (test instance) is
given, k-NN finds the k closest points from the stored data.
• Make a Prediction:
• If it’s a classification problem, it assigns the most common class
(majority voting).
• If it’s a regression problem, it takes the average of the nearest values.
Understanding the Image
• The dots (circles) represent data
points.
• There are two different classes of
objects (C₁ and C₂).
• The new data point (T) is the one we
need to classify.
• A circle is drawn around it to find the k =
3 nearest neighbors.
• Since the majority of the 3 nearest
neighbors belong to class C₂, the new
instance T is classified as C₂.
Weighted K-Nearest-Neighbor Algorithm
The Weighted k-Nearest Neighbor (Weighted k-NN) is an improved version of
the k-NN (k-Nearest Neighbor) algorithm.
• Instead of treating all neighbors equally, Weighted k-NN gives more
importance (weight) to closer neighbors and less importance to farther
neighbors.
• The weight is inversely proportional to distance (closer points have higher
weight).
Example:
• Suppose you are asking for restaurant recommendations from 5 people.
• You trust the people living near the restaurant more than those who live far
away.
• So, their opinion (weight) matters more in your final decision.
How Are Weights Assigned?
There are two ways to assign weights to the k-nearest neighbors:
1️⃣ Uniform Weighting (Simple k-NN)
• Every neighbor has equal weight, no matter how far they are.
• Works well when distances are not very different.
2️⃣ Inverse Distance Weighting (Weighted k-NN)
• Closer neighbors get higher weights.
• Farther neighbors get lower weights.
• This makes the prediction more accurate.
Formula for Weight Calculation:

If a neighbor is very close, the weight is large. If it is far away, the weight is
small.
Why is Weighted k-NN Better than k-NN?
More accurate predictions by giving more importance to relevant (closer)
neighbors.
Reduces errors caused by distant, irrelevant points.
Works well for unevenly distributed data, where some points are
clustered close together.
Nearest Centroid Classifier
The Nearest Centroid Classifier is a simple way to classify data points based
on their average (centroid) position. Instead of looking at individual neighbors
like k-NN, it calculates the average position (centroid) of each class and
assigns a new test instance to the class with the closest centroid.

How Does It Work?


1️⃣ Find the centroid (mean) of each class
• The centroid is simply the average of all feature values for each class.
2️⃣ Compute the distance between the test instance and each class centroid.
• The Euclidean distance formula is used to measure closeness.
3️⃣ Predict the class
• Assign the test instance to the class with the smallest distance to its centroid.
Locally Weighted Regression (LWR)
Locally Weighted Regression (LWR) is a non-parametric machine learning algorithm
that combines linear regression with nearest neighbors. Unlike standard linear
regression, which fits a straight line to all the data, LWR creates small localized
models that fit different sections of the data, making the overall model more flexible
and non-linear.

How Does It Work?


1️⃣ Find the nearest neighbors
• Instead of using the entire dataset, LWR focuses on data points close to the test
instance (just like k-NN).
2️⃣ Assign weights to neighbors
• The closer a data point is to the test instance, the higher its weight.
• The weight is calculated using a Gaussian function, which gives more importance
to nearby points and less to far-away points.
3️⃣ Fit a regression model to the nearest neighbors
• Instead of fitting one straight line for the entire dataset, LWR fits small
linear models for local neighborhoods.
• These small models combine to form a curved prediction line, instead of a
single straight line.
4️⃣ Make predictions
• The algorithm predicts the target value based on the weighted local model
for the test instance.
Introduction to Regression
Regression is a very important method in machine learning. It is a type of
supervised learning technique.
→ In supervised learning, we have a set of input data (called features) and
output data (called target or label).
→ The goal of regression is to find the relationship between input (independent
variable) and output (dependent variable).
Regression helps us to understand how one or more input values (x) affect the
output value (y).
The mathematical relation is written as:
y = f(x)
Where:
• x = independent variable (input)
• y = dependent variable (output)
The independent variable is also called:
→ Explanatory variable
→ Predictor variable
→ Input feature
The dependent variable is also called:
→ Target variable
→ Label
→ Response variable
Regression is mainly used for:
• Prediction
• Forecasting future results
• Finding the relationship between data
It helps us understand:
• How the input variables affect the output.
• Strength of the relationship (strong or weak).
• Whether the relationship is linear or non-linear.
• Importance of each input variable.
• Contribution of each variable in the result.
Applications of Regression Analysis:
• Sales of products or services
• Value of bonds in finance
• Insurance premium calculation
• Crop yield in agriculture
• Real estate price prediction
INTRODUCTION TO LINEARITY,CORRELATION &
CAUSATION
Regression & Correlation
Types of Correlation:
1. Positive Correlation:
→ If one variable increases, the other also increases.
Example:
Study time ↑ → Marks ↑
2. Negative Correlation:
→ If one variable increases, the other decreases.
Example:
Increase in exercise time ↑ → Body weight ↓
3. No Correlation (Random):
→ No relationship between variables.
→ Data is scattered randomly.
Example:
Roll number ↑ → No effect on marks.
Regression & Causation
Causation means — One thing happens because of another thing.
If x causes y → it means x is the reason for y happening.
This is written as:
x → y (x implies y)

Difference between Correlation and Causation:


• Correlation means → Two things are related (but not necessary that one causes the
other).
• Causation means → One thing directly causes the other to happen.

Example 1:
There may be a correlation between a student's family background (economic
status) and marks scored.
→ But this does not mean that having a rich family directly causes higher marks.
(Marks depend on hard work, study, etc.)
Linearity & Non – Linearity Relationship
A Linear Relationship between variables means:
→ When one variable increases, the other also increases in a
constant manner.
Formula of Linear Relation:
y=ax+b
Where:
• a = slope (rate of change)
• b = intercept (starting point)
Graph:
• It looks like a straight line.
A Non-linear Relationship means:
→ The increase in variables is not constant.
→ The graph is not a straight line.
→ It may curve or change direction.
Examples of Non-linear Functions:
Types of Regression Methods
Limitation of Regression Methods
1. Outliers
• Outliers are abnormal or unusual data points.
• They can distort the regression line because the model tries to fit them.
Problem:
→ Makes the prediction less accurate.
2. Number of Cases (Sample Size)
• There should be enough data points.
• Recommended Ratio:
→ 20:1 (20 samples for each independent variable)
→ Minimum 5 samples in extreme cases.
Problem:
→ Less data → Poor model performance.
3. Missing Data
• If there are missing values in the dataset, the model may not work properly.
Problem:
→ Model becomes unreliable or inaccurate.
Introduction to Linear Regression
Linear regression is a way to find the best straight line that fits through a set of
points on a graph. This line helps us understand how one variable (like study
hours) affects another variable (like exam scores).
Assumptions in Linear Regression:

• Observations are random – Each data point (like a student’s


score) is randomly selected and doesn’t affect others.
• Errors are independent – The gap between predicted and actual
values (errors) follows a regular pattern (like a normal distribution).
• Error does not depend on variables – Errors aren’t affected by the
variables themselves.
• Parameters stay the same – The slope and intercept are fixed for
the model.
OLS (Ordinary Least Squares) Method:
This method finds the best line by minimizing the total error.
It works like this:
• For each point, measure the vertical gap (error) between it
and the line.
• Square each error (to get rid of negatives) and add them all
up.
• The line that gives the lowest total squared error is the
best-fitting line.
This total is called the sum of squared errors or residuals.

Example from the Graph:


In the figure, three points are shown (with errors e1,e2,e3 ).
The vertical lines between the points and the straight line
show the errors.
Multiple Linear Regression
Multiple Linear Regression is just like simple linear regression, but instead of
using one independent variable (input) to predict the output, we use two or
more inputs.

Think of it like this:


If you’re trying to predict weekly sales, you might look at:
• How many units of Product 1 were sold (x₁)
• How many units of Product 2 were sold (x₂)
Assumptions in Multiple Linear Regression
• To make sure the model works well, we assume:
• No Multicollinearity: The input variables shouldn’t be strongly related to
each other.
• (If x1 and x2 are too similar, the model can get confused.)
• Normal Distribution of Errors: The difference between predicted and actual
results (residuals) should follow a normal (bell curve) distribution.
Polynomial Regression
Polynomial Regression is a type of regression that can model curved
relationships between input (x) and output (y). Unlike linear regression that
fits a straight line, polynomial regression can fit a curve.

Why Use It?


Sometimes data doesn’t follow a straight line. For example, if your data looks
like a U-shape or an S-curve, a straight line won’t capture the pattern well.
That’s when polynomial regression is useful.
When to Use What?
•Use transformation when the data can be reshaped (like
exponential or power relationships).
•Use polynomial regression when the data naturally follows a curve
(like a parabola).
Consider the data provided in Table 5.8 and fit it using the
second-order polynomial.
Logistic Regression
Linear regression predicts the numerical response but is not suitable
for predicting the categorical variables. When categorical variables
are involved, it is called classification problem. Logistic regression
is suitable for binary classification problem. Here, the output is
often a categorical variable. For example, the following scenarios
are instances of predicting categorical variables.
1. Is the mail spam or not spam? The answer is yes or no. Thus,
categorical dependant variable is a binary response of yes or no.
2. If the student should be admitted or not is based on entrance
examination marks. Here, categorical variable response is admitted
or not.
3. The student being pass or fail is based on marks secured.
Introduction to Decision Tree Learning Model
• Decision tree learning model, one of the most popular supervised
predictive learning models, classifies data instances with high
accuracy and consistency.
• This model is variably used for solving complex classification
applications.
• Decision tree is a concept tree which summarizes the information
contained in the training dataset in the form of a tree structure.
Once the concept model is built, test data can be easily
classified.
How Does It Work?
• You give it a dataset (called X), and the model tries to learn a rule or function
(called f(X)) that explains how to decide the output.
• The input data has different features (like age, salary, etc.), and the output is a
decision tree that uses these features to predict the answer.
• Features / Attributes: The inputs (also called independent variables) — e.g.,
age, marks, gender.
• Target / Output: What we are trying to predict — also called the response
variable — e.g., "Pass/Fail", "Spam/Not Spam".
How It Builds the Tree:
• The model checks all possible ways to split the data and builds a full tree
(called a hypothesis space).
• While using the tree, it tries to find the best and shortest path to make a
decision.
• This process may favor certain types of trees over others — this is known as
preference bias
Structure of a Decision Tree
A decision tree looks like a tree, but upside down:
• Root Node ( circle): This is where the decision process starts.
It’s based on the most important question.
• Decision/Internal Nodes ( diamond): These are tests or
decisions based on different features (e.g., “Is age > 30?”).
• Leaf Nodes ( rectangle): These are the final outcomes or
decisions (e.g., “Approve Loan”).
Every path from the top (root) to a leaf is a series of decisions that
lead to a conclusion. Each path is like following a trail of "yes" or
"no" answers to reach a result.
How Do You Build a Decision Tree?
• Start at the top (root node).
• Pick the best question (feature) to split the data.
• Keep splitting until all outcomes are clear (reaching the leaf nodes).
• Done! The final tree helps predict future answers.
This process continues automatically until no more useful splits are left. The
final tree represents all possible rules based on the training data.
How Does It Make Predictions? (Classification)
When you want to predict something using the tree:
• You start at the root.
• You follow the path based on the data's values (like answering questions).
• You reach a leaf node which tells you the prediction.
For example, if you want to predict whether a student will pass an exam, the
tree might check their study hours, attendance, etc., and finally say "Pass" or
"Fail."
Advantages of Decision Trees
1. Easy to model and interpret
2. Simple to understand
3. The input and output attributes can be discrete or
continuous predictor variables.
4. Can model a high degree of nonlinearity in the relationship
between the target variables and the predictor variables
5. Quick to train
Disadvantages of Decision Trees
Some of the issues that generally arise with a decision tree learning are that:
1. It is difficult to determine how deeply a decision tree can be grown or when
to stop growing it.
2. If training data has errors or missing attribute values, then the decision tree
constructed may become unstable or biased.
3. If the training data has continuous valued attributes, handling it is
computation all y complex and has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output
classes.
Fundamentals of Entropy
What are we trying to do?
When we build a decision tree, we want to find the best way to split
the data so we can easily figure out the correct answer or class
(like Pass/Fail, Yes/No, etc.).

How do we decide the best way to split?


We use a concept called Entropy to help us decide.
Think of entropy as a measure of "messiness" or confusion in the
data.
• If the data is very mixed up (like half Pass, half Fail), it's high
entropy.
• If the data is all one type (like all Pass), it's low entropy — and
that's good!
What is Entropy in simple terms?
Entropy = Uncertainty
• Imagine flipping a coin → 2 outcomes (Head or Tail) → Some uncertainty.
• Now imagine rolling a dice → 6 outcomes → Even more uncertainty.
So, more possible outcomes = more entropy (more confusion)
In decision trees, we prefer splits that reduce entropy — in other words, splits
that make the groups more "pure" (where all the examples are the same
class).

How is entropy calculated?


Entropy is based on probabilities — how likely different outcomes are.
Example:
• If 100 students, and 50 Pass and 50 Fail → Very mixed = High entropy.
• If 90 Pass and 10 Fail → Mostly Pass = Lower entropy.
• If all 100 Pass → No confusion = Entropy = 0 (Perfect!)
Why do we care in decision trees?
Each time we split the data in the tree, we want to pick the attribute (feature)
that gives the purest groups (lowest entropy). This helps the tree make the
clearest decisions.
Why is Entropy Important in Decision Trees?
• Helps us choose the best feature to split on.
• The goal is to reduce entropy after each split.
• If entropy becomes zero, it means the data is perfectly classified — and the
split is complete.
Step 1: Choose the Best Attribute
• From your dataset, find the attribute that splits the data best using a
selection measure (like entropy or information gain).
• Place that attribute at the root of the tree.
Think of this like asking the most important question first.
Step 2: Split the Data
• Divide the dataset into branches based on the values of the selected
attribute.
• Each branch contains data with the same value for that attribute.
Like dividing students based on whether they submitted an assignment
(Yes or No).
Step 3: Repeat the Process
• For each branch (subset), go back to Step 1 and find the next best attribute to
split that smaller group.
Now ask the next best question on each branch.
Step 4: Stop When a Condition is Met
• The process keeps going until we hit a stopping condition.
• Once a stopping condition is reached, that branch becomes a leaf node with the final
result.
Stopping Criteria (When to Stop Splitting)
• All data instances in a branch are the same class
• For example: All students passed or all failed.
• Entropy = 0 → No need to split further.
• Too few data instances
• If the number of items in a node is very small (between 0.25% to 1% of total dataset),
splitting may not make sense.
• This node becomes a leaf.
• Maximum tree depth is reached
• We set a limit on how deep the tree can go to avoid overfitting or making it too
complex.
• Once that depth is reached, stop splitting.
Decision Tree Induction Algorithms.
Popular Decision Tree Algorithms
The content lists many decision tree algorithms:
• ID3 (Iterative Dichotomiser 3): Created by J.R. Quinlan in 1986.
• C4.5: Also by Quinlan, developed in 1993, it's an improved version of ID3.
• CART (Classification and Regression Trees): Created by Breiman et al. in
1984.
• Others include: CHAID, QUEST, GUIDE, CRUISE, CTREE.
But the most commonly used and well-known ones are:
• ID3
• C4.5
• CART
How Do These Algorithms Work?
ID3:
• Uses Information Gain.
• This measures how much "information" or "purity" we gain by splitting on a particular
feature.
• It picks the attribute that reduces uncertainty (entropy) the most.
C4.5:
• Uses Gain Ratio.
• It's a modified version of Information Gain that avoids bias toward features with many
categories.
• Gain Ratio = Information Gain / Split Information (which normalizes the result)
CART:
• Uses the GINI Index.
• This measures the impurity of a dataset.
• The GINI Index tells us how often a randomly chosen element from the dataset would be
incorrectly labeled if we randomly labeled it according to the distribution of labels in the
ID3 Tree Construction
• ID3 (Iterative Dichotomiser 3) is a supervised learning algorithm.
• It constructs a decision tree using a labeled training dataset.
• The main goal is to use the training data to build a model that can classify
new (unseen) data.
Univariate Decision Tree (Axis-Aligned Splits):
• ID3 considers only one attribute at a time to split the data at each node.
• Because of this, the decision boundaries are axis-aligned (perpendicular to
one of the features).
• This type of tree is called a univariate decision tree.
Tree Construction Strategy:
• ID3 builds the tree using a top-down, greedy approach.
• At each step (or node), it selects the best attribute to split the data based
on a splitting criterion.
Splitting Criterion – Information Gain:
• Information Gain is a measure of purity.
• It tells us how much "information" or "certainty" we gain by splitting on a particular
attribute.
• ID3 chooses the attribute that gives the highest Information Gain at each node.
Types of Data Suitable for ID3:
• ID3 works best with categorical or nominal data (discrete values).
• If the dataset has continuous attributes (like age, height, etc.), those attributes need to
be:
• Discretized (converted into categories or ranges)
• or partitioned manually into categorical bins.
Handling of Missing Values:
• ID3 does not handle missing values well.
• It assumes the training set is clean and fully complete (no missing attribute values).
Performance with Different Dataset Sizes:
• ID3 works well with large datasets.
• For small datasets, ID3 is more likely to overfit, meaning it may memorize
the training data and perform poorly on new data.
Pruning and Outliers:
• No pruning is performed in ID3.
• Pruning means removing branches that do not contribute much to prediction accuracy
to avoid overfitting.
• Since pruning is not done, ID3 is prone to overfitting and sensitive to
outliers.
Comparison with C4.5 and CART:

Feature ID3 C4.5 CART

Splitting Criterion Information Gain Gain Ratio Gini Index

Handles Continuous (requires


Attributes discretization)

Handles Missing Values

Pruning

(C4.5 is prone to
Handles Outliers
outliers)
C4.5 Construction
• C4.5 is an improved version of the ID3 algorithm.
• It can work with:
• Continuous (numerical) and discrete (categorical) attributes.
• Missing values in the data.
• Post-pruning (removes unnecessary branches after the tree is built).
Successor of C4.5:
• C5.0 is an advanced version of C4.5.
• It is faster, uses less memory, and builds smaller and more efficient trees.
Handling Missing Values:
• C4.5 can handle missing data by marking missing values with ‘?’.
• However, those missing values are not included in certain calculations
during tree building.
Occam’s Razor Principle:
C4.5 follows Occam’s Razor, which means: When multiple correct solutions
exist, choose the simplest one. So, C4.5 prefers smaller, simpler decision trees
that still perform accurately.

Splitting Criterion – Gain Ratio (Not Information Gain):


• ID3 uses Information Gain for choosing attributes.
• But it can be biased toward attributes with many distinct values.
• Example: “Register Number” is unique for each student and might appear useful but
doesn’t actually help in classification.
• To avoid this bias, C4.5 uses a new measure called Gain Ratio.

What is Split Info?


• Split Info tells us how well an attribute splits the data. It checks how evenly
the data is divided based on the values of a particular attribute.
What is Gain Ratio?
• After calculating Split Info, we use it to correct the bias of Information Gain
(used in ID3). This helps avoid choosing attributes with many unique values
(like IDs).

Why is this helpful?


• Attributes with many values may seem useful (high Info Gain), but actually
may overfit.
• Gain Ratio balances it by dividing by Split_Info.
Classification & Regression Tree Construction
Regression Trees

You might also like