Machine Learning
Unit-1
Vani K S
Assistant Professor
Department of ISE
Nitte Meenakshi Institute of Technology
Introduction to Machine
Learning
• Machine Learning Definition:
The name machine learning was coined in 1959 by Arthur Samuel.
• Arthur Samuel defined machine learning as a "Field of study that
gives computers the ability to learn without being explicitly
programmed".
• Tom M. Mitchell provided a widely quoted, more formal definition of
the algorithms studied in the machine learning field:
• "A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with
experience E."
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 2
Department of ISE, Nitte Meenakshi Institute of Technology 3
Examples of Machine Learning
Applications
1) Learning Associations
2) Classification
3) Regression
4) Unsupervised Learning (anomaly detection, customer segmentation,
and recommendation engines)
5) Reinforcement Learning
* Refer “Introduction to Machine Learning, Second Edition, Ethem
Alpaydın, The MIT Press, Cambridge, Massachusetts, London, England.”
(textbook 1 ) for more details
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 4
Learning Problems
• A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E.
• Ex: A computer program that learns to play checkers might improve
its performance as measured by its ability to win at the class of tasks
involving playing checkers games,
• Through experience obtained by playing games against itself.
• To have a well-defined learning problem, we must identity these three
features: the class of tasks, the measure of performance to be
improved, and the source of experience.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 5
Learning Problems
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 6
Learning Problems
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 7
Successful Application of ML
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 8
Successful Application of ML
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 9
Designing a Learning System
• Consider designing a program to learn to play checkers, with the goal
of entering it in the world checkers tournament.
• The obvious performance measure: the percent of games it wins in
this world tournament.
A checkers learning problem:
• Task T: playing checkers
• Performance measure P: percent of games won in the world
tournament
• Training experience E: games played against itself
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 10
Choosing the training
experience
• The type of training experience available can have a significant impact
on success or failure of the learner.
• First key attribute is whether the training experience provides direct
or indirect feedback regarding the choices made by the performance
system.
• Note: Learning from direct training feedback is typically easier than
learning from indirect feedback
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 11
Choosing the training
experience
• A second important attribute of the training experience is the degree
to which the learner controls the sequence of training examples.
(i) The learner might rely on the teacher to select informative board
states and to provide the correct move for each.
(ii) The learner might itself propose board states that it finds
particularly confusing and ask the teacher for the correct move.
(iii) The learner may have complete control over both the board states
and (indirect) training classifications, as it does when it learns by
playing against itself with no teacher present.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 12
Choosing the training
experience
• A third important attribute of the training experience is how well it
represents the distribution of examples over which the final system
performance P must be measured.
• For example, the learner might never encounter certain crucial board
states that are very likely to be played by the human checkers
champion.
• In practice, it is often necessary to learn from a distribution of
examples that is somewhat different from those on which the final
system will be evaluated
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 13
Choosing the target function
• The next design choice is to determine exactly what type of
knowledge will be learned
• How this will be used by the performance program.
• Begin with a checkers-playing program that can generate the legal
moves from any board state.
• The program needs only to learn how to choose the best move from
among these legal moves.
• Function ChooseMove and use the notation
• ChooseMove : BM to indicate that this function accepts as input any
board from the set of legal board states B and produces as output
some move from the set of legal moves M.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 14
Choosing the target function
• An alternative target function and one that will turn out to be easier
to learn in this setting-is an evaluation function that assigns a
numerical score to any given board state.
• Target function V and again use the notation V : BR
• To denote that V maps any legal board state from the set B to some
real value
• What exactly should be the value of the target function V for any
given board state?
• Of course any evaluation function that assigns higher scores to better
board states will do.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 15
Choosing the target function
• Define the target value V(b) for an arbitrary board state b in B, as
follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V(b) = V(b`), where b' is the
best final board state that can be achieved starting from b and playing
optimally until the end of the game
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 16
Choosing a Representation for the
Target Function
• Now that we have specified the ideal target function V
• We must choose a representation that the learning program will use
to describe the function c that it will learn.
• For any given board state, the function c will be calculated as a linear
combination of the following board features:
• x1: the number of black pieces on the board
• x2: the number of red pieces on the board
• x3: the number of black kings on the board
• x4: the number of red kings on the board
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 17
Contd..
• x5: the number of black pieces threatened by red (i.e., which can be
captured on red's next turn)
• X6: the number of red pieces threatened by black
• Thus, our learning program will represent V’(b) as a linear function of
the form
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 18
Partial design of a checkers
learning program:
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 19
Choosing a Function
Approximation Algorithm
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 20
Adjusting the weights
• All that remains is to specify the learning algorithm for choosing the
weights wi to best fit the set of training examples { (b,Vtrain(b))}.
• As a first step we must define what we mean by the bestfit to the
training data.
• One common approach is to define the best hypothesis, or set of
weights,
• As that which minimizes the square error E between the training
values and the values predicted by the hypothesis V.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 21
LMS Weight Update Rule
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 22
The Final Design
• The final design described by four distinct program modules that
represent the central components in many learning systems.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 23
Performance System
• Many machine learning systems can-be usefully characterized in
terms of these four generic modules
• The Performance System is the module that must solve the given
performance task.
• It takes an instance of a new problem (new game) as input and
produces a trace of its solution (game history) as output.
• In our case, the strategy used by the Performance System to select its
next move at each step is determined by the learned V’evaluation
function.
• Therefore, we expect its performance to improve as this evaluation
function becomes increasingly accurate.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 24
The Critic
• The Critic takes as input the history or trace of the game
• Produces as output a set of training examples of the target function.
• As shown in the diagram, each training example in this case
corresponds to some game state in the trace,
• Along with an estimate Vtrain the target function value for this
example.
• In our example, the Critic corresponds to the training rule.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 25
The Generaliser
• The Generalizer takes as input the training examples and produces an
output hypothesis that is its estimate of the target function.
• It generalizes from the specific training examples, hypothesizing a
general function
• That covers these examples and other cases beyond the training
examples.
• In our example, the Generalizer corresponds to the LMS algorithm,
• The output hypothesis is the function V’ described by the learned
weights wo, . . . , W6.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 26
The Experiment Generator
• The Experiment Generator takes as input the current hypothesis and
outputs a new problem (i.e., initial board state) for the Performance
System to explore.
• Its role is to pick new practice problems that will maximize the
learning rate of the overall system.
• In our example, the Experiment Generator follows a very simple
strategy:
• It always proposes the same initial game board to begin a new game.
• More sophisticated strategies could involve creating board positions
designed to explore particular regions of the state space.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 27
Design Choices
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 28
Contd..
• We have restricted the type of knowledge that can be acquired to a
single linear evaluation function.
• Furthermore, we have constrained this evaluation function to depend
on only the six specific board features provided.
• Would the program we have designed be able to learn well enough to
beat the human checkers world champion?
• Probably not. In part, this is because the linear function
representation for V’ is too simple a representation to capture well
the nuances of the game.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 29
Perspectives in Machine
Learning
• One useful perspective on machine learning is that it involves
searching a very large space of possible hypotheses
• To determine one that best fits the observed data and any prior
knowledge held by the learner.
• Algorithms search a hypothesis space defined by some underlying
representation
• We see perspective of learning as a search problem in order to
characterize learning methods by their search strategies and by the
underlying structure of the search spaces they explore.
• Analyze the relationship between the size of the hypothesis space to
be searched, the number of training examples available.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 30
Issues in Machine Learning
• What algorithms exist for learning general target functions from
specific training examples? In what settings will particular algorithms
converge to the desired function, given sufficient training data?
• Which algorithms perform best for which types of problems and
representations?
• How much training data is sufficient?
• When and how can prior knowledge held by the learner guide the
process of generalizing from examples?
• What is the best strategy for choosing a useful next training
experience.
• How can the learner automatically alter its representation to improve
its ability to represent and learn the target function?
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 31
Concept Learning
• We consider the problem of automatically inferring the general
definition of some concept, given examples labeled as members or
nonmembers of the concept.
• This task is commonly referred to as concept learning, or
approximating a boolean-valued function from examples.
• Concept learning.
• Inferring a boolean-valued function from training examples of its
input and output.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 32
Concept Learning Task
• If some instance x satisfies all the constraints of hypothesis h, then h
classifies x as a positive example (h(x) = 1).
• c can be any Boolean valued function defined over the instances X;
• that is, c : X {0, 1}
• c(x) = 1 are called positive examples, or members of the target
concept.
• Instances for which c(x) = 0 are called negative examples, or
nonmembers of the target concept.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 33
Notations
• When learning the target concept, the learner is presented a set of
training examples, each consisting of an instance x from X, along with
its target concept value c(x).
• Instances for which c(x)=1 are called positive examples or members of
the target concept.
• Instances for which c(x)=0 are called negative examples or non-
members of the target concept.
• The ordered pair (x,c(x)) is used to describe the training example..
• D is used to represent the set of available training examples.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 34
Inductive Learning
Hypothesis
• The learning task is to determine a hypothesis h identical to the target
concept c over the entire set of instances X
• The only information available about c is its value over the training
examples.
• Our assumption is that the best hypothesis regarding unseen
instances is the hypothesis that best fits the observed training data
• The inductive learning hypothesis: Any hypothesis found to
approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over
other unobserved examples.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 35
Concept Learning as Search
• Concept learning can be viewed as the task of searching through a
large space of hypotheses implicitly defined by the hypothesis
representation.
• The goal of this search is to find the hypothesis that best fits the
training examples.
• Instance space X contains exactly 3.2 2 .2 2 .2 = 96 distinct instances.
• A similar calculation shows that there are [Link].4.4 = 5120
syntactically distinct hypotheses within H.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 36
Contd..
• Every hypothesis containing one or more "" symbols represents the
empty set of instances;
• that is, it classifies every instance as negative.
• Therefore, the number of semantically distinct hypotheses is only
• 1 + ([Link].3.3) = 973.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 37
General-to-Specific Ordering of
Hypotheses
• Consider the two hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
• The sets of instances that are classified positive by h1 are also
classified as positive by h2.
• Because h2 imposes fewer constraints on the instance, it classifies
more instances as positive.
• Any instance classified positive by hl will also be classified positive by
h2.
• Therefore, we say that h2 is more general than hl.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 38
Contd..
• Given hypotheses hj and hk, hj is more_general_than_or_equal_to
hk if and only if any instance that satisfies hk also satisfies hj.
• Begin with the most specific possible hypothesis in H, then generalize
this hypothesis each time it fails to cover an observed positive training
example.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 39
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 40
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 41
Terminologies
•Features: The number of features or distinct traits that can be used to
describe each item in a quantitative manner.
•Feature Vector : n-dimensional vector of numerical features that
represent some object.
•Concept c: subset of objects from X(c is unknown).
•Target function f or c: Maps each instance x Ɛ X to target label y Ɛ Y.
•Training Data S: Collection of examples observed by learning
algorithm.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 42
Find-S Algorithm
• Hypothesis: A hypothesis is a certain function that we believe (or
hope) is similar to the true function, the target function that we want
to model.
• In machine learning field, the terms hypothesis and model are often
used interchangeably.
• Many Possible representations for hypothesis h.
• h as conjunction of constraints on features.
• Each constraint can be:
• a specific value( Nose=square)
• don’t care (nose=?)
• no values allowed( nose=θ)
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 43
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 44
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 45
Find-S Algorithm
• Initialize h to the most specific hypothesis in H
•For each positive training instance x
–For each attribute constraint ai in h :
•If the constraint a in h is satisfied by x THEN do nothing
•Else replace ai in h by the next more general constraint
that is satisfied by x
•Output hypothesis h
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 46
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 47
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 48
Find-S Algorithm-2nd Dataset
Restaurant Meal Day Cost Reaction
Sam’s Breakfast Friday Cheap Yes
Hilton Lunch Friday Expensive No
Sam’s Lunch Saturday Cheap Yes
Denny’s Breakfast Sunday Cheap No
Sam’s Breakfast Sunday Expensive No
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 49
Find-S Algorithm-3rd Dataset
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 50
Version Space-Introduction
• Second approach to concept learning, the Candidate Elimination
Algorithm, that addresses several of the limitations of FIND-S
• Find-S algorithm –one hypothesis
• Candidate Elimination -set of all hypotheses consistent with the
training examples.
• Practical applications of the Find-S and Candidate Elimination
algorithms are limited by the fact that they both perform poorly when
given noisy training data.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 51
Representation
• Hypothesis is consistent with the training examples if it correctly
classifies these examples.
• The Candidate Elimination Algorithm represents the set of all
hypotheses consistent with the observed training examples.
• This subset of all hypotheses is called the version space with respect
to the hypothesis space H and the training examples D,
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 52
Version Space
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 53
Version Spaces
• Recall that given the four training examples of enjoysport data, FIND-S
outputs the hypothesis
h = (Sunny, Warm, ?, Strong, ?, ?)
• In fact, this is just one of six different hypotheses from H that are
consistent with these training examples.
• All six hypotheses are shown below.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 54
Data
The Type of Data
• Data sets differ in a number of ways.
• For example, the attributes used to describe data objects can be of
different types—quantitative or qualitative
• Data sets may have special characteristics; e.g., some data sets
contain time series or objects with explicit relationships to one
another.
• Not surprisingly, the type of data determines which tools and
techniques can be used to analyze the data.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 55
The Quality of the Data
• Data is often far from perfect.
• While most data mining techniques can tolerate some level of
imperfection in the data
• A focus on understanding and improving data quality typically
improves the quality of the resulting analysis.
• Data quality issues that often need to be addressed include the
presence of noise and outliers;
• Missing, inconsistent, or duplicate data;
• Data that is biased or, in some other way, unrepresentative of the
phenomenon or population that the data is supposed to describe.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 56
Pre-processing Steps
• Preprocessing Steps to Make the Data More Suitable for Data
Mining
• The raw data must be processed in order to make it suitable for
analysis.
• While one objective may be to improve data quality, other goals focus
on modifying the data so that it better fits a specified data mining
technique or tool.
• Ex: A continuous attribute, e.g., length, may need to be transformed
into an attribute with discrete categories, e.g., short, medium, or long,
in order to apply a particular technique.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 57
Types of Data
• A data set can often be viewed as a collection of data objects.
• Other names for a data object are record, point, vector, pattern,
event, case, sample, observation, or entity.
• Data objects are described by a number of attributes that capture the
basic characteristics of an object.
• The mass of a physical object or the time at which an event occurred.
• Other names for an attribute are variable, characteristic, field,
feature, or dimension.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 58
Example
• A data set is a file, in which the objects are records (or rows) in the file
and each field (or column) corresponds to an attribute.
• What Is an attribute?
• Definition. An attribute is a property or characteristic of an object
that may vary, either from one object to another or from one time to
another.
• Eye color(discrete attribute) varies from person to person, while the
temperature(continuous) of an object varies over time
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 59
The Different Types of
Attributes
• A useful (and simple) way to specify the type of an attribute is to identify
the properties of numbers that correspond to underlying properties of
the attribute.
• Ex: An attribute such as length has many of the properties of numbers.
• It makes sense to compare and order objects by length, as well as to talk
about the differences and ratios of length.
• The following properties (operations) of numbers are typically used to
describe attributes.
1. Distinctness = and !=
2. Order <, ≤, >, and ≥
3. Addition + and −
4. Multiplication ∗ and /
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 60
The Different Types of
Attributes
• Given these properties, we can define four types of attributes:
nominal, ordinal, interval, and ratio.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 61
The Different Types of
Attributes
Interval Attribute
• Definition: Numeric data where the difference between values is meaningful, but there is no true zero point.
• Examples:
• Temperature in Celsius or Fahrenheit (e.g., 10°C, 20°C, 30°C → The difference between 10°C and 20°C is
meaningful, but 0°C is not an absolute absence of temperature).
• IQ Scores (e.g., 100, 120, 140 → Differences are meaningful, but there is no true zero IQ).
• Years in a calendar (e.g., 2000, 2010, 2020 → The differences matter, but 0 year does not mean "no time").
Ratio Attribute
• Definition: Numeric data where both the difference and the ratio are meaningful, and there is a true zero point.
• Examples:
• Height (e.g., 150 cm, 180 cm → Someone who is 180 cm is 1.2 times taller than someone who is 150 cm, and
0 cm means no height).
• Weight (e.g., 50 kg, 100 kg → A person weighing 100 kg is twice as heavy as one weighing 50 kg, and 0 kg
means no weight).
• Age (e.g., 10 years, 20 years → Someone aged 20 is twice as old as someone aged 10, and 0 years means no
age).
• Income (e.g., $1000, $2000 → $2000 is twice as much as $1000, and $0 means no income).
• The key difference is that Ratio attributes have a true zero, whereas Interval attributes do not.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 62
The Different Types of
Attributes
• Nominal and ordinal attributes are collectively referred to as
categorical or qualitative attributes.
• Qualitative attributes, such as employee ID, lack most of the
properties of numbers.
• Even if they are represented by numbers, i.e., integers, they should be
treated more like symbols.
• The remaining two types of attributes, interval and ratio, are
collectively referred to as quantitative or numeric attributes.
• Quantitative attributes are represented by numbers and have most of
the properties of numbers.
• Quantitative attributes can be integer-valued or continuous.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 63
The Different Types of
Attributes
• The types of attributes can also be described in terms of
transformations that do not change the meaning of an attribute.
• Ex: The meaning of a length attribute is unchanged if it is measured in
meters instead of feet.
• The statistical operations that make sense for a particular type of
attribute are those
• That will yield the same results when the attribute is transformed
using a transformation that preserves the attribute’s meaning.
• To illustrate, the average length of a set of objects is different when
measured in meters rather than in feet, but both averages represent
the same length.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 64
Describing Attributes by the
Number of Values
• An independent way of distinguishing between attributes is by the
number of values they can take.
• Discrete A discrete attribute has a finite or countably infinite set of
values.
• Such attributes can be categorical, such as zip codes or ID numbers, or
numeric, such as counts.
• Discrete attributes are often represented using integer variables.
• Binary attributes are a special case of discrete attributes and assume
only two values, e.g., true/false, yes/no, male/female, or 0/1.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 65
Describing Attributes by the
Number of Values
• Binary attributes are often represented as Boolean variables, or as
integer variables that only take the values 0 or 1.
• Continuous A continuous attribute is one whose values are real
numbers.
• Examples include attributes such as temperature, height, or weight.
• Continuous attributes are typically represented as floating-point
variables.
• Practically, real values can only be measured and represented with
limited precision.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 66
Questions about different
types of data
• Questions about different types of [Link]
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 67
Missing Values
• The information was not collected; e.g., some people decline to give
their age or weight.
• In other cases, some attributes are not applicable to all objects;
• Missing values should be taken into account during the data analysis.
• Strategies for dealing with missing data:
• Eliminate Data Objects or Attributes.
• Estimate Missing Values
• Ignore the Missing Value during Analysis
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 68
Inconsistent Values
• Data can contain inconsistent values.
• Consider an address field, where both a zip code and city are listed,
• But the specified zip code area is not contained in that city.
• It may be that the individual entering this information transposed two
digits, or perhaps a digit was misread when the information was
scanned.
• Some types of inconsistences are easy to detect.
• Ex: A person’s height should not be negative.
• Once an inconsistency has been detected, it is sometimes possible to
correct the data.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 69
Duplicate Data
• A data set may include data objects that are duplicates, or almost
duplicates, of one another.
• If there are two objects that actually represent a single object:
• Then the values of corresponding attributes may differ, and these
inconsistent values must be resolved.
• Second, care needs to be taken to avoid accidentally combining data
objects that are similar, but not duplicates,
• Such as two distinct people with identical names.
• The term deduplication is often used to refer to the process of dealing
with these issues.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 70
Data Pre-processing (Data
Cleaning)
• Pre-processing steps should be applied to make the data more
suitable for data mining.
• Aggregation or Sampling
• Dimensionality reduction
• Feature subset selection, Feature creation
• Discretization and binarization
• Variable transformation
• These items fall into two categories: selecting data objects and
attributes for the analysis or creating/changing the attributes.
• In both cases the goal is to improve the data mining analysis with
respect to time, cost, and quality.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 71
Sampling
• Sampling is a commonly used approach for selecting a subset of the
data objects to be analysed.
• Using a sampling algorithm can reduce the data size to the point
where a better, but more expensive algorithm can be used.
• Using a sample will work almost as well as using the entire data set if
the sample is representative.
• A sample is representative if it has approximately the same property
(of interest) as the original set of data.
• Ex: A sample is representative if it has a mean that is close to that of
the original data.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 72
Sampling Approaches
• The simplest type of sampling is simple random sampling.
• There is an equal probability of selecting any particular item.
(1) sampling with replacement
(2) sampling without replacement
• Simple random sampling can fail to adequately represent those types
of objects that are less frequent.
• Stratified sampling: A sampling scheme that can accommodate
differing frequencies for the items of interest.
• In the simplest version, equal numbers of objects are drawn from
each group even though the groups are of different sizes.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 73
Sampling and Loss of
Information
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 74
Progressive Sampling
• The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used.
• These approaches start with a small sample, and then increase the
sample size until a sample of sufficient size has been obtained.
• Suppose, for instance, that progressive sampling is used to learn a
predictive model.
• Although the accuracy of predictive models increases as the sample
size increases, at some point the increase in accuracy levels off.
• We want to stop increasing the sample size at this leveling-off point.
• We can get an estimate as to how close we are to this leveling-off
point, and thus, stop sampling.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 75
Dimensionality Reduction
• Data sets can have a large number of features.
• There are a variety of benefits to dimensionality reduction.
• A key benefit is that many data mining algorithms work better if the
dimensionality—the number of attributes in the data—is lower.
• Dimensionality reduction can eliminate irrelevant features and reduce
noise.
• A reduction of dimensionality can lead to a more understandable
model because the model may involve fewer attributes.
• The amount of time and memory required by the data mining
algorithm is reduced with a reduction in dimensionality.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 76
Dimensionality Reduction
• The reduction of dimensionality by selecting new attributes that are a
subset of the old is known as feature subset selection or feature
selection.
• The Curse of Dimensionality
• The curse of dimensionality refers to the phenomenon that many
types of data analysis become significantly harder as the
dimensionality of the data increases.
• For clustering, the definitions of density and the distance between
points, which are critical for clustering, become less meaningful.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 77
Linear Algebra Techniques for
Dimensionality Reduction
• Principal Components Analysis (PCA) is a linear algebra technique for
continuous attributes that finds new attributes (principal
components) that
(1) are linear combinations of the original attributes,
(2) are orthogonal (perpendicular) to each other,
(3) capture the maximum amount of variation in the data.
• Ex: The first two principal components capture as much of the
variation in the data as is possible with two orthogonal attributes that
are linear combinations of the original attributes.
• Singular Value Decomposition (SVD) is a linear algebra technique that
is related to PCA and is also commonly used for dimensionality
reduction.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 78
Feature Selection
• There are three standard approaches to feature selection: embedded,
filter, and wrapper.
• Embedded approaches: Feature selection occurs naturally as part of
the data mining algorithm. The algorithm itself decides which
attributes to use and which to ignore. Ex: Decision tree classifiers,
• Filter approaches: Features are selected before the data mining
algorithm is run, using some approach that is independent of the data
mining task. Ex: select sets of attributes whose pairwise correlation is
as low as possible.
• Wrapper approaches: These methods use the target data mining
algorithm as a black box to find the best subset of attributes.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 79
Discretization and
Binarization
• Some data mining algorithms, especially certain classification
algorithms, require that the data be in the form of categorical
attributes.
• Often necessary to transform a continuous attribute into a categorical
attribute (discretization)
• Both continuous and discrete attributes may need to be transformed
into one or more binary attributes (binarization).
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 80
Discretization of Continuous
Attributes
• Transformation of a continuous attribute to a categorical attribute
involves two subtasks:
• Deciding how many categories to have and determining how to map
the values of the continuous attribute to these categories.
• In the first step, after the values of the continuous attribute are
sorted, they are then divided into n intervals by specifying n−1 split
points.
• In the second, rather trivial step, all the values in one interval are
mapped to the same categorical value.
• A basic distinction between discretization methods for classification is
whether class information is used (supervised) or not (unsupervised).
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 81
Measures of Similarity and
Dissimilarity
• Similarity and dissimilarity are important because they are used by a
number of Machine learning techniques, such as clustering, nearest
neighbor classification, and anomaly detection.
• The term proximity is used to refer to either similarity or dissimilarity.
• The similarity between two objects is a numerical measure of the
degree to which the two objects are alike.
• Similarities are higher for pairs of objects that are more alike.
• Similarities are usually non-negative and are often between 0 (no
similarity) and 1 (complete similarity).
• The dissimilarity between two objects is a numerical measure of the
degree to which the two objects are different.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 82
Measures of Similarity and
Dissimilarity
• Dissimilarities are lower for more similar pairs of objects.
• The term distance is used as a synonym for dissimilarity.
• Distance is often used to refer to a special class of dissimilarities.
• Dissimilarities sometimes fall in the interval [0, 1], but it is also
common for them to range from 0 to ∞.
• Transformations
• Transformations are often applied to convert a similarity to a
dissimilarity, or vice versa,
• To transform a proximity measure to fall within a particular range,
such as [0,1].
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 83
Similarity and Dissimilarity between
Simple Attributes
• Consider objects described by one nominal attribute.
• What would it mean for two such objects to be similar?
• Nominal attributes only convey information about the distinctness of
objects:
• We can say that two objects either have the same value or they do
not.
• Similarity is traditionally defined as 1 if attribute values match, and as
0 otherwise.
• A dissimilarity would be defined in the opposite way: 0 if the attribute
values match, and 1 if they do not.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 84
Similarity Between Nominal
Attributes
• m is the number of matches,
• p is the length of x or y.
• x and y are some answers from Polly and Molly:
• x = [‘high’, ‘A’, ‘yes’, ‘Asian’]
• y = [‘high’, ‘A’, ‘yes’, ‘Latino’]
• Here, there are four elements and 3 of them are the same. The
distance between x and y is:
• d = (4–3)/4 = 1/4 = 0.25
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 85
Objects with a single ordinal
attribute
• Consider an attribute that measures the quality of a product:
• Ex: A candy bar, on the scale {poor, fair, OK, good, wonderful}.
• A product, P1, which is rated wonderful, would be closer to a product
P2, which is rated good, than it would be to a product P3, which is
rated OK.
• The values of the ordinal attribute are often mapped to successive
integers, beginning at 0 or 1, e.g., {poor=0, fair=1, OK=2, good=3,
wonderful=4}.
• d(P1, P2) = 3 − 2 = 1 or, if we want the dissimilarity to fall between 0
and 1, d(P1, P2) = 3−2/4 = 0.25.
• A similarity for ordinal attributes can then be defined as s = 1− d.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 86
Similarity and Dissimilarity between
Simple Attributes
• For interval or ratio attributes, the natural measure of dissimilarity
between two objects is the absolute difference of their values.
• Ex: We might compare our current weight and our weight a year ago
by saying “I am ten pounds heavier.”
• In cases such as these, the dissimilarities typically range from 0 to ∞,
rather than from 0 to 1.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 87
Dissimilarities between Data
Objects
• Distances:
• The Euclidean distance, d, between two points, x and y, in one, two,
three, or higher dimensional space, is given by the following familiar
formula:
• where n is the number of dimensions and xk and yk are, respectively,
the kth attributes (components) of x and y.
• The Minkowski distance metric:
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 88
Distances
• The following are the three most common examples of Minkowski
distances.
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example is the Hamming distance, which is the number of
bits that are different between two objects that have only binary
attributes, i.e., between two binary vectors.
• r = 2. Euclidean distance (L2 norm).
• r = ∞. Supremum (Lmax or L∞ norm) distance.
• This is the maximum difference between any attribute of the objects.
• The L∞ distance is defined by either of these
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 89
Examples-Distances
1) Two points x and y : (5, 1), (9, -2).
• Euclidean Distance:
2) x and y:
• x = [3, 6, 11, 8]
• y = [0, 9, 5, 3]
• The Manhattan distance between x and y:
• d = |3–0| + |6–9| + |11–5| + |8–3| =3+3+6+5 = 17
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 90
Examples-Distances
• x = [9, 2, -1, 12]
• y = [-4, 5, 10, 13]
• When h = 1, The formula becomes the Manhattan distance formula
which is referred to as the L-1 norm:
• More Examples on Distances
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 91
Distances
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 92
Distances
• If d(x, y) is the distance between two points, x and y, then the
following properties hold.
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
2. Symmetry
• d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
• d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.
• Measures that satisfy all three properties are known as metrics.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 93
Similarities between Data
Objects
• For similarities, the triangle inequality typically does not hold, but
symmetry and positivity typically do.
• To be explicit, if s(x, y) is the similarity between points x and y, then
the typical properties of similarities are the following:
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry).
• A similarity measure can easily be converted to a metric distance.
• Ex: The cosine and Jaccard similarity.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 94
Similarity Measures for
Binary Data
• Similarity measures between objects that contain only binary
attributes are called similarity coefficients, and typically have values
between 0 and 1.
• A value of 1 indicates that the two objects are completely similar,
while a value of 0 indicates that the objects are not at all similar.
• Let x and y be two objects that consist of n binary attributes.
• The comparison of two such objects, i.e., two binary vectors, leads to
the following four quantities (frequencies):
• f00 = the number of attributes where x is 0 and y is 0
• f01 = the number of attributes where x is 0 and y is 1
• f10 = the number of attributes where x is 1 and y is 0
• f11 = the number of attributes where x is 1 and y is 1
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 95
Simple Matching Coefficient
• One commonly used similarity coefficient is the simple matching
coefficient (SMC), which is defined as
• This measure counts both presences and absences equally.
• The SMC could be used to find students who had answered questions
similarly on a test that consisted only of true/false questions.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 96
Jaccard Coefficient
• The Jaccard coefficient is frequently used to handle objects consisting
of asymmetric binary attributes.
• The Jaccard coefficient, which is often symbolized by J, is given by the
following equation:
• Example. Calculate SMC abd JC
• x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
• y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 97
Cosine Similarity
• Documents are often represented as vectors.
• Each attribute represents the frequency with which a particular term
(word) occurs in the document.
• The cosine similarity is one of the most common measure of
document similarity.
• If x and y are two document vectors, then
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 98
Cosine Similarity-Example
• Two data objects, which might represent document vectors:
• Cosine similarity really is a measure of the (cosine of the) angle between x and y.
• If the cosine similarity is 1, the angle between x and y is 0◦, and x and y are the same except for
magnitude (length).
• If the cosine similarity is 0, then the angle between x and y is 90◦, and they do not share any
terms (words).
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 99
Extended Jaccard Coefficient
(Tanimoto Coefficient)
• The extended Jaccard coefficient can be used for document data and
that reduces to the Jaccard coefficient in the case of binary attributes.
• The extended Jaccard coefficient is also known as the Tanimoto
coefficient.
Correlation
• The correlation between two data objects that have binary or
continuous variables is a measure of the linear relationship between
the attributes of the objects.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 100
Comparision
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 101
Covariance and Correlation
• Covariance and correlation are two terms that are opposed and are
both used in statistics and regression analysis.
• Covariance shows you how the two variables differ, whereas
correlation shows you how the two variables are related.
• The covariance value can range from -∞ to +∞, with a negative value
indicating a negative relationship and a positive value indicating a
positive relationship.
• Covariance tells whether both variables vary in the same direction
(positive covariance) or in the opposite direction (negative
covariance).
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 102
Covariance and Correlation
• As covariance only tells about the direction which is not enough to
understand the relationship completely,
• We divide the covariance with a standard deviation of x and y
respectively and get correlation coefficient which varies between -1 to
+1.
• if the correlation coefficient is zero, so is the covariance.
• The change in location does not affect correlation and covariance
measurements.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 103
Pearson’s correlation
• Pearson’s correlation coefficient between two data objects, x and y, is
defined by the following equation:
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 104
Example for Pearson
Correlation
• Example 1: A survey was conducted in your city.
Given is the following sample data containing a
person's age and their corresponding income.
Find out whether the increase in age has an
effect on income using the correlation coefficient
formula. Age 25 30 36 43
Inco
30000 44000 52000 70000
me
• Solution:
• To simplify the calculation, we divide y by 1000.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 105
Example for Pearson
Correlation Income/
(xi-xbar)2 (yi-ybar)2 (xi-xbar)(yi-
1000 ybar)
Age (xi) xi−xbar yi−ybar
(yi/1000
)
25 30 -8.5 -19 72.25 361 161.5
30 44 -3.5 -5 12.25 25 17.5
36 52 2.5 3 6.25 9 7.5
43 70 9.5 21 90.25 441 199.5
xbar=33.5 ybar=49 181 836 386
Pearson correlation coefficient for sample =0.99
Answer: Yes, with the increase in age a person's income
increases as well, since the Pearson correlation coefficient
between age and income is very close to 1.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 106
Perfect Correlation
• Correlation is always in the range −1 to 1.
• A correlation of 1 (−1) means that x and y have a perfect positive
(negative) linear relationship;
• That is, xk = ayk + b, where a and b are constants.
• The following two sets of values for x and y indicate cases where the
correlation is −1 and +1, respectively.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 107
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 108
Principal Component Analysis
• Principal component analysis (PCA) is an important technique to
understand in the fields of statistics and data science.
• Let’s say that you want to predict what the gross domestic product
(GDP) of the India will be for 2026?.
• Do you have so many variables that you are in danger of overfitting
your model to your data?
• You want to “reduce the dimension of your feature space.”
• By reducing the dimension of your feature space, you have fewer
relationships between variables to consider
• And you are less likely to overfit your model.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 109
Introduction
• Reducing the dimension of the feature space is called
“dimensionality reduction.”
• There are many ways to achieve dimensionality reduction, but most
of these techniques fall into one of two classes:
• Feature Elimination
• Feature Extraction
• By eliminating features, we’ve also entirely eliminated any benefits
those dropped variables would bring.
• Feature extraction, however, doesn’t run into this problem.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 110
PCA
• Principal component analysis is a technique for feature extraction —
• It combines our input variables in a specific way, then we can drop the
“least important” variables while still retaining the most valuable
parts of all of the variables!
• As an added benefit, each of the “new” variables after PCA are all
independent of one another.
• This is a benefit because the assumptions of a linear model require
our independent variables to be independent of one another.
• The main goal of a PCA analysis is to identify patterns in data; PCA
aims to detect the correlation between variables.
• If a strong correlation between variables exists, the attempt to reduce
the dimensionality only makes sense
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 111
PCA Introduction
• The main idea of principal component analysis (PCA) is to reduce the
dimensionality of a data set,
• Consisting of many variables correlated with each other, either
heavily or lightly,
• while retaining the variation present in the dataset, up to the
maximum extent.
• The same is done by transforming the variables to a new set of
variables, which are known as the principal components (or simply,
the PCs) and are orthogonal,
• ordered such that the retention of variation present in the original
variables decreases as we move down in the order.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 112
PCA Introduction
• The 1st principal component retains maximum variation that was
present in the original components.
• The principal components are the eigenvectors of a covariance
matrix, and hence they are orthogonal.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 113
Properties of PCA
• A principal component can be defined as a linear combination of
optimally-weighted observed variables.
• The PCs possess some useful properties which are listed below:
[Link] PCs are essentially the linear combinations of the original
variables. The weights vector in this combination is actually the
eigenvector found which in turn satisfies the principle of least
squares.
[Link] PCs are orthogonal, as already discussed.
[Link] variation present in the PCs decrease as we move from the 1st
PC to the last one, hence the importance.
• The least important PCs are also sometimes useful in regression,
outlier detection, etc.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 114
Introduction
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 115
Terminologies
• Dimensionality: It is the number of random variables in a dataset or
simply the number of features, or rather more simply, the number of
columns present in your dataset.
• Correlation: It shows how strongly two variable are related to each
other.
• The value of the same ranges for -1 to +1.
• Positive indicates that when one variable increases, the other
increases as well, while negative indicates the other decreases on
increasing the former.
• Orthogonal: Uncorrelated to each other, i.e., correlation between
any pair of variables is 0.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 116
Terminologies
• Covariance Matrix: This matrix consists of the covariances between
the pairs of variables.
• The (i,j)th element is the covariance between i-th and j-th variable.
• Eigenvectors: Consider a non-zero vector v.
• It is an eigenvector of a square matrix A, if Av is a scalar multiple of v.
Or simply: Av = ƛv
• Here, v is the eigenvector and ƛ is the eigenvalue associated with it.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 117
Eigen Vectors and Eigen
Values
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 118
Eigen Vectors
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 119
Eigen Vectors
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 120
Visualization
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 121
Eigen Vectors and Principal
Components
• An eigenvector is a direction, in the example above the eigenvector
was the direction of the line (vertical, horizontal, 45 degrees etc.) .
• An eigenvalue is a number, telling you how much variance there is in
the data in that direction,
• In the example above the eigenvalue is a number telling us how
spread out the data is on the line.
• The eigenvector with the highest eigenvalue is therefore the principal
component.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 122
Eigen Values and Eigen
Vectors
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 123
Eigen and PCA
• Eigenvalues are coefficients applied to eigenvectors that give the
vectors their length or magnitude.
• So, PCA is a method that:
• Measures how each variable is associated with one another using a
Covariance matrix
• Understands the directions of the spread of our data using
Eigenvectors
• Brings out the relative importance of these directions using
Eigenvalues
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 124
Eigen and PCA-Relation
(Additional Information)
• PCA attempts to draw straight, explanatory lines through data, like
linear regression.
• Each straight line represents a “principal component,” or a
relationship between an independent and dependent variable.
• While there are as many principal components as there are
dimensions in the data, PCA’s role is to prioritize them.
More explanation on PCA
[Link]
s-378e851bf372
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 125
Implement PCA on a 2D
Dataset
• Eigenvectors are unit vectors with length or magnitude equal
to 1.
• They are often referred to as right vectors, which simply
means a column vector.
• Eigenvalues are coefficients applied to eigenvectors that give
the vectors their length or magnitude.
• So, PCA is a method that:
• Measures how each variable is associated with one another
using a Covariance matrix
• Understands the directions of the spread of our data
using Eigenvectors
• Brings out the relative importance of these directions
using Eigenvalues. If there are eigenvalues close to zero,
they represent components that may be discarded.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 126
Implement PCA on a 2D
Dataset
• Step 1: Normalize the data: First step is to normalize the data that we have
so that PCA works properly.
• This is done by subtracting the respective means from the numbers in the
• So if we have two dimensions X and Y, all X become 𝔁-mean and all
respective column.
Y become 𝒚-mean.
• This produces a dataset whose mean is zero.
• Step 2: Calculate the covariance matrix:
• Since the dataset we took is 2-dimensional, this will result in a 2x2
Covariance matrix.
• Note that Var[X1] = Cov[X1,X1]
• Var[X2] = Cov[X2,X2].
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 127
Finding Covariance Matrix
• The covariance matrix is a square matrix, of d x d dimensions,
where d stands for “dimension” (or feature or column, if our data is
tabular).
• It shows the pairwise feature correlation between each feature.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 128
Finding Covariance Matrix X-mean Y-mean Var(x) Var(y) covar(x,y)
0.69 0.49
0.4761 0.2401 0.3381
-1.31 -1.21
1.7161 1.4641 1.5851
Covariance Calculation 0.39 0.99
0.1521 0.9801 0.3861
0.09 0.29
0.0081 0.0841 0.0261
1.29 1.09
1.6641 1.1881 1.4061
0.49 0.79
0.2401 0.6241 0.3871
0.19 -0.31
0.0361 0.0961 -0.0589
-0.81 -0.81
0.6561 0.6561 0.6561
-0.31 -0.31
0.0961 0.0961 0.0961
-0.71 -1.01
0.5041 1.0201 0.7171
Sum
5.549 6.449 5.539
Sum/(n-1)
0.616556 0.716556 0.615444
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 129
Implement PCA on a 2D
Dataset
• Step 3: Calculate the eigenvalues and eigenvectors:
• Calculate the eigenvalues and eigenvectors for the covariance matrix.
• The same is possible because it is a square matrix.
• ƛ is an eigenvalue for a matrix A if it is a solution of the characteristic
equation:
• det( ƛI - A ) = 0
• For each eigenvalue ƛ, a corresponding eigen-vector v, can be found
by solving:
• ( ƛI - A )v = 0
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 130
Implement PCA on a 2D
Dataset
• Step 4: Choosing components and forming a feature vector:
• If we have a dataset with n variables, then we have the
corresponding n eigenvalues and eigenvectors.
• It turns out that the eigenvector corresponding to the highest
eigenvalue is the principal component of the dataset.
• To reduce the dimensions, we choose the first p eigenvalues and
ignore the rest.
• We do lose out some information in the process, but if the
eigenvalues are small, we do not lose much.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 131
Contd..
• We form a feature vector which is a matrix of vectors, in our case, the
eigenvectors.
• Since we just have 2 dimensions in the running example, we can either
choose the one corresponding to the greater eigenvalue or simply take
both.
• Feature Vector = (eigvector1, eigvector2)
• Step 5: Forming Principal Components:
• This is the final step where we actually form the principal components.
• We take the transpose of the feature vector and left-multiply it with the
transpose of scaled version of original dataset.
• NewData = ScaledDataT x FeatureVectorT
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 132
Example-1
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 133
Example-1
X-mean Y-mean Var(x) Var(y) Var (x,y)
Covariance Calculation -1 -2 1 4 2
1 2 1 4 2
Sum 2 8 4
Sum/(n-1) 2 8 4
Final Covariance Matrix
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 134
Example-1
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 135
Example-1
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 136
Example-1
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 137
Example-1
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 138
Example-2
Apply the PCA algorithm and calculate the principal
components
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 139
Solution
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 140
Applications of PCA
• PCA is predominantly used as a dimensionality reduction technique in
domains like facial recognition, computer vision and image
compression.
• It is also used for finding patterns in data of high dimension in the field
of finance, data mining, bioinformatics, psychology, etc.
• PCA for images:
• We will be restricting our discussion to square images only.
• Any square image of size NxN pixels can be represented as
a NxN matrix where each element is the intensity value of the image.
• The image is formed placing the rows of pixels one after the other to
form one single image.
• So if you have a set of images, we can form a matrix out of these
matrices.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 141
PCA for images
• Say you are given an image to recognize which is not a part of the
previous set.
• The machine checks the differences between the to-be-recognized
image and each of the principal components.
• It turns out that the process performs well if PCA is applied and the
differences are taken from the ‘transformed’ matrix.
• Also, applying PCA gives us the liberty to leave out some of the
components without losing out much information and thus reducing
the complexity of the problem.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 142
Normalization (Only for Lab
Syllabus)
• Some features, such as latitude or longitude, are bounded in value.
• Other numeric features, such as counts, may increase without bound.
• Models that are smooth functions of the input, such as linear
regression, logistic regression, or anything that involves a matrix, are
affected by the scale of the input.
• If your model is sensitive to the scale of input features, feature scaling
could help.
• As the name suggests, feature scaling changes the scale of the
feature.
• Sometimes people also call it feature normalization.
• Feature scaling is usually done individually to each feature.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 143
Min-Max Scaling
(Normalization)
• Let x be an individual feature value (i.e., a value of the feature in
some data point),
• min(x) and max(x), respectively, be the minimum and maximum
values of this feature over the entire dataset.
• Min-max scaling squeezes (or stretches) all feature values to be within
the range of [0, 1].
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 144
Standardization (Variance
Scaling)
• Feature standardization is defined as:
• It subtracts off the mean of the feature (over all data points) and
divides by the variance.
• Hence, it can also be called variance scaling.
• The resulting scaled feature has a mean of 0 and a variance of 1.
• If the original feature has a Gaussian distribution, then the scaled
feature does too.
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 145
Standardization (Variance
Scaling)
23/07/2025 Vani K S, Assistant Professor, Department of ISE, NMIT 146
Thank You
23/07/2025 DepartmentVani
of KISE, Nitte Meenakshi
S, Assistant Institute
Professor, Department of NMIT
of ISE, Technology 147