INTRODUCTION TO
MACHINE LEARNING
Ingredients of Machine Learning
▪ Fundamentally machine learning is about getting data in some form and aggregating it in a
way that lets us make some kind of predictions
▪ The main focus of machine learning is making decisions or predictions based on data.
▪ One crucial aspect of machine learning approaches to solving problems is that human
engineering plays an important role.
▪ A human still has to frame the problem: acquire and organize data, design a space of
possible solutions, select a learning algorithm and its parameters, apply the algorithm to
the data, validate the resulting solution to decide whether it's good enough to use, etc.
▪ These steps are of great importance.
▪ In general, we need to solve these two problems:
• estimation: When we have data that are noisy reflections of some underlying quantity of
interest, we have to aggregate the data and make estimates or predictions about the
quantity. How do we deal with the fact that, for example, the same treatment may end up
with different results on different trials? How can we predict how well an estimate may
compare to future results?
• generalization: How can we predict results of a situation or experiment that we have never
encountered before in our data set?
▪ We can describe problems and their solutions using six characteristics, three of which
characterize the problem and three of which characterize the solution:
1. Problem class: What is the nature of the training data and what kinds of queries will be made at
testing time?
2. Assumptions: What do we know about the source of the data or the form of the solution?
3. Evaluation criteria: What is the goal of the prediction or estimation system? How will the
answers to individual queries be evaluated? How will the overall performance of the system be
measured?
4. Model type: Will an intermediate model be made? What aspects of the data will be modeled?
How will the model be used to make predictions?
5. Model class: What particular parametric class of models will be used? What criterion will we use
to pick a particular model from the model class?
6. Algorithm: What computational process will be used to fit the model to the data and/or to make
predictions?
▪ Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization.
▪ Tasks: the problems that can be solved with machine learning (looking for
structure)
▪ Models: the output of machine learning
▪ Geometric models
▪ Probabilistic models
▪ Logical models
▪ Grouping and grading
▪ Features: the workhorses of machine learning
▪ Many uses of features
▪ Feature construction and transformation
▪ The most common machine learning tasks are predictive, in the sense that they
concern predicting a target variable from features.
▪ Binary and multi-class classification: categorical target
▪ Regression: numerical target
▪ Clustering: hidden target
▪ Descriptive tasks are concerned with exploiting underlying structure in the data.
▪ If our e-mails are described by word-occurrence features as in the text
classification example, the similarity of e-mails would be measured in terms of the
words they have in common. For instance, we could take the number of common
words in two e-mails and divide it by the number of words occurring in either e-
mail (this measure is called the Jaccard coefficient).
▪ Suppose that one e-mail contains 42 (different) words and another contains 112
words, and the two e-mails have 23 words in common, then their similarity would
be 23/(42+112−23) = 23/130 = 0.18. We can then cluster our e-mails into groups,
such that the average similarity of an e-mail to the other e-mails in its group is
much larger than the average similarity to e-mails from other groups.
▪ Consider the following matrix:
▪ Imagine these represent ratings by six different people (in rows), on a scale of 0 to
3, of four different films – say Parasite, Avatar, Inception, Home Alone, (in columns,
from left to right). Inception seems to be the most popular of the four with an
average rating of 1.5, and Parasite is the least appreciated with an average rating of
0.5. Can you see any structure in this matrix?
▪ The right-most matrix associates films (in columns) with genres (in rows): Parasite
and Avatar belong to two different genres, say drama and adventure, Inception
belongs to both, and Home Alone belongs to adventure genre and also introduces
a new genre (say comedy).
▪ The tall, 6-by-3 matrix then expresses people’s preferences in terms of genres.
▪ Finally, the middle matrix states that the adventure genre is twice as important as
the other two genres in terms of determining people’s preferences.
▪ We can draw a distinction between whether the model output involves the target
variable or not: we call it a predictive model if it does, and a descriptive model if
it does not. This leads to the four different machine learning settings
▪ The rows refer to whether the training data is labelled with a target variable, while
the columns indicate whether the models learned are used to predict a target
variable or rather describe the given data.
Machine learning models can be distinguished according to their main intuition:
▪ Geometric models use intuitions from geometry such as separating (hyper-
)planes, linear transformations and distance metrics.
▪ Probabilistic models view learning as a process of reducing uncertainty,
modelled by means of probability distributions.
▪ Logical models are defined in terms of easily interpretable logical expressions.
Alternatively, they can be characterised by their modus operandi:
▪ Grouping models divide the instance space into segments; in each segment a
very simple (e.g., constant) model is learned.
▪ Grading models learning a single, global model over the instance space.
The basic linear classifier constructs a decision
boundary by half-way intersecting the line
between the positive and negative centres of mass.
It is described by the equation w· x = t, where w is
a vector perpendicular to the decision boundary, x
points to an arbitrary point on the decision
boundary, and t is the decision threshold. A good
way to think of the vector w is as pointing from the
‘centre of mass’ of the negative examples, n, to the
centre of mass of the positives p with w = p−n; the
decision threshold can be found by noting that
(p+n)/2 is on the decision boundary, and hence t =
(p−n)·(p+n)/2 = (||p||2 −||n||2 )/2, where ||x||
denotes the length of vector x.
▪ The objective of the support vector machine
algorithm is to find a hyperplane in an N-
dimensional space(N — the number of
features) that distinctly classifies the data
points.
▪ To separate the two classes of data points,
there are many possible hyperplanes that
could be chosen. Our objective is to find a
plane that has the maximum margin, i.e the
maximum distance between data points of
both classes. Maximizing the margin
distance provides some reinforcement so
that future data points can be classified with
more confidence.
▪ Hyperplanes are decision boundaries that help classify the data points. Data points falling on
either side of the hyperplane can be attributed to different classes. Also, the dimension of the
hyperplane depends upon the number of features. If the number of input features is 2, then
the hyperplane is just a line. If the number of input features is 3, then the hyperplane
becomes a two-dimensional plane. It becomes difficult to imagine when the number of
features exceeds 3.
▪ Support vectors are data points that are closer to the hyperplane and influence the position
and orientation of the hyperplane. Using these support vectors, we maximize the margin of
the classifier. Deleting the support vectors will change the position of the hyperplane. These
are the points that help us build our SVM.
Free Ipod Lottery P(Y = spam|Free Ipod,lottery) P(Y = ham|Free Ipod,lottery)
0 0 0.31 0.69
0 1 0.65 0.35
1 0 0.80 0.20
1 1 0.40 0.60
▪ Assuming that X and Y are the only variables we know and care about, the posterior
distribution P(Y |X) helps us to answer many questions of interest.
▪ For instance, to classify a new e-mail we determine whether the words ‘Free Ipod’
and ‘lottery’ occur in it, look up the corresponding probability P(Y = spam|Free
Ipod,lottery), and predict spam if this probability exceeds 0.5 and ham otherwise.
▪ Such a recipe to predict a value of Y on the basis of the values of X and the
posterior distribution P(Y |X) is called a decision rule.
▪ Suppose we skimmed an e-mail and noticed that it contains the word ‘lottery’ but
we haven’t looked closely enough to determine whether it uses the words ‘Free
Ipod’. This means that we don’t know whether to use the second or the fourth row in
previous Table to make a prediction.
▪ This is a problem, as we would predict spam if the e-mail contained the word ‘Free
Ipod’ (second row) and ham if it didn’t (fourth row). The solution is to average these
two rows, using the probability of ‘Free Ipod’ occurring in any e-mail (spam or not):
P(Y |lottery) =P(Y |Free Ipod = 0,lottery)P(Free Ipod = 0) +P(Y |Free Ipod =
1,lottery)P(Free Ipod = 1)
▪ For instance, suppose for the sake of argument that one in ten e-mails contain the
word ‘Free Ipod’, then P(Free Ipod = 1) = 0.10 and P(free Ipod = 0) = 0.90. Using
the above formula,
we obtain
P(Y = spam|lottery = 1) = 0.65 · 0.90+0.40 · 0.10 = 0.625 and
P(Y = ham|lottery = 1) = 0.35 · 0.90+0.60 · 0.10 = 0.375.
Because the occurrence of ‘Free IPod’ in any e-mail is relatively rare, the resulting
distribution deviates only a little from the second row in Table.
▪ As a matter of fact, statisticians work very often with different conditional
probabilities, given by the likelihood function P(X|Y ).
▪ I like to think of these as thought experiments: if somebody were to send me a
spam e-mail, how likely would it be that it contains exactly the words of the e-mail
I’m looking at? And how likely if it were a ham e-mail instead?
▪ What really matters is not the magnitude of these likelihoods, but their ratio: how
much more likely is it to observe this combination of words in a spam e-mail than it
is in a non-spam e-mail.
▪ For instance, suppose that for a particular e-mail described by X we have P(X|Y =
spam) = 3.5 · 10−5 and P(X|Y = ham) = 7.4 · 10−6 , then observing X in a spam e-
mail is nearly five times more likely than it is in a ham e-mail.
▪ This suggests the following decision rule (maximum a posteriori, MAP): predict
spam if the likelihood ratio is larger than 1 and ham otherwise.
▪ Use likelihoods if you want to ignore the prior distribution or assume it uniform,
and posterior probabilities otherwise.
Using the data from Table, and assuming a uniform prior distribution, we arrive at the
following posterior odds:
P(Y = spam|Free Ipod = 0,lottery = 0) /P(Y = ham| Free Ipod = 0,lottery = 0) = 0.31/ 0.69
= 0.45
P(Y = spam| Free Ipod = 1,lottery = 1) P(Y = ham| Free Ipod = 1,lottery = 1) = 0.40/ 0.60
= 0.67
P(Y = spam| Free Ipod = 0,lottery = 1) P(Y = ham| Free Ipod = 0,lottery = 1) = 0.65 /0.35
= 1.9
P(Y = spam| Free Ipod = 1,lottery = 0) P(Y = ham| Free Ipod = 1,lottery = 0) = 0.80 /0.20
= 4.0
Using a MAP decision rule we predict ham in the top two cases and spam in the bottom
two. Given that the full posterior distribution is all there is to know about the domain in a
statistical sense, these predictions are the best we can do: they are Bayes-optimal.
Y P(Free Ipod = 1|Y ) P(Free Ipod = 0|Y )
SPAM 0.4 0.6
HAM 0.12 0.88
▪ Using the marginal likelihoods from Table in previous slide, we can approximate
the likelihood ratios (the previously calculated odds from the full posterior
distribution are shown in brackets):
▪ P(Free Ipod = 0|Y = spam) P(Free Ipod = 0|Y = ham)/ P(lottery = 0|Y = spam)
P(lottery = 0|Y = ham) = (0.60 x 0.88) /(0.79 x 0.87) = 0.62 (0.45)
▪ P(Free Ipod = 0|Y = spam) P(Free Ipod = 0|Y = ham) / P(lottery = 1|Y = spam)
P(lottery = 1|Y = ham) = (0.60 x 0.88) / (0.21 x 0.13) = 1.1 (1.9)
▪ P(Free Ipod = 1|Y = spam) P(Free Ipod = 1|Y = ham) / P(lottery = 0|Y = spam)
P(lottery = 0|Y = ham) = (0.40 x 0.12) / (0.79 x 0.87) = 3.0 (4.0)
▪ P(Free Ipod = 1|Y = spam) P(Free Ipod = 1|Y = ham) / P(lottery = 1|Y = spam)
P(lottery = 1|Y = ham) = (0.40 x 0.12) / (0.21 x 0.13) = 5.4 (0.67)
▪ We see that, using a maximum likelihood decision rule, our very simple model
arrives at the Bayes-optimal prediction in the first three cases, but not in the fourth
(‘Free Ipod’ and ‘lottery’ both present), where the marginal likelihoods are actually
very misleading.
▪ Models of this type can be easily translated into rules that are understandable by
humans, such as ·if Free Ipod = 1 then Class = Y = spam.
▪ Such rules are easily organised in a tree structure called feature tree
▪ The idea of such a tree is that features are used to iteratively partition the instance
space. The leaves of the tree therefore correspond to rectangular areas in the
instance space which we will call instance space segments, or segments for short.
▪ Depending on the task we are solving, we can then label the leaves with a class, a
probability, a real value, and so on. Feature trees whose leaves are labelled with
classes are commonly called decision trees.
▪ (left) A feature tree combining two Boolean features. Each internal node or split is
labelled with a feature, and each edge emanating from a split is labelled with a
feature value. Each leaf therefore corresponds to a unique combination of feature
values. Also indicated in each leaf is the class distribution derived from the training
set. (right) A feature tree partitions the instance space into rectangular regions, one
for each leaf. We can clearly see that the majority of ham lives in the lower left-
hand corner.
▪ The leaves of the tree in Figure could be labelled, from left to right, as ham – spam
– spam, employing a simple decision rule called majority class.
▪ Alternatively, we could label them with the proportion of spam e-mail occurring in
each leaf: from left to right, 1/3, 2/3, and 4/5.
▪ Or, if our task was a regression task, we could label the leaves with predicted real
values or even linear functions of some other, real-valued features.
▪ (left) A complete feature tree built from two Boolean features. (right) The
corresponding instance space partition is the finest partition that can be achieved
with those two features.
▪ A feature list is a binary feature tree which always branches in the same direction,
either left or right. The tree in Figure is a left-branching feature list.
▪ Such feature lists can be written as nested if–then–else statements
▪ For instance, if we were to label the leaves in Figure by majority class we obtain the
following decision list: ·
if Free Ipod = 1 then Class = Y = spam· ·
else if lottery = 1 then Class = Y = spam· ·
else Class = Y = ham·
▪ For each path from the root to a leaf:
Collect all comparisons from the intermediate nodes
Join the comparisons using AND
Use majority class from the leaf as decision
▪ Consider the following rules: ·
if lottery = 1 then Class = Y = spam· ·
if Peter = 1 then Class = Y = ham·
▪ As can be seen in Figure, these rules overlap for lottery = 1 ∧ Peter = 1, for which
they make contradictory predictions. Furthermore, they fail to make any
predictions for lottery = 0 ∧ Peter = 0.
▪ A ‘map’ of some of the models.
Models that share characteristics are
plotted closer together: logical
models to the right, geometric
models on the top left and
probabilistic models on the bottom
left. The horizontal dimension
roughly ranges from grading models
on the left to grouping models on the
right.
▪ A taxonomy describing machine
learning methods in terms of the
extent to which they are grading or
grouping models, logical, geometric
or a combination, and supervised or
unsupervised. The colours indicate
the type of model, from left to right:
logical (red), probabilistic (orange)
and geometric (purple).
▪ Features determine much of the success of a machine learning application,
because a model is only as good as its features.
▪ A feature can be thought of as a kind of measurement that can be easily performed
on any instance.
▪ Mathematically, they are functions that map from the instance space to some set of
feature values called the domain of the feature.
▪ Since measurements are often numerical, the most common feature domain is the
set of real numbers.
▪ Other typical feature domains include the set of integers, for instance when the
feature counts something, such as the number of occurrences of a particular word;
the Booleans, if our feature is a statement that can be true or false for a particular
instance, such as ‘this e-mail is addressed to Irzam Shahid’; and arbitrary finite sets,
such as a set of colours, or a set of shapes.
▪ Suppose we have a number of learning models that we want to describe in terms of
a number of properties:
▪ the extent to which the models are geometric, probabilistic or logical;
▪ whether they are grouping or grading models;
▪ the extent to which they can handle discrete and/or real-valued features;
▪ whether they are used in supervised or unsupervised learning; and
▪ the extent to which they can handle multi-class problems.
▪ The first two properties could be expressed by discrete features with three and two
values, respectively; or if the distinctions are more gradual, each aspect could be
rated on some numerical scale. A simple approach would be to measure each
property on an integer scale from 0 to 3, as in Table. This table establishes a data
set in which each row represents an instance and each column a feature.
▪ Suppose we want to approximate y = cosπx on the interval −1 ≤ x ≤ 1. A linear
approximation is not much use here, since the best fit would be y = 0. However, if
we split the x-axis in two intervals −1 ≤ x < 0 and 0 ≤ x ≤ 1, we could find
reasonable linear approximations on each interval. We can achieve this by using x
both as a splitting feature and as a regression variable
▪ (left) A regression tree combining a one-split feature tree with linear regression
models in the leaves. Notice how x is used as both a splitting feature and a
regression variable. (right) The function y = cosπx on the interval −1 ≤ x ≤ 1, and
the piecewise linear approximation achieved by the regression tree.
▪ (left) Artificial data depicting a histogram of body weight measurements of people
with (blue) and without (red) diabetes, with eleven fixed intervals of 10 kilograms
width each. (right) By joining the first and second, third and fourth, fifth and sixth,
and the eighth, ninth and tenth intervals, we obtain a discretisation such that the
proportion of diabetes cases increases from left to right. This discretisation makes
the feature more useful in predicting diabetes.