0% found this document useful (0 votes)

4K views45 pages

SVM Classifier on Modified Iris Dataset

Q: How does increasing the regularization parameter in an L2-regularized linear regression model affect the coefficients, excluding the intercept?

As the regularization parameter increases, the absolute values of the coefficients generally decrease, leading to a shrinking effect that penalizes large coefficients, fostering simpler and less overfit models .

Q: What is the main difference between the hyperplane generated by a linear kernel SVM and a perceptron when training on linearly separable data?

The main difference is that the hyperplane generated by an SVM is guaranteed to have a maximal margin separating the classes, whereas the perceptron algorithm does not provide any margin guarantees. This results in potentially different hyperplanes for the same training data between the two methods .

Q: Why might a decision tree with limited node splits potentially affect model performance despite controlling model complexity?

Preventing splits that don't immediately decrease the error significantly can limit the tree's performance. Even poor splits can lead to better ones later in the tree, which might significantly reduce the error. Overrestricting the complexity might hinder capturing those beneficial eventual splits, negatively impacting the model's generalization ability .

Q: What computational method differs between ANNs and SVMs in finding the error surface minimum?

In ANNs, the error surface is navigated using a gradient descent technique, which may land on different local minima depending on the initialization. In contrast, SVMs use convex optimization solvers, ensuring a single minimum due to their convex error surfaces .

Q: How does the perceptron learning differ from the XOR problem, and what does this illustrate about the capabilities of single-layer perceptrons?

The XOR function cannot be solved by a single-layer perceptron because it is not linearly separable, illustrating the limitation that single-layer perceptrons can only learn linearly separable patterns. This necessitates the use of multi-layer networks to solve more complex problems like XOR .

Q: What implications does the Perceptron Convergence Theorem have for the training of perceptrons on linearly separable datasets?

The Perceptron Convergence Theorem guarantees that a perceptron will converge to a solution in a finite number of steps for linearly separable datasets. This ensures that a perceptron will eventually find a hyperplane that separates the classes without additional features or hidden layers .

Q: When examining Bayesian networks, how can one determine which probability dependencies exist, and why might some intuitions about dependencies be misleading?

In Bayesian networks, understanding dependencies involves analyzing the directed edges and conditional independence based on node connections. Standard intuitions about dependencies might be misleading if they don't consider the role of conditioning on joint parent nodes or colliders within the networks, which can introduce dependence where none is initially apparent without conditioning .

Q: What assumption does the CURE algorithm make regarding the cluster shapes, and how does it contrast with other clustering algorithms?

CURE does not assume any predefined cluster shape (such as spherical or elliptical), which contrasts with algorithms like K-means that often assume spherical clusters. This flexibility allows CURE to effectively detect arbitrary cluster shapes .

Q: How do Bayesian networks handle causality, and what misconception might occur when interpreting their structure?

Bayesian networks represent conditional dependencies using directed acyclic graphs. However, the presence of directed edges does not imply causality between the nodes. This common misconception can lead to incorrect assumptions about causal relationships just based on network structure .

Q: In a Markov Random Field, how can one identify the variables that have no effect on a given variable once its Markov blanket is known?

The Markov blanket of a variable includes its neighbors in the graph, any parents of these neighbors, and any children of these neighbors. Variables not included in the Markov blanket cannot affect the given variable once the blanket is known because all dependencies are confined within the blanket .

Uploaded by

Himanshu Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4K views45 pages

SVM Classifier on Modified Iris Dataset

Uploaded by

Himanshu Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Assignment 1

Introduction to Machine Learning

Prof. B. Ravindran
1. Which of the following is a supervised learning problem?

(a) Grouping related documents from an unannotated corpus.

(b) Predicting credit approval based on historical data
(c) Predicting rainfall based on historical data
(d) Predicting if a customer is going to return or keep a particular product he/she purchased
from e-commerce website based on the historical data about the customer purchases and
the particular product.
(e) Fingerprint recognition of a particular person used in biometric attendance from the
fingerprint data of various other people and that particular person
Sol. (b), (c), (d) (e)
(a) does not have labels to indicate the groups.
2. Which of the following is not a classification problem?
(a) Predicting the temperature (in Celsius) of a room from other environmental features (such
as atmospheric pressure, humidity etc).
(b) Predicting if a cricket player is a batsman or bowler given his playing records.
(c) Predicting the price of house (in INR) based on the data consisting prices of other house
(in INR) and its features such as area, number of rooms, location etc.
(d) Filtering of spam messages
(e) Predicting the weather for tomorrow as “hot”, “cold”, or “rainy” based on the historical
data wind speed, humidity, temperature, and precipitation.
Sol. (a),(c)

3. Which of the following is a regression task? (multiple options may be correct)

(a) Predicting the monthly sales of a cloth store in rupees.

(b) Predicting if a user would like to listen to a newly released song or not based on historical
data.
(c) Predicting the confirmation probability (in fraction) of your train ticket whose current
status is waiting list based on historical data.
(d) Predicting if a patient has diabetes or not based on historical medical records.
(e) Predicting if a customer is satisfied or unsatisfied from the product purchased from e-
commerce website using the the reviews he/she wrote for the purchased product.
Sol. (a) and (c)

4. Which of the following is an unsupervised task?

1
(a) Predicting if a new edible item is sweet or spicy based on the information of the ingredi-
ents, their quantities, and labels (sweet or spicy) for many other similar dishes.
(b) Grouping related documents from an unannotated corpus.
(c) Grouping of hand-written digits from their image.
(d) Predicting the time (in days) a PhD student will take to complete his/her thesis to earn a
degree based on the historical data such as qualifications, department, institute, research
area, and time taken by other scholars to earn the degree.
(e) all of the above
Sol. (b), (c)

5. Which of the following is a categorical feature?

(a) Number of rooms in a hostel.
(b) Minimum RAM requirement (in GB) of a system to play a game like FIFA, DOTA.
(c) Your weekly expenditure in rupees.
(d) Ethnicity of a person
(e) Area (in sq. centimeter) of your laptop screen.
(f) The color of the curtains in your room.
Sol. (d) and (f)
6. Let X and Y be a uniformly distributed random variable over the interval [0, 4] and [0, 6]
respectively. If X and Y are independent events, then compute the probability,
P(max(X, Y ) > 3)
1
(a) 6
5
(b) 6
2
(c) 3
1
(d) 2
2
(e) 6
5
(f) 8
(g) None of the above
Sol. (f)

P(max(X, Y ) > 3) = P(X > 3) + P(Y > 3) − P(X > 3 & Y > 3)
1 1 1 1
= + − ×
4 2 4 2
5
=
8

a b
7. Let the trace and determinant of a matrix A = be 6 and 16 respectively. The eigen-
c d
values of A are

2
√ √
3+ι 7 3−ι 7
√
(a) 2 , 2 , where ι = −1
(b) 1, 3
√ √
3+ι 7 3−ι 7
√
(c) 4 , 4 , where ι = −1
(d) 1/2, 3/2
√ √ √
(e) 3 + ι 7, 3 − ι 7, where ι = −1
(f) 2, 8
(g) None of the above
(h) Can be computed only if A is a symmetric matrix.
(i) Can not be computed as the entries of the matrix A are not given.
Sol. (e)
Use of the facts that the trace and determinant of a matrix is equal to the sum and product
of its eigenvalues respectively. Using this

λ1 + λ2 = 4, λ1 λ2 = 3

where λ1 and λ2 denotes the eigenvalues. Solve the above two equations in two variables.
8. What happens when your model complexity increases? (multiple options may be correct)
(a) Model Bias decreases
(b) Model Bias increases
(c) Variance of the model decreases
(d) Variance of the model increases
Sol. (a) and (d)
9. A new phone, E-Corp X1 has been announced and it is what you’ve been waiting for, all along.
You decide to read the reviews before buying it. From past experiences, you’ve figured out
that good reviews mean that the product is good 90% of the time and bad reviews mean that
it is bad 70% of the time. Upon glancing through the reviews section, you find out that the X1
has been reviewed 1269 times and only 172 of them were bad reviews. What is the probability
that, if you order the X1, it is a bad phone?
(a) 0.136
(b) 0.160
(c) 0.360
(d) 0.840
(e) 0.773
(f) 0.573
(g) 0.181
Sol. (g)
For the solution, let’s use the following abbreviations.
• BP - Bad Phone

3
• GP - Good Phone
• GR - Good Review
• BR - Bad Review
From the given data, P r(BP |BR) = 0.7 and P r(GP |GR) = 0.9. Using this, Pr(BP—GR) =
1 - Pr(GP—GR) = 0.1.
Hence,

P r(BP ) = P r(BP |BR) · P r(BR) + P r(BP |GR) · P r(GR)

172 1269 − 172
= 0.7 · + 0.1 ·
1269 1269
= 0.1813

10. Which of the following are false about bias and variance of overfitted and underfitted models?
(multiple options may be correct)
(a) Underfitted models have high bias.
(b) Underfitted models have low bias.
(c) Overfitted models have low variance.
(d) Overfitted models have high variance.
Sol. (b), (c)

4
Assignment 2
Introduction to Machine Learning
Prof. B. Ravindran
1. Given a training data set of 10,000 instances, with each input instance having 17 dimensions
and each output instance having 2 dimensions, the dimensions of the design matrix used in
applying linear regression to this data is
(a) 10000 × 17
(b) 10002 × 17
(c) 10000 × 18
(d) 10000 × 19
Sol. (c)
2. Suppose we want to add a regularizer to the linear regression loss function,
Pp to control the
magnitudes
Pp of the weights β. We have a choice between Ω 1 (β) = i=1 |β| and Ω2 (β) =
2
i=1 β . Which one is more likely to result in sparse weights?
(a) Ω1
(b) Ω2
(c) Both Ω1 and Ω2 will result in sparse weights
(d) Neither of Ω1 or Ω2 can result in sparse weights
Sol. (a)
3. The model obtained by applying linear regression on the identified subset of features may differ
from the model obtained at the end of the process of identifying the subset during
(a) Forward stepwise selection
(b) Backward stepwise selection
(c) Forward stagewise selection
(d) All of the above
Sol. (c)
Let us assume that the data set has p features among which each method is used to select
k, 0 < k < p, features. If we use the selected k features identified by forward stagewise selection,
and apply linear regression, the model we obtain may differ from the model obtained at the end
of the process of applying forward stagewise selection to identify the k features. This is due to
the manner in which the coefficients are built in this method where at each step the algorithm
computes the coefficient for the variable identified as resulting in the maximum reduction in
error, but does not recompute the coefficients for the previously identified variables. Note that
there will be no difference in the other two methods, because in both forward and backward
stepwise selection, at each step of removing/adding a feature, linear regression is performed
on the retained subset of features to learn the coefficients.
4. Consider forward selection, backward selection and best subset selection with respect to the
same data set. Which of the following is true?

1
(a) Best subset selection can be computationally more expensive than forward selection
(b) Forward selection and backward selection always lead to the same result
(c) Best subset selection can be computationally less expensive than backward selection
(d) Best subset selection and forward selection are computationally equally expensive
(e) Both (b) and (d)

Sol. (a)
Explanation: Best subset selection has to explore all possible subsets which takes exponential
time. It is not guaranteed that forward selection and backward selection take the same time.
Forward selection and backward selection are computational much cheaper than best subset
selection.

5. In the lecture on Multivariate Regression, you learn about using orthogonalization iteratively
to obtain regression co-effecients. This method is generally referred to as Multiple Regression
using Successive Orthogonalization.
In the formulation of the method, we observe that in iteration k, we regress the entire dataset
on z0 , z1 , . . . zk−1 . It seems like a waste of computation to recompute the coefficients for z0 a
total of p times, z1 a total of p − 1 times and so on. Can we re-use the coefficients computed
in iteration j for iteration j + 1 for zj−1 ?
(a) No. Doing so will result in the wrong γ matrix. and hence, the wrong βi ’s.
(b) Yes. Since zj−1 is orthogonal to zj−l ∀ l ≤ j1, the multiple regression in each iteration is
essentially a univariate regression on each of the previous residuals. Since the regression
coefficients for the previous residuals don’t change over iterations, we can re-use the
coefficients for further iterations.
Sol. (b)
The answer is self-explanatory. Please refer to the section on Multiple Regression using Suc-
cessive Orthogonalization in Elements of Statistical Learning, 2nd edition for the algorithm.

6. Principal Component Regression (PCR) is an approach to find an orthogonal set of basis

vectors which can then be used to reduce the dimension of the input. Which of the following
matrices contains the principal component directions as its columns (follow notation from the
lecture video)
(a) X
(b) S
(c) Xc
(d) V
(e) U

Sol. (d)
7. (2 marks) Consider the following five training examples

2
x y
2 9.8978
3 12.7586
4 16.3192
5 19.3129
6 21.1351

We want to learn a function f (x) of the form f (x) = ax + b which is parameterised by (a, b).
Using squared error as the loss function, which of the following parameters would you use to
model this function to get a solution with the minimum loss.
(a) (4, 3)
(b) (1, 4)
(c) (4, 1)
(d) (3, 4)
Sol. (d)
8. (2 marks) Here is a data set of words in two languages.

Word Language
piano English
cat English
kepto Vinglish
shaito Vinglish

Let us build a nearest neighbours classifier that will predict which language a word belongs
to. Say we represent each word using the following features.
• Length of the word
• Number of consonants in the word
• Whether it ends with the letter ’o’ (1 if it does, 0 if it doesn’t)
For example, the representation of the word ‘waffle’ would be [6, 2, 0]. For a distance function,
use the Manhattan distance.
Pn
d(a, b) = i=1 |ai − bi | where a, b ∈ Rn

Take the input word ‘keto’. With k = 1, the predicted language for the word is?
(a) English
(b) Vinglish
(c) None of the above
Sol. (a)
Since its nearest neighbour is ‘piano’. The representations for the 4 words are [5, 2, 1], [3, 2,
0], [5, 3, 1] and [6, 3, 1] respectively, and the representation for the input word is [4, 2, 1]. The
distances are 1,2,2 and 3 respectively.

3
Assignment 3
Introduction to Machine Learning
Prof. B. Ravindran
1. Which of the following is false about a logistic regression based classifier?

(a) The logistic function is non-linear in the weights

(b) The logistic function is linear in the weights
(c) The decision boundary is non-linear in the weights
(d) The decision boundary is linear in the weights

Sol. (b), (c)

2. (2 marks) Consider the case where two classes follow Gaussian distribution which are cen-
tered at (3, 9) and (−3, 3) and have identity covariance matrix. Which of the following is the
separating decision boundary using LDA assuming the priors to be equal?

(a) y − x = 3
(b) x + y = 3
(c) x + y = 6
(d) both (b) and (c)
(e) None of the above
(f) Can not be found from the given information
Sol. (c)
As the distribution is Gaussian and have identity covariance (which are equal), the separating
boundary will be linear. The decision boundary will be orthogonal to the line joining the
centers and will pass from the midpoint of centers.
3. Consider the following relation between a dependent variable and an independent variable
identified by doing simple linear regression. Which among the following relations between the
two variables does the graph indicate?

1
(a) as the independent variable increases, so does the dependent variable
(b) as the independent variable increases, the dependent variable decreases
(c) if an increase in the value of the dependent variable is observed, then the independent
variable will show a corresponding increase
(d) if an increase in the value of the dependent variable is observed, then the independent
variable will show a corresponding decrease
(e) the dependent variable in this graph does not actually depend on the independent variable
(f) none of the above

Sol. (e)
4. Given the following distribution of data points:

2
What method would you choose to perform Dimensionality Reduction?

(a) Linear Discriminant Analysis

(b) Principal Component Analysis
Sol. (a)
PCA does not use class labels and will treat all the points as instances of the same pool. Thus
the principal component will be the vertical axis, as the most variance is along that direction.
However, projecting all the points into the vertical axis will mean that critical information is
lost and both classes are mixed completely. LDA, on the other hand models each class with a
gaussian. This will lead to a principal component along the horizontal axis which retains class
information (the classes are still linearly separable)
5. In general, which of the following classification methods is the most resistant to gross outliers?

3
(a) Quadratic Discriminant Analysis (QDA)
(b) Linear Regression
(c) Logistic regression
(d) Linear Discriminant Analysis (LDA)
Sol. (c)
In general, a good way to tell if a method is sensitive to outliers is to look at the loss it incurs
upon ignoring outliers.
Linear Regression uses a square loss and thus, outliers that are far away from the hyperplane
contribute significantly to the loss.
LDA and QDA both use the L2-Norm and, for the same reason, sensitive to outliers.
Logistic Regression weights the points close to the boundary higher than points far away. This
is an implication of the Logistic loss function (beyond the boundary, roughly linear instead of
quadratic).
6. Suppose that we have two variables, X and Y (the dependent variable). We wish to find
the relation between them. An expert tells us that relation between the two has the form
Y = m + X 2 + c. Available to us are samples of the variables X and Y. Is it possible to apply
linear regression to this data to estimate the values of m and c?

(a) no
(b) yes
(c) insufficient information

Sol. (a)
Note that linear regression can estimate (m+c) but not m and c separately.
7. In a binary classification scenario where x is the independent variable and y is the dependent
variable, logistic regression assumes that the conditional distribution y|x follows a

(a) Bernoulli distribution

(b) binomial distribution
(c) normal distribution
(d) exponential distribution

Sol. (a)
The dependent variable is binary, so a Bernoulli distribution is assumed.
8. Consider the following data:

Feature 1 Feature 2 Class

1 1 A
2 3 A
2 4 A
5 3 B
8 6 A
8 8 B
9 9 B
11 7 B

4
Assuming that you apply LDA to this data, what is the estimated covariance matrix?

1.875 0.3125
(a)
0.3125 0.9375

2.5 0.4167
(b)
0.4167 1.25

1.875 0.3125
(c)
0.3125 1.2188

2.5 0.4167
(d)
0.4167 1.625

3.25 1.1667
(e)
1.1667 2.375

2.4375 0.875
(f)
0.875 1.7812
(g) None of these

Sol. (e)

9. Given the following 3D input data, identify the principal component.

Feature 1 Feature 2 Feature 3

1 1 1
2 3 1
2 4 1
5 3 2
8 6 1
8 8 2
9 9 2
11 7 2

(Steps: center the data, calculate the sample covariance matrix, calculate the eigenvectors and
eigenvalues, identify the principal component)
 
−0.1022
(a)  0.0018 
0.9948

5
 
0.5742
(b) −0.8164
0.0605
 
0.5742
(c) 0.8164
0.0605
 
−0.5742
(d)  0.8164 
0.0605
 
0.8123
(e) 0.5774
0.0824
(f) None of the above

Sol. (e)
Refer to the solution of practice assignment 3

10. For the data given in the previous question, find the transformed input along the first two
principal components.
 
0.6100 −0.0196
−0.4487 −0.1181
 
−1.2651 −0.1163
 
 1.3345 0.5702 
(a) 
 
 0.5474 −0.7257

−1.0250 0.2727 
 
−1.2672 0.1724 
1.5142 −0.0355
 
−0.1817 0.8944
−1.2404 0.7959
 
−2.0568 0.7977
 
 0.5428 1.4842
(b) 
 
−0.2443 0.1884

−1.8167 1.1868
 
−2.0589 1.0864
0.7225 0.8785
 
−6.2814 0.6100
−4.3143 −0.4487
 
−3.7368 −1.2651
 
−1.7950 1.3345 
(c) 
 
 2.2917 0.5474 

 3.5289 −1.0250
 
 4.9186 −1.2672
5.3883 1.5142

6
 
1.4721 −0.1817
 3.4392
 −1.2404
 4.0166
 −2.0568
 5.9584 0.5428 
(d)  
10.0451
 −0.2443
11.2823
 −1.8167
12.6720 −2.0589
13.1418 0.7225
(e) None of the above

Sol. (c)
Refer to the solution of practice assignment 3

7
Assignment 4
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider a Boolean function in three variables, that returns True if two or more variables out
of three are True, and False otherwise. Can this function be implemented using the perceptron
algorithm?
(a) no
(b) yes
Sol. (b)

2. For a support vector machine model, let xi be an input instance with label yi . If yi (βˆ0 +xTi β̂) >
1, where β0 and β̂ are the estimated parameters of the model, then

(a) xi is not a support vector

(b) xi is a support vector
(c) xi is either an outlier or a support vector
(d) Depending upon other data points, xi may or may not be a support vector.

Sol. (a)
3. Suppose we use a linear kernel SVM to build a classifier for a 2-class problem where the
training data points are linearly separable. In general, will the classifier trained in this manner
be always the same as the classifier trained using the perceptron training algorithm on the
same training data?
(a) Yes
(b) No
Sol. (b) The hyperplane returned by the SVM approach will have a maximal margin, whereas
no such guarantee can be given for the hyperplane identified using the perceptron training
algorithm.
For Q4,5: Kindly download the synthetic dataset from the following link
[Link]
The dataset contains 1000 points and each input point contains 3 features.
4. (2 marks) Train a linear regression model (without regularization) on the above dataset. Re-
port the coefficients of the best fit model. Report the coefficients in the following format:
β0 , β1 , β2 , β3 . (You can round-off the accuracy value to the nearest 2-decimal point number.)
(a) -1.2, 2.1, 2.2, 1
(b) 1, 1.2, 2.1, 2.2
(c) -1, 1.2, 2.1, 2.2
(d) 1, -1.2, 2.1, 2.2
(e) 1, 1.2, -2.1, -2.2

1
Sol. (d)
Follow the steps given on the sklearn page.
5. Train an l2 regularized linear regression model on the above dataset. Vary the regularization
parameter from 1 to 10. As you increase the regularization parameter, absolute value of the
coefficients (excluding the intercept) of the model:

(a) increase
(b) first increase then decrease
(c) decrease
(d) first decrease then increase

Sol. (c)
Follow the steps given on the sklearn page.

For Q6,7: Kindly download the modified version of Iris dataset from this link.
Available at: ([Link]
The dataset contains 150 points and each input point contains 4 features and belongs to one
among three classes. Use the first 100 points as the training data and the remaining 50 as
test data. In the following questions, to report accuracy, use test dataset. You can round-off
the accuracy value to the nearest 2-decimal point number. (Note: Do not change the order of
data points.)

6. (2 marks) Train an l2 regularized logistic regression classifier on the modified iris dataset. We
recommend using sklearn. Use only the first two features for your model. We encourage you
to explore the impact of varying different hyperparameters of the model. Kindly note that the
C parameter mentioned below is the inverse of the regularization parameter λ. As part of the
assignment train a model with the following hyperparameters:
Model: logistic regression with one-vs-rest classifier, C = 1e4
For the above set of hyperparameters, report the best classification accuracy
(a) 0.88
(b) 0.86
(c) 0.98
(d) 0.68
Sol. (b)
Following code will give the desired result.
>>clf = LogisticRegression(penalty=’l2’, C=1e4, multi class = ”ovr”).fit(X[0:100,0:2], Y[0:100])
>>[Link](X[100:,0:2], Y[100:])

7. (2 marks) Train an SVM classifier on the modified iris dataset. We recommend using sklearn.
Use only the first two features for your model. We encourage you to explore the impact
of varying different hyperparameters of the model. Specifically try different kernels and the
associated hyperparameters. As part of the assignment train models with the following set of
hyperparameters

2
RBF-kernel, gamma = 0.5, one-vs-rest classifier, no-feature-normalization.
Try C = 0.01, 1, 10. For the above set of hyperparameters, report the best classification
accuracy along with total number of support vectors on the test data.
(a) 0.92, 69
(b) 0.88, 40
(c) 0.88, 69
(d) 0.98, 41
Sol. (c)
Following code will give the desired result.
>>clf = [Link]( C=1.0, kernel=’rbf’, decision function shape=’ovr’, gamma = 0.5)).fit(X[0:100,0:2],
Y[0:100])
>>[Link](X[100:,0:2], Y[100:])
>>clf.n support

3
Assignment 5
Introduction to Machine Learning
Prof. B. Ravindran
1. You are given the N samples of input (x) and output (y) as shown in the figure below. What
will be the most appropriate model y = f (x)?

(a) y = wẋ with w > 0

(b) y = wẋ with w < 0
(c) y = xw with w > 0
(d) y = xw with w < 0

Sol. (c)

2. For training a binary classification model with five independent variables, you choose to use
neural networks. You apply one hidden layer with three neurons. What are the number of
parameters to be estimated? (Consider the bias term as a parameter)
(a) 16
(b) 21
(c) 34 = 81
(d) 43 = 64
(e) 12
(f) 22
(g) 25
(h) 26
(i) 4
(j) None of these

1
Sol. (f)
Number of weights from input to hidden layer = 5 × 3 = 15
Bias term for the three neurons in hidden layer = 3
Number of weights from hidden to output layer = 3
Bias term in the final output layer = 1
Summing the above = 22

3. Suppose the marks obtained by randomly sampled students follow a normal distribution with
unknown µ. A random sample of 5 marks are 25, 55, 64, 7 and 99. Using the given samples
find the maximum likelihood estimate for the mean.
(a) 54.2
(b) 67.75
(c) 50
(d) Information not sufficient for estimation
Sol. (c)
4. You are given the following neural networks which take two binary valued inputs x1 , x2 ∈ {0, 1}
and the activation function is the threshold function(h(x) = 1 if x > 0; 0 otherwise). Which
of the following logical functions does it compute?

(a) OR
(b) AND
(c) NAND
(d) None of the above.

Sol. (a)

5. Using the notations used in class, evaluate the value of the neural network with a 3-3-1 archi-
tecture (2-dimensional input with 1 node for the bias term in both the layers). The parameters
are as follows
1 0.2 0.4
α=
−1 0.8 0.5

2

β = 0.8 0.4 0.5
Using sigmoid function as the activation functions at both the layers, the output of the network
for an input of (0.8, 0.7) will be
(a) 0.6710
(b) 0.9617
(c) 0.6948
(d) 0.7052
(e) 0.2023
(f) 0.7977
(g) 0.2446
(h) None of these

Solution (f)
This is a straight forward computation task. First pad x with 1 and make it the X vector,
 
1
X = 0.8
0.7

The output of the first layer can be written as

o1 = αX

Next apply the sigmoid function and compute

1
a1 (i) =
1 + e−o1 (i)
Then pad the a1 vector also with 1 for bias, then compute the output of the second layer.

o2 = βa1
1
a2 =
1 + e−o2
a2 = 0.7977

6. Which of the following statements are true:

(a) The chances of overfitting decreases with increasing the number of hidden nodes and
increasing the number of hidden layers.
(b) A neural network with one hidden layer can represent any Boolean function given sufficient
number of hidden units and appropriate activation functions.
(c) Two hidden layer neural networks can represent any continuous functions (within a tol-
erance) as long as the number of hidden units is sufficient and appropriate activation
functions used.

3
Sol. (b), (c)
By increasing the number of hidden nodes or hidden layers we are increasing the number of
parameters. Increased set of parameters is more capable to memorize the training data. Hence
it may result in overfitting.
7. We have a function which takes a two-dimensional input x = (x1 , x2 ) and has two parameters
w = (w1 , w2 ) given by f (x, w) = σ(σ(x1 w1 )w2 + x2 ) where σ(x) = 1+e1−x . We use backprop-
agation to estimate the right parameter values. We start by setting both the parameters to
1. Assume that we are given a training point x2 = 1, x1 = 0, y = 5. Given this information
∂f
answer the next two questions. What is the value of ∂w 2
?
(a) 0.150
(b) -0.25
(c) 0.125
(d) 0.098
(e) 0.0746
(f) 0.1604
(g) None of these
Solution: (e)
Write σ(x1 w1 )w2 + x2 as o2 and x1 w1 as o1

∂f ∂f ∂o2
=
∂w2 ∂o2 ∂w2
∂f
= σ(o2 )(1 − σ(o2 )) × σ(o1 )
∂w2
8. If the learning rate is 0.5, what will be the value of w2 after one update using backpropagation
algorithm?
(a) 0.4197
(b) -0.4197
(c) 0.6881
(d) -0.6881
(e) 1.3119
(f) -1.3119
(g) 0.5625
(h) -0.5625
(i) None of these
Solution: (e)
The update equation would be
∂L
w2 = w2 − λ
∂w2

4
where L is the loss function, here L = (y − f )2

∂f
w2 = w2 − λ × 2(y − f ) × (−1) ×
∂w2

Now putting in the given values we get the right answer.

9. Which of the following are true when comparing ANNs and SVMs?

(a) ANN error surface has multiple local minima while SVM error surface has only one minima
(b) After training, an ANN might land on a different minimum each time, when initialized
with random weights during each run.
(c) As shown for Perceptron, there are some classes of functions that cannot be learnt by an
ANN. An SVM can learn a hyperplane for any kind of distribution.
(d) In training, ANN’s error surface is navigated using a gradient descent technique while
SVM’s error surface is navigated using convex optimization solvers.
Sol. (a), (b) and (d)
By universal approximate theorem, we can argue that option (d) is not true.
10. Which of the following are correct?

(a) A perceptron will learn the underlying linearly separable boundary with finite number of
training steps.
(b) XOR function can be modelled by a single perceptron.
(c) Backpropagation algorithm used while estimating parameters of neural networks actually
uses gradient descent algorithm.
(d) The backpropagation algorithm will always converge to global optimum, which is one of
the reasons for impressive performance of neural networks.
Sol. (a), (c)

5
Assignment 6
Introduction to Machine Learning
Prof. B. Ravindran
1. When building models using decision trees we essentially split the entire input space using
(a) axis parallel hyper-rectangles
(b) polynomials curves of order greater than two
(c) polynomial curves of the same order as the length of decision tree
(d) none of the above
Sol. (a)
2. In building a decision tree model, to control the size of the tree, we need to control the number
of regions. One approach to do this would be to split tree nodes only if the resultant decrease
in the sum of squares error exceeds some threshold. For the described method, which among
the following are true?
(a) it would, in general, help restrict the size of the trees
(b) it has the potential to affect the performance of the resultant regression/classification
model
(c) it is computationally infeasible
Sol. (a), (b)
While this approach may restrict the eventual number of regions produced, the main problem
with this approach is that it is too restrictive and may result in poor performance. It is very
common for splits at one level, which themselves are not that good (i.e., they do not decrease
the error significantly), to lead to very good splits (i.e., where the error is significantly reduced)
down the line. Think about the XOR problem.
3. Suppose we use the decision tree model for solving a multi-class classification problem. As we
continue building the tree, w.r.t. the generalisation error of the model,
(a) the error due to bias increases
(b) the error due to bias decreases
(c) the error due to variance increases
(d) the error due to variance decreases
Sol. (b) & (c)
As we continue to build the decision tree model, it is possible that we overfit the data. In
this case, the model is sufficiently complex, i.e., the error due to bias is low. However, due to
overfitting, the error due to variance starts increasing.
4. (2 marks) Having built a decision tree, we are using reduced error pruning to reduce the size
of the tree. We select a node to collapse. For this particular node, on the left branch, there are
3 training data points with the following outputs: 5, 7, 9.6 and for the right branch, there are
four training data points with the following outputs: 8.7, 9.8, 10.5, 11. The average value of the
outputs of data points denotes the response of a branch. The original responses for data points

1
along the two branches (left right respectively) were response left and, response right and the
new response after collapsing the node is response new. What are the values for response left,
response right and response new (numbers in the option are given in the same order)?
(a) 21.6, 40, 61.6
(b) 7.2; 10; 8.8
(c) 3, 4, 7
(d) depends on the tree height.
Sol. (b)
Original responses:
Left: 5+7+9.6
3 = 7.2
Right: 8.7+9.8+10.5+11
4 = 10
New response: 7.2 × 37 + 10 × 4
7 = 8.8
5. (2 marks) Consider the following dataset:

feature1 feature2 output

11.7 183.2 a
12.8 187.6 a
15.3 177.4 a
13.9 198.6 a
17.2 175.3 a
16.8 151.1 b
17.5 171.4 b
23.6 162.8 b
16.9 179.5 b
19.1 173.8 b

Which among the following split-points for the feature 1 would give the best split according to
the information gain measure?

(a) 14.6
(b) 16.05
(c) 16.85
(d) 17.35

Sol. (b)
3 3 3 0 0 7 2 2 5 5
info feature1 (14.6) (D) = 10 (− 3 log2 3 − 3 log2 3 ) + 10 (− 7 log2 7 − 7 log2 7 ) = 0.6042
4
info feature1 (16.05) (D) = 10 (− 44 log2 44 − 04 log2 04 ) + 10
6
(− 16 log2 61 − 56 log2 56 ) = 0.39
5
info feature1 (16.85) (D) = 10 (− 45 log2 54 − 15 log2 15 ) + 10
5
(− 15 log2 51 − 45 log2 45 ) = 0.7219
7
info feature1 (17.35) (D) = 10 (− 57 log2 75 − 27 log2 27 ) + 10
3
(− 03 log2 30 − 33 log2 33 ) = 0.6042
6. (2 marks) For the same dataset, which among the following split-points for feature2 would
give the best split according to the gini index measure?
(a) 172.6

2
(b) 176.35
(c) 178.45
(d) 185.4
Sol. (a)
7
ginifeature2 (172.6) (D) = 10 × 2 × 75 × 27 + 10
3
× 2 × 03 × 33 = 0.2857
5
ginifeature2 (176.35) (D) = 10 × 2 × 51 × 54 + 105
× 2 × 45 × 15 = 0.32
ginifeature2 (178.45) (D) = 10 × 2 × 6 × 6 + 10 × 2 × 34 × 14 = 0.4167
6 2 4 4
2
ginifeature2 (185.4) (D) = 10 × 2 × 22 × 02 + 10
8
× 2 × 38 × 85 = 0.375

7. In which of the following situations is it appropriate to introduce a new category ’Missing’ for
missing values? (multiple options may be correct)

(a) When values are missing because the 108 emergency operator is sometimes attending a
very urgent distress call.
(b) When values are missing because the attendant spilled coffee on the papers from which
the data was extracted.
(c) When values are missing because the warehouse storing the paper records went up in
flames and burnt parts of it.
(d) When values are missing because the nurse/doctor finds the patient’s situation too urgent.

Sol. (a),(d)
We typically introduce a ‘Missing’ value when the fact that a value is missing can also be a
relevant feature. In the case of (a) is can imply that the call was so urgent that the operator
couldn’t note it down. This urgency could potentially be useful to determine the target.
But a coffee spill corrupting the records is likely to be completely random and we glean no
new information from it. In this case, a better method is to try to predict the missing data
from the available data.

3
Assignment 7
Introduction to Machine Learning
Prof. B. Ravindran
1. For the given confusion matrix, compute the recall

True Positive True Negative

Predicted Positive 6 4
Predicted Negative 3 7

(a) 0.73
(b) 0.7
(c) 0.6
(d) 0.67
(e) 0.78
(f) None of the above

Sol. (d)

2. Pallavi is working on developing a binary classifier which has a huge class imbalance. Which
of the following metric should she optimize the classifier over to develop a good model?

(a) Accuracy
(b) Precision
(c) Recall
(d) F-Score

Sol. (d)
3. For large datasets, we should always be choosing large k while doing k− fold cross validation
to get better performance on test set.
(a) True
(b) False
Sol. (b)
The data might have class imbalance.
4. While designing an experiment, which of these aspects should be considered?

(a) Floor/Ceiling Effects

(b) Order Effects
(c) Sampling Bias
Sol. (a), (b) and (c)

1
5. Which of the following are true?
TP - True Positive, TN - True Negative, FP - False Positive, FN - False Negative
TP
(a) Precision = T P +F P
TP
(b) Recall = T P +F N
2(T P +T N )
(c) Accuracy = T P +T N +F P +F N
FP
(d) Recall = T P +F P

Sol. (a), (b)

6. In the ROC plot, what are the quantities along x and y axes respectively?
(a) Precision, Recall
(b) Recall, Precision,
(c) True Positive Rate, False Positive Rate
(d) False Positive Rate, True Positive Rate
(e) Specificity, Sensitivity
(f) True Positive, True Negative
(g) True Negative, True Positive
Sol. (d)
7. How does bagging help in improving the classification performance?
(a) If the parameters of the resultant classifiers are fully uncorrelated (independent), then
bagging is inefficient.
(b) It helps reduce variance
(c) If the parameters of the resultant classifiers are fully correlated, then bagging is inefficient.
(d) It helps reduce bias

Sol. (b), (c)

The lecture clearly states that correlated weights generally means that all the classifiers learn
very similar functions. This means that bagging gives no extra stability.
Having a lot of uncorrelated classifier helps to reduce variance since the resultant ensemble is
more resistant to a single outlier (It’s likely that the outlier affects only a small fraction of
classifiers in the ensemble)

8. Which method among bagging and stacking should be chosen in case of limited training data?
and What is the appropriate reason for your preference?
(a) Bagging, because we can combine as many classifier as we want by training each on a
different sample of the training data
(b) Bagging, because we use the same classification algorithms on all samples of the training
data
(c) Stacking, because we can use different classification algorithms on the training data

2
(d) Stacking, because each classifier is trained on all of the available data

Sol. (d)
9. (2 marks) Which of the following statements are false when comparing Committee Machines
and Stacking
(a) Committee Machines are, in general, special cases of 2-layer stacking where the second-
layer classifier provides uniform weightage.
(b) Both Committee Machines and Stacking have similar mechanisms, but Stacking uses
different classifiers while Committee Machines use similar classifiers.
(c) Committee Machines are more powerful than Stacking
(d) Committee Machines are less powerful than Stacking

Sol. (b), (c)

Both Committee Machines and Stacked Classifiers use sets of different classifiers. Assigning
constant weight to all first layer classifiers in a Stacked Classifier is simply the same as giving
each one a single vote (Committee Machines).
Since Committee Machines are a special case of Stacked Classifiers, they are less powerful than
Stacking, which can assign an adaptive weight depending on the region.

3
Assignment 8
Introduction to Machine Learning
Prof. B. Ravindran
1. The Naive Bayes classifier makes the assumption that the are independent given the
.

(a) features, class labels

(b) class labels, features
(c) features, data points
(d) there is no such assumption

Sol. (a)
2. Can the decision boundary produced by the Naive Bayes algorithm be non-linear?

(a) no
(b) yes
Sol. (b)
3. A major problem of using the one vs. rest multi-class classification approach is:

(a) class imbalance

(b) increased time complexity
Sol. (a)
4. (2 marks) Consider the problem of learning a function X → Y , where Y is Boolean. X is
an input vector (X1 , X2 ), where X1 is categorical and takes 3 values, and X2 is a continuous
variable (normally distributed). What would be the minimum number of parameters required
to define a Naive Bayes model for this function?
(a) 8
(b) 10
(c) 9
(d) 5
Sol. (c)
There are 3 possible values for X1 and 2 possible values for Y . We would have one parameter
for each P (X1 = x1 |Y = y), and there are 3 of these for each Y = y - however we would
only need 2, since the three probabilities have to sum to 1. Since there are 2 values for Y,
that gives us 4 parameters. For P (X2 = x2 |Y = y), which is continuous, we have the mean
and variance of a Gaussian for each Y = y - this gives 4 parameters. We also need the prior
probabilities P (Y = y); there are 2 of these since Y takes 2 values, but we only need one since
P (Y = 1) = 1 − P (Y = 0). The total is hence 4 + 4 + 1 = 9

5. In boosting, the weights of data points that were miscalssified are as training progresses.

1
(a) decreased
(b) increased
(c) first decreased and then increased
(d) kept unchanged
Sol. (b)

6. In a random forest model let m << p be the number of randomly selected features that are
used to identify the best split at any node of a tree. Which of the following are true? (p is the
original number of features)
(Multiple options may be correct)
(a) increasing m reduces the correlation between any two trees in the forest
(b) decreasing m reduces the correlation between any two trees in the forest
(c) increasing m increases the performance of individual trees in the forest
(d) decreasing m increases the performance of individual trees in the forest
Sol. (b) and (c)

7. (2 marks) Consider the following graphical model, which of the following are false about the
model? (multiple options may be correct)

(a) A is independent of B when C is known

(b) D is independent of A when C is known
(c) D is not independent of A when B is known

2
(d) D is not independent of A when C is known

Sol. (a), (b)

8. Consider the Bayesian network given in the previous question. Let ‘A’, ‘B’, ‘C’, ‘D’and
‘E’denote the random variables shown in the network. Which of the following can be inferred
from the network structure?
(a) ‘A’causes ‘D’
(b) ‘E’causes ‘D’
(c) ‘C’causes ‘A’
(d) options (a) and (b) are correct
(e) none of the above can be inferred
Sol. (e)
As discussed in the lecture, in Bayesian Network, the edges do not imply any causality.

3
Assignment 9
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider the bayesian network shown below.

Figure 1: Q1

Two students - Manish and Trisha make the following claims:

• Manish claims P (D|{S, L, C}) = P (D|{L, C})
• Trisha claims P (D|{S, L}) = P (D|L)
where P (X|Y ) denotes probability of event X given Y . Please note that Y can be a set. Which
of the following is true?
(a) Manish and Trisha are correct.
(b) Manish is correct and Trisha is incorrect.
(c) Manish is incorrect and Trisha is correct.
(d) Both are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (b)
D and S are independent events given two variables {L, C} but not in the case when only L is
given.
2. Consider the same bayesian network shown in previous question (Figure 1). Two other students
in the class - Trina and Manish make the following claims:
• Trina claims P (S|{G, C}) = P (S|C)
• Manish claims P (L|{D, G}) = P (L|G)
Which of the following is true?
(a) Both the students are correct.

1
(b) Trina is incorrect and Manish is correct.
(c) Trina is correct and Manish is incorrect.
(d) Both the students are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (a)
same as previous question
3. Consider the Bayesian graph shown below in Figure 2.

Figure 2: Q3

The random variables have the following notation: d - Difficulty, i - Intelligence, g - Grade, s -
SAT, l - Letter. The random variables are modeled as discrete variables and the corresponding
CPDs are as below.
d0 d1
0.6 0.4
i0 i1
0.6 0.4
g1 g2 g3
0 0
i ,d 0.3 0.4 0.3
i0 , d1 0.05 0.25 0.7
i1 , d0 0.9 0.08 0.02
i1 , d1 0.5 0.3 0.2
s0 s1
0
i 0.95 0.05
i1 0.2 0.8

2
l0 l1
1
g 0.2 0.8
g2 0.4 0.6
g3 0.99 0.01
What is the probability of P (i = 1, d = 0, g = 2, s = 1, l = 1)?

(a) 0.004608
(b) 0.006144
(c) 0.001536
(d) 0.003992
(e) 0.009216
(f) 0.007309
(g) None of these

Sol. (e)

P (i = 1, d = 0, g = 2, s = 1, l = 0) = P (i = 1)P (d = 0)P (g = 2|i = 1, d = 0)P (s = 1|i = 1)P (l = 1|g = 2)

= 0.4 ∗ 0.6 ∗ 0.08 ∗ 0.8 ∗ 0.6 = 0.009216

4. Using the data given in the previous question, compute the probability of following assignment,
P (i = 1, g = 1, s = 1, l = 0) irrespective of the difficulty of the course? (up to 3 decimal places)

(a) 0.160
(b) 0.371
(c) 0.662
(d) 0.047
(e) 0.037
(f) 0.066
(g) 0.189

Sol. (d)
d=0,1
X
P (i = 1, g = 1, s = 1, l = 0) = P (i = 1)P (s = 1|i = 1)P (l = 0|g = 1) (P (d)P (g = 1|i = 1, d))
d

P (i = 1, g = 1, s = 1, l = 1) = 0.4 × 0.8 × 0.2 × (0.9 × 0.6 + 0.5 × 0.4) = 0.04736

5. Consider the Bayesian network shown below in Figure 3

3
Figure 3: Q5

Two students - Manish and Trisha make the following claims:

• Manish claims P (H|{S, G, J}) = P (H|{G, J})
• Trisha claims P (H|{S, C, J}) = P (H|{C, J})
Which of the following is true?
(a) Manish and Trisha are correct.
(b) Both are incorrect.
(c) Manish is incorrect and Trisha is correct.
(d) Manish is correct and Trisha is incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (d)

6. Consider the Markov network shown below in Figure 4

Figure 4: Q6

4
Which of the following variables are NOT in the markov blanket of variable “4” shown in the
above Figure 4 ? (multiple answers may be correct)
(a) 1
(b) 8
(c) 2
(d) 5
(e) 6
(f) 4
(g) 7
Sol. (d) and (g)

7. In the Markov network given in Figure 4, two students make the following claims:
• Manish claims variable “1” is dependent on variable “7” given variable “2”.
• Trina claims variable “2” is independent of variable “6” given variable “3”.

Which of the following is true?

(a) Both the students are correct.
(b) Trina is incorrect and Manish is correct.
(c) Trina is correct and Manish is incorrect.
(d) Both the students are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (d)

8. Four random variables are known to follow the given factorization

1
P (A1 = a1 , A2 = a2 , A3 = a3 , A4 = a4 ) = ψ1 (a1 , a2 )ψ2 (a1 , a4 )ψ3 (a1 , a3 )ψ4 (a2 , a4 )ψ5 (a3 , a4 )
Z
The corresponding Markov network would be

(a)

5
(b)

(c)

(d)

(e)

(f): None of the above

6
Sol. (c)

9. Does there exist a more compact factorization involving less number of factors for the distri-
bution given in previous question?
(a) Yes
(b) No
(c) Insufficient information
Sol. (a)

10. Consider the following Markov Random Field.

Figure 11: Q10

Which of the following nodes will have no effect on H given the Markov Blanket of H?

(a) A
(b) B
(c) C
(d) D
(e) E
(f) F
(g) G
(h) I
(i) J

Sol. (c), (e) and (f)

The question requires you to select the random variables not in the Markov blanket of H. We
see that the Markov blanket of H contains A, B, I, J, F, G, D. The only other variables, other
than H are C, E, F . These three variables can have no effect on H once the Markov blanket
is known/given.

7
Assignment 10
Introduction to Machine Learning
Prof. B. Ravindran
1. (2 marks) Consider the following one dimensional data set: 12, 22, 2, 3, 33, 27, 5, 16, 6, 31, 20, 37, 8 and 18.
Given k = 3 and initial cluster centers to be 5, 6 and 31, what are the final cluster centres
obtained on applying the k-means algorithm?
(a) 5, 18, 30
(b) 5, 18, 32
(c) 6, 19, 32
(d) 4.8, 17.6, 32
(e) None of the above
Sol. (d)

2. (1 mark) For the previous question, in how many iterations will the k-means algorithm con-
verge?
(a) 2
(b) 3
(c) 4
(d) 6
(e) 7
Sol. (c)
3. (1 mark) In the lecture on the BIRCH algorithm, it is stated that using the number of points
N, sum of points SUM and sum of squared points SS, we can determine the centroid and
radius of the combination of any two clusters A and B. How do you determine the centroid of
the combined cluster? (In terms of N,SUM and SS of both the clusters)
(a) SU MA + SU MB
SU MA SU MB
(b) NA + NB
SU MA +SU MB
(c) NA +NB
SSA +SSB
(d) NA +NB

Sol. (c)
Apply the centroid formula to the combined cluster points. It’s simply the sum of all points
divided by the total number of points.
4. (1 mark) What assumption does the CURE clustering algorithm make with regards to the
shape of the clusters?

(a) No assumption
(b) Spherical

1
(c) Elliptical
Sol. (a)
Explanation CURE does not make any assumption on the shape of the clusters.
5. (1 mark) What would be the effect of increasing MinPts in DBSCAN while retaining the same
Eps parameter? (Note that more than one statement may be correct)
(a) Increase in the sizes of individual clusters
(b) Decrease in the sizes of individual clusters
(c) Increase in the number of clusters
(d) Decrease in the number of clusters
Sol. (b), (c)
By increasing the MinPts, we are expecting large number of points in the neighborhood, to
include them in cluster. In one sense, by increasing MinPts, we are looking for dense clusters.
This can break not-so-dense clusters into more than one part, which can lead to reduce the
cluster size and increase the number of clusters.

For the next question, kindly download the dataset - DS1. The first two columns in the
dataset correspond to the co-ordinates of each data point. The third column corresponds two
the actual cluster label.
DS1: [Link]
6. (2 marks) Visualize the dataset DS1. Which of the following algorithms will be able to recover
the true clusters (first check by visual inspection and then write code to see if the result
matches to what you expected).
(a) K-means clustering
(b) Single link hierarchical clustering
(c) Complete link hierarchical clustering
(d) Average link hierarchical clustering
Sol. (b)
The dataset contains spiral clusters. Single link hierarchical clustering can recover spiral
clusters with appropriate parameter settings.
7. (1 marks) Consider the similarity matrix given below: Which of the following shows the
hierarchy of clusters created by the single link clustering algorithm.

P1 P2 P3 P4 P5 P6
P1 1.0000 0.7895 0.1579 0.0100 0.5292 0.3542
P2 0.7895 1.0000 0.3684 0.2105 0.7023 0.5480
P3 0.1579 0.3684 1.0000 0.8421 0.5292 0.6870
P4 0.0100 0.2105 0.8421 1.0000 0.3840 0.5573
P5 0.5292 0.7023 0.5292 0.3840 1.0000 0.8105
P6 0.3542 0.5480 0.6870 0.5573 0.8105 1.0000

2
Sol. (b)

8. (1 marks) For the similarity matrix given in the previous question, which of the following shows
the hierarchy of clusters created by the complete link clustering algorithm.

3
Sol. (d)

4
Assignment 11
Introduction to Machine Learning
Prof. B. Ravindran
1. Given n samples x1 , x2 , . . . , xN drawn independently from an Exponential distribution un-
known parameter λ, find the MLE of λ.
Pn
(a) λM LE = i=1 xi
Pn
(b) λM LE = n i=1 xi
(c) λM LE = Pnn
i=1 xi
Pn
i=1 xi
(d) λM LE = n
(e) λM LE = Pn−1
n
i=1 xi
Pn
i=1 xi
(f) λM LE = n−1

Sol. (c)

n
Y n
Y Pn
L(λ, x1 , . . . , xn ) = f (xi , λ) = λe−λx = λn e−λ i=1 xi

i=1 i=1

Pn
d ln λn e−λ i=1 xi

d ln (L(λ, x1 , . . . , xn ))
=
dλ dλ Pn
d ln (n ln(λ) − λ i=1 xi )
=
dλ
n
n X
= − xi
λ i=1

Set the above term to zero to obtain MLE of λ

n
λ= P
n
xi
i=1

2. Given n samples x1 , x2 , . . . , xn drawn independently from an Geometric distribution unknown

parameter p given by pdf Pr(X = k) = (1 − p)k−1 p for k = 1, 2, 3, · · · , find the MLE of p.
Pn
(a) pM LE = i=1 xi
Pn
(b) pM LE = n i=1 xi
(c) pM LE = Pnn
i=1 xi
Pn
i=1 xi
(d) pM LE = n
(e) pM LE = Pn−1
n
i=1 xi
Pn
i=1 xi
(f) pM LE = n−1

1
Sol. (c)

3. (2 marks) Suppose we are trying to model a p dimensional Gaussian distribution. What is the
actual number of independent parameters that need to be estimated in mean and covariance
matrix respectively?
(a) 1, 1
(b) p − 1, 1
(c) p, p
(d) p, p(p + 1)
(e) p, p(p + 1)/2
(f) p, (p + 3)/2
(g) p − 1, p(p + 1)
(h) p − 1, p(p + 1)/2 + 1
(i) p − 1, (p + 3)/2
(j) p, p(p + 1) − 1
(k) p, p(p + 1)/2 − 1
(l) p, (p + 3)/2 − 1
(m) p, p2
(n) p, p2 /2
(o) None of these
Sol. (e)
Explanation Mean vector has p parameters. The covariance matrix is symmetric (p × p) and
hence has p p+1
2 independent parameters.

4. (2 marks) Given n samples x1 , x2 , . . . , xN drawn independently from a Poisson distribution

unknown parameter λ, find the MLE of λ.

2
Pn
(a) λM LE = i=1 xi
Pn
(b) λM LE = n i=1 xi
(c) λM LE = Pnn
i=1 xi
Pn
i=1 xi
(d) λM LE = n
(e) λM LE = Pn−1
n
i=1 xi
Pn
i=1 xi
(f) λM LE = n−1

Sol. (d)
Write the likelihood:
Y λxi e−nλ
l(λ; x) = e−λ = λx1 +x2 +···+xn
i
xi ! x1 !x2 ! · · · xn !

Take the log and differentiate the log-likelihood with respect to λ and set it to 0.
5. (2 marks) In Gaussian Mixture Models, πi are the mixing coefficients. Select the correct
conditions that the mixing coefficients need to satisfy for a valid GMM model.
(a) −1 ≤ πi ≤ 1, ∀i
(b) 0 ≤ πi ≤ 1, ∀i
P
(c) i πi = 1
P
(d) i πi need not be bounded

Sol. (b), (c)

6. (2 marks) Expectation-Maximization, or the EM algorithm, consists of two steps - E step and

the M-step. Using the following notation, select the correct set of equations used at each step
of the algorithm.
Notation.
X: Known/Given variables/data
Z: Hidden/Unknown variables
θ: Total set of parameters to be learned
θk : Values of all the parameters after stage k
Q(, ): The Q-function as described in the lectures
(a) E-step: EZ|X,θ [log(P r(X, Z|θm ))]
(b) E-step: EZ|X,θm−1 [log(P r(X, Z|θ))]
P
(c) M-step: argmaxθ Z P r(Z|X, θm−2 ) · log(P r(X, Z|θ))
(d) M-step: argmaxθ Q(θ, θm−1 )
(e) M-step: argmaxθ Q(θ, θm−2 )
Sol. (b), (d)

Common questions