0% found this document useful (0 votes)
35 views6 pages

ML Practice Questions and Solutions

The document contains practice questions and solutions related to various machine learning topics, including linear regression, regularization, logistic regression, naive Bayes, support vector machines, decision trees, regression trees, and ensemble learning. Each section presents specific problems or scenarios, requiring the application of concepts and formulas to derive answers. The document also includes discussions on algorithm complexities and assumptions made by different machine learning models.

Uploaded by

ron47507
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

ML Practice Questions and Solutions

The document contains practice questions and solutions related to various machine learning topics, including linear regression, regularization, logistic regression, naive Bayes, support vector machines, decision trees, regression trees, and ensemble learning. Each section presents specific problems or scenarios, requiring the application of concepts and formulas to derive answers. The document also includes discussions on algorithm complexities and assumptions made by different machine learning models.

Uploaded by

ron47507
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Practice Questions ML

November 2024

Linear Regression
1. Suppose we have a dataset with five predictors, X1 = GPA, X2 = IQ, X3 =
Level (1 for College and 0 for High School), X4 = Interaction between GPA
and IQ, and X5 = Interaction between GPA and Level. The response is
starting salary after graduation (in thousands of dollars). Suppose we use
least squares to fit the model and get the following estimated coefficients:
β̂0 = 50, β̂1 = 20, β̂2 = 0.07, β̂3 = 35, β̂4 = 0.01, β̂5 = −10. Justify your
answer (ISLP 3.7, Q3)

(a) Which answer is correct, and why?


i. For a fixed value of IQ and GPA, high school graduates earn
more, on average, than college graduates.
ii. For a fixed value of IQ and GPA, college graduates earn more,
on average, than high school graduates.
iii. For a fixed value of IQ and GPA, high school graduates earn
more, on average, than college graduates provided that the GPA
is high enough.
iv. For a fixed value of IQ and GPA, college graduates earn more,
on average, than high school graduates provided that the GPA
is high enough.
(b) Predict the salary of a college graduate with an IQ of 110 and a GPA
of 4.0.
(c) True or false: Since the coefficient for the GPA/IQ interaction term
is very small, there is very little evidence of an interaction effect
between GPA and IQ.

2. Consider a linear regression problem with N samples where the input is in


D-dimensional space, and all output values are yi ϵ−1, +1. Which of the
following statements is correct?

(a) linear regression cannot “work” if N ≫ D


(b) linear regression cannot “work” if N ≪ D

1
(c) linear regression can be made to work perfectly if the data is linearly
separable
Solution: Answer c is correct.
3. Consider a data matrix XϵRD×N of N data points in D dimensions, and
target values yn for n = 1, , N . We perform least squares linear regression,
without the use of any regularizer.

(a) Write down the normal equations.


(b) Give the expression to predict a new unseen point xm . Do not assume
knowledge of w but compute it.

Solution: Refer to class notes.

Regularization
It is well-known that ridge regression tends to give similar coefficient values to
correlated variables, whereas the lasso may give quite different coefficient values
to correlated variables. We will now explore this property in a very simple
setting. (ISLP 6.6, Q5)
Suppose that n = 2, p = 2, x11 = x12 , x21 = x22 . Furthermore, suppose
that y1 + y2 = 0 and x11 + x21 = 0 and x12 + x22 = 0, so that the estimate for
the intercept in a least squares, ridge regression, or lasso model is zero: β̂0 = 0.

(a) Write out the ridge regression optimization problem in this setting.

(b) Argue that in this setting, the ridge coefficient estimates satisfy β̂1 = β̂2 .
(c) Write out the lasso optimization problem in this setting.

(d) Argue that in this setting, the lasso coefficients β̂1 and β̂2 are not unique—in
other words, there are many possible solutions to the optimization problem
in (c). Describe these solutions.

Logistic Regression
Suppose we collect data for a group of students in a statistics class with variables
X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a
logistic regression and produce estimated coefficients: β̂0 = −6, β̂1 = 0.05,
β̂2 = 1. (ISLP 4.8, Q6)

(a) Estimate the probability that a student who studies for 40 hours and has
an undergrad GPA of 3.5 gets an A in the class.
(b) How many hours would the student in part (a) need to study to have a
50% chance of getting an A in the class?

2
Naive Bayes
Suppose that we wish to predict whether a given stock will issue a dividend this
year (”Yes” or ”No”) based on X, last year’s percent profit. We examine a large
number of companies and discover that the mean value of X for companies that
issued a dividend was X̄ = 10, while the mean for those that didn’t was X̄ = 0.
In addition, the variance of X for these two sets of companies was σ 2 = 36.
Finally, 80% of companies issued dividends. Assuming that X follows a normal
distribution, predict the probability that a company will issue a dividend this
year given that its percentage profit was X = 4 last year. (ISLP 4.8, Q7)
Hint: Recall that the density function for a normal random variable is
1 2
/2σ 2
f (x) = √ e−(x−µ)
2πσ 2
You will need to use Bayes’ theorem.

Support Vector Machines (SVM)


Here we explore the maximal margin classifier on a toy data set. (ISLP 9.7, Q3)

(a) We are given n = 7 observations in p = 2 dimensions. For each observa-


tion, there is an associated class label.

Obs. X1 X2 Y
1 3 4 Red
2 2 2 Red
3 4 4 Red
4 1 4 Red
5 2 1 Blue
6 4 3 Blue
7 4 1 Blue

Sketch the observations.


(b) Sketch the optimal separating hyperplane, and provide the equation for
this hyperplane (in the form β0 + β1 X1 + β2 X2 = 0).
(c) Describe the classification rule for the maximal margin classifier. It should
be something along the lines of “Classify to Red if β0 + β1 X1 + β2 X2 > 0,
and classify to Blue otherwise.” Provide the values for β0 , β1 , and β2 .
(d) On your sketch, indicate the margin for the maximal margin hyperplane.
(e) Indicate the support vectors for the maximal margin classifier.
(f) Argue that a slight movement of the seventh observation would not affect
the maximal margin hyperplane.

3
(g) Sketch a hyperplane that is not the optimal separating hyperplane, and
provide the equation for this hyperplane.
(h) Draw an additional observation on the plot so that the two classes are no
longer separable by a hyperplane.

Decision Trees
Draw an example (of your own invention) of a partition of two dimensional
feature space that could result from recursive binary splitting. Your example
should contain at least six regions. Draw a decision tree corresponding to this
partition. Be sure to label all as pects of your figures, including the regions
R1,R2,..., the cutpoints t1,t2,..., and so forth. (ISLP 8.4, Q1)

Regression Trees
Provide a detailed explanation of the algorithm that is used to fit a regression
tree. (ISLP 8.4, Q6)

Bagging and Boosting


Suppose we produce ten bootstrapped samples from a data set containing red
and green classes. We then apply a classification tree to each bootstrapped
sample and, for a specific value of X, produce 10 estimates of P (Class is Red|X):

0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75.

There are two common ways to combine these results together into a single
class prediction. One is the majority vote approach. The second approach is
to classify based on the average probability. In this example, what is the final
classification under each of these two approaches? (ISLP 8.4, Q5)

Miscellaneous
1. Assume that you initialize all weights in a neural net to the same value
and you do the same for the bias terms. Which of the following statements
is correct.

(a) This is a good idea since it treats every edge equally.


(b) This is a bad idea

Solution: Answer b is correct since in this case every node on a


particular level will learn the same feature.

4
2. What is the complexity of the back-propagation algorithm for a neural
net with L layers and K nodes per layer?
Solution: O(K 2 L). The dominant term is a multiplication of a
vector with a K × K matrix and this has to be done in each of
the L layers
3. Which of the following are typical benefits of ensemble learning in its basic
form (that is, not AdaBoost and not with randomized decision bound-
aries), with all weak learners having the same learning algorithm and an
equal vote?

(a) Ensemble learning tends to reduce the bias of your classification al-
gorithm.
(b) Ensemble learning can be used to avoid overfitting.
(c) Ensemble learning tends to reduce the variance of your classification
algorithm.
(d) Ensemble learning can be used to avoid underfitting.

Solution: In ensemble learning, increasing the number of classi-


fiers reduces the variance of our model but generally has little
effect on the bias. Therefore, basic ensembling can be used to
avoid overfitting but would generally not be used to avoid un-
derfitting. (By contrast, AdaBoost reduces bias.)
4. Newton’s method always converges, but sometimes it is slower than Gra-
dient Descent.[T/F] ?
Solution: False, sometimes it diverges too
5. What is the loss function and regularizer of what is typically referred to
as “SVM”?
Solution: hinge loss + l2 regularization
6. Name one scenario when you want to use the Huber loss over L2 and vice
versa.
Solution: Huber Loss: if you have bad outliers. L2 loss: if you
want to estimate the mean labels for all inputs
7. For each algorithm name one assumption that it makes on the data:

• Naive Bayes
• Logistic Regression
• SVM
• Regression
• Decision Trees

Solution: refer to notes

5
8. Let x, yϵRd be two points ( e.g. sample or test points). Consider the func-
tion k(x, y) = xT rev(y) where rev(y) reverses the order of the components
in y. rev([123]) = [321]. Show that k cannot be a valid kernel function.
Solution:We have that k((−1, 1), (−1, 1)) = −2, but this is impos-
sible as, if k is a valid kernel, then there is some function ϕ such
that k(x, x) = ϕT (x)ϕ(x) ≥ 0

Common questions

Powered by AI

A valid kernel must satisfy Mercer’s theorem, implying it should define the inner products in some feature space and inherently be positive semi-definite. The function k(x, y) = xT rev(y) reverses the order of y’s components, potentially obtaining negative values (k((-1, 1), (-1, 1)) = -2), which violates k(x, x) >= 0, thus invalidating it as a kernel .

Logistic regression requires sufficient samples relative to dimensions because each dimension represents a degree of freedom the model must understand to accurately distinguish classes. When samples are fewer than dimensions, the resulting solution space becomes under-constrained, potentially leading to overfitting and a hyperplane that lacks generalizability across unseen data points .

Huber loss combines the benefits of L2 loss (for small error values) with L1 for large errors, allowing a model to be less sensitive to outliers than L2 while being smoother compared to L1. It is preferred in cases with data containing outliers, providing a more robust fitting by trading off between sensitivity to outlier impacts and loss smooth estimation .

Using logistic regression coefficients ˆβ0 = -6, ˆβ1 = 0.05, and ˆβ2 = 1, the probability is calculated with the formula: P(Y=1|X) = 1 / (1 + exp(-ˆβ0 - ˆβ1*X1 - ˆβ2*X2)). For 40 hours and a GPA of 3.5, this becomes P(Y=1) = 1 / (1 + exp(-(-6 - 0.05 * 40 - 1 * 3.5))) = 1 / (1 + exp(-(-8.5))) = 0.9997 .

When weights in a neural network are initialized identically, all neurons in each layer learn identical features due to symmetric initial conditions. This prevents diversification of features learned across the network, causing gradient descent updates to be the same for similar nodes, thus limiting learning progress and feature extraction capabilities .

SVM focuses on support vectors, the samples closest to the decision boundary which define the hyperplane's position. By maximizing the margin between these vectors and the boundary, SVM ensures that small changes outside this margin do not affect the classification, thus enhancing robustness and helping prevent overfitting to minor variations in the training set .

Majority vote considers only the most frequent class prediction, potentially higher robustness to outliers. Averaging probability provides a nuanced view, accounting for certainty across models, which can improve predictive accuracy but might be swayed by biased or poorly calibrated probabilities. The choice affects bias-variance trade-offs, with majority vote often reducing variance significantly more than averaging .

Ridge regression shrinks coefficient estimates towards zero but doesn't set any to exactly zero, thus balancing between bias and variance. In the given simple setup with x11 = x12 and x21 = x22, ridge penalizes the coefficients equally, driving them to similar values (ˆβ1 = ˆβ2) due to the shared correlation structure, whereas lasso might zero one out selectively .

Naive Bayes assumes normal distribution as it simplifies the problem analytically and computationally, providing a straightforward calculation of likelihoods. The assumption aids in applying Bayes' theorem effectively to update forecasts with observed data, but it also implies we are assuming the data within each class has symmetric, bell-curve distribution, which can mislead if the real data has significant skew or kurtosis .

For high school graduates to earn more than college graduates with a high GPA, the interaction term between GPA and Level must offset the higher baseline salary that comes with Level being 1 for college graduates. Here, GPA interacts negatively with the level coefficient (ˆβ5 = -10), meaning as GPA increases, the advantage of having Level = 1 (college) diminishes, possibly even reversing at high GPA values, reflecting in the salary comparison between graduates .

You might also like