Practice Questions ML
November 2024
Linear Regression
1. Suppose we have a dataset with five predictors, X1 = GPA, X2 = IQ, X3 =
Level (1 for College and 0 for High School), X4 = Interaction between GPA
and IQ, and X5 = Interaction between GPA and Level. The response is
starting salary after graduation (in thousands of dollars). Suppose we use
least squares to fit the model and get the following estimated coefficients:
β̂0 = 50, β̂1 = 20, β̂2 = 0.07, β̂3 = 35, β̂4 = 0.01, β̂5 = −10. Justify your
answer (ISLP 3.7, Q3)
(a) Which answer is correct, and why?
i. For a fixed value of IQ and GPA, high school graduates earn
more, on average, than college graduates.
ii. For a fixed value of IQ and GPA, college graduates earn more,
on average, than high school graduates.
iii. For a fixed value of IQ and GPA, high school graduates earn
more, on average, than college graduates provided that the GPA
is high enough.
iv. For a fixed value of IQ and GPA, college graduates earn more,
on average, than high school graduates provided that the GPA
is high enough.
(b) Predict the salary of a college graduate with an IQ of 110 and a GPA
of 4.0.
(c) True or false: Since the coefficient for the GPA/IQ interaction term
is very small, there is very little evidence of an interaction effect
between GPA and IQ.
2. Consider a linear regression problem with N samples where the input is in
D-dimensional space, and all output values are yi ϵ−1, +1. Which of the
following statements is correct?
(a) linear regression cannot “work” if N ≫ D
(b) linear regression cannot “work” if N ≪ D
1
(c) linear regression can be made to work perfectly if the data is linearly
separable
Solution: Answer c is correct.
3. Consider a data matrix XϵRD×N of N data points in D dimensions, and
target values yn for n = 1, , N . We perform least squares linear regression,
without the use of any regularizer.
(a) Write down the normal equations.
(b) Give the expression to predict a new unseen point xm . Do not assume
knowledge of w but compute it.
Solution: Refer to class notes.
Regularization
It is well-known that ridge regression tends to give similar coefficient values to
correlated variables, whereas the lasso may give quite different coefficient values
to correlated variables. We will now explore this property in a very simple
setting. (ISLP 6.6, Q5)
Suppose that n = 2, p = 2, x11 = x12 , x21 = x22 . Furthermore, suppose
that y1 + y2 = 0 and x11 + x21 = 0 and x12 + x22 = 0, so that the estimate for
the intercept in a least squares, ridge regression, or lasso model is zero: β̂0 = 0.
(a) Write out the ridge regression optimization problem in this setting.
(b) Argue that in this setting, the ridge coefficient estimates satisfy β̂1 = β̂2 .
(c) Write out the lasso optimization problem in this setting.
(d) Argue that in this setting, the lasso coefficients β̂1 and β̂2 are not unique—in
other words, there are many possible solutions to the optimization problem
in (c). Describe these solutions.
Logistic Regression
Suppose we collect data for a group of students in a statistics class with variables
X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a
logistic regression and produce estimated coefficients: β̂0 = −6, β̂1 = 0.05,
β̂2 = 1. (ISLP 4.8, Q6)
(a) Estimate the probability that a student who studies for 40 hours and has
an undergrad GPA of 3.5 gets an A in the class.
(b) How many hours would the student in part (a) need to study to have a
50% chance of getting an A in the class?
2
Naive Bayes
Suppose that we wish to predict whether a given stock will issue a dividend this
year (”Yes” or ”No”) based on X, last year’s percent profit. We examine a large
number of companies and discover that the mean value of X for companies that
issued a dividend was X̄ = 10, while the mean for those that didn’t was X̄ = 0.
In addition, the variance of X for these two sets of companies was σ 2 = 36.
Finally, 80% of companies issued dividends. Assuming that X follows a normal
distribution, predict the probability that a company will issue a dividend this
year given that its percentage profit was X = 4 last year. (ISLP 4.8, Q7)
Hint: Recall that the density function for a normal random variable is
1 2
/2σ 2
f (x) = √ e−(x−µ)
2πσ 2
You will need to use Bayes’ theorem.
Support Vector Machines (SVM)
Here we explore the maximal margin classifier on a toy data set. (ISLP 9.7, Q3)
(a) We are given n = 7 observations in p = 2 dimensions. For each observa-
tion, there is an associated class label.
Obs. X1 X2 Y
1 3 4 Red
2 2 2 Red
3 4 4 Red
4 1 4 Red
5 2 1 Blue
6 4 3 Blue
7 4 1 Blue
Sketch the observations.
(b) Sketch the optimal separating hyperplane, and provide the equation for
this hyperplane (in the form β0 + β1 X1 + β2 X2 = 0).
(c) Describe the classification rule for the maximal margin classifier. It should
be something along the lines of “Classify to Red if β0 + β1 X1 + β2 X2 > 0,
and classify to Blue otherwise.” Provide the values for β0 , β1 , and β2 .
(d) On your sketch, indicate the margin for the maximal margin hyperplane.
(e) Indicate the support vectors for the maximal margin classifier.
(f) Argue that a slight movement of the seventh observation would not affect
the maximal margin hyperplane.
3
(g) Sketch a hyperplane that is not the optimal separating hyperplane, and
provide the equation for this hyperplane.
(h) Draw an additional observation on the plot so that the two classes are no
longer separable by a hyperplane.
Decision Trees
Draw an example (of your own invention) of a partition of two dimensional
feature space that could result from recursive binary splitting. Your example
should contain at least six regions. Draw a decision tree corresponding to this
partition. Be sure to label all as pects of your figures, including the regions
R1,R2,..., the cutpoints t1,t2,..., and so forth. (ISLP 8.4, Q1)
Regression Trees
Provide a detailed explanation of the algorithm that is used to fit a regression
tree. (ISLP 8.4, Q6)
Bagging and Boosting
Suppose we produce ten bootstrapped samples from a data set containing red
and green classes. We then apply a classification tree to each bootstrapped
sample and, for a specific value of X, produce 10 estimates of P (Class is Red|X):
0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75.
There are two common ways to combine these results together into a single
class prediction. One is the majority vote approach. The second approach is
to classify based on the average probability. In this example, what is the final
classification under each of these two approaches? (ISLP 8.4, Q5)
Miscellaneous
1. Assume that you initialize all weights in a neural net to the same value
and you do the same for the bias terms. Which of the following statements
is correct.
(a) This is a good idea since it treats every edge equally.
(b) This is a bad idea
Solution: Answer b is correct since in this case every node on a
particular level will learn the same feature.
4
2. What is the complexity of the back-propagation algorithm for a neural
net with L layers and K nodes per layer?
Solution: O(K 2 L). The dominant term is a multiplication of a
vector with a K × K matrix and this has to be done in each of
the L layers
3. Which of the following are typical benefits of ensemble learning in its basic
form (that is, not AdaBoost and not with randomized decision bound-
aries), with all weak learners having the same learning algorithm and an
equal vote?
(a) Ensemble learning tends to reduce the bias of your classification al-
gorithm.
(b) Ensemble learning can be used to avoid overfitting.
(c) Ensemble learning tends to reduce the variance of your classification
algorithm.
(d) Ensemble learning can be used to avoid underfitting.
Solution: In ensemble learning, increasing the number of classi-
fiers reduces the variance of our model but generally has little
effect on the bias. Therefore, basic ensembling can be used to
avoid overfitting but would generally not be used to avoid un-
derfitting. (By contrast, AdaBoost reduces bias.)
4. Newton’s method always converges, but sometimes it is slower than Gra-
dient Descent.[T/F] ?
Solution: False, sometimes it diverges too
5. What is the loss function and regularizer of what is typically referred to
as “SVM”?
Solution: hinge loss + l2 regularization
6. Name one scenario when you want to use the Huber loss over L2 and vice
versa.
Solution: Huber Loss: if you have bad outliers. L2 loss: if you
want to estimate the mean labels for all inputs
7. For each algorithm name one assumption that it makes on the data:
• Naive Bayes
• Logistic Regression
• SVM
• Regression
• Decision Trees
Solution: refer to notes
5
8. Let x, yϵRd be two points ( e.g. sample or test points). Consider the func-
tion k(x, y) = xT rev(y) where rev(y) reverses the order of the components
in y. rev([123]) = [321]. Show that k cannot be a valid kernel function.
Solution:We have that k((−1, 1), (−1, 1)) = −2, but this is impos-
sible as, if k is a valid kernel, then there is some function ϕ such
that k(x, x) = ϕT (x)ϕ(x) ≥ 0