University of Engineering and Technology
Department of Electrical Engineering
EE 439: Introduction to Machine Learning
Spring 2026
Problem Set 11 Issue Date: February 20, 2026 Due Date: February 27, 2026
1. Newton’s method for computing least squares.
In this problem, we will prove that if we use Newton’s method solve the least squares optimization
problem, then we only need one iteration to converge to θ∗ .
(a) Find the Hessian of the cost function
m
1 X T (i)
J(θ) = (θ x − y (i) )2 .
2
i=1
(b) Show that the first iteration of Newton’s method gives us
θ∗ = (X T X)−1 X T ⃗y ,
the solution to our least squares problem.
2. Locally-weighted logistic regression
In this problem you will implement a locally-weighted version of logistic regression, where we
weight different training examples differently according to the query point. The locally-weighted
logistic regression problem is to maximize
m
λ T X h i
ℓ(θ) = − θ θ + w(i) y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) )) .
2
i=1
The − λ2 θT θ here is what is known as a regularization parameter, which will be discussed in a
future lecture, but which we include here because it is needed for Newton’s method to perform
well on this task. For the entirety of this problem you can use the value λ = 0.0001.
Using this definition, the gradient of ℓ(θ) is given by
∇θ ℓ(θ) = X T z − λθ
where z ∈ Rm is defined by
zi = w(i) (y (i) − hθ (x(i) ))
and the Hessian is given by
H = X T DX − λI
where D ∈ Rm×m is a diagonal matrix with
Dii = −w(i) hθ (x(i) )(1 − hθ (x(i) )).
For the sake of this problem you can just use the above formulas, but you should try to derive
these results for yourself as well.
1
Questions 1, 2, 3 are from Stanford’s CS 229
Given a query point x, we choose compute the weights
!
(i) ∥x − x(i) ∥2
w = exp − .
2τ 2
Much like the locally weighted linear regression that was discussed in class, this weighting scheme
gives more when the “nearby” points when predicting the class of a new example.
(a) Implement the Newton–Raphson algorithm for optimizing ℓ(θ) for a new query point x, and
use this to predict the class of x.
You should implement a function of the form
y = lwlr(X train, y train, x, tau)
in Python. This function takes as input the training set (the X train and y train matrices,
in the form described in the class notes), a new query point x, and the weight bandwidth
tau.
Given this input, the function should 1) compute weights w(i) for each training example
using the formula above, 2) maximize ℓ(θ) using Newton’s method, and finally 3) output
y = 1{hθ (x) > 0.5} as the prediction.
No starter code will be provided. You may structure your program as you wish, but your
implementation must explicitly compute the gradient and Hessian using the formulas given
in the problem statement and perform iterative Newton updates until convergence.
You may use numerical libraries such as numpy for linear algebra operations. However,
you may not use high-level machine learning libraries (e.g., scikit-learn) to perform the
optimization. You may use loadtxt() function from numpy to load the given data.
To visualize the classifier, you may evaluate your lwlr function over a grid of points and
plot the resulting predictions, coloring regions according to whether the classifier predicts
y = 0 or y = 1. Depending on how efficient your implementation is, generating a fine grid
may take some time. We recommend debugging with a coarse resolution (e.g., 50), and later
increasing it to at least 200 to better visualize the decision boundary.
(b) Evaluate the system with a variety of different bandwidth parameters τ . In particular, try
τ = 0.01, 0.05, 0.1, 0.5, 1.0, 5.0. How does the classification boundary change when varying
this parameter? Can you predict what the decision boundary of ordinary (unweighted)
logistic regression would look like?
3. Multivariate least squares
So far in class, we have only considered cases where our target variable y is a scalar value. Suppose
that instead of trying to predict a single output, we have a training set with multiple outputs for
each example:
{(x(i) , y (i) ), i = 1, . . . , m}, x(i) ∈ Rn , y (i) ∈ Rp .
Thus for each training example, y (i) is vector-valued, with p entries. We wish to use a linear
model to predict the outputs, as in least squares, by specifying the parameter matrix Θ in
y = ΘT x,
where Θ ∈ Rn×p .
(a) The cost function for this case is
m p
1 X X T (i) (i) 2
J(Θ) = (Θ x )j − yj .
2
i=1 j=1
Page 2
Write J(Θ) in matrix-vector notation (i.e., without using any summations). [Hint: Start
with the m × n design matrix
−(x(1) )T −
−(x(2) )T −
X=
..
.
−(x(m) )T −
and the m × p target matrix
−(y (1) )T −
−(y (2) )T −
Y =
..
.
−(y (m) )T −
and then work out how to express J(Θ) in terms of these matrices.]
(b) Find the closed form solution for Θ which minimizes J(Θ). This is the equivalent to the
normal equations for the multivariate case.
(c) Suppose instead of considering the multivariate vectors y (i) all at once, we instead compute
(i)
each variable yj separately for each j = 1, . . . , p. In this case, we have p individual linear
models, of the form
(i)
yj = θjT x(i) , j = 1, . . . , p.
(So here, each θj ∈ Rn .) How do the parameters from these p independent least squares
problems compare to the multivariate solution?
4. Modeling and Estimating Emergency Call Rates
The city of Metroville records the number of emergency calls received per hour in one district.
Let X denote the number of calls in a randomly selected hour. A statistician proposes the model
X ∼ Poisson(λ), λ > 0,
where λ is the average number of calls per hour. Recall that for a Poisson random variable,
E[X] = λ, Var(X) = λ.
The city collects data over n = 200 hours and reports:
X̄ = 2.4, S 2 = 2.5,
where
n n
1X 2 1 X
X̄ = Xi , S = (Xi − X̄)2 .
n n−1
i=1 i=1
Define the dispersion ratio
S2
D= .
X̄
For large n, under a Poisson model, D is approximately centered at 1 with standard deviation
r
2
.
n
(a) Model Evaluation
Page 3
i. What is the support of the Poisson distribution? Explain why this makes it a natural
(or unnatural) model for hourly call counts.
ii. Let X1 ∼ Poisson(λ1 ) and X2 ∼ Poisson(λ2 ) be independent. Prove that
X1 + X2 ∼ Poisson(λ1 + λ2 ).
iii. Suppose emergency calls arise independently from two different neighborhoods, or from
two independent sources (e.g., medical and fire calls), with hourly counts
X1 ∼ Poisson(λ1 ), X2 ∼ Poisson(λ2 ),
and total calls X = X1 +X2 . Using the result above, explain why the additivity property
is desirable in this context. Interpret λ1 + λ2 in terms of call intensities.
iv. Give two realistic situations in which the Poisson model might fail for emergency call
data. For each, clearly state which modeling assumption is violated.
v. Give two mathematical reasons why a normal distribution may be inappropriate for
modeling hourly call counts. Your answer must explicitly refer to (i) the support of the
normal distribution and (ii) at least one additional structural property of the normal
family.
vi. What additional assumptions about the data-generating process would be required to
justify a binomial model for hourly call counts? Would these assumptions be reasonable
in this setting? Briefly explain.
vii. Compute the dispersion ratio D. Using the approximation above, assess whether the
deviation of D from 1 is consistent with sampling variability when n = 200. Briefly
comment on whether the Poisson model appears reasonable.
(b) Maximum Likelihood Estimation
Assume for the remainder of the problem that the Poisson model is adopted.
i. Write down the likelihood function L(λ) and log-likelihood function ℓ(λ) for observations
X1 , . . . , Xn .
ii. Show that the maximum likelihood estimator is
λ̂MLE = X̄.
iii. Compute the numerical value of λ̂MLE .
(c) Maximum A Posteriori Estimation
City planners believe, based on historical data from similar districts, that the typical call
rate is around 2 calls per hour and that extremely large rates are unlikely. They model this
belief with a Gamma prior:
β α α−1 −βλ
p(λ) = λ e , λ > 0,
Γ(α)
with α = 4 and β = 2.
i. Write down the posterior distribution p(λ | data) up to proportionality.
ii. Derive the MAP estimator λ̂MAP .
iii. Compare λ̂MLE and λ̂MAP . Explain why they differ and interpret the MAP estimator as
a form of shrinkage.
(d) Prediction
i. Using the MLE plug-in estimate, write an expression for the probability that in the next
hour there will be at least 5 calls.
ii. Write the analogous expression using the MAP plug-in estimate.
iii. Explain conceptually why a full Bayesian predictive distribution would generally differ
from both plug-in approaches. No computation is required.
Page 4