0% found this document useful (0 votes)

305 views12 pages

CIS 519 Machine Learning Assignment 2

This document provides instructions for Assignment 2 of the CIS 419/519 Introduction to Machine Learning course. It outlines the collaboration policy, formatting requirements, and files needed to complete the assignment. The assignment consists of two parts - a problem set involving theoretical questions and programming exercises involving implementing polynomial regression and exploring the bias-variance tradeoff. Students are instructed to complete the assignment individually and are prohibited from sharing solutions. The programming exercises involve modifying linear regression code to fit polynomial models and exploring learning curves.

Uploaded by

Toán

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

305 views12 pages

CIS 519 Machine Learning Assignment 2

Uploaded by

Toán

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CIS 419/519 Introduction to Machine Learning

Assignment 2
Due: October 11, 2017 11:59pm

Instructions
Read all instructions in this section thoroughly.
Collaboration: Make certain that you understand the course collaboration policy, described on the course
website. You must complete this assignment individually; you are not allowed to collaborate with anyone
else. You may discuss the homework to understand the problems and the mathematics behind the various
learning algorithms, but you are not allowed to share problem solutions or your code with any
other students. You must also not consult code on the internet that is directly related to the programming
exercise. We will be using automatic checking software to detect academic dishonesty, so please don’t do it.
You are also prohibited from posting any part of your solution to the internet, even after the course is
complete. Similarly, please don’t post this PDF file or the homework skeleton code to the internet.
Formatting: This assignment consists of two parts: a problem set and program exercises.
For the problem set, you must write up your solutions electronically and submit it as a single PDF document.
We will not accept handwritten or paper copies of the homework. Your problem set solutions must use
proper mathematical formatting. For this reason, we strongly encourage you to write up your responses
using LATEX. (Alternative word processors, such as MS Word, produce very poorly formatted mathematics.)
Your solutions to the programming exercises must be implemented in python, following the precise instruc-
tions included in Part 2. Portions of the programing exercise will be graded automatically, so it is imperative
that your code follows the specified API. A few parts of the programming exercise asks you to create plots
or describe results; these should be included in the same PDF document that you create for the problem set.
Homework Template and Files to Get You Started: The homework zip file contains the skeleton
code and data sets that you will require for this assignment. Please read through the documentation
provided in ALL files before starting the assignment.
Citing Your Sources: Any sources of help that you consult while completing this assignment (other
students, textbooks, websites, etc.) *MUST* be noted in the your README file. This includes anyone you
briefly discussed the homework with. If you received help from the following sources, you do not need to cite
it: course instructor, course teaching assistants, course lecture notes, course textbooks or other readings.
Submitting Your Solution: We will post instructions for submitting your solution one week before the
assignment is due. Be sure to check Piazza then for details.
CIS 519 ONLY Problems: Several problems are marked as “[CIS 519 ONLY]” in this assignment. Only
students enrolled in CIS 519 are required to complete these problems. However, we do encourage students
in CIS 419 to read through these problems, although you are not required to complete them.
All homeworks will receive a percentage grade, but CIS 519 homeworks will be graded out of a different
total number of points than CIS 419 homeworks. Students in CIS 419 choosing to complete CIS 519 ONLY
exercises will not receive any credit for answers to these questions (i.e., they will not count as extra credit
nor will they compensate for points lost on other problems).
Acknowledgements: Parts of the programming exercises have been adapted from course materials by
Andrew Ng.
Assignment Version 20170926a

1
PART I: PROBLEM SET
Your solutions to the problems will be submitted as a single PDF document. Be certain that your problems
are well-numbered and that it is clear what your answers are. Additionally, you will be required to duplicate
your answers to particular problems in the README file that you will submit.

1 Gradient Descent (5 pts)

Let k be a counter for the iterations of gradient descent, and let αk be the learning rate for the k th step
of gradient descent. In one sentence, what are the implications of using a constant value for αk in gradient
descent? In another sentence, what are the implications for setting αk as a function of k?

2 Fitting an SVM by Hand (15 pts)

[Adapted
√ from Murphy & Jaakkola] Consider a dataset with only 2 points in 1D: (x1 = 0,√ y1 = −1) and
(x2 = 2, y2 = +1). Consider mapping each point to 3D using the feature vector φ(x) = [1, 2x, x2 ]| (i.e.,
use a 2nd-order polynomial kernel). The maximum margin classifier has the form
min kwk22 s.t. (1)
y1 (w| φ(x1 ) + w0 ) ≥ 1 (2)
y2 (w| φ(x2 ) + w0 ) ≥ 1 (3)
a.) Write down a vector that is parallel to the optimal vector w. (Hint: recall that w is orthogonal to the
decision boundary between the two points in 3D space.)
b.) What is the value of the margin that is achieved by w? (Hint: think about the geometry of two points
in space, with a line separating one from the other.)
2
c.) Solve for w, using the fact that the margin is equal to kwk2 .

d.) Solve for w0 , using your value of w and the two constraints (2)–(3) for the max margin classifier.
e.) Write down the form of the discriminant h(x) = w0 + w| φ(x) as an explicit function in terms of x.

3 VC Dimension (519 ONLY) (10 pts)

a.) Imagine that we are working in Rd space and we are using a hyper-dimensional sphere centered at the
origin as a classifier. Anything inside the sphere is considered positive, and the only thing we can do to train
the model is to adjust the radius of the sphere (it stays centered at the origin). What is the VC dimension
of this classifier?
b.) Now, imagine we are able to change the direction of the classification surface, so that we could have
anything inside the sphere be predicted positive or everything inside the sphere be predicted negative (our
choice). What is the VC dimension of this classifier now?

2
PART II: PROGRAMMING EXERCISES

1 Polynomial Regression (25 pts)

In the previous assignment, you implemented linear regression. In the first implementation exercise, you will
modify your implementation to fit a polynomial model and explore the bias/variance tradeoff.
Relevant Files in Homework Skeleton1
• [Link] • test polyreg [Link]
• linreg [Link] • data/[Link]
• test polyreg [Link]

1.1 Implementing Polynomial Regression

Recall that polynomial regression learns a function hθ (x) = θ0 + θ1 x + θ2 x2 + . . . + θd xd . In this case, d
represents the polynomial’s degree. We can equivalently write this in the form of a generalized linear model
hθ (x) = θ0 φ0 (x) + θ1 φ1 (x) + θ2 φ2 (x) + . . . + θd φd (x) , (4)
j
using the basis expansion that φj (x) = x . Notice that, with this basis expansion, we obtain a linear model
where the features are various powers of the single univariate x. We’re still solving a linear regression
problem, but are fitting a polynomial function of the input.
Implement regularized polynomial regression in [Link]. You may implement it however you like, using
gradient descent or a closed-form solution. However, I would recommend the closed-form solution since the
data sets are small; for this reason, we’ve included an example closed-form implementation of linear regression
in linreg [Link] (you are welcome to build upon this implementation, but make CERTAIN you
understand it, since you’ll need to change several lines of it). You are also welcome to build upon your
implementation from the previous assignment, but you must follow the API below. Note that all matrices
are actually 2D numpy arrays in the implementation.
• init(degree=1, regLambda=1E-8) : constructor with arguments of d and λ
• fit(X,Y): method to train the polynomial regression model
• predict(X): method to use the trained polynomial regression model for prediction
• polyfeatures(X, degree): expands the given n × 1 matrix X into an n × d matrix of polynomial
features of degree d. Note that the returned matrix will not include the zero-th power.
Note that the polyfeatures(X, degree) function maps the original univariate data into its higher order
powers. Specifically, X will be an n×1 matrix (X ∈ Rn×1 ) and this function will return the polynomial expan-
sion of this data, a n×d matrix. Note that this function will not add in the zero-th order feature (i.e., x0 = 1).
You should add the x0 feature separately, outside of this function, before training the model. By not including
the x0 column in the matrix returned by polyfeatures(), this allows the polyfeatures function to be more
general, so it could be applied to multi-variate data as well.
(If it did add the x0 feature, we’d end up with multiple
columns of 1’s for multivariate data.)
Also, notice that the resulting features will be badly scaled
if we use them in raw form. For example, with a polyno-
mial of degree d = 8 and x = 20, the basis expansion yields
x1 = 20 while x8 = 2.56 × 1010 – an absolutely huge differ-
ence in range. Consequently, we will need to standardize the
data before solving linear regression. Standardize the data in
fit() after you perform the polynomial feature expansion.
You’ll need to apply the same standardization transforma-
tion in predict() before you apply it to new data.
Run test polyreg [Link] to test your implemen- Figure 1: Fit of polynomial regression with
tation, which will plot the learned function. In this case, the λ = 0 and d = 8
1 Bold text indicates files or functions that you will need to complete; you should not need to modify any of the other files.

3
script fits a polynomial of degree d = 8 with no regularization λ = 0. From the plot, we see that the
function fits the data well, but will not generalize well to new data points. Try increasing the amount of
regularization, and examine the resulting effect on the function.

1.2 Examine the Bias-Variance Tradeoff through Learning Curves

Learning curves provide a valuable mechanism for evaluating the bias-variance tradeoff. We have provided
the learningCurve() function in [Link] to compute the learning curves for a given training/test
set. You don’t need to do anything to this function—it is ready to use—but be sure to read through and
understand how it works. The learningCurve(Xtrain, ytrain, Xtest, ytest, degree, regLambda)
function takes in the training data (Xtrain, ytrain), the testing data (Xtest, ytest), and values for the
polynomial degree d and regularization parameter λ.
The function returns two arrays, errorTrain (the array of training errors) and errorTest (the array of
testing errors). The ith index (starting from 0) of each array returns the training error (or testing error)
for learning with i + 1 training instances. Note that the 0th index actually won’t matter, since we start
displaying the learning curves with two or more instances.
When computing the learning curves, we learn on Xtrain[0:i] for i = 1, . . . , numInstances(Xtrain), each
time computing the testing error over the entire test set. Recall that the error for regression problems is
given by
n
1X
(hθ (xi ) − yi )2 . (5)
n i=1

Once you thoroughly understand the learningCurve() function, run the test polyreg [Link]
script to plot the learning curves for various values of λ and d. You should see plots similar to the following:

Notice the following:

• The y-axis is using a log-scale and the ranges of the y-scale are all different for the plots. The dashed
black line indicates the y = 1 line as a point of reference between the plots.

4
• The plot of the unregularized model with d = 1 shows poor training error, indicating a high bias (i.e.,
it is a standard univariate linear regression fit).
• The plot of the unregularized model (λ = 0) with d = 8 shows that the training error is low, but that
the testing error is high. There is a huge gap between the training and testing errors caused by the
model overfitting the training data, indicating a high variance problem.

• As the regularization parameter increases (e.g., λ = 1) with d = 8, we see that the gap between the
training and testing error narrows, with both the training and testing errors converging to a low value.
We can see that the model fits the data well and generalizes well, and therefore does not have either a
high bias or a high variance problem. Effectively, it has a good tradeoff between bias and variance.

• Once the regularization parameter is too high (λ = 100), we see that the training and testing errors
are once again high, indicating a poor fit. Effectively, there is too much regularization, resulting in
high bias.

Make absolutely certain that you understand these observations, and how they relate to the learning curve
plots. In practice, we can choose the value for λ via cross-validation to achieve the best bias-variance tradeoff.

5
2 Logistic Regression (35 points)
Now that we’ve implemented a basic regression model using gradient descent, we will use a similar technique
to implement the logistic regression classifier.
Relevant Files in Homework Skeleton2

• test [Link] - python script to test logistic regression

• test [Link] - python script to test logistic regression on non-linearly separable data (519 ONLY)
• data/[Link] - Student admissions prediction data, nearly linearly separable

• data/[Link] - Microchip data, non-linearly separable

• [Link] - Map instances to a higher dimensional polynomial feature space (519 ONLY).
• [Link]: implements the LogisticRegression class

2.1 Implementation
Implement regularized logistic regression by completing the LogisticRegression class in [Link]. Your
class must implement the following API:

• init (alpha, regLambda, epsilon, maxNumIters): the constructor, which takes in α, λ, , and
maxNumIters as arguments

• fit(X,y): train the classifier from labeled data (X, y)

• predict(X): return a vector of n predictions for each of n rows of X
• computeCost(theta, X, y, regLambda): computes the logistic regression objective function for the
given values of θ, X, y, and λ (“lambda” is a keyword in python, so we must call the regularization
parameter something different)
• computeGradient(theta, X, y, reg): computes the d-dimensional gradient of the logistic regression
objective function for the given values of θ, X, y, and reg = λ
• sigmoid(z): returns the sigmoid function of z

Note that these methods have already been defined correctly for you in [Link]; be very careful not to
change the API.

Sigmoid Function You should begin by implementing the sigmoid(z) function. Recall that the logistic
regression hypothesis h() is defined as:

hθ (x) = g(θ T x)
where g() is the sigmoid function defined as:

1
g(z) =
1 + exp−z
The Sigmoid function has the property that g(+∞) ≈ 1 and g(−∞) ≈ 0. Test your function by calling
sigmoid(z) on different test samples. Be certain that your sigmoid function works with both
vectors and matrices — for either a vector or a matrix, you function should perform the sigmoid function
on every element.
2 Bold text indicates files that you will need to complete; you should not need to modify any of the other files.

6
Cost Function and Gradient Implement the functions to compute the cost function and the gradient
of the cost function. Recall the cost function for logistic regression is a scalar value given by
n
X λ
J (θ) = [−y (i) log(hθ (x(i) )) − (1 − y (i) ) log(1 − hθ (x(i) ))] + kθk22 .
i=1
2
The gradient of the cost function is a d-dimensional vector, where the j th element (for j = 1...d) is given by
n
∂J (θ) X (i)
= (hθ (x(i) ) − y (i) )xj + λθj .
∂θj i=1
We must be careful not to regularize the θ0 parameter (corresponding to the 1’s feature we add to each
instance), and so
n
∂J (θ) X
= (hθ (x(i) ) − y (i) ) .
∂θ0 i=1

Training and Prediction Once you have the cost and gradient functions complete, implement the fit
and predict methods. To make absolutely certain that the un-regularized θ0 corresponds to the 1’s feature
that we add to the input, we will augment both the training and testing instances within the fit and
predict methods (instead of relying on it being done externally to the classifier). Recall that you can do
this via:
X = np . c [ np . o n e s ( ( n , 1 ) ) , X]

Your fit method should train the model via gradient descent, relying on the cost and gradient functions.
Instead of simply running gradient descent for a specific number of iterations (as in the linear regression
exercise), we will use a more sophisticated method: we will stop it after the solution has converged. Stop
the gradient descent procedure when θ stops changing between consecutive iterations. You can detect this
convergence when
kθnew − θold k2 ≤ , (6)
for some small (e.g, = 10E-4). For readability, we’d recommend implementing this convergence test as
a dedicated function hasConverged. For safety, we will also set the maximum number of gradient descent
iterations, maxNumIters. The values of λ, , maxNumIters, and α (the gradient descent learning rate) are
arguments to LogisticRegression’s constructor. At the start of gradient descent, θ should be initialized
to random values with mean 0, as described in the linear regression exercise.

2.2 Testing Your Implementation

To test your logistic regression implementation, run python test [Link] from the command line. This
script trains a logistic regression model using your implementation and then uses it to predict whether or
not a student will be admitted to a school based on their scores on two exams. In the plot, the colors of the
points indicate their true class label and the background color indicates the predicted class label. If your
implementation is correct, the decision boundary should closely match the true class labels of the points,
with only a few errors.

2.3 Analysis
Varying λ changes the amount of regularization, which acts as a penalty on the objective function for complex
solutions. In test [Link], we provide the code to draw the decision boundary of the model. Vary the
value of λ in test [Link] and plot the decision boundary for each value. Include several plots in your
PDF writeup for this assignment, labeling each plot with the value of λ in your writeup. Write a brief
paragraph (2-3 sentences) describing what you observe as you increase the value of λ and explain why.

7
2.4 [CIS 519 ONLY] Learning a Nonlinear Decision Surface (10 pts)
We strongly encourage students in CIS 419 to read through this exercise in detail, even though you are not
required to complete it. This exercise is required only for students in CIS 519.
Some classification problems that cannot be solved in a low-dimensional feature space can be separable
in a high-dimension space. We can make our logistic regression classifier much more powerful by adding
additional features to the input. Imagine that we are given a data set with only two features, x1 and x2 , but
the decision surface is non-linear in these two features. We can expand the possible feature set into higher
dimensions by adding new features that are functions of the given features. For example, we could add a
new feature that is the product of the original features, or any other mathematical function of those two
features that we want.
In this example, we will map the two features into all polynomial
 terms
 of x1 and x2 up to the 6th power:
1
 x1 
 
 x2 
 
 x21 
 
 x1 x
mapF eature(x1 , x2 ) =  2  (7)

 x22 
 
 ··· 
 
 x1 x52 
x62
Given a 2-dimensional instance x, we can project that instance into a 28-dimensional feature space using
the transformation above. Then, instead of training the classifier on instances in the 2-dimensional space,
we train it on the 28-dimensional projection of those instances. The resulting decision boundary will then
be more complex and will appear non-linear in the 2D plot.
Complete the mapFeature function in [Link] to map instances from a 2D feature space to a 28-
dimensional feature space defined by all polynomial terms of x1 and x2 up to the sixth power. Your function
mapFeature(X1, X2) will take in two column matrices, X1 and X2, which correspond to the 1st and 2nd
columns respectively of the data set X (recall that X is n-by-d, and so both X1 and X2 will have n entries).
It will return an n-by-28 dimensional matrix, where each row represents the expanded feature set for one
instance.
Once this function is complete, you can run python test [Link] from the command line. This script
will load a data set that is non-linearly separable, use your mapFeature function to project the instances
into a higher dimensional space, and then train a logistic regression model in that higher dimensional feature
space. When we project the resulting non-linear classifier back down into 2D, we get a decision surface that
appears similar to the following:

Figure 2: Nonlinear Logistic Regression Classifier

8
3 Support Vector Machines (20 points)
In this section, we’ll implement various kernels for the support vector machine (SVM). This exercise looks
long, but in practice you’ll be writing only a few lines of code. The scikit learn package already includes
several SVM implementations; in this case, we’ll focus on the SVM for classification, [Link].
Before starting this assignment, be certain to read through the documentation for SVC, available at http:
//[Link]/stable/modules/generated/[Link].
While we could certainly create our own SVM implementation, most people applying SVMs to real problems
rely on highly optimized SVM toolboxes, such as LIBSVM or SVMlight.3 These toolboxes provide highly
optimized SVM implementations that use a variety of optimization techniques to enable them to scale to
extremely large problems. Therefore, we will focus on implementing custom kernels for the SVMs, which is
often essential for applying these SVM toolboxes to real applications.
Relevant Files in Homework 2 Skeleton4
• example [Link] • test [Link]
• example [Link] • data/[Link]
• [Link] • data/[Link]
• test [Link]

3.1 Getting Started

The SVC implementation provided with scikit learn uses the parameter C to control the penalty for misclas-
sifying training instances. We can think of C as being similar to the inverse of the regularization parameter
1
λ that we used before for linear and logistic regression. C = 0 causes the SVM to incur no penalty for
misclassifications, which will encourage it to fit a larger-margin hyperplane, even if that hyperplane misclas-
sifies more training instances. As C grows large, it causes the SVM to try to classify all training examples
correctly, and so it will choose a smaller margin hyperplane if that hyperplane fits the training data better.
Examine example [Link], which fits a linear SVM to the data shown below. Note that most of the positive
and negative instances are grouped together, suggesting a clear separation between the classes, but there is
an outlier around (0.5,6.2). In the first part of this exercise, we will see how this outlier affects the SVM fit.
Run example [Link] with C = 0.01, and you can clearly see that the hyperplane follows the natural
separation between most of the data, but misclassifies the outlier. Try increasing the value of C and observe
the effect on the resulting hyperplane. With C = 1, 000, we can see that the decision boundary correctly
classifies all training data, but clearly no longer captures the natural separation between the data.

3 The SVC implementation provided with scikit learn is based on LIBSVM, but is not quite as efficient.
4 Bold text indicates files that you will need to complete; you should not need to modify any of the other files.

9
3.2 Implementing Custom Kernels
The SVC implementation allows us to define our own kernel functions to learn non-linear decision surfaces
using the SVM. The SVC constructor’s kernel argument can be defined as either a string specifying one of
the built-in kernels (e.g., ’linear’, ’poly’ (polynomial), ’rbf’ (radial basis function), ’sigmoid’, etc.)
or it can take as input a custom kernel function, as we will define in this exercise.
For example, the following code snippet defines a custom kernel and uses it to train the SVM:
def myCustomKernel (X1 , X2 ) :
”””
Custom k e r n e l :
k (X1 , X2) = X1 (3 0) X2 .T
(0 2)

Note t h a t X1 and X2 a r e numpy a r r a y s , so we must use . d o t t o m u l t i p l y them .

”””
M = np . matrix ( [ [ 3 . 0 , 0 ] , [ 0 , 2 . 0 ] ] )
return np . dot ( np . dot (X1 , M) , X2 . T)

# c r e a t e SVM w i t h custom k e r n e l and t r a i n model

c l f = svm . SVC( k e r n e l=myCustomKernel )
c l f . f i t (X, Y)

When the SVM calls the custom kernel function during training, X1 and X2 are both initialized to be
the same as X (i.e., ntrain -by-d numpy arrays); in other words, they both contain a complete copy of
the training instances. The custom kernel function returns an ntrain -by-ntrain numpy array during the
training step. Later, when it is used for testing, X1 will be the ntest testing instances and X2 will be the
ntrain training instances, and so it will return an ntest -by-ntrain numpy array. For a complete example, see
example [Link], which uses the custom kernel above to generate the following figure:

3.3 Implementing the Polynomial Kernel

We will start by writing our own implementation of the polynomial kernel and incorporate it into the SVM.5
Complete the myPolynomialKernel() function in [Link] to implement the polynomial kernel:
K(v, w) = (hv, wi + 1)d , (8)
where d is the degree of polynomial and the “+1” incorporates all lower-order polynomial terms. In this
case, v and w are feature vectors. Vectorize your implementation to make it fast. Once complete, run
test [Link] to produce a plot of the decision surface. For comparison, it also shows the decision
surface learned using the equivalent built-in polynomial kernel; your results should be identical.
5 Although scikit learn already defines the polynomial kernel, defining our own version of it provides an easy way to get
started implementing custom kernels.

10
Figure 3: Sample output of test [Link]

For the built-in polynomial kernel, the degree is specified in the SVC constructor. However, in our custom
kernel we must set the degree via a global variable polyDegree. Therefore, be sure to use the value of the
global variable polyDegree as the degree in your polynomial kernel. The test [Link] script
uses polyDegree for the degree of both your custom kernel and the built-in polynomial kernel.
Vary both C and d and study how the SVM reacts.

3.4 Implementing the Gaussian Radial Basis Function Kernel

Next, complete the myGaussianKernel() function in [Link] to implement the Gaussian kernel:
kv − wk22

K(v, w) = exp − . (9)
2σ 2
Be sure to use the gaussSigma for σ in your implementation. For computing the pairwise squared distances
between the points, you must write the method to compute it yourself; specifically you may not use the helper
methods available in [Link] or that come with scipy. You can test your implementation
and compare it to the equivalent RBF-kernel provided in sklearn by running test [Link].

Figure 4: Sample output of test [Link]

Again, vary both C and σ and study how the SVM reacts.
Write a brief paragraph describing how the SVM reacts as both C and d vary for the polynomial kernel, and
as C and σ vary for the Gaussian kernel. Put this paragraph in your writeup, and also in the README file.

3.5 Choosing the Optimal Parameters

This exercise will further help you gain further practical skill in applying SVMs with Gaussian kernels. Choos-
ing the correct values for C and σ can dramatically affect the quality of the model’s fit to data. Your task

11
is to determine the optimal values of C and σ for an SVM with your Gaussian kernel as applied to the data
in data/[Link], depicted below. You should use whatever method you like to search over the

space of possible combinations of C and σ, and determine the optimal fit as measured by accuracy. You may
use any built-in methods from scikit learn you wish (e.g., [Link] [Link]). We recom-
mend that you search over multiplicative steps (e.g., . . . , 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60, 100, . . .).
Once you determine the optimal parameter values, report those optimal values and the corresponding es-
timated accuracy in the README file. For reference, the SVM with the optimal parameters we found
produced the following decision surface.

The resulting decision surface for your optimal parameters may look slightly different than ours.

3.6 Implementing the Cosine Similarity Kernel (CIS 519 ONLY) (10 pts)
Implement the cosine similarity kernel by completing myCosineSimilarityKernel() in [Link]. We
have not provided (quite on purpose!) a test script that compares your implementation to the cosine similarity
kernel already provided by scikit learn, but you are welcome to adapt the test [Link] script
for this purpose.

Common questions

Convergence testing in gradient descent serves to determine when the model's parameters have reached a state where further iterations are unlikely to result in significant changes. By implementing a convergence test, such as checking when the difference between consecutive parameter values (θnew and θold) falls below a specified threshold (ϵ), training can be stopped efficiently once the algorithm has sufficiently minimized the cost function. This avoids unnecessary computations and accelerates the training process, ensuring that computational resources are used effectively without compromising model accuracy.

The optimal parameters for a Gaussian kernel in SVM can be determined by searching over a combination of potential values for C and sigma (σ), often using techniques like grid search with cross-validation to evaluate different combinations based on their accuracy. Incorrect parameter selection can lead to poor model performance; a very small σ can cause the model to become overly sensitive to the training data, resulting in overfitting, while a very large σ can cause underfitting by simplifying the model too much. Similarly, the value of C controls the trade-off between the smooth decision boundary and classification accuracy on the training data, affecting bias and variance directly.

Initially, the sphere-based classifier has a limited VC dimension as the only parameter to adjust is the radius. However, when the ability to change the direction of the classification surface is introduced, the VC dimension increases because it introduces an additional flexibility to classify points inside or outside the sphere as positive or negative. This effectively allows the classifier to handle more complex decision boundaries, thereby increasing its capacity.

Regularization introduces a penalty term that controls the complexity of the polynomial regression model, thereby helping manage the bias-variance trade-off. Without regularization, higher degree polynomials can lead to overfitting, resulting in a low training error but a high testing error, indicating high variance. By increasing the regularization parameter (lambda), both training and testing errors become closer, reducing the model's variance without significantly increasing the bias. This way, regularization helps achieve a good balance between bias and variance, leading to a model that generalizes well to new data.

Implementing custom kernels in SVMs is important because they allow the classifier to handle data that is not linearly separable in the original feature space. Custom kernels transform the data into a higher dimensional space where it may become linearly separable. A polynomial kernel, specifically, transforms the feature space by taking each pair of data points and computes their similarity raised to a power, thereby mapping the data implicitly to a higher dimension. This transformation allows the SVM to construct more complex decision boundaries that can handle non-linear patterns effectively without explicitly computing coordinates in the higher-dimensional space.

In SVMs, the penalty parameter C controls the trade-off between maximizing the margin and minimizing the classification error. A small C value allows the SVM to choose a larger-margin hyperplane even if it misclassifies some training points, thus accommodating outliers as it seeks to minimize the misclassification penalty. However, a large C value forces the SVM to classify all training examples correctly by minimizing misclassification errors, which may result in a smaller-margin hyperplane susceptible to outliers. As C increases, the decision boundary becomes more sensitive to outliers, potentially leading to overfitting.

The learning rate (alpha) defines the size of the step taken during each iteration of gradient descent. If the learning rate is too high, the algorithm may overshoot the minimum, causing divergence or oscillations. Conversely, if the learning rate is too low, the convergence will be slow, taking much longer to reach the minimum value. In the logistic regression model, balancing the learning rate is crucial for achieving convergence efficiently and finding the optimal parameters without overshooting.

Using polynomial degrees in logistic regression enhances its ability to fit non-linear decision boundaries by increasing model complexity. However, without regularization, high-degree polynomials can cause overfitting, leading to significant gaps between training and test errors. Proper regularization (controlled by λ) mitigates this by penalizing higher complexity, maintaining a balance between fitting capacity and generalization performance. As the polynomial degree increases, proper adjustments of the regularization parameter are crucial to prevent overfitting while maintaining high accuracy. Cross-validation is typically used to fine-tune these hyperparameters, aiming for a robust model that accurately predicts nonlinear boundaries with minimal variance and bias.

Using a custom cosine similarity kernel impacts the SVM's ability to classify data by emphasizing the angular relationships between data points rather than their magnitude or Euclidean distances. This can be particularly useful in applications where the direction of data vectors is more informative than their magnitude, such as in text classification or any domain involving high-dimensional data with large variance. By capturing these angular relationships, the SVM can potentially enhance classification performance if traditional kernel functions fail to capture the intrinsic structure of the data well. This advantage is specific to cases where such directional information is relevant to the classification task.

Adding additional features through basis expansion in polynomial regression allows the model to capture more complex patterns by fitting higher-order polynomial functions. This can improve model accuracy when the relationship between features and the target variable is nonlinear. However, it also increases the risk of overfitting, as the model becomes more flexible and may start capturing noise in the training data. This complexity increase necessitates the use of regularization to prevent overfitting and manage the balance between bias and variance effectively.

Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
10 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
22 pages
Perceptron Model Overview and Learning
No ratings yet
Perceptron Model Overview and Learning
33 pages
Feed Forward Neural Networks Overview
No ratings yet
Feed Forward Neural Networks Overview
121 pages
Understanding Neural Networks and Fuzzy Logic
No ratings yet
Understanding Neural Networks and Fuzzy Logic
13 pages
MLP Overview in Soft Computing
No ratings yet
MLP Overview in Soft Computing
20 pages
Fuzzy Decision Making in Inference Systems
No ratings yet
Fuzzy Decision Making in Inference Systems
52 pages
Data Analytics Lab Manual 2024-25
No ratings yet
Data Analytics Lab Manual 2024-25
26 pages
Adversarial Search in Game Theory
No ratings yet
Adversarial Search in Game Theory
57 pages
Computer Vision: Overview and Applications
No ratings yet
Computer Vision: Overview and Applications
33 pages
Bidirectional Search in Python Code
100% (1)
Bidirectional Search in Python Code
70 pages
My India My Pride Initiative Overview
No ratings yet
My India My Pride Initiative Overview
11 pages
Machine Learning Lab Viva Questions
100% (1)
Machine Learning Lab Viva Questions
4 pages
Engineering Computer Networking Insights
No ratings yet
Engineering Computer Networking Insights
6 pages
Machine Learning Exam Question Paper
No ratings yet
Machine Learning Exam Question Paper
3 pages
Understanding Unsupervised Learning Techniques
No ratings yet
Understanding Unsupervised Learning Techniques
4 pages
Perceptron Learning with PyTorch Lab
No ratings yet
Perceptron Learning with PyTorch Lab
13 pages
McCulloch-Pitts Neuron Model Overview
No ratings yet
McCulloch-Pitts Neuron Model Overview
20 pages
Computer Vision Course Overview
No ratings yet
Computer Vision Course Overview
6 pages
Artificial Neural Networks Overview
No ratings yet
Artificial Neural Networks Overview
18 pages
PyTorch Autoencoder Architecture Guide
No ratings yet
PyTorch Autoencoder Architecture Guide
42 pages
McCulloch-Pitts and Perceptron Overview
No ratings yet
McCulloch-Pitts and Perceptron Overview
55 pages
Algorithm Performance Analysis Guide
No ratings yet
Algorithm Performance Analysis Guide
10 pages
Neuro-Fuzzy System Architecture Explained
100% (2)
Neuro-Fuzzy System Architecture Explained
27 pages
Understanding Unsupervised Learning
No ratings yet
Understanding Unsupervised Learning
35 pages
Training Deep Neural Networks Insights
No ratings yet
Training Deep Neural Networks Insights
55 pages
Pattern Recognition & Anomaly Detection
No ratings yet
Pattern Recognition & Anomaly Detection
2 pages
Local vs Global Search in AI
No ratings yet
Local vs Global Search in AI
79 pages
MapReduce in Batch Processing
No ratings yet
MapReduce in Batch Processing
57 pages
Deep Learning Practical by Ritik Kumar
No ratings yet
Deep Learning Practical by Ritik Kumar
17 pages
OpenCV Python Image Processing Guide
No ratings yet
OpenCV Python Image Processing Guide
9 pages
Backpropagation in Deep Neural Networks
No ratings yet
Backpropagation in Deep Neural Networks
4 pages
Deep Learning Interview Questions Guide
No ratings yet
Deep Learning Interview Questions Guide
46 pages
Machine Learning Concepts Overview
No ratings yet
Machine Learning Concepts Overview
5 pages
Multilayer Perceptron Overview
No ratings yet
Multilayer Perceptron Overview
71 pages
Feature Extraction in Image Processing
No ratings yet
Feature Extraction in Image Processing
12 pages
IN2346 Deep Learning Mock Exam Solutions
No ratings yet
IN2346 Deep Learning Mock Exam Solutions
8 pages
Data Preprocessing Techniques in Data Mining
No ratings yet
Data Preprocessing Techniques in Data Mining
93 pages
Quality Assurance in Outcome-Based Education
No ratings yet
Quality Assurance in Outcome-Based Education
10 pages
AD3511 Deep Learning Lab Manual
No ratings yet
AD3511 Deep Learning Lab Manual
80 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
4 pages
Unsupervised Learning Techniques Overview
No ratings yet
Unsupervised Learning Techniques Overview
3 pages
AI & ML Course: From Basics to GenAI
No ratings yet
AI & ML Course: From Basics to GenAI
5 pages
DL Lab Manual: Neural Network Programs
No ratings yet
DL Lab Manual: Neural Network Programs
29 pages
Feature Extraction & Segmentation Techniques
No ratings yet
Feature Extraction & Segmentation Techniques
73 pages
Perceptron Trick in Logistic Regression
No ratings yet
Perceptron Trick in Logistic Regression
44 pages
Deep Learning Course: Master Neural Networks
No ratings yet
Deep Learning Course: Master Neural Networks
2 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
6 pages
Advanced Deep Learning Syllabus
No ratings yet
Advanced Deep Learning Syllabus
2 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
101 pages
Backpropagation with Batch Normalization
No ratings yet
Backpropagation with Batch Normalization
20 pages
Supervised Learning Fundamentals in AI
No ratings yet
Supervised Learning Fundamentals in AI
7 pages
Expectation-Maximization Algorithm Overview
No ratings yet
Expectation-Maximization Algorithm Overview
3 pages
Introduction to Soft Computing
No ratings yet
Introduction to Soft Computing
7 pages
Mountain Clustering in Data Analysis
No ratings yet
Mountain Clustering in Data Analysis
21 pages
39-Q Learning Numerical
No ratings yet
39-Q Learning Numerical
13 pages
MC-Culloch Pitt Model for AND Gate
No ratings yet
MC-Culloch Pitt Model for AND Gate
10 pages
Polynomial Regression Homework Solutions
No ratings yet
Polynomial Regression Homework Solutions
11 pages
Cross-Validation in Polynomial Fitting
No ratings yet
Cross-Validation in Polynomial Fitting
3 pages
CS 189/289A Spring 2025 HW1 Guide
No ratings yet
CS 189/289A Spring 2025 HW1 Guide
12 pages
2.4 GHz Microstrip Patch Antenna Design
No ratings yet
2.4 GHz Microstrip Patch Antenna Design
11 pages
Effective EMI Filter Design Method of Singlephase Inverter Based On Noise Source Impedance
No ratings yet
Effective EMI Filter Design Method of Singlephase Inverter Based On Noise Source Impedance
6 pages
English Listening Test Instructions
No ratings yet
English Listening Test Instructions
41 pages
Listening Test Instructions and Format
No ratings yet
Listening Test Instructions and Format
13 pages
Reading Comprehension Test Guide
No ratings yet
Reading Comprehension Test Guide
27 pages
Đọc 4
No ratings yet
Đọc 4
22 pages
Listening Test Instructions and Format
No ratings yet
Listening Test Instructions and Format
37 pages
Listening Test Instructions and Format
No ratings yet
Listening Test Instructions and Format
13 pages
Understanding Spline Curves and Polynomials
No ratings yet
Understanding Spline Curves and Polynomials
17 pages
BJT Small-Signal Amplifier Analysis
No ratings yet
BJT Small-Signal Amplifier Analysis
62 pages
Add Assignments to CIS419 Repository
No ratings yet
Add Assignments to CIS419 Repository
2 pages
BJT Small-Signal Amplifier Analysis
No ratings yet
BJT Small-Signal Amplifier Analysis
62 pages
Arduino Architecture Overview
No ratings yet
Arduino Architecture Overview
3 pages
Migrating Enterprise Vault to Office 365
No ratings yet
Migrating Enterprise Vault to Office 365
9 pages
Software Update Guide for TCL TVs
No ratings yet
Software Update Guide for TCL TVs
7 pages
Online Car Rental Management System
100% (3)
Online Car Rental Management System
41 pages
Understanding Bandwidth vs Internet Speed
No ratings yet
Understanding Bandwidth vs Internet Speed
2 pages
Java Programming Course 314317 Overview
100% (1)
Java Programming Course 314317 Overview
8 pages
Nr2003 to rFactor Track Conversion Guide
No ratings yet
Nr2003 to rFactor Track Conversion Guide
12 pages
OSI Model: MAC Address Explained
No ratings yet
OSI Model: MAC Address Explained
18 pages
LTE/EPS Network Architecture Overview
0% (1)
LTE/EPS Network Architecture Overview
46 pages
H3C Magic BS210T-P PoE Switch Overview
No ratings yet
H3C Magic BS210T-P PoE Switch Overview
4 pages
Understanding ER Diagrams and Models
No ratings yet
Understanding ER Diagrams and Models
41 pages
GCE Computer Science Mark Scheme 2019
No ratings yet
GCE Computer Science Mark Scheme 2019
27 pages
Understanding Chaincode in Blockchain
No ratings yet
Understanding Chaincode in Blockchain
7 pages
IL2CPP Error in Mobile Legends
No ratings yet
IL2CPP Error in Mobile Legends
133 pages
Quectel RM520N-GL Hardware Design V1.0
No ratings yet
Quectel RM520N-GL Hardware Design V1.0
85 pages
Startdrive V15.1 Openness en
No ratings yet
Startdrive V15.1 Openness en
64 pages
DMT Library: Modbus Communication Setup
No ratings yet
DMT Library: Modbus Communication Setup
5 pages
E-Learning Awareness at National Open University
No ratings yet
E-Learning Awareness at National Open University
53 pages
DGS-1210 Switch Setup for Dante/AES67
No ratings yet
DGS-1210 Switch Setup for Dante/AES67
4 pages
Smith & Nephew 560P Camera Control Unit
No ratings yet
Smith & Nephew 560P Camera Control Unit
2 pages
HPSC PGT-CS: Emerging Tech Trends
No ratings yet
HPSC PGT-CS: Emerging Tech Trends
7 pages
Evaluating Presentation Software Tools
No ratings yet
Evaluating Presentation Software Tools
61 pages
SPI Print Optimizer Overview
No ratings yet
SPI Print Optimizer Overview
2 pages
Resolving AADSTS165000 Errors in MAUI
No ratings yet
Resolving AADSTS165000 Errors in MAUI
11 pages
Multi-Disease Prediction with ML Tools
No ratings yet
Multi-Disease Prediction with ML Tools
2 pages
8086 Microprocessor Instruction Set Guide
No ratings yet
8086 Microprocessor Instruction Set Guide
36 pages
PID Controllers - Intro To Control Design - Online Engineering Courses
No ratings yet
PID Controllers - Intro To Control Design - Online Engineering Courses
1 page
International Recognition of Moldovan Universities
No ratings yet
International Recognition of Moldovan Universities
7 pages
98 42 958 0201 Han Modular Selection Guide EN
No ratings yet
98 42 958 0201 Han Modular Selection Guide EN
10 pages
Extracting MiCOM Px40 Settings Files
No ratings yet
Extracting MiCOM Px40 Settings Files
16 pages

CIS 519 Machine Learning Assignment 2

Uploaded by

CIS 519 Machine Learning Assignment 2

Uploaded by

CIS 419/519 Introduction to Machine Learning

1 Gradient Descent (5 pts)

2 Fitting an SVM by Hand (15 pts)

3 VC Dimension (519 ONLY) (10 pts)

1 Polynomial Regression (25 pts)

1.1 Implementing Polynomial Regression

1.2 Examine the Bias-Variance Tradeoff through Learning Curves

Notice the following:

• test [Link] - python script to test logistic regression

• data/[Link] - Microchip data, non-linearly separable

• fit(X,y): train the classifier from labeled data (X, y)

2.2 Testing Your Implementation

Figure 2: Nonlinear Logistic Regression Classifier

3.1 Getting Started

Note t h a t X1 and X2 a r e numpy a r r a y s , so we must use . d o t t o m u l t i p l y them .

# c r e a t e SVM w i t h custom k e r n e l and t r a i n model

3.3 Implementing the Polynomial Kernel

3.4 Implementing the Gaussian Radial Basis Function Kernel

Figure 4: Sample output of test [Link]

3.5 Choosing the Optimal Parameters

Common questions

What is the purpose of convergence testing in gradient descent, and how does it influence training efficiency?

How do you determine the optimal parameters for the Gaussian kernel in SVM, and what is the impact of incorrect parameter selection?

How does the ability to change the direction of the classification surface affect the VC dimension of a sphere-based classifier?

What role does regularization play in controlling the bias-variance trade-off in polynomial regression models?

Why is it important to implement custom kernels in SVM, and how does a polynomial kernel transform feature space?

How does the penalty parameter C in SVMs influence the decision boundary, particularly in the presence of outliers?

What is the effect of the learning rate on the convergence of the logistic regression model using gradient descent?

Evaluate the impact of various polynomial degrees and regularization parameters on the classification performance of logistic regression in non-linear decision boundary problems.

How does using a custom cosine similarity kernel impact the SVM's ability to classify data compared to using traditional kernel functions?

What are the implications of adding additional features through basis expansion in polynomial regression?

You might also like