0% found this document useful (0 votes)

306 views10 pages

CS 224n: Word2Vec Assignment 2

the assignment of Stanford Deep Learning class

Uploaded by

stevenma6366

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

306 views10 pages

CS 224n: Word2Vec Assignment 2

the assignment of Stanford Deep Learning class

Uploaded by

stevenma6366

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS 224n Assignment #2: Word2Vec and Dependency

Parsing

Due Date: April 18th, Thursday, 4:30 PM PST.

In this assignment, you will review the mathematics behind Word2Vec and build a neural dependency
parser using PyTorch. For a review of the fundamentals of PyTorch, please check out the PyTorch review
session on Canvas. In Part 1, you will explore the partial derivatives involved in training a Word2vec
model using the naive softmax loss. In Part 2, you will learn about two general neural network techniques
(Adam Optimization and Dropout). In Part 3, you will implement and train a dependency parser using the
techniques from Part 2, before analyzing a few erroneous dependency parses.
If you are using LaTeX, you can use \ifans{} to type your solutions.
Please tag the questions correctly on Gradescope, otherwise the TAs will take points off if you
don’t tag questions.

1. Understanding word2vec (15 points)

Recall that the key insight behind word2vec is that ‘a word is known by the company it keeps’. Con-
cretely, consider a ‘center’ word c surrounded before and after by a context of a certain length. We term
words in this contextual window ‘outside words’ (O). For example, in Figure 1, the context window
length is 2, the center word c is ‘banking’, and the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’:

Figure 1: The word2vec skip-gram prediction model with window size 2

Skip-gram word2vec aims to learn the probability distribution P (O|C). Specifically, given a specific
word o and a specific word c, we want to predict P (O = o|C = c): the probability that word o is an
‘outside’ word for c (i.e., that it falls within the contextual window of c). We model this probability by
taking the softmax function over a series of vector dot-products:

exp(u⊤ o vc )
P (O = o | C = c) = P ⊤v )
(1)
w∈Vocab exp(u w c

For each word, we learn vectors u and v, where uo is the ‘outside’ vector representing outside word o,
and vc is the ‘center’ vector representing center word c. We store these parameters in two matrices,
U and V. The columns of U are all the ‘outside’ vectors uw ; the columns of V are all of the ‘center’
vectors vw . Both U and V contain a vector for every w ∈ Vocabulary.1

Recall from lectures that, for a single pair of words c and o, the loss is given by:

Jnaive-softmax (vc , o, U) = − log P (O = o|C = c). (2)

1 Assume that every word in our vocabulary is matched to an integer number k. Bolded lowercase letters represent vectors.
uk is both the kth column of U and the ‘outside’ word vector for the word indexed by k. vk is both the kth column of V and
the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to
refer to word k and the index of word k.

1
CS 224n Assignment 2 Page 2 of 10

We can view this loss as the cross-entropy2 between the true distribution y and the predicted distribu-
tion ŷ, for a particular center word c and a particular outside word o. Here, both y and ŷ are vectors
with length equal to the number of words in the vocabulary. Furthermore, the k th entry in these vectors
indicates the conditional probability of the k th word being an ‘outside word’ for the given c. The true
empirical distribution y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else,
for this particular example of center word c and outside word o.3 The predicted distribution ŷ is the
probability distribution P (O|C = c) given by our model in equation (1).

Note: Throughout this homework, when computing derivatives, please use the method reviewed during
the lecture (i.e. no Taylor Series Approximations).

2 The
P
cross-entropy loss between the true (discrete) probability distribution p and another distribution q is − i pi log(qi ).
3 Note that the true conditional probability distribution of context words for the entire training dataset would not be one-hot.
CS 224n Assignment 2 Page 3 of 10

(a) (2 points) Prove that the naive-softmax loss (Equation 2) is the same as the cross-entropy loss
between y and ŷ, i.e. (note that y (true distribution), ŷ (predicted distribution) are vectors and
ŷo is a scalar):
X
− yw log(ŷw ) = − log(ŷo ). (3)
w∈Vocab

Your answer should be one line. You may describe your answer in words.
(b) (6 points) i. Compute the partial derivative of Jnaive-softmax (vc , o, U) with respect to vc . Please
write your answer in terms of y, ŷ, U, and show your work to receive full credit.
• Note: Your final answers for the partial derivative should follow the shape convention: the
partial derivative of any function f (x) with respect to x should have the same shape as
x.4
• Please provide your answers for the partial derivative in vectorized form. For example,
when we ask you to write your answers in terms of y, ŷ, and U, you may not refer to
specific elements of these terms in your final answer (such as y1 , y2 , . . . ).
ii. When is the gradient you computed equal to zero?
Hint: You may wish to review and use some introductory linear algebra concepts.
iii. The gradient you found is the difference between the two terms. Provide an interpretation of
how each of these terms improves the word vector when this gradient is subtracted from the
word vector vc .
(c) (1 point) In many downstream
pP applications using word embeddings, L2 normalized vectors (e.g.
u/||u||2 where ||u||2 = 2
i ui ) are used instead of their raw forms (e.g. u). Let’s consider
a hypothetical downstream task of binary classification of phrases as being positive or negative,
where you decide the sign based on the sum of individual embeddings of the words. When would
L2 normalization take away useful information for the downstream task? When would it not?
Hint: Consider the case where ux = αuy for some words x ̸= y and some scalar α. When α is
positive, what will be the value of normalized ux and normalized uy ? How might ux and uy be
related for such a normalization to affect or not affect the resulting classification?
(d) (5 points) Compute the partial derivatives of Jnaive-softmax (vc , o, U) with respect to each of the
‘outside’ word vectors, uw ’s. There will be two cases: when w = o, the true ‘outside’ word vector,
and w ̸= o, for all other words. Please write your answer in terms of y, ŷ, and vc . In this subpart,
you may use specific elements within these terms as well (such as y1 , y2 , . . . ). Note that uw is a
vector while y1 , y2 , . . . are scalars. Show your work to receive full credit.
(e) (1 point) Write down the partial derivative of Jnaive-softmax (vc , o, U) with respect to U. Please
break down your answer in terms of the column vectors ∂J(v∂u c ,o,U)
1
, ∂J(v∂u
c ,o,U)
2
, · · · , ∂J(v c ,o,U)
∂u|Vocab| . No
derivations are necessary, just an answer in the form of a matrix.

4 Thisallows us to efficiently minimize a function using gradient descent without worrying about reshaping or dimension
∂J(θ)
mismatching. While following the shape convention, we’re guaranteed that θ := θ − α ∂θ is a well-defined update rule.
CS 224n Assignment 2 Page 4 of 10

2. Machine Learning & Neural Networks (8 points)

(a) (4 points) Adam Optimizer
Recall the standard Stochastic Gradient Descent update rule:

θ t+1 ← θ t − α∇θ t Jminibatch (θ t )

where t + 1 is the current timestep, θ is a vector containing all of the model parameters, (θ t is
the model parameter at time step t, and θ t+1 is the model parameter at time step t + 1), J is the
loss function, ∇θ Jminibatch (θ) is the gradient of the loss function with respect to the parameters
on a minibatch of data, and α is the learning rate. Adam Optimization5 uses a more sophisticated
update rule with two additional steps.6
i. (2 points) First, Adam uses a trick called momentum by keeping track of m, a rolling average
of the gradients:

mt+1 ← β1 mt + (1 − β1 )∇θ t Jminibatch (θ t )

θ t+1 ← θ t − αmt+1

where β1 is a hyperparameter between 0 and 1 (often set to 0.9). Briefly explain in 2–4
sentences (you don’t need to prove mathematically, just give an intuition) how using m stops
the updates from varying as much and why this low variance may be helpful to learning, overall.

ii. (2 points) Adam extends the idea of momentum with the trick of adaptive learning rates by
keeping track of v, a rolling average of the magnitudes of the gradients:

mt+1 ← β1 mt + (1 − β1 )∇θ t Jminibatch (θ t )

vt+1 ← β2 vt + (1 − β2 )(∇θ t Jminibatch (θ t ) ⊙ ∇θ t Jminibatch (θ t ))
√
θ t+1 ← θ t − αmt+1 / vt+1

where ⊙ and / denote elementwise multiplication and division (so z⊙z is elementwise squaring)
and β2 is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update
√
by v, which of the model parameters will get larger updates? Why might this help with
learning?
(b) (4 points) Dropout7 is a regularization technique. During training, dropout randomly sets units
in the hidden layer h to zero with probability pdrop (dropping different units each minibatch), and
then multiplies h by a constant γ. We can write this as:

hdrop = γd ⊙ h

where d ∈ {0, 1}Dh (Dh is the size of h) is a mask vector where each entry is 0 with probability
pdrop and 1 with probability (1 − pdrop ). γ is chosen such that the expected value of hdrop is h:

Epdrop [hdrop ]i = hi

for all i ∈ {1, . . . , Dh }.

i. (2 points) What must γ equal in terms of pdrop ? Briefly justify your answer or show your math
derivation using the equations given above.
ii. (2 points) Why should dropout be applied during training? Why should dropout NOT be
applied during evaluation? Hint: it may help to look at the dropout paper linked.

5 Kingma and Ba, 2015, [Link]

6 The actual Adam update uses a few additional tricks that are less important, but we won’t worry about them here. If you
want to learn more about it, you can take a look at: [Link]
7 Srivastava et al., 2014, [Link] hinton/absps/[Link]
˜
CS 224n Assignment 2 Page 5 of 10

3. Neural Transition-Based Dependency Parsing (54 points)

In this section, you’ll be implementing a neural-network based dependency parser with the goal of max-
imizing performance on the UAS (Unlabeled Attachment Score) metric.

Before you begin, please follow the README to install all the needed dependencies for the assignment.
We will be using PyTorch 2.1.2 from [Link] with the
CUDA option set to None, and the tqdm package – which produces progress bar visualizations through-
out your training process. The official PyTorch website is a great resource that includes tutorials for
understanding PyTorch’s Tensor library and neural networks.

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between
head words, and words which modify those heads. There are multiple types of dependency parsers,
including transition-based parsers, graph-based parsers, and feature-based parsers. Your implementation
will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step
it maintains a partial parse, which is represented as follows:
• A stack of words that are currently being processed.
• A buffer of words yet to be processed.
• A list of dependencies predicted by the parser.
Initially, the stack only contains ROOT, the dependencies list is empty, and the buffer contains all words
of the sentence in order. At each step, the parser applies a transition to the partial parse until its buffer
is empty and the stack size is 1. The following transitions can be applied:
• SHIFT: removes the first word from the buffer and pushes it onto the stack.
• LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of
the first item and removes the second item from the stack, adding a first word → second word
dependency to the dependency list.
• RIGHT-ARC: marks the first (most recently added) item on the stack as a dependent of the second
item and removes the first item from the stack, adding a second word → first word dependency to
the dependency list.
On each step, your parser will decide among the three transitions using a neural network classifier.
(a) (4 points) Go through the sequence of transitions needed for parsing the sentence “I presented my
findings at the NLP conference”. The dependency tree for the sentence is shown below. At each
step, give the configuration of the stack and buffer, as well as what transition was applied this
step and what new dependency was added (if any). The first three steps are provided below as an
example.

Stack Buffer New dependency Transition

[ROOT] [I, presented, my, findings, at, the, NLP, conference] Initial Configuration
[ROOT, I] [presented, my, findings, at, the, NLP, conference] SHIFT
[ROOT, I, presented] [my, findings, at, the, NLP, conference] SHIFT
[ROOT, presented] [my, findings, at, the, NLP, conference] presented→I LEFT-ARC
CS 224n Assignment 2 Page 6 of 10

(b) (2 points) A sentence containing n words will be parsed in how many steps (in terms of n)? Briefly
explain in 1–2 sentences why.
(c) (6 points) Implement the init and parse step functions in the PartialParse class in
parser [Link]. This implements the transition mechanics your parser will use. You
can run basic (non-exhaustive) tests by running python parser [Link] part c.
(d) (8 points) Our network will predict which transition should be applied next to a partial parse. We
could use it to parse a single sentence by applying predicted transitions until the parse is complete.
However, neural networks run much more efficiently when making predictions about batches of data
at a time (i.e., predicting the next transition for any different partial parses simultaneously). We
can parse sentences in minibatches with the following algorithm.

Algorithm 1 Minibatch Dependency Parsing

Input: sentences, a list of sentences to be parsed and model, our model that makes parse decisions

Initialize partial parses as a list of PartialParses, one for each sentence in sentences
Initialize unfinished parses as a shallow copy of partial parses
while unfinished parses is not empty do
Take the first batch size parses in unfinished parses as a minibatch
Use the model to predict the next transition for each partial parse in the minibatch
Perform a parse step on each partial parse in the minibatch with its predicted transition
Remove the completed (empty buffer and stack of size 1) parses from unfinished parses
end while

Return: The dependencies for each (now completed) parse in partial parses.

Implement this algorithm in the minibatch parse function in parser [Link]. You
can run basic (non-exhaustive) tests by running python parser [Link] part d.
Note: You will need minibatch parse to be correctly implemented to evaluate the model you will
build in part (e). However, you do not need it to train the model, so you should be able to complete
most of part (e) even if minibatch parse is not implemented yet.

(e) (20 points) We are now going to train a neural network to predict, given the state of the stack,
buffer, and dependencies, which transition should be applied next.
First, the model extracts a feature vector representing the current state. We will be using the feature
set presented in the original neural dependency parsing paper: A Fast and Accurate Dependency
Parser using Neural Networks.8 The function extracting these features has been implemented for
you in utils/parser [Link]. This feature vector consists of a list of tokens (e.g., the last
word in the stack, first word in the buffer, dependent of the second-to-last word in the stack if there
is one, etc.). They can be represented as a list of integers w = [w1 , w2 , . . . , wm ] where m is the
number of features and each 0 ≤ wi < |V | is the index of a token in the vocabulary (|V | is the
vocabulary size). Then our network looks up an embedding for each word and concatenates them
into a single input vector:

x = [Ew1 , ..., Ewm ] ∈ Rdm

where E ∈ R|V |×d is an embedding matrix with each row Ew as the vector for a particular word w
8 Chen and Manning, 2014, [Link]
CS 224n Assignment 2 Page 7 of 10

with dimension d. We then compute our prediction as:

h = ReLU(xW + b1 )
l = hU + b2
ŷ = softmax(l)

where h is referred to as the hidden layer, l is referred to as the logits, ŷ is referred to as the
predictions, and ReLU(z) = max(z, 0)). We will train the model to minimize cross-entropy loss:
3
X
J(θ) = CE(y, ŷ) = − yj log ŷj
j=1

where yj denotes the jth element of y. To compute the loss for the training set, we average this
J(θ) across all training examples.
i. Compute the derivative of h = ReLU(xW + b1 ) with respect to x. For simplicity, you only
∂hi
need to show the derivative ∂x j
for some index i and j. You may ignore the case where the
derivative is not defined at 0.
ii. Recall in part 1b, we computed the partial derivative of Jnaive-softmax (vc , o, U). Likewise, please
compute the partial derivative of J(θ) with respect to the ith entry of l, which is denoted as li .
Specifically, compute ∂CE(y,ŷ)
∂li , assuming that l ∈ R3 , ŷ ∈ R3 , y ∈ R3 , and the true label is c.
∂ ŷj
Hints: You may recall from part 1a, ∂CE(y,ŷ) = j ∂CE(y,ŷ) ∂CE(y,ŷ)
P
∂li ∂ ŷj ∂li , and ∂ ŷj = 0 if j ̸= c.
iii. We will use UAS score as our evaluation metric. UAS refers to Unlabeled Attachment Score,
which is computed as the ratio between number of correctly predicted dependencies and the
number of total dependencies despite of the relations (our model doesn’t predict this).

In parser [Link] you will find skeleton code to implement this simple neural network using
PyTorch. Complete the init , embedding lookup and forward functions to implement
the model. Then complete the train for epoch and train functions within the [Link]
file.
Finally execute python [Link] to train your model and compute predictions on test data
from Penn Treebank (annotated with Universal Dependencies).
Note:
• For this assignment, you are asked to implement Linear layer and Embedding layer. Please
DO NOT use [Link] or [Link] module in your code, otherwise
you will receive deductions for this problem.
• Please follow the naming requirements in our TODO if there are any, e.g. if there are
explicit requirements about variable names you have to follow them in order to receive full
credits. You are free to declare other variable names if not explicitly required.
Hints:
• Each of the variables you are asked to declare ([Link] to hidden weight,
[Link] to hidden bias, [Link] to logits weight,
[Link] to logits bias) corresponds to one of the variables above (W, b1 , U,
b2 ).
• It may help to work backwards in the algorithm (start from ŷ) and keep track of the
matrix/vector sizes.
• Once you have implemented embedding lookup (e) or forward (f) you can call
python parser [Link] with flag -e or -f or both to run sanity checks with each
CS 224n Assignment 2 Page 8 of 10

function. These sanity checks are fairly basic and passing them doesn’t mean your code is
bug free.
• When debugging, you can add a debug flag: python [Link] -d. This will cause the
code to run over a small subset of the data, so that training the model won’t take as long.
Make sure to remove the -d flag to run the full model once you are done debugging.
• When running with debug mode, you should be able to get a loss smaller than 0.2 and a
UAS larger than 65 on the dev set (although in rare cases your results may be lower, there
is some randomness when training).
• It should take up to 15 minutes to train the model on the entire training dataset, i.e.,
when debug mode is disabled.
• When debug mode is disabled, you should be able to get a loss smaller than 0.08 on the train
set and an Unlabeled Attachment Score larger than 87 on the dev set. For comparison, the
model in the original neural dependency parsing paper gets 92.5 UAS. If you want, you can
tweak the hyperparameters for your model (hidden layer size, hyperparameters for Adam,
number of epochs, etc.) to improve the performance (but you are not required to do so).
Deliverables:
• Working implementation of the transition mechanics that the neural dependency parser
uses in parser [Link].
• Working implementation of minibatch dependency parsing in parser [Link].
• Working implementation of the neural dependency parser in parser [Link]. (We’ll
look at and run this code for grading).
• Working implementation of the functions for training in [Link]. (We’ll look at and run
this code for grading).
• Report the best UAS your model achieves on the dev set and the UAS it
achieves on the test set in your written submission.
(f) (12 points) We’d like to look at example dependency parses and understand where parsers like ours
might be wrong. For example, in this sentence:
root
punct
nmod
nsubj dobj case

Moscow sent troops into Afghanistan .

PROPN VERB NOUN ADP PROPN PUNCT

the dependency of the phrase into Afghanistan is wrong, because the phrase should modify sent (as
in sent into Afghanistan) not troops (because troops into Afghanistan doesn’t make sense, unless
there are somehow weirdly some troops that stan Afghanistan). Here is the correct parse:
root
punct
nmod
nsubj dobj case

Moscow sent troops into Afghanistan .

PROPN VERB NOUN ADP PROPN PUNCT

More generally, here are four types of parsing error:

• Prepositional Phrase Attachment Error: In the example above, the phrase into Afghanistan
is a prepositional phrase9 . A Prepositional Phrase Attachment Error is when a prepositional
9 For examples of prepositional phrases, see: [Link]
CS 224n Assignment 2 Page 9 of 10

phrase is attached to the wrong head word (in this example, troops is the wrong head word and
sent is the correct head word). More examples of prepositional phrases include with a rock,
before midnight and under the carpet.
• Verb Phrase Attachment Error: In the sentence Leaving the store unattended, I went
outside to watch the parade, the phrase leaving the store unattended is a verb phrase10 . A Verb
Phrase Attachment Error is when a verb phrase is attached to the wrong head word (in this
example, the correct head word is went).
• Modifier Attachment Error: In the sentence I am extremely short, the adverb extremely is
a modifier of the adjective short. A Modifier Attachment Error is when a modifier is attached
to the wrong head word (in this example, the correct head word is short).
• Coordination Attachment Error: In the sentence Would you like brown rice or garlic naan?,
the phrases brown rice and garlic naan are both conjuncts and the word or is the coordinating
conjunction. The second conjunct (here garlic naan) should be attached to the first conjunct
(here brown rice). A Coordination Attachment Error is when the second conjunct is attached
to the wrong head word (in this example, the correct head word is rice). Other coordinating
conjunctions include and, but and so.
In this question are four sentences with dependency parses obtained from a parser. Each sentence
has one error type, and there is one example of each of the four types above. For each sentence,
state the type of error, the incorrect dependency, and the correct dependency. While each sentence
should have a unique error type, there may be multiple possible correct dependencies for some of
the sentences. To demonstrate: for the example above, you would write:
• Error type: Prepositional Phrase Attachment Error
• Incorrect dependency: troops → Afghanistan
• Correct dependency: sent → Afghanistan
Note: There are lots of details and conventions for dependency annotation. If you want to
learn more about them, you can look at the UD website: [Link]
org 11 or the short introductory slides at: [Link]
p/[Link]. Note that you do not need to know all these details in order to do
this question. In each of these cases, we are asking about the attachment of phrases and it should
be sufficient to see if they are modifying the correct head. In particular, you do not need to look at
the labels on the the dependency edges – it suffices to just look at the edges themselves.

root
punct

nmod
obj advcl case
det nsubj det punct obj det acl

The university blocked the acquisition , citing concerns about the risks involved .
DET NOUN VERB DET NOUN PUNCT VERB NOUN ADP DET NOUN VERB PUNCT

ii.

10 For examples of verb phrases, see: [Link]

11 But note that in the assignment we are actually using UDv1, see: [Link]
CS 224n Assignment 2 Page 10 of 10

root punct
nmod

case

nsubj det nmod nmod

aux advmod case case

aux amod det compound

Brian has been one of the most crucial elements to the success of Mozilla software .
PROPN AUX AUX NUM ADP DET ADV ADJ NOUN ADP DET NOUN ADP PROPN NOUN PUNCT

iii.

root

punct

nmod

obl case

xcomp case det

compound nsubj mark det compound

Investment Canada declined to comment on the reasons for the goverment decision .
NOUN PROPN VERB PART VERB ADP DET NOUN ADP DET NOUN NOUN PUNCT

iv.

conj

obl obj
root case nummod

det acl:relcl compound nmod

nsubj amod nsubj compound cc case

People benefit from a separate move that affects three US car plants and one in Quebec
NOUN VERB ADP DET ADJ NOUN PRON VERB NUM PROPN NOUN NOUN CCONJ NUM ADP PROPN

(g) (2 points) Recall in part (e), the parser uses features which includes words and their part-of-speech
(POS) tags. Explain the benefit of using part-of-speech tags as features in the parser?

Submission Instructions
You shall submit this assignment on GradeScope as two submissions – one for “Assignment 2 [coding]” and
another for ‘Assignment 2 [written]”:

1. Run the collect [Link] script to produce your [Link] file.

2. Upload your [Link] file to GradeScope to “Assignment 2 [coding]”.

3. Upload your written solutions to GradeScope to “Assignment 2 [written]”.

CS224n Word2Vec Assignment Guide
No ratings yet
CS224n Word2Vec Assignment Guide
6 pages
CS 224n Assignment 2: word2vec Details
No ratings yet
CS 224n Assignment 2: word2vec Details
4 pages
Word Vectors in NLP: CS224N Lecture 2
No ratings yet
Word Vectors in NLP: CS224N Lecture 2
33 pages
Word Window Classification in NLP
No ratings yet
Word Window Classification in NLP
61 pages
Supervised Learning Problem Set
No ratings yet
Supervised Learning Problem Set
5 pages
Understanding word2vec Models
100% (1)
Understanding word2vec Models
17 pages
B.Tech Class Test Questions for AIML
No ratings yet
B.Tech Class Test Questions for AIML
6 pages
M.Tech AIML Mid-Sem Exam: NLP Test
100% (1)
M.Tech AIML Mid-Sem Exam: NLP Test
9 pages
Lecture02-wordvecs
No ratings yet
Lecture02-wordvecs
42 pages
Neural Networks and NER in Deep Learning
No ratings yet
Neural Networks and NER in Deep Learning
84 pages
CS771A Machine Learning Homework 1
No ratings yet
CS771A Machine Learning Homework 1
3 pages
Deep Neural Networks Midterm Exam F2016
No ratings yet
Deep Neural Networks Midterm Exam F2016
14 pages
Word2Vec vs GloVe: Key Differences
No ratings yet
Word2Vec vs GloVe: Key Differences
39 pages
CS224N 2025: Neural Net Learning Overview
No ratings yet
CS224N 2025: Neural Net Learning Overview
96 pages
Vision Systems and Word Embeddings Explained
No ratings yet
Vision Systems and Word Embeddings Explained
5 pages
CS224N Lecture 2: Word Vectors Overview
No ratings yet
CS224N Lecture 2: Word Vectors Overview
46 pages
CS224N Lecture 2: Word Vectors Overview
No ratings yet
CS224N Lecture 2: Word Vectors Overview
45 pages
CS224N Winter 2018 Midterm Exam
No ratings yet
CS224N Winter 2018 Midterm Exam
17 pages
EPFL Machine Learning Final Exam 2017
No ratings yet
EPFL Machine Learning Final Exam 2017
18 pages
Supervised vs Unsupervised Learning Explained
No ratings yet
Supervised vs Unsupervised Learning Explained
11 pages
COL 774 Machine Learning Exam Guide
No ratings yet
COL 774 Machine Learning Exam Guide
3 pages
CS689 Fall 2023 ML Final Exam Solutions
No ratings yet
CS689 Fall 2023 ML Final Exam Solutions
12 pages
CS221 Fall 2019 Exam Solutions
No ratings yet
CS221 Fall 2019 Exam Solutions
30 pages
Word Vectors and Neural Classifiers in NLP
No ratings yet
Word Vectors and Neural Classifiers in NLP
47 pages
MIT Machine Learning Problem Set 3
No ratings yet
MIT Machine Learning Problem Set 3
8 pages
Midterm Exam Solutions for ML Course
No ratings yet
Midterm Exam Solutions for ML Course
11 pages
Understanding Word Vectors and Skip-gram
No ratings yet
Understanding Word Vectors and Skip-gram
23 pages
SVM Homework: Listings vs. Minted
No ratings yet
SVM Homework: Listings vs. Minted
7 pages
CS-419 Semester-End Exam Instructions
No ratings yet
CS-419 Semester-End Exam Instructions
6 pages
CS224N Practice Midterm Solutions
No ratings yet
CS224N Practice Midterm Solutions
9 pages
Simple Bayesian Classifier Analysis
No ratings yet
Simple Bayesian Classifier Analysis
5 pages
02 Neural Lms
No ratings yet
02 Neural Lms
57 pages
Overview of Binary Classification Models
No ratings yet
Overview of Binary Classification Models
2 pages
Word2Vec: CBOW vs. Skip-Gram Models
100% (1)
Word2Vec: CBOW vs. Skip-Gram Models
37 pages
2. Word2Vec
No ratings yet
2. Word2Vec
35 pages
Machine Learning Exam: Kharagpur 2011
No ratings yet
Machine Learning Exam: Kharagpur 2011
10 pages
NTU Machine Learning Homework 5 Details
No ratings yet
NTU Machine Learning Homework 5 Details
6 pages
Understanding Perceptrons in Neural Networks
No ratings yet
Understanding Perceptrons in Neural Networks
46 pages
CS 182/282A Discussion 1 Overview
No ratings yet
CS 182/282A Discussion 1 Overview
7 pages
CS229 Machine Learning Course Overview
No ratings yet
CS229 Machine Learning Course Overview
40 pages
Next Word Prediction with Neural Networks
No ratings yet
Next Word Prediction with Neural Networks
47 pages
Understanding word2vec Parameter Learning
No ratings yet
Understanding word2vec Parameter Learning
22 pages
Deep Neural Networks: Key Concepts & Techniques
No ratings yet
Deep Neural Networks: Key Concepts & Techniques
6 pages
Distributional Semantics Overview
No ratings yet
Distributional Semantics Overview
23 pages
Machine Learning Exam Solutions 2021
No ratings yet
Machine Learning Exam Solutions 2021
8 pages
CS224N: Neural Networks and Gradients
No ratings yet
CS224N: Neural Networks and Gradients
83 pages
Deep Learning Homework 2: CS/DS541
No ratings yet
Deep Learning Homework 2: CS/DS541
3 pages
CS671 Mid-Semester Exam Overview
No ratings yet
CS671 Mid-Semester Exam Overview
7 pages
CSE 446/546 Homework #3 Guidelines
No ratings yet
CSE 446/546 Homework #3 Guidelines
7 pages
Bayesian Classification and Overfitting Analysis
No ratings yet
Bayesian Classification and Overfitting Analysis
8 pages
Bias-Variance Tradeoff in Neural Networks
No ratings yet
Bias-Variance Tradeoff in Neural Networks
9 pages
CSCI 5521 Final Exam Instructions
No ratings yet
CSCI 5521 Final Exam Instructions
8 pages
Graphical Models Exam Questions 2024
No ratings yet
Graphical Models Exam Questions 2024
5 pages
Present Perfect Exercises and Practice
No ratings yet
Present Perfect Exercises and Practice
2 pages
Concurrency Control and IPC Explained
No ratings yet
Concurrency Control and IPC Explained
41 pages
10 to 4 Line Encoder Overview
No ratings yet
10 to 4 Line Encoder Overview
7 pages
Primary 2 Computer Exam Questions
No ratings yet
Primary 2 Computer Exam Questions
28 pages
Dell EMC Unity - Installations-Installations - UnityVSA Installation Guide-1
No ratings yet
Dell EMC Unity - Installations-Installations - UnityVSA Installation Guide-1
39 pages
Kenyan Oral Literature and Identity
No ratings yet
Kenyan Oral Literature and Identity
9 pages
Hyper Hub Premium GUI Script
No ratings yet
Hyper Hub Premium GUI Script
9 pages
BGP-Based VPLS Autodiscovery Guide
No ratings yet
BGP-Based VPLS Autodiscovery Guide
34 pages
School Management System Project Report
No ratings yet
School Management System Project Report
152 pages
Tiêu Chuẩn Hướng Dẫn Nghiên Cứu Sinh
No ratings yet
Tiêu Chuẩn Hướng Dẫn Nghiên Cứu Sinh
3 pages
Fund Requisition and Quotation Details
No ratings yet
Fund Requisition and Quotation Details
2 pages
Class 11 IP: Intro to Python Notes
60% (5)
Class 11 IP: Intro to Python Notes
37 pages
Enterprise Configuration Summary Report
No ratings yet
Enterprise Configuration Summary Report
4 pages
Grade 9 Computer Hardware Lesson Plan
No ratings yet
Grade 9 Computer Hardware Lesson Plan
5 pages
IGCSE Biology 0610/21 Mark Scheme 2023
No ratings yet
IGCSE Biology 0610/21 Mark Scheme 2023
3 pages
Spring Boot Interview Questions 50
No ratings yet
Spring Boot Interview Questions 50
2 pages
Book Review Sample Format
No ratings yet
Book Review Sample Format
11 pages
Understanding Data Storage Types
No ratings yet
Understanding Data Storage Types
5 pages
Report Writing Guidelines for Students
No ratings yet
Report Writing Guidelines for Students
4 pages
Understanding Dangling Modifiers Explained
No ratings yet
Understanding Dangling Modifiers Explained
22 pages
Answer Key: Progress Tests (A) : Unit 7
100% (1)
Answer Key: Progress Tests (A) : Unit 7
1 page
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
99 pages
D&D 5E Character Tools & Spell Sheets
No ratings yet
D&D 5E Character Tools & Spell Sheets
62 pages
Tableau Basics and Hierarchy Creation
No ratings yet
Tableau Basics and Hierarchy Creation
120 pages
Planet Facts Timeline Infographic
No ratings yet
Planet Facts Timeline Infographic
33 pages
Importance of English in India Debate
No ratings yet
Importance of English in India Debate
2 pages
Year 3 English Lesson Plans
No ratings yet
Year 3 English Lesson Plans
12 pages
Cube Root Graph
No ratings yet
Cube Root Graph
3 pages
Statutory Interpretation Essentials
No ratings yet
Statutory Interpretation Essentials
11 pages
Starbucks Coffee House Project Report
No ratings yet
Starbucks Coffee House Project Report
21 pages

CS 224n: Word2Vec Assignment 2

Uploaded by

CS 224n: Word2Vec Assignment 2

Uploaded by

CS 224n Assignment #2: Word2Vec and Dependency

Due Date: April 18th, Thursday, 4:30 PM PST.

1. Understanding word2vec (15 points)

Figure 1: The word2vec skip-gram prediction model with window size 2

Jnaive-softmax (vc , o, U) = − log P (O = o|C = c). (2)

2. Machine Learning & Neural Networks (8 points)

θ t+1 ← θ t − α∇θ t Jminibatch (θ t )

mt+1 ← β1 mt + (1 − β1 )∇θ t Jminibatch (θ t )

mt+1 ← β1 mt + (1 − β1 )∇θ t Jminibatch (θ t )

for all i ∈ {1, . . . , Dh }.

5 Kingma and Ba, 2015, [Link]

3. Neural Transition-Based Dependency Parsing (54 points)

Stack Buffer New dependency Transition

Algorithm 1 Minibatch Dependency Parsing

x = [Ew1 , ..., Ewm ] ∈ Rdm

with dimension d. We then compute our prediction as:

Moscow sent troops into Afghanistan .

Moscow sent troops into Afghanistan .

More generally, here are four types of parsing error:

10 For examples of verb phrases, see: [Link]

nsubj det nmod nmod

aux advmod case case

aux amod det compound

xcomp case det

compound nsubj mark det compound

det acl:relcl compound nmod

nsubj amod nsubj compound cc case

1. Run the collect [Link] script to produce your [Link] file.

2. Upload your [Link] file to GradeScope to “Assignment 2 [coding]”.

3. Upload your written solutions to GradeScope to “Assignment 2 [written]”.

You might also like