0% found this document useful (0 votes)
93 views22 pages

Understanding Basic IR Models and TF-IDF

The document discusses the vector space model in information retrieval. It explains how documents and queries are represented as weighted term vectors, with weights typically calculated as term frequency-inverse document frequency (tf-idf). Similarity between documents and queries is measured by calculating the cosine similarity between their vector representations. The document provides an example calculation of tf-idf weights, vector representations, and cosine similarities for a sample term-document matrix and query.

Uploaded by

Anna Poorani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views22 pages

Understanding Basic IR Models and TF-IDF

The document discusses the vector space model in information retrieval. It explains how documents and queries are represented as weighted term vectors, with weights typically calculated as term frequency-inverse document frequency (tf-idf). Similarity between documents and queries is measured by calculating the cosine similarity between their vector representations. The document provides an example calculation of tf-idf weights, vector representations, and cosine similarities for a sample term-document matrix and query.

Uploaded by

Anna Poorani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Basic IR: Modeling

 Basic IR Task:
 Match a subset of documents to the user’s
query
 Slightly more complex:
 and rank the resulting documents by predicted
relevance
The derivation of relevance leads to different
IR models.
Concepts:
Term-Document Incidence
Imagine matrix of terms X documents with 1 when
the term appears in the document and 0 otherwise.
search segment select semanti …
c
MIR 1 0 1 1
AI 1 1 0 1

Queries satisfied how?
Problems?
Concepts:
Term Frequency
 To support document ranking, need
more than just term incidence.
 Term frequency records number of
times a given term appears in each
document.
 Intuition: More times a term appears in
a document the more central it is to the
topic of the document.
Concept: Term Weight
 Weights represent the importance of a
given term for characterizing a document.
 wij is a weight for term i in document j.
Mapping Task and Document
Type to Model
Index Full Text Full Text +
Terms Structure
Searching Classic Classic Structured
(Retrieval)

Surfing Flat Flat Structure Guided


(Browsing) Hypertext Hypertext
IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
U vector
Generalized Vector
Retrieval: probabilistic
s Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
Non-Overlapping Lists
T Inference Network
Proximal Nodes
a Belief Network
s Browsing
k Browsing
Flat
Structure Guided
Hypertext from MIR text
Classic Models: Basic Concepts
 Ki is an index term
 dj is a document
 t is the total number of docs
 K = (k1, k2, …, kt) is the set of all index terms
 wij >= 0 is a weight associated with (ki,dj)
 wij = 0 indicates that term does not belong to doc
 vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj
 gi(vec(dj)) = wij is a function which returns the weight
associated with pair (ki,dj)
Classic: Boolean Model
 Based on set theory: map queries with
Boolean operations to set operations
 Select documents from term-document

incidence matrix
Pros:
Cons:
Exact Matching Ignores…
 term frequency in document
 term scarcity in corpus
 size of document
 ranking
Vector Model
 Vector of term weights based on term
frequency
 Compute similarity between query and
document where both are vectors
 vec(dj) = (w1j, w2j, ..., wtj) vec(q) =
(w1q, w2q, ..., wtq)
 Similarity is the cosine of the angle between
the vectors.
Cosine Measure
j
 cos()
dj
dj  q
Sim(d , q ) 
dj  q

q
t

w i, j
 wi ,q
 t
i 1
t

Since wij > 0 and wiq > 0,  i, j 


w 2

i 1
 i ,q
w 2

i 1
0 <= sim(q,dj) <=1
from MIR notes
How to Set Wij Weights?
TF-IDF
 Within document: Term-Frequency
 tf measures term density within a document
 Across document: Inverse Document
Frequency
 idf measures informativeness or rarity of term
across corpus.
 n 
idf i  log 
 df i 
TF * IDF Computation
wi ,d  tf i ,d  log(n / df i )
tf i ,d  frequency of term i in document d
n  total number of documents
df i  the number of documents that contain term i

 What happens as number of occurrences in a document


increases?
 What happens as term becomes more rare?
TF * IDF
 TF may be normalized.
 tf(i,d) = freq(i,d) / max(freq(l,d))
 IDF is computed
 normalized to size of corpus
 as log to make TF and IDF values
comparable
 IDF requires a static corpus.
How to Set Wi,q Weights?
1. Create Vector directly from query
2. Use modified tf-idf
 freq (i, q)   n 
Wi ,q  0.5   0.5 *  * log 
 max( freq (i, q))   df i 
The Vector Model:
Example
k1 k2 k3
d1 2 0 1 k2
k1
d2 1 0 0 d7
d6
d3 0 1 3 d2
d4 2 0 0 d4 d5
d3
d5 1 2 4 d1
d6 1 2 0
d7 0 5 0
k3
q 1 2 3

from MIR notes


The Vector Model: k1
k2

Example (cont.) d2 d6
d7

d4 d5
d3
d1

1. Compute Tf-IDF Vector for each document k3


For first document:
K1: ((2/2)*(log (7/5)) = .33
K2: (0*(log (7/4))) =0
K3: ((1/2)*(log (7/3))) = .42

for rest:
[.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85],
[.17 .56 0], [0 .56 0]

from MIR notes


The Vector Model: k1
d7
k2

Example (cont.) d2

d4
d6

d5
d3
d1

k3

2. Compute the Tf-IDF for the query [1 2 3]:


K1: (.5 + ((.5 * 1)/3))*(log (7/5)))
K2: (.5 + ((.5 * 2)/3))*(log (7/4)))
K3: (.5 + ((.5 * 3)/3))*(log (7/3)))
which is: [.22 .47 .85]
The Vector Model: k1
d7
k2

Example (cont.) d2

d4
d6

d5
d3
d1

k3

3. Compute the Sim for each document:


D1:
D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43
|D1| = sqrt((.33^2) + (.42^2)) = .53
|q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0
sim = .43 / (.53 * 1.0) = .81
D2: .22 D3: .93 D4: .23
D5: .97 D6: .51 D7: .47
Vector Model
Implementation Issues
 Sparse TermXDocument matrix
 Store term count, term weight, or
weighted by idfi ?
 What if the corpus is not fixed (e.g., the
Web)? What happens to IDF?
 How to efficiently compute Cosine for
large index?
Heuristics for Computing
Cosine for Large Index
 Select from only non-zero cosines
 Focus on non-zero cosines for rare (high idf)
words
 Pre-compute document adjacency
 for each term, pre-compute k nearest docs
 for a t term query, compute cosines from query
to union of t pre-computed lists, choose top k
The TFIDF Vector Model:
Pros/Cons
 Pros:
 term-weighting improves quality
 cosine ranking formula sorts documents
according to degree of similarity to the query
 Cons:
 assumes independence of index terms

You might also like