Understanding Basic IR Models and TF-IDF

The document discusses the vector space model in information retrieval. It explains how documents and queries are represented as weighted term vectors, with weights typically calculated as term frequency-inverse document frequency (tf-idf). Similarity between documents and queries is measured by calculating the cosine similarity between their vector representations. The document provides an example calculation of tf-idf weights, vector representations, and cosine similarities for a sample term-document matrix and query.

Uploaded by

Anna Poorani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views22 pages

Understanding Basic IR Models and TF-IDF

Uploaded by

Anna Poorani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Basic IR: Modeling

 Basic IR Task:
 Match a subset of documents to the user’s
query
 Slightly more complex:
 and rank the resulting documents by predicted
relevance
The derivation of relevance leads to different
IR models.
Concepts:
Term-Document Incidence
Imagine matrix of terms X documents with 1 when
the term appears in the document and 0 otherwise.
search segment select semanti …
c
MIR 1 0 1 1
AI 1 1 0 1
…
Queries satisfied how?
Problems?
Concepts:
Term Frequency
 To support document ranking, need
more than just term incidence.
 Term frequency records number of
times a given term appears in each
document.
 Intuition: More times a term appears in
a document the more central it is to the
topic of the document.
Concept: Term Weight
 Weights represent the importance of a
given term for characterizing a document.
 wij is a weight for term i in document j.
Mapping Task and Document
Type to Model
Index Full Text Full Text +
Terms Structure
Searching Classic Classic Structured
(Retrieval)

Surfing Flat Flat Structure Guided

(Browsing) Hypertext Hypertext
IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
U vector
Generalized Vector
Retrieval: probabilistic
s Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
Non-Overlapping Lists
T Inference Network
Proximal Nodes
a Belief Network
s Browsing
k Browsing
Flat
Structure Guided
Hypertext from MIR text
Classic Models: Basic Concepts
 Ki is an index term
 dj is a document
 t is the total number of docs
 K = (k1, k2, …, kt) is the set of all index terms
 wij >= 0 is a weight associated with (ki,dj)
 wij = 0 indicates that term does not belong to doc
 vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj
 gi(vec(dj)) = wij is a function which returns the weight
associated with pair (ki,dj)
Classic: Boolean Model
 Based on set theory: map queries with
Boolean operations to set operations
 Select documents from term-document

incidence matrix
Pros:
Cons:
Exact Matching Ignores…
 term frequency in document
 term scarcity in corpus
 size of document
 ranking
Vector Model
 Vector of term weights based on term
frequency
 Compute similarity between query and
document where both are vectors
 vec(dj) = (w1j, w2j, ..., wtj) vec(q) =
(w1q, w2q, ..., wtq)
 Similarity is the cosine of the angle between
the vectors.
Cosine Measure
j
 cos()
dj
dj  q
Sim(d , q ) 
dj  q

q
t

w i, j
 wi ,q
 t
i 1
t

Since wij > 0 and wiq > 0,  i, j 

w 2

i 1
 i ,q
w 2

i 1
0 <= sim(q,dj) <=1
from MIR notes
How to Set Wij Weights?
TF-IDF
 Within document: Term-Frequency
 tf measures term density within a document
 Across document: Inverse Document
Frequency
 idf measures informativeness or rarity of term
across corpus.
 n 
idf i  log 
 df i 
TF * IDF Computation
wi ,d  tf i ,d  log(n / df i )
tf i ,d  frequency of term i in document d
n  total number of documents
df i  the number of documents that contain term i

 What happens as number of occurrences in a document

increases?
 What happens as term becomes more rare?
TF * IDF
 TF may be normalized.
 tf(i,d) = freq(i,d) / max(freq(l,d))
 IDF is computed
 normalized to size of corpus
 as log to make TF and IDF values
comparable
 IDF requires a static corpus.
How to Set Wi,q Weights?
1. Create Vector directly from query
2. Use modified tf-idf
 freq (i, q)   n 
Wi ,q  0.5   0.5 *  * log 
 max( freq (i, q))   df i 
The Vector Model:
Example
k1 k2 k3
d1 2 0 1 k2
k1
d2 1 0 0 d7
d6
d3 0 1 3 d2
d4 2 0 0 d4 d5
d3
d5 1 2 4 d1
d6 1 2 0
d7 0 5 0
k3
q 1 2 3

from MIR notes

The Vector Model: k1
k2

Example (cont.) d2 d6
d7

d4 d5
d3
d1

1. Compute Tf-IDF Vector for each document k3

For first document:
K1: ((2/2)*(log (7/5)) = .33
K2: (0*(log (7/4))) =0
K3: ((1/2)*(log (7/3))) = .42

for rest:
[.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85],
[.17 .56 0], [0 .56 0]

from MIR notes

The Vector Model: k1
d7
k2

Example (cont.) d2

d4
d6

d5
d3
d1

2. Compute the Tf-IDF for the query [1 2 3]:

K1: (.5 + ((.5 * 1)/3))*(log (7/5)))
K2: (.5 + ((.5 * 2)/3))*(log (7/4)))
K3: (.5 + ((.5 * 3)/3))*(log (7/3)))
which is: [.22 .47 .85]
The Vector Model: k1
d7
k2

Example (cont.) d2

d4
d6

d5
d3
d1

3. Compute the Sim for each document:

D1:
D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43
|D1| = sqrt((.33^2) + (.42^2)) = .53
|q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0
sim = .43 / (.53 * 1.0) = .81
D2: .22 D3: .93 D4: .23
D5: .97 D6: .51 D7: .47
Vector Model
Implementation Issues
 Sparse TermXDocument matrix
 Store term count, term weight, or
weighted by idfi ?
 What if the corpus is not fixed (e.g., the
Web)? What happens to IDF?
 How to efficiently compute Cosine for
large index?
Heuristics for Computing
Cosine for Large Index
 Select from only non-zero cosines
 Focus on non-zero cosines for rare (high idf)
words
 Pre-compute document adjacency
 for each term, pre-compute k nearest docs
 for a t term query, compute cosines from query
to union of t pre-computed lists, choose top k
The TFIDF Vector Model:
Pros/Cons
 Pros:
 term-weighting improves quality
 cosine ranking formula sorts documents
according to degree of similarity to the query
 Cons:
 assumes independence of index terms

Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
24 pages
Understanding Term Weighting Methods
No ratings yet
Understanding Term Weighting Methods
17 pages
Understanding Term Weighting Methods
No ratings yet
Understanding Term Weighting Methods
34 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
34 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
51 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
32 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
33 pages
Understanding IR Models and Indexing
No ratings yet
Understanding IR Models and Indexing
46 pages
Overview of Information Retrieval Models
100% (3)
Overview of Information Retrieval Models
58 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
46 pages
Understanding IR Models and Weighting Techniques
No ratings yet
Understanding IR Models and Weighting Techniques
33 pages
Understanding IR Models and Weighting
No ratings yet
Understanding IR Models and Weighting
30 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
30 pages
Understanding IR Models and Techniques
100% (1)
Understanding IR Models and Techniques
32 pages
Vector Space Model in IR Systems
100% (1)
Vector Space Model in IR Systems
32 pages
Understanding TF-IDF Weighting
100% (2)
Understanding TF-IDF Weighting
38 pages
Vector Space Model and Term Weighting
No ratings yet
Vector Space Model and Term Weighting
57 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
34 pages
IR Models and Document Retrieval Techniques
100% (1)
IR Models and Document Retrieval Techniques
26 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
28 pages
Understanding IR Models and Techniques
No ratings yet
Understanding IR Models and Techniques
25 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
26 pages
Term Weighting in Document Retrieval
No ratings yet
Term Weighting in Document Retrieval
25 pages
TFIDF & Vector Space Model Explained
No ratings yet
TFIDF & Vector Space Model Explained
27 pages
Document Ranking via Vector Space Model
No ratings yet
Document Ranking via Vector Space Model
6 pages
Indexing and Retrieval Models Overview
No ratings yet
Indexing and Retrieval Models Overview
46 pages
Vector Space Model for Document Retrieval
No ratings yet
Vector Space Model for Document Retrieval
57 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
43 pages
Boolean vs Vector Space Models
No ratings yet
Boolean vs Vector Space Models
27 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
33 pages
Web Search and Information Retrieval Models
No ratings yet
Web Search and Information Retrieval Models
30 pages
Document Relevance and TF-IDF Explained
No ratings yet
Document Relevance and TF-IDF Explained
10 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
11 pages
TF-IDF Weighting in Document Retrieval
No ratings yet
TF-IDF Weighting in Document Retrieval
37 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
420 pages
Term Weighting in Image Similarity Measures
50% (2)
Term Weighting in Image Similarity Measures
54 pages
Retrieval Models in Information Retrieval
No ratings yet
Retrieval Models in Information Retrieval
37 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
21 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
40 pages
Term Weighting in Vector Space Model
No ratings yet
Term Weighting in Vector Space Model
2 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
87 pages
Vector Space Model in Information Retrieval
No ratings yet
Vector Space Model in Information Retrieval
30 pages
Retrieval Models: Boolean & Vector Space
No ratings yet
Retrieval Models: Boolean & Vector Space
33 pages
Understanding Ranked Retrieval Models
No ratings yet
Understanding Ranked Retrieval Models
52 pages
TF-IDF and Vector Space Model Explained
No ratings yet
TF-IDF and Vector Space Model Explained
44 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
10 pages
Algebraic Models in Information Retrieval
No ratings yet
Algebraic Models in Information Retrieval
3 pages
Ranked Retrieval and Vector Space Model
No ratings yet
Ranked Retrieval and Vector Space Model
43 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
7 pages
Boolean and Vector Space Models
No ratings yet
Boolean and Vector Space Models
31 pages
Understanding Bag-of-Words in NLP
No ratings yet
Understanding Bag-of-Words in NLP
40 pages
Document Scoring in Information Retrieval
100% (3)
Document Scoring in Information Retrieval
38 pages
Information Retrieval Models Explained
No ratings yet
Information Retrieval Models Explained
34 pages
Information Retrieval in Databases
No ratings yet
Information Retrieval in Databases
21 pages
Information Retrieval Modeling Techniques
No ratings yet
Information Retrieval Modeling Techniques
10 pages
Vector Space Model in Information Retrieval
No ratings yet
Vector Space Model in Information Retrieval
23 pages
Textile Management Education Prospectus
No ratings yet
Textile Management Education Prospectus
11 pages
Lung Segmentation in X-Ray Using GANs
No ratings yet
Lung Segmentation in X-Ray Using GANs
12 pages
Understanding Multimedia Elements
No ratings yet
Understanding Multimedia Elements
19 pages
CoSinGAN for COVID-19 CT Segmentation
No ratings yet
CoSinGAN for COVID-19 CT Segmentation
28 pages
AVI Format: Pros and Cons Explained
No ratings yet
AVI Format: Pros and Cons Explained
19 pages
Challenges in Distributed Multimedia Systems
No ratings yet
Challenges in Distributed Multimedia Systems
9 pages
Levels of Software Testing in IT2032
No ratings yet
Levels of Software Testing in IT2032
40 pages
Lung Segmentation in X-rays Using GANs
No ratings yet
Lung Segmentation in X-rays Using GANs
11 pages
Dynamics of Cell Signalling Networks
No ratings yet
Dynamics of Cell Signalling Networks
12 pages
Sequence Alignment Methods Explained
No ratings yet
Sequence Alignment Methods Explained
37 pages
Internationalization & Localization Testing
No ratings yet
Internationalization & Localization Testing
50 pages
IT2032 System Testing Overview
No ratings yet
IT2032 System Testing Overview
39 pages
User Interfaces in Modern Search Systems
No ratings yet
User Interfaces in Modern Search Systems
87 pages
Evaluating IR Systems: Cranfield Approach
No ratings yet
Evaluating IR Systems: Cranfield Approach
144 pages
Parallel Processing Algorithms Overview
No ratings yet
Parallel Processing Algorithms Overview
36 pages
C Programming Lab Manual 2017-2018
100% (1)
C Programming Lab Manual 2017-2018
81 pages
Python Programming Problem Solving Guide
No ratings yet
Python Programming Problem Solving Guide
6 pages
Mobile Computing Exam - Alagappa College
No ratings yet
Mobile Computing Exam - Alagappa College
2 pages
Manual Epson
No ratings yet
Manual Epson
191 pages
20W C-Band BUC Specifications
No ratings yet
20W C-Band BUC Specifications
4 pages
Anime Vanguards Update 8.5 Overview
No ratings yet
Anime Vanguards Update 8.5 Overview
7 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
44 pages
Clap Switch Circuit Using IC 4017
No ratings yet
Clap Switch Circuit Using IC 4017
4 pages
Hadoop Ecosystem for Polyglot Big Data
No ratings yet
Hadoop Ecosystem for Polyglot Big Data
17 pages
Fisdap's Nursing Program Expansion Strategy
No ratings yet
Fisdap's Nursing Program Expansion Strategy
21 pages
RTU560 560CMU02 Connections Guide
No ratings yet
RTU560 560CMU02 Connections Guide
4 pages
Git and GitHub Question Bank
100% (1)
Git and GitHub Question Bank
4 pages
Cloud Computing Interview Q&A Guide
No ratings yet
Cloud Computing Interview Q&A Guide
6 pages
Weekly Expected Move Indicator Code
No ratings yet
Weekly Expected Move Indicator Code
5 pages
PCI Device Driver Information Report
No ratings yet
PCI Device Driver Information Report
30 pages
Eagle Point Software Quick Start Guide
No ratings yet
Eagle Point Software Quick Start Guide
48 pages
Library Automation: History and Phases
No ratings yet
Library Automation: History and Phases
20 pages
Palo Alto Networks NGFW Q61-80
No ratings yet
Palo Alto Networks NGFW Q61-80
7 pages
Lenovo: G400s/G405s/G400s Touch G500s/G505s/G500s Touch Regulatory Notice
No ratings yet
Lenovo: G400s/G405s/G400s Touch G500s/G505s/G500s Touch Regulatory Notice
30 pages
GNN-Based Text Classification Optimization
No ratings yet
GNN-Based Text Classification Optimization
9 pages
Linear Programming Duality Explained
No ratings yet
Linear Programming Duality Explained
5 pages
NIBSS QR Code Payment Integration Guide
No ratings yet
NIBSS QR Code Payment Integration Guide
63 pages
Understanding Slowly Changing Dimensions
No ratings yet
Understanding Slowly Changing Dimensions
26 pages
Quality Management System Manual ISO 13485
No ratings yet
Quality Management System Manual ISO 13485
5 pages
Beautiful Soup Cheat Sheet
No ratings yet
Beautiful Soup Cheat Sheet
3 pages
upGrad Software Engineering Bootcamp
No ratings yet
upGrad Software Engineering Bootcamp
17 pages
Origin and Growth of Statistics
No ratings yet
Origin and Growth of Statistics
18 pages
Computers 2011: A Digital Overview
No ratings yet
Computers 2011: A Digital Overview
37 pages
Indian Army Boosts Cybersecurity Measures
No ratings yet
Indian Army Boosts Cybersecurity Measures
2 pages
India Digital Music Market Insights 2018
No ratings yet
India Digital Music Market Insights 2018
11 pages
Siemens ITK Integration Toolkit Overview
No ratings yet
Siemens ITK Integration Toolkit Overview
15 pages
Understanding Single Sign-On (SSO) Benefits
No ratings yet
Understanding Single Sign-On (SSO) Benefits
8 pages
Intel 8086 Microprocessor Overview
No ratings yet
Intel 8086 Microprocessor Overview
21 pages