INFORMATION RETRIEVAL SYSTEM AND
THE PAGERANK ALGORITHM
OUTLINE
Information retrieval system
Data retrieval versus information retrieval
Basic concepts of information retrieval
Retrieval process
Classical models of information retrieval
Boolean model
Vector model
Probabilistic model
Web information retrieval
Features of Google’s search system
Google’s architecture
A brief analysis of PageRank algorithm
PageRank versus HITS algorithm
WHAT IS INFORMATION RETRIEVAL?
Information retrieval (IR) deals with the representation, storage,
organization of, and access to information items[1].
The user must first translate this information need into a query which can
be processed by IR system.
The key goal of an IR system is to retrieve information which might be
useful or relevant to the user.
DATA VERSUS INFORMATION RETRIEVAL
DATA RETRIEVAL INFORMATION RETRIEVAL
Determines which documents of a Retrieves information about a subject
collection contain the keywords in rather than data which satisfies a
the user query given query
All objects which satisfy clearly IR system somehow 'interprets' the
defined conditions are retrieved contents of documents in a collection
and rank them according to a degree
of relevance to the user query
A single erroneous object means The retrieved objects might be
total failure inaccurate and small errors are
ignored
Data has a well defined structure Data is a natural language text which
and semantics is not always well structured and
could be semantically ambiguous
BASIC CONCEPTS OF IR
The effective retrieval of relevant information is directly affected by :
User task – The task of the user might be:
Information or a data retrieval
Browsing
Filtering
Figure1: User tasks in an IR system[1]
Logical View-The way the index words might be extracted from the
document can be of 2 types:
Full Text
Index term
Figure 2: Text operations for Index Term Logical View [1]
RETRIEVAL PROCESS
Step 1: Before the retrieval process can even be initiated, it is necessary
to define the text database. This is usually done by the manager of the
database, which specifies the following:
(a) the documents to be used
(b) text operations
(c) the text model
Step 2: Once the logical view of the documents is defined, the database
manager builds an index of the text. An index is a critical data structure
because it allows fast searching over large volumes of data(e.g.
inverted file)
Figure 3 : Retrieval Process[1]
Step 3: Then, the user first specifies a user need which is then parsed
and transformed by the same text operations applied to the text. Then,
query operations are applied to the actual query which is then
processed to obtain the retrieved documents. Fast query processing is
made possible by the index structure previously built.
Step 4: Before been sent to the user, the retrieved documents are ranked
according to a likelihood of relevance.
Step 5: The user then examines the set of ranked documents in the
search for useful information. At this point, he might pinpoint a subset
of the documents seen as definitely of interest and initiate a user
feedback cycle[1].
IR MODELS
The central problem regarding IR systems is the issue of predicting which
documents are relevant and which are not.
A ranking algorithm operates according to basic premises regarding the
notion of document relevance.
The IR model adopted determines the predictions of what is relevant and
what is not.
Figure 4 : Classification of the various IR models[1]
FORMAL DEFINITION OF IR
An information retrieval model is a quadruple :
{D,Q, F, R(qi, dj)}
where:
D is a set composed of logical views (or representations) for the
documents in the collection.
Q is a set composed of logical views (or representations) for the user
information needs (called queries).
F is a framework for modeling document representations, queries, and
their relationships.
R(qi, dj) is a ranking function which associates a real number with a
query qi ϵ Q and a document representation dj ϵ D. Such ranking
defines an ordering among the documents with regard to the query qi.
CLASSICAL MODEL
Classic models in IR system consider that each document is described
by a set of representative keywords called index terms which are used
to index and summarize the document contents.
Thus, the distinct index terms have varying relevance when used to
describe document contents.
In this model, this effect is captured through the assignment of
numerical weights to each index term of a document.
The main classical models are:
Boolean Model
Vector Model
Probabilistic Model
STRUCTURED MODEL
Retrieval models which combine information on text content with
information on the document structure are called structured text retrieval
models.[1]
There are two models for structured text retrieval:-
Non-overlapping lists model
Proximal nodes model
Figure 5: List structure for (a) Non-overlapping lists model (b) Proximal nodes model [1]
BROWSING MODEL
Browsing is a process of retrieving information whose main objectives
are not clearly defined in the beginning and whose purpose might
change during the interaction with the system.
For browsing, there are 3 models :-
Flat model
Structure guided model
Hypertext model
BOOLEAN MODEL
The Boolean model is a simple retrieval model based on set theory and
Boolean algebra.
The queries are specified as Boolean expressions which have precise
semantics.
The Boolean model considers that index terms are present or absent in
a document. As a result, the index term weights are assumed to be all
binary, i.e., Wi,j ϵ{0,I}.
A query q is composed of index terms linked by three connectives: not,
and, or.
A query is essentially a conventional Boolean expression which can be
represented as a disjunction of conjunctive vectors in Disjunctive
normal form. The binary weighted vectors are called the conjunctive
components of Qdnf.
ADVANTAGES:-
1. clean formalism behind the model
[Link] simplicity.
DISADVANTAGES:-
1. Its retrieval strategy is based on a binary decision criterion and
behaves more as data retrieval model.
2. The exact matching may lead to retrieval of too few or too many
documents.
3. It is not simple to translate an information need into a Boolean
expression
4. The Boolean expressions actually formulated by users often are
quite simple.
APPLICATIONS:-
Commercial document database systems
VECTOR MODEL
The vector model was given by Gerard Salton and McGill.
This model proposes to apply partial matching strategy by assigning
non-binary weights to index terms in queries and in documents.
These term weights are ultimately used to compute the degree of
similarity between each document stored in the system and the user
query.
In the vector model,
Weight Wi,j associated with a pair of index terms and document
vector is positive and non-binary.
The index terms in the query are also weighted.
ADVANTAGES:
Its term-weighting scheme improves retrieval performance.
Its partial matching strategy allows retrieval of documents that
approximate the query conditions.
Its cosine ranking formula sorts the documents according to their
degree of similarity to the query.
It is simple and resilient ranking strategy.
DISADVANTAGE:
Index terms are assumed to be mutually independent.
PROBABILISTIC MODEL
The classic probabilistic model introduced in 1976 by Roberston and
Sparck Jones.
The probabilistic model attempts to capture the IR problem within a
probabilistic framework.
BASIC IDEA: Given a user query, there is a set of documents which
contains exactly the relevant documents referred as the ideal answer
set. Given the description of this ideal answer set, we retrieve the
documents that satisfy this condition.
Thus the querying process will be a process of specifying the
properties of an ideal answer set .
Assumption (Probabilistic Principle) -
‘Given a user query q and a document dj in the collection, the
probabilistic model tries to estimate the probability that the user will
find the document dj relevant.
— The model assumes that this probability of relevance depends on the
query and the document representations only.
— Further, the model assumes that there is a subset of all documents
which the user prefers as the answer set for the query q, called an
ideal answer set is labeled R which should maximize the overall
probability of relevance to the user.
— Documents in the set R are predicted to be relevant to the query.
Documents not in this set are predicted to be non-relevant.’
This assumption does not state explicitly :-
How to compute the probabilities of relevance
We don’t know even the sample space
ADVANTAGES:-
The documents are ranked in decreasing order of their probability of
being relevant.
DISADVANTAGES:-
There is a need to guess the initial separation of documents into
relevant and non-relevant sets.
It does not take into account the frequency with which an index term
occurs inside a document
The adoption of the independence assumption for index terms.
COMPARISON OF THE CLASSICAL
MODELS
BOOLEAN MODEL VECTOR MODEL PROBABILISTIC MODEL
It evaluates queries as It uses the concept of It evaluates the queries by
evaluating Boolean index weights and partial using the ideal set
expression. matching to match a probabilistic index terms.
document to a query.
Weights are binary. The Index terms are weighted. Weights are binary. Initially
document is either So, there is a ranking the document either
relevant or irrelevant. created based on these belongs to the ideal set or
weights(using similarity). is considered irrelevant.
It is simple to evaluate It is more complex than This is the most complex
based on the query and binary as the index term model since neither the
the document. weighting needs to be weights nor the ideal set is
done. initially defined.
Performance is not that Performance is Performance is proved to
good. considered to be optimal. be optimal. However, in
practice it may become
impractical.
WEB IR VERSUS TRADITIONAL IR
The differences between the modeling for the web and the traditional
document collections are because of the following reasons:
o Web is huge
o Dynamic nature of Web
o Web is self organized
o Web growth is fast
o Web is hyperlinked
GOOGLE SEARCH ENGINE
Google ,the most popular search engine, came into existence in 1998.
It was developed by Sergey Brin and Lawrence Page as a solution for
the problem of Web information retrieval.
DESIGN GOALS OF GOOGLE
Improved search quality
Academic search engine
Usage
Architecture
HOW GOOGLE SEARCH WORKS
STEP 1: CRAWLING
STEP 2: COMPRESSING
STEP 3: INDEXING
STEP 4: PAGERANK CALCULATION
STEP 5: SORTING
STEP 6: SEARCHING
GOOGLE SYSTEM FEATURES
1. ANCHOR TEXT-Google associates the text of the link with 2 things:
The page that the link is on
The page the link points to
2. THE PAGERANK ALGORITHM- PageRank extends the idea of
citations by not counting links from all pages equally and by
normalizing by the number of links on a page.
We assume page A has pages T1,T2….Tn which point to it (i.e., are
citations). The parameter d is a damping factor which can be set
between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the
number of links going out of page A. The PageRank of a page A is
given as follows:
ADVANTAGES OF USING PAGERANK ALGORITHM-
Random Surfer Model is used as the intuitive justification of
PageRank.
Pages that are well cited from many places around the Web are
worth looking at. Also, pages that have perhaps only one citation
from a well known site are also generally worth looking at.
ADVANTAGES OF USING ANCHOR TEXT-
Anchors often provide more accurate of Web pages than the
pages themselves.
Anchors may exist for documents which cannot be indexed by a
text-based search engine, such as images, programs, and
databases.
MATHEMATICS OF PAGERANK
The PageRank Thesis: A page is important if it is pointed to by other
important pages.
ORIGINAL FORMULA- The PageRank of a page Pi is denoted r(i)
Figure 6: Example of PageRank calculation on web pages
The problem is that the PageRanks of pages inlinking to page Pi are
unknown. So, an iterative procedure was used.
INITIAL ASSUMPTION: In the beginning, all pages have equal
PageRank of 1/n, where n is the number of pages in Google's index
of the Web. So, iterative formula is-
This can also be written as-
where H is row normalized matrix such that
OBSERVATIONS:
Each iteration of the equation involves one vector-matrix
multiplication, which generally requires O( n2) computation, where
n=size (Hnxn).
H is a very sparse because most web pages link to only a handful
of other pages. Hence, it requires O(nnz(H)) computation, where
nnz(H) =number of non zeros in H which reduces to O( n) effort.
The iterative method applied to H is the classical power method
applied to H matrix.
H looks a lot like a stochastic transition probability matrix for a
Markov chain. The dangling nodes of the network, those nodes
with no outlinks, create 0 rows in the matrix. All the other rows,
which correspond to the non dangling nodes, create stochastic
rows. Thus, H is called sub-stochastic. [2].
PROBLEMS WITH THE ITERATIVE PROCESS-
1. Problem of Rank Sinks-
Rank sinks are those pages that accumulate more and more
PageRank at each iteration
It is used by SEO and link
Thus, ranking nodes by their PageRank values is tough when a
majority of the nodes are tied with PageRank 0.
It’s peferable to have PageRanks as positive.
Figure 7: (a) Rank Sink (b) Cycle
2. Problem of Cycles-
In the page cycles, the page1 only points to page 2 and vice
versa which creates an infinite loop or cycle.
The iterates will not converge no matter how long the process
is run since (k)T will flip flop indefinitely
ADJUSTMENTS TO THE MODEL-
So, to counter the problems, Brin and Page made use of the Random
Surfer Model.
Imagine a web surfer who bounces along randomly following the
hyperlink structure of the Web & when he arrives at a page with several
outlinks, he chooses one at random, hyperlinks to this new page, and
continues this random decision process indefinitely.
In the long run, the proportion of time the random surfer spends on a
given page is a measure of the relative importance of that page.
Unfortunately, this random surfer encounters some problems. He gets
caught whenever he enters a dangling node e.g., pdf files, image files,
data tables, etc[3].”
To fix this, Brin and Page define their first adjustment, which we call the
stochasticity adjustment because the 0T rows of H are replaced with
1/n eT, thereby making H stochastic. Now, the random surfer can
hyperlink to any page at random. The stochastic matrix is called S.
So,
S=H+a(1/n eT)
where, a = dangling node vector
This adjustment guarantees that S is stochastic, but it alone cannot
guarantee the convergence results desired. So a ,primitivity
adjustment was done to make it irreducible and aperiodic (so that
a PageRank value is generated)
When the random surfer abandons the hyperlink method by
entering a new destination, the random surfer, "teleports" to the
new page, where he begins hyperlink surfing again, until the next
teleportation, and so on.
To model this activity mathematically, Brin and Page invented a new
matrix G, such that-
G=αS + (1-α)1/n eeT
where,
α is teleportation factor/damping factor and α ϵ {(0,1)}
G is called the Google matrix
E = 1/ n eeT is the teleportation matrix
The teleporting is random because the E is uniform meaning the
surfer is equally likely, when teleporting, to jump to any page.
So, Google's adjusted PageRank method is:
which is simply the power method applied to G.
HITS ALGORITHM
HITS (Hypertext Induced Topic Search) was invented by Jon Kleinberg
in 1998 and uses the Web's hyperlink structure to create
HITS produces two popularity scores and is query-dependent. HITS
thinks of web pages as authorities and hubs.
An authority is a page with many inlinks, and a hub is a page with
many outIinks.
The main criteria for HITS is:Good authorities are pointed to by good
hubs and good hubs point to good authorities.
Every page i has both an authority score xi and a hub score yi . If E is
the set of all directed edges in the web graph,then,
• and for k=1,2,3….
given that each page has somehow been assigned an initial authority
score x(0) and hub score y(0) ,
HITS VERSUS PAGERANK
HITS PageRank
Scoring criteria Good authorities are pointed to by A webpage is important if it
good hubs and good hubs point to is pointed to by other
good authorities. important pages.
Number of scores Dual Rankings Page rank only presents
a) one with the most authoritative one score.
documents related to the query
b) other with the most "hubby"
documents.
Query indepedence HITS score is calculated after PageRank score is query
getting the neighbourhood graph independent
according to the query
Resilience to Susceptible to spamming since Since PageRank is able to
spamming addition of pages slightly affects isolate spam, it is risilient
the ranking. to spamming.
FUTURE WORK
Creating spam-resistant ranking algorithms-
The proposed algorithm considers each page one at a time, and asks,
"What proportion of this page's outlinking pages point back to it?" If
this value comes more than a threshold value, then we can detect the
presence of a link farm.
The second proposal is to build a score that is the "opposite" of
PageRank called BadRank for each page. Then, actual ranking would
be done by the difference of these 2 quantities.
Intelligent Agent-
An intelligent agent is a software robot designed to retrieve specific
information automatically. . So we need to factor in such crawlers that
do not cause privacy issues.