Introduction to Information Retrieval
Introduction to
Information Retrieval
Hinrich Schütze and Christina Lioma
Lecture 20: Crawling
1
Introduction to Information Retrieval
Overview
❶ Recap
❷ A simple crawler
❸ A real crawler
2
Introduction to Information Retrieval
Outline
❶ Recap
❷ A simple crawler
❸ A real crawler
3
Introduction to Information Retrieval
Search engines rank content pages and ads
4
Introduction to Information Retrieval
Google’s second price auction
▪ bid: maximum bid for a click by advertiser
▪ CTR: click-through rate: when an ad is displayed, what
percentage of time do users click on it? CTR is a measure of
relevance.
▪ ad rank: bid × CTR: this trades off (i) how much money the
advertiser is willing to pay against (ii) how relevant the ad is
▪ paid: Second price auction: The advertiser pays the minimum
amount necessary to maintain their position in the auction
(plus 1 cent). 5
Introduction to Information Retrieval
What’s great about search ads
▪ Users only click if they are interested.
▪ The advertiser only pays when a user clicks on an ad.
▪ Searching for something indicates that you are more likely
to buy it . . .
▪ . . . in contrast to radio and newpaper ads.
6
Introduction to Information Retrieval
Near duplicate detection: Minimum of
permutation
document 1: {sk} document 2: {sk}
Roughly: We use as a test for: are
d1 and d2 near-duplicates?
7
Introduction to Information Retrieval
Example
h(x) = x mod 5
g(x) = (2x + 1) mod 5
final sketches
8
Introduction to Information Retrieval
Outline
❶ Recap
❷ A simple crawler
❸ A real crawler
9
Introduction to Information Retrieval
How hard can crawling be?
▪ Web search engines must crawl their documents.
▪ Getting the content of the documents is easier for many
other IR systems.
▪ E.g., indexing all files on your hard disk: just do a recursive
descent on your file system
▪ Ok: for web IR, getting the content of the documents takes
longer . . .
▪ . . . because of latency.
▪ But is that really a design/systems challenge?
10
Introduction to Information Retrieval
Basic crawler operation
▪ Initialize queue with URLs of known seed pages
▪ Repeat
▪ Take URL from queue
▪ Fetch and parse page
▪ Extract URLs from page
▪ Add URLs to queue
▪ Fundamental assumption: The web is well linked.
11
Introduction to Information Retrieval
Exercise: What’s wrong with this crawler?
urlqueue := (some carefully selected set of seed urls)
while urlqueue is not empty:
myurl := [Link]()
mypage := [Link]()
[Link](myurl)
newurls := [Link]()
for myurl in newurls:
if myurl not in fetchedurls and not in urlqueue:
[Link](myurl)
addtoinvertedindex(mypage)
12
Introduction to Information Retrieval
What’s wrong with the simple crawler
▪ Scale: we need to distribute.
▪ We can’t index everything: we need to subselect. How?
▪ Duplicates: need to integrate duplicate detection
▪ Spam and spider traps: need to integrate spam detection
▪ Politeness: we need to be “nice” and space out all requests
for a site over a longer period (hours, days)
▪ Freshness: we need to recrawl periodically.
▪ Because of the size of the web, we can do frequent recrawls
only for a small subset.
▪ Again, subselection problem or prioritization
13
Introduction to Information Retrieval
Magnitude of the crawling problem
▪ To fetch 20,000,000,000 pages in one month . . .
▪ . . . we need to fetch almost 8000 pages per second!
▪ Actually: many more since many of the pages we attempt to
crawl will be duplicates, unfetchable, spam etc.
14
Introduction to Information Retrieval
What a crawler must do
Be polite
▪ Don’t hit a site too often
▪ Only crawl pages you are allowed to crawl: [Link]
Be robust
▪ Be immune to spider traps, duplicates, very large pages, very
large websites, dynamic pages etc
15
Introduction to Information Retrieval
[Link]
▪ Protocol for giving crawlers (“robots”) limited access to a
website, originally from 1994
▪ Examples:
▪ User-agent: *
Disallow: /yoursite/temp/
▪ User-agent: searchengine
Disallow: /
▪ Important: cache the [Link] file of each site we are crawling
16
Introduction to Information Retrieval
Example of a [Link] ([Link])
User-agent: PicoSearch/1.0
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
User-agent: *
Disallow: /news/information/knight/
Disallow: /nidcd/
...
Disallow: /news/research_matters/secure/
Disallow: /od/ocpl/wag/
Disallow: /ddir/
Disallow: /sdminutes/
17
Introduction to Information Retrieval
What any crawler should do
▪ Be capable of distributed operation
▪ Be scalable: need to be able to increase crawl rate by adding
more machines
▪ Fetch pages of higher quality first
▪ Continuous operation: get fresh version of already crawled
pages
18
Introduction to Information Retrieval
Outline
❶ Recap
❷ A simple crawler
❸ A real crawler
19
Introduction to Information Retrieval
URL frontier
20
Introduction to Information Retrieval
URL frontier
▪ The URL frontier is the data structure that holds and
manages URLs we’ve seen, but that have not been crawled
yet.
▪ Can include multiple pages from the same host
▪ Must avoid trying to fetch them all at the same time
▪ Must keep all crawling threads busy
21
Introduction to Information Retrieval
Basic crawl architecture
22
Introduction to Information Retrieval
URL normalization
▪ Some URLs extracted from a document are relative URLs.
▪ E.g., at [Link] we may have [Link]
▪ This is the same as: [Link]
▪ During parsing, we must normalize (expand) all relative URLs.
23
Introduction to Information Retrieval
Content seen
▪ For each page fetched: check if the content is already in the
index
▪ Check this using document fingerprints or shingles
▪ Skip documents whose content has already been indexed
24
Introduction to Information Retrieval
Distributing the crawler
▪ Run multiple crawl threads, potentially at different nodes
▪ Usually geographically distributed nodes
▪ Partition hosts being crawled into nodes
25
Introduction to Information Retrieval
Google data centers (wazfaring. com)
26
Introduction to Information Retrieval
Distributed crawler
27
Introduction to Information Retrieval
URL frontier: Two main considerations
▪ Politeness: Don’t hit a web server too frequently
▪ E.g., insert a time gap between successive requests to the
same server
▪ Freshness: Crawl some pages (e.g., news sites) more often
than others
▪ Not an easy problem: simple priority queue fails.
28
Introduction to Information Retrieval
Mercator URL frontier
29
Introduction to Information Retrieval
Mercator URL frontier
▪ URLs flow in from the top
into the frontier.
30
Introduction to Information Retrieval
Mercator URL frontier
▪ URLs flow in from the top
into the frontier.
▪ Front queues manage
prioritization.
31
Introduction to Information Retrieval
Mercator URL frontier
▪ URLs flow in from the top
into the frontier.
▪ Front queues manage
prioritization.
▪ Back queues enforce
politeness.
32
Introduction to Information Retrieval
Mercator URL frontier
▪ URLs flow in from the top
into the frontier.
▪ Front queues manage
prioritization.
▪ Back queues enforce
politeness.
▪ Each queue is FIFO.
33
Introduction to Information Retrieval
Mercator URL frontier: Front queues
34
Introduction to Information Retrieval
Mercator URL frontier: Front queues
▪ Prioritizer assigns to URL
an integer priority
between 1 and F.
35
Introduction to Information Retrieval
Mercator URL frontier: Front queues
▪ Prioritizer assigns to URL
an integer priority
between 1 and F.
▪ Then appends URL to
corresponding queue
36
Introduction to Information Retrieval
Mercator URL frontier: Front queues
▪ Prioritizer assigns to URL
an integer priority
between 1 and F.
▪ Then appends URL to
corresponding queue
▪ Heuristics for assigning
priority: refresh rate,
PageRank etc
37
Introduction to Information Retrieval
Mercator URL frontier: Front queues
▪ Selection from front
queues is initiated by
back queues
▪ Pick a front queue from
which to select next
URL: Round robin,
randomly, or more
sophisticated variant
▪ But with a bias in favor
of high-priority front
queues
38
Introduction to Information Retrieval
Mercator URL frontier: Back queues
39
Introduction to Information Retrieval
Mercator URL frontier: Back queues
▪ Invariant 1. Each back
queue is kept non-
empty while the crawl is
in progress.
▪ Invariant 2. Each back
queue only contains
URLs from a single host.
▪ Maintain a table from
hosts to back queues.
40
Introduction to Information Retrieval
Mercator URL frontier: Back queues
▪ In the heap:
▪ One entry for each back
queue
▪ The entry is the earliest
time te at which the host
corresponding to the
back queue can be hit
again.
▪ The earliest time te is
determined by (i) last
access to that host (ii)
time gap heuristic
41
Introduction to Information Retrieval
Mercator URL frontier: Back queues
▪ How fetcher interacts
with back queue:
▪ Repeat (i) extract
current root q of the
heap (q is a back queue)
▪ and (ii) fetch URL u at
head of q . . .
▪ . . . until we empty the q
we get.
▪ (i.e.: u was the last URL
in q)
42
Introduction to Information Retrieval
Mercator URL frontier: Back queues
▪ When we have emptied
a back queue q:
▪ Repeat (i) pull URLs u
from front queues and
(ii) add u to its
corresponding back
queue . . .
▪ . . . until we get a u
whose host does not
have a back queue.
▪ Then put u in q and
create heap entry for it.
43
Introduction to Information Retrieval
Mercator URL frontier
▪ URLs flow in from the
top into the frontier.
▪ Front queues manage
prioritization.
▪ Back queues enforce
politeness.
44
Introduction to Information Retrieval
Spider trap
▪ Malicious server that generates an infinite sequence of
linked pages
▪ Sophisticated spider traps generate pages that are not easily
identified as dynamic.
45
Introduction to Information Retrieval
Resources
▪ Chapter 20 of IIR
▪ Resources at [Link]
▪ Paper on Mercator by Heydon et al.
▪ Robot exclusion standard
46