0% found this document useful (0 votes)
9 views16 pages

BDA-4

The document explains the PageRank algorithm, which determines the importance of web pages based on their link structure, assigning higher ranks to pages linked by other important pages. It details the iterative process of calculating PageRank using a formula that incorporates a damping factor and normalization, as well as its application in search engines and efficient computation using MapReduce. Additionally, it discusses link spam, spam farms, and introduces Topic-Sensitive PageRank, which considers topical relevance in ranking pages.

Uploaded by

maffiliate032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

BDA-4

The document explains the PageRank algorithm, which determines the importance of web pages based on their link structure, assigning higher ranks to pages linked by other important pages. It details the iterative process of calculating PageRank using a formula that incorporates a damping factor and normalization, as well as its application in search engines and efficient computation using MapReduce. Additionally, it discusses link spam, spam farms, and introduces Topic-Sensitive PageRank, which considers topical relevance in ranking pages.

Uploaded by

maffiliate032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Prepared By : Mayank Yadav Big Data Analytics

4 - Link Analysis
⭐ Explain Page Rank Algorithm.
PageRank is an algorithm used by search engines (most famously, Google) to determine the
importance or relevance of web pages.

It assigns a numerical weight to each page, representing its relative importance within the web's
hyperlink structure. A higher PageRank indicates a more important page. It's a key factor in how
search engines rank search results.

PageRank is based on the idea that important pages are linked to by other important pages. A
page that receives many links from high-ranking pages is considered more important than a
page that receives few links or links from low-ranking pages. It's a recursive definition – a page's
importance depends on the importance of the pages that link to it.

The PageRank Algorithm:

1.​ Initial PageRank: Initially, each page is assigned an equal PageRank value (e.g., 1/N,
where N is the total number of pages).

Iteration: The PageRank of each page is iteratively updated based on the PageRank of the
pages that link to it. The basic formula is:​

PR(A) = (1-d) + d * (PR(T1)/C(T1) + PR(T2)/C(T2) + ... + PR(Tn)/C(Tn))

○​ PR(A) is the PageRank of page A.


○​ d is a damping factor (usually between 0 and 1, often around 0.85). It represents
the probability that a user will continue clicking on links rather than jumping to a
random page.
○​ T1, T2, ..., Tn are the pages that link to page A.
○​ PR(Ti) is the PageRank of page Ti.
○​ C(Ti) is the number of outgoing links from page Ti.

2.​ Normalization: After each iteration, the PageRank values are normalized so that they
sum up to 1 (or some other constant). This is important for convergence.​

3.​ Convergence: The iterations continue until the PageRank values converge, meaning
they don't change significantly from one iteration to the next.​
Prepared By : Mayank Yadav Big Data Analytics

Intuition Behind the Formula:

●​ (1-d): This represents the probability that a user will jump to a random page. Even if a
page has no incoming links, it still has a small chance of being visited.
●​ d * (...): This represents the probability that a user will reach page A by clicking on links.
The sum is over all pages that link to A.
●​ PR(Ti)/C(Ti): This term represents the contribution of page Ti to the PageRank of page
A. A page with a high PageRank and few outgoing links will contribute more to the
PageRank of the pages it links to.

Example:

Let's say we have three pages: A, B, and C.

●​ A links to B and C.
●​ B links to C.
●​ C links to A.

Initially, each page has a PageRank of 1/3.

After one iteration (and simplifying the formula for this small example, ignoring the damping
factor for illustration):

●​ PR(A) = PR(C)/1 = 1/3


●​ PR(B) = PR(A)/2 = 1/6
●​ PR(C) = PR(A)/2 + PR(B)/1 = 1/6 + 1/6 = 1/3

We would continue iterating until the PageRank values stabilize.

Importance of PageRank:

●​ Search Engine Ranking: PageRank is a crucial factor in how search engines rank
search results. Pages with higher PageRank are more likely to appear higher in the
search results.
●​ Web Navigation: PageRank can be used to understand the importance of different
pages within a website or the web as a whole.
●​ Information Retrieval: PageRank can be used to identify relevant documents in a
collection.
●​ Social Network Analysis: The PageRank concept can be applied to social networks to
identify influential users.
Prepared By : Mayank Yadav Big Data Analytics

⭐ Explain how Page Rank is used in a search engine.


How PageRank Works (Simplified):

1.​ Initial PageRank: Every page starts with an equal PageRank score. Imagine distributing
a fixed amount of "importance" evenly across all pages.​

2.​ Iteration: The algorithm repeatedly updates each page's PageRank based on the links it
receives. Here's the basic idea:​

○​ A page's PageRank increases if it's linked to by pages with high PageRank.


○​ A page's PageRank decreases if it links to many other pages (it's "sharing" its
importance).
3.​ Mathematical Formula: The actual calculation is a bit more complex, involving a
damping factor (representing a user's likelihood of continuing to click links) and
normalization (to keep the total "importance" constant).​

4.​ Convergence: The algorithm keeps iterating until the PageRank scores stabilize,
meaning they don't change much from one round to the next.​

PageRank in Search Engines:

Search engines use PageRank as one of many factors to determine the order of search results.
When you search for something:

1.​ The search engine finds pages containing your keywords.


2.​ It then considers factors like:
○​ Content relevance: How well the page's content matches your search.
○​ PageRank: The importance of the page, as determined by the PageRank
algorithm.
○​ Other factors: Hundreds of other signals, including website quality, user
experience, and more.
3.​ The search engine combines these factors to rank the pages and display the most
relevant and important ones first.
Prepared By : Mayank Yadav Big Data Analytics

⭐ Explain the efficient computation of Page Rank using MapReduce.


The iterative nature of PageRank calculations makes it well-suited for distributed processing.

Data Representation:

We represent the web graph as a set of key-value pairs.

●​ Key: A web page URL (or ID).


●​ Value: A list of URLs (or IDs) that the key page links to (its outgoing links).

Map Phase:

●​ Input: Each mapper receives a chunk of the web graph data (a set of key-value pairs
representing pages and their outgoing links). It also receives the current PageRank
values for all pages.
●​ Processing: For each page i in its input, the mapper does the following:
1.​ For each outgoing link from page i to page j, the mapper emits a key-value pair:
■​ Key: Page j (the page being linked to).
■​ Value: PR(i) / C(i) (where PR(i) is the current PageRank of page i,
and C(i) is the number of outgoing links from page i).
2.​ The mapper also emits a key-value pair for page i itself, to carry over its current
PageRank value. This will be used in the reduce phase. The key is i and the
value can be a special marker or the existing PR(i).

Shuffle and Sort:

The MapReduce framework shuffles and sorts the intermediate key-value pairs emitted by the
mappers. All pairs with the same key (a page URL) are grouped together and sent to the same
reducer.

Reduce Phase:

●​ Input: Each reducer receives all the values associated with a particular page j. These
values will be of the form PR(i) / C(i) (contributions from pages linking to j) and
possibly the page j's current PR value.​

Processing: The reducer calculates the new PageRank for page j using the PageRank formula:​

PR(j) = (1-d) + d * Σ (PR(i) / C(i))

●​ Where the sum is over all pages i that link to page j. The reducer also receives the
current PR(j) but uses the above formula to update it.​
Prepared By : Mayank Yadav Big Data Analytics

●​ Output: The reducer emits the updated PageRank for page j as a key-value pair:​

○​ Key: Page j.
○​ Value: PR(j) (the new PageRank).

Iteration:

The output of the reduce phase (the updated PageRank values) becomes the input for the next
iteration of the MapReduce job. This process is repeated until the PageRank values converge
(meaning they don't change significantly from one iteration to the next).

Normalization:

After each iteration (or periodically), the PageRank values are normalized to ensure they sum
up to 1 (or a constant). This is typically done in a separate MapReduce job.

Example : Let's say page A links to B and C, B links to C, and C links to A.

●​ Map:​

○​ Mapper for A: emits (B, PR(A)/2), (C, PR(A)/2), (A, PR(A))


○​ Mapper for B: emits (C, PR(B)/1), (B, PR(B))
○​ Mapper for C: emits (A, PR(C)/1), (C, PR(C))
●​ Shuffle & Sort:​

○​ (A, PR(A)), (A, PR(C)/1)


○​ (B, PR(B)), (B, PR(A)/2)
○​ (C, PR(C)), (C, PR(A)/2), (C, PR(B)/1)
●​ Reduce:​

○​ Reducer for A: calculates PR(A) = (1-d) + d * (PR(C)/1)


○​ Reducer for B: calculates PR(B) = (1-d) + d * (PR(A)/2)
○​ Reducer for C: calculates PR(C) = (1-d) + d * (PR(A)/2 + PR(B)/1)

This process repeats until convergence.

Advantages of MapReduce for PageRank:

●​ Scalability: Can handle massive web graphs by distributing the computation across a
cluster.
●​ Parallelism: The map and reduce phases can be executed in parallel, significantly
speeding up the calculation.
●​ Fault Tolerance: The MapReduce framework handles task failures automatically.
Prepared By : Mayank Yadav Big Data Analytics

⭐ What is Link Spam? Explain the architecture of a Spam Farm.


Link spam refers to backlinks created with the purpose of manipulating the rankings of a website
in organic search results.

The goal is to trick search engines into ranking a page higher in search results than it deserves
based on its actual content and relevance. Link spam can take many forms, but the core idea is
to create a large number of low-quality or irrelevant links pointing to the target page.

Why is Link Spam a Problem? Link spam undermines the integrity of search engine results. It
can lead to:

●​ Lower quality search results: Users are presented with irrelevant or low-quality pages.
●​ Wasted time and effort: Users have to sift through spam to find what they're looking for.
●​ Unfair competition: Legitimate websites can be penalized by the presence of spam in
search results.

What is a Spam Farm?

A spam farm is a network of websites


specifically created for the purpose of link
spamming. These websites are often
low-quality, containing duplicate content,
automatically generated text, or other
forms of spam. The spam farm owner
controls all the sites and uses them to
create a large number of links pointing to
the target website.

Architecture of a Spam Farm: A spam


farm typically has the following
components:

1.​ Target Website: The website that the spam farm owner wants to boost in search
rankings.​

2.​ Source Websites (Spam Sites): A large number of websites created solely for the
purpose of linking to the target website. These sites often have:​

○​ Low-quality content: Duplicate content, spun content (automatically rewritten


text), or completely nonsensical text.
○​ Hidden links: Links to the target website might be hidden in the code or placed
in inconspicuous locations.
○​ Lack of real value: These sites are not intended to be useful to users.
Prepared By : Mayank Yadav Big Data Analytics

3.​ Link Network: The network of links connecting the source websites to the target
website. This network can be structured in various ways:​

○​ Direct Links: Each source website directly links to the target website.
○​ Indirect Links: Links might pass through intermediate websites to make the
spam less obvious.
○​ Reciprocal Links: Websites in the spam farm might link to each other to create
a dense network of interlinking, further attempting to manipulate link metrics.
4.​ Control Mechanism: The spam farm owner uses some mechanism to manage the large
number of source websites and their links. This could involve:​

○​ Automated tools: Software to create and manage the spam websites.


○​ Content generation tools: Software to create spun or duplicate content.
○​ Link management tools: Software to manage the links between the websites.

How Spam Farms Try to Manipulate Search Engines:

Spam farms try to exploit the algorithms used by search engines to determine the importance of
web pages. By creating a large number of links pointing to the target website, the spam farm
owner hopes to:

●​ Increase PageRank: Inflate the PageRank of the target website, making it appear more
important to search engines.
●​ Manipulate Anchor Text: Use specific keywords in the anchor text of the links to make
the target website appear relevant to those keywords.

How Search Engines Combat Spam Farms:

Search engines are constantly evolving their algorithms to detect and penalize spam farms.
Techniques include:

●​ Link analysis: Identifying suspicious link patterns that are characteristic of spam farms.
●​ Content analysis: Detecting low-quality or duplicate content.
●​ Identifying spam farm owners: Tracking down the individuals or organizations behind
spam farms.
●​ Manual review: Human reviewers might manually check websites to identify spam.
Prepared By : Mayank Yadav Big Data Analytics

⭐ Explain Topic-Sensitive Page Rank.


Topic-Sensitive PageRank (TSPR) is an extension of the traditional PageRank algorithm that
takes into account the topical relevance of web pages.

While standard PageRank measures the overall importance of a page based on its link
structure, TSPR considers the importance of a page within a specific topic. This allows for
more personalized and contextually relevant search results.

Why Topic-Sensitive PageRank?

Traditional PageRank treats all links equally. However, a link from a page about "cats" to a page
about "dogs" might be less relevant than a link from a page about "dog breeds" to the same
"dogs" page. TSPR addresses this by weighting links based on the topical similarity between the
linking page and the linked page.

TSPR assigns different PageRank scores to each page for different topics. A page might have a
high PageRank for the "dogs" topic but a low PageRank for the "cats" topic. The algorithm
considers both the link structure and the topical relevance of the links when calculating these
topic-specific PageRank scores.

How Topic-Sensitive PageRank Works :

1.​ Topic Classification: Each web page is assigned to one or more topics. This can be
done using text classification techniques, analyzing the page's content, or using other
methods.​

2.​ Link Weighting: The weight of a link between two pages is determined based on the
topical similarity between the pages. Links between pages on the same or related topics
have higher weights than links between pages on unrelated topics. Various methods can
be used to measure topical similarity, such as cosine similarity between topic vectors or
shared keywords.

Topic-Specific PageRank Calculation: The PageRank algorithm is then modified to


incorporate these topic-specific link weights. The basic formula is similar to the standard
PageRank formula, but the contribution of each link is now weighted by the topical similarity. For
a given topic t:​

PR_t(A) = (1-d) + d * Σ (similarity(Ti, A, t) * PR_t(Ti) / C(Ti))

○​ PR_t(A) is the PageRank of page A for topic t.


○​ similarity(Ti, A, t) is the topical similarity between page Ti and page A
with respect to topic t. This could be 1 if the pages are both relevant to topic t and
0 otherwise, or it could be a more nuanced similarity score.
Prepared By : Mayank Yadav Big Data Analytics

3.​ Iteration and Convergence: The algorithm iterates until the topic-specific PageRank
values converge.​

Example : Imagine three pages:

●​ Page A: About "dog breeds."


●​ Page B: About "dogs" (general).
●​ Page C: About "cats."

Page A and B are topically related, while A and C are not. Page A links to B, and Page B links
to C.

●​ Standard PageRank would treat the link from B to C the same as the link from A to B.
●​ TSPR, for the "dogs" topic, would give more weight to the link from A to B because they
are both about dogs. The link from B to C would receive less weight (or perhaps zero
weight if "cats" is considered completely unrelated to "dogs").

Use in Search Engines:

When a user searches for something, the search engine can use the topic of the search query
to retrieve the topic-specific PageRank scores for the pages in its index. This allows the search
engine to rank pages not only by their general importance but also by their relevance to the
user's search topic.
Prepared By : Mayank Yadav Big Data Analytics

⭐ Explain Hubs and Authorities.


Hubs and Authorities is a link analysis algorithm used to determine the importance of web
pages. It was developed by Jon Kleinberg and is also known as the HITS algorithm
(Hyperlink-Induced Topic Search).

The algorithm is based on the idea that there are two types of important pages on the web:

●​ Hubs: Pages that link to many authoritative pages on a particular topic. They act as
directories or compilations of useful resources.
●​ Authorities: Pages that contain high-quality information on a particular topic and are
linked to by many hubs.

The HITS algorithm assigns two scores to each page: a hub score and an authority score.
These scores are calculated iteratively based on the link structure of the web.

How it Works:

1.​ Initial Scores: Each page is initially assigned a hub score and an authority score of 1.​

2.​ Iteration: The algorithm iteratively updates the hub and authority scores of each page
based on the following rules:​

○​ Authority Update: A page's authority score is the sum of the hub scores of the
pages that link to it.
○​ Hub Update: A page's hub score is the sum of the authority scores of the pages
it links to.
3.​ Normalization: After each iteration, the hub and authority scores are normalized to
prevent them from growing too large.​

4.​ Convergence: The iterations continue until the hub and authority scores converge,
meaning they don't change significantly from one iteration to the next.​
Prepared By : Mayank Yadav Big Data Analytics

Intuition:

●​ A good hub is a page that links to many good authorities.


●​ A good authority is a page that is linked to by many good hubs.

This is a recursive definition, as the hub score of a page depends on the authority scores of the
pages it links to, and the authority score of a page depends on the hub scores of the pages that
link to it.

Use in Search Engines:

The HITS algorithm can be used in search engines to identify relevant and authoritative pages
for a given query. When a user submits a query, the search engine can:

1.​ Identify a set of relevant pages: This can be done using traditional information retrieval
techniques.
2.​ Construct a subgraph: A subgraph is created that includes the relevant pages and the
pages that link to them or are linked to by them.
3.​ Run the HITS algorithm: The HITS algorithm is run on the subgraph to calculate the
hub and authority scores for each page.
4.​ Rank the pages: The pages are ranked based on their authority scores, with pages with
higher authority scores being considered more relevant.

Advantages of Hubs and Authorities:

●​ Considers both link structure and content: The algorithm considers both the link
structure of the web and the topical relevance of the links.
●​ Identifies different types of important pages: The algorithm can identify both hubs
(pages that link to many authoritative pages) and authorities (pages that contain
high-quality information).

Limitations of Hubs and Authorities:

●​ Computationally expensive: The algorithm can be computationally expensive to run on


large graphs.
●​ Query-dependent: The hub and authority scores of a page can vary depending on the
query.
●​ Susceptible to manipulation: The algorithm can be manipulated by creating spam
farms or other link schemes.
Prepared By : Mayank Yadav Big Data Analytics

⭐ Explain SimRank Algorithm.


SimRank is a similarity measure algorithm used to assess the similarity between nodes in a
graph.

Unlike algorithms like PageRank, which focus on influence or importance, SimRank measures
how structurally similar two nodes are based on their connections and the connections of their
neighbors. The core idea is: "two nodes are similar if their neighbors are similar."

Two nodes are considered similar if they


are connected to similar nodes. This is a
recursive definition because the
similarity between two nodes depends
on the similarity between their
neighbors, which in turn depends on the
similarity between their neighbors, and
so on.

How SimRank Works :

1.​ Initialization: The similarity


between a node and itself is
initialized to 1. The similarity
between any two distinct nodes
is initialized to 0.​

Iteration: The algorithm iteratively updates the similarity scores between nodes based on the
following rule:​

The similarity between two nodes a and b is proportional to the average similarity between their
neighbors.​

S(a, b) = C * Σ_{(ia, ib)} (S(ia, ib) / (|N(a)| * |N(b)|))

○​ S(a, b) is the similarity between nodes a and b.


○​ C is a damping factor (between 0 and 1, similar to PageRank). It controls how
much weight is given to distant neighbors.
○​ N(a) is the set of neighbors of node a.
○​ |N(a)| is the number of neighbors of node a.
○​ Σ_{(ia, ib)} represents the sum over all pairs of neighbors ia of a and ib of
b.
○​ S(ia, ib) is the similarity between the neighbors ia and ib, which has been
calculated in the previous iteration.
Prepared By : Mayank Yadav Big Data Analytics

2.​ Normalization (Optional): Similarity scores can be normalized to a range (e.g., 0 to 1).​

3.​ Convergence: The iterations continue until the similarity scores converge, meaning they
don't change significantly from one iteration to the next.​

Intuition: Imagine two users on a social network. If they both have many friends in common,
they are likely to be similar. SimRank captures this idea by considering not only the direct
connections but also the connections of their connections, and so on.

Example :

Let's say we have four nodes: A, B, C, and D.

●​ A is connected to C.
●​ B is connected to C and D.
●​ C is connected to A and B.
●​ D is connected to B.

Initially, S(A,A) = S(B,B) = S(C,C) = S(D,D) = 1, and all other similarities are 0.

After one iteration (simplified, ignoring the damping factor and normalization for illustration):

●​ S(A,B) would be proportional to S(C,C) / (1 * 2) = 1/2 (because C is a


common neighbor).
●​ S(A,C) would be influenced by S(A,A) and S(C,C).
●​ S(B,C) would be influenced by S(B,B) and S(C,C).
●​ S(B,D) would be influenced by S(B,B) and S(D,D).

These calculations would continue iteratively until the similarity scores converge.

Use Cases of SimRank:

●​ Recommendation Systems: Recommending items to users based on the similarity of


their preferences to other users.
●​ Social Network Analysis: Identifying users with similar interests or connections.
●​ Link Prediction: Predicting potential connections between nodes in a network.
●​ Duplicate Detection: Identifying similar documents or web pages.
Prepared By : Mayank Yadav Big Data Analytics

⭐ What is the Use of Combiners to Consolidate the Result Vector? (Unit-3)


Combiners are a powerful optimization technique in MapReduce that can significantly
improve performance, especially when dealing with large datasets.

They act as a "mini-reducer" at the map stage, consolidating intermediate key-value pairs
before they are sent to the reducers. This reduces the amount of data that needs to be shuffled
across the network, which is often a major bottleneck in MapReduce jobs.

The Problem: In a typical MapReduce job, mappers generate intermediate key-value pairs. All
pairs with the same key are then shuffled across the network to the appropriate reducer. If the
mappers produce a large number of values for the same key, this shuffling can be very
time-consuming.

The Solution: Combiners

A combiner is a function that runs on the output of each mapper before the shuffle phase. It
performs a local aggregation or reduction on the intermediate key-value pairs generated by that
specific mapper. The combiner's output is then sent to the reducer, rather than all the individual
key-value pairs from the mapper.
Prepared By : Mayank Yadav Big Data Analytics

How Combiners Consolidate the Result Vector (with Example):

Let's say you're doing a word count. Your mappers are processing documents, and they emit
key-value pairs where the key is a word and the value is 1 (representing one occurrence of the
word).

●​ Without Combiners: If the word "the" appears 1000 times in a document processed by
one mapper, that mapper will emit 1000 key-value pairs: ("the", 1), ("the", 1), ...,
("the", 1). All 1000 of these pairs will be shuffled to the reducer.​

●​ With Combiners: The combiner running on that mapper's output will see all the
("the", 1) pairs. It can consolidate them into a single pair: ("the", 1000). Now,
only one key-value pair needs to be shuffled to the reducer, dramatically reducing the
network traffic.​

Benefits of Using Combiners:

●​ Reduced Network Traffic: The most significant benefit. By reducing the amount of data
shuffled, you decrease network congestion and improve overall job performance.
●​ Faster Shuffle Phase: Less data to shuffle means the shuffle phase completes more
quickly.
●​ Faster Reduce Phase (Indirectly): Reducers have less data to process, which can
speed up the reduce phase as well.

When to Use Combiners:

Combiners are most effective when the reduce operation is commutative and associative. This
means that the order in which the values are combined doesn't matter, and the combination can
be done in chunks. Examples include:

●​ Sum: sum(1, 1, 1, ...) = sum(sum(1, 1), sum(1, 1), ...)


●​ Count: count(1, 1, 1, ...) = count(count(1, 1), count(1, 1), ...)
●​ Max/Min: max(10, 5, 20) = max(max(10, 5), 20)

Combiners are not suitable for operations where the order matters, like calculating a median or
standard deviation.
Prepared By : Mayank Yadav Big Data Analytics

See you in the next unit ! 👋

You might also like