Chapter 6 : Query Languages
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Keyword-based Querying
Queries are combinations of words.
The document collection is searched for documents that contain
these words.
Word queries are intuitive, easy to express and provide fast
ranking.
The concept of word must be defined:
A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
Definition of letter and separator is flexible; e.g., hyphen could be
defined as a letter or as a separator.
Usually, common words (such as “a”, “the”, “of”, …) are ignored.
2
Single-word Queries
A query is a single word.
Usually used for searching in document images.
Simplest form of query.
What are the possible documents retrieved as relevant?
All documents that include this word are retrieved.
On what base documents are ranked?
Documents may be ranked by the frequency of the query word in
the document.
Documents containing more of the query word are given the
highest priority.
3
Phrase Queries
A query is a sequence of words treated as a single unit. Also
called “literal string” or “exact phrase” query.
Phrase is usually surrounded by quotation marks.
All documents that include this phrase are retrieved.
Usually, separators (commas, colons, ...) & common words
(“a”, “the”, “of”, “for”…) in the phrase are ignored.
In effect, this query is for a set of words that must appear in
sequence.
Allows users to specify a context and thus gain precision.
Ex.: “Information Processing for Document Retrieval”.
What are the possible documents retrieved as relevant?
All documents that include phrase query are retrieved.
On what base documents are ranked? 4
Multiple-word Queries
A query is a set of words (or phrases).
Ex.: What is the result for the query “Data Mining and Intelligent
Database Design”?
What are the possible documents retrieved as relevant?
Two options: A document is retrieved if it includes:
Any of the query words, or
each of the query words.
5
Multiple-word Queries
On what bases documents be ranked to list according to best
matching principle?
Documents are ranked by the number of query words they contain.
A document containing n query words is ranked higher than a
document containing m < n query words.
Documents are ranked in decreasing order:
Those containing all the query words are ranked at the top,
only one query word at bottom.
Frequency counts may be used to break ties among documents that
contain the same query words.
6
Boolean Queries
Queries are formulated based on concepts from logic: AND, OR,
NOT.
It describes the information needed by relating multiple words with
Boolean operators.
Semantics: For each query word w a corresponding set Dw is
constructed that includes the documents that contain w.
The Boolean expression is then interpreted as an expression on
the corresponding document sets with corresponding set
operators:
AND: Finds only documents containing all of the specified words
or phrases.
OR: Finds documents containing at least one of the specified words
or phrases.
NOT: Excludes documents containing the specified word or
phrase. 7
Examples: Boolean Queries
[Link] OR server
Finds documents containing either computer, server or both.
2. (computer OR server) NOT mainframe
Select all documents that discuss computers or servers, do not
select any documents that discuss mainframes.
3. Computer NOT (server OR mainframe)
Select all documents that discuss computers, and do not discuss
either servers or mainframes.
4. Computer OR server NOT mainframe
Select all documents that discuss computers, or documents that
discuss servers but do not discuss mainframes.
8
Weighted Queries
Each of the words is assigned a different weight, expressing the
relative importance of the word within the query.
A query is then a set of word-weight pairs:
(q1, w1), (q2, w2), …, (qn, wn).
The ranking of a document is the sum of the weights for the
query words that it satisfies.
Example: given Query: (A,0.8,), (B,0.9), (C,0.3); and
Document 1: (A, B, D) and Document 2: (A, C, D) which
document ranked first ?
Score of Document 1: 0.8+0.9 = 1.7
Score of Document 2: 0.8+0.3 = 1.1
Each document includes two words from the query, but
Document1 is ranked higher because it includes more important
words. 9
Penalizing Documents
When interpreting queries,
The Boolean model does not “penalize” documents with extra (non-
requested) keywords.
Some models demote documents that include keywords that were not
requested.
The vector model with the cosine measure
The probabilistic Bayesian network model
Ex.: Assume the vector model with the cosine measure and the
simple case that both documents and queries use binary values.
Consider the following two documents and a query:
d1 = (0,1,0,1,0), d2= (0,1,1,1,0), q= (0,1,0,1,0)
sim(q, d1) = 1.0, sim(q, d2) = 0.82
d2 is demoted because it includes an extra keyword not requested
by q. 10
Pattern Queries
What is Pattern?
An expression that defines a set of objects. Pattern shows the
internal representation of an object.
What is the pattern of a word?
Pattern matching: A word matches a pattern if it is equal to one
of the words defined by the pattern.
In other words,
The semantics are of disjunction: A pattern P that defines a word
(c1, c2, …, cn) is interpreted as c1 v c2 v … v cn.
11
Pattern Queries
Similarity pattern. Specifies a string and a radius
Defines all the words whose distance from the string is within the
radius.
Assume the distance between two strings is measured by the
number of one-character changes (insertions, deletions,
replacements) required to transform one string into the other.
The similarity pattern (king, 2) defines kin, kong, knig, kings, cling,
…
Useful to compensate for typing or scanning (OCR) errors.
One of the technique used for pattern matching is string editing.
12
String Editing
The problem is given two sequences of symbols, X = x1 x2 … xn
and Y = y1 y2 … ym, transform X to Y, based on a sequence of
three operations: Delete, Insert and Replace, so that for every
operation COST(Cij) is incurred.
The objective of string editing is to identify a minimum cost
sequence of edit operation that will transform X into Y.
Example: consider the sequences:
X = {a a b a b} and Y = {b a b b}
Identify a minimum cost sequence of edit operation that
transform X into Y.
Assume change costs 2 units, delete 1 unit and insert 1 unit. 13
Dynamic programming
The minimum cost of any edit sequence that transforms x1 x2 … xi into y1 y2 … yj (for i>0 and j>0) is the minimum of the three costs: delete, replace, or
insert operations.
The following recurrence equation is used for COST(i,j).
0 if i=0, j=0
COST(i-1,0) + D(xi) i>0, j=0 COST(0,j-1) + I(yj) j>0, i=0
COST'(i,j) i>0, j>0
where COST'(i,j) = min { COST(i-1,j) + D(xi), COST(i-1,j-1) + C(xi,yj), COST(i,j-1) + I(yj)
}
COST(i,j) =
14
Example
Transform the sequences:
Xi = {a a b a b} into Yj = {b a b b}
With minimum cost sequence of edit operation using dynamic
programming approach, Assume that change costs 2 units, delete
and insert 1 unit.
j 0 1 2 3 4
i The value 3 at (5,4) is the
0 0 1 2 3 4
optimal solution
1 1 2 1 2 3 By tracing back one can
2 2 3 2 3 4 determine which operations
3 lead to optimal solution.
3 2 3 2 3
Delete x1, Delete x2 and
4 4 3 2 3 4 Insert y4 Or,
5 5 4 3 2 3 Change x1 to y1 & Delete 15x 4.
Natural language
Using natural language for querying is very attractive.
Example: Find all the documents that discuss
“ campaign finance reforms, including documents that discuss
violations of campaign financing regulations.
Do not include documents that discuss campaign contributions
by the gun and the tobacco industries”.
Natural language queries are converted to a formal language for
processing against a set of documents.
Such translation requires intelligence and is still a challenge.
16
Natural language
Pseudo NL processing: System scans the text and extracts
recognized terms and Boolean connectors.
The grammaticality of the text is not important.
Often used by search engines.
Problem: Recognizing the negation in the search statement
(“Do not include...”).
Compromise: Users enter natural language clauses connected
with Boolean operators.
In the above example: “campaign finance reforms” or
“violations of campaign financing regulations" and not
“campaign contributions by the gun and the tobacco
industries”. 17
Question & Answer
04/25/24 18
Thank You !!!
04/25/24 19