Query Languages
Berlin Chen 2005
Reference:
1. Modern Information Retrieval, chapter 4
The Kinds of Queries
• Data retrieval
– Pattern-based querying
– Retrieve docs that contains (or exactly match) the objects that
satisfy the conditions clearly specified in the query
– A single erroneous object implies failure!
• Information retrieval
– Keyword-based querying
– Retrieve relevant docs in response to the query
(the formulation of a user information need)
– Allow the answer to be ranked
IR – Berlin Chen 2
The Kinds of Queries
• On-line databases or CD-ROM archives
– High level software packages should be viewed as query
languages
– Named “protocols”
Different query languages are formulated and then
used at different situations, by considering
- The underlying retrieval models (ranking alogrithms)
- The content (semantics) and structure (syntax) of the text
Models: Boolean, vector-space, HMM ….
Formulations/word-treating machineries: stop-word list,
stemming, query-expansion, ….
IR – Berlin Chen 3
The Retrieval Units
• The retrieval unit: the basic element which can be
retrieved as an answer to a query
– A set of such basic elements with ranking information
• The retrieval unit can be a file, a doc, a Web page, a
paragraph, a passage, or some other structural units
• Simply referred as “docs”
kinds of
retrieval units
kinds of queries IR – Berlin Chen 4
Keyword-based Querying
• Keywords
– Those words can be used for retrieval by a query
– A small set of words extracted from the docs
• Preprocessing is needed
• Characteristics of keyword-based queries
– A query composed of keywords and the docs containing such
keywords are searching for
– Intuitive, easy to express, and allowing for fast ranking
– A query can be a single keyword, multiple keywords (basic
queries), or more complex combination of operation involving
several keywords
• AND, OR, BUT, …
IR – Berlin Chen 5
Keyword-based Querying (cont.)
• Single-word queries
– Query: The elementary query is a word
– Docs: The docs are long sequences of words
– What is a word in English ?
• A word is a sequence of letters surrounded by separators
• Some characters are not letters but do not split a word, e.g.
the hyphen in ‘on-line’
• Words possess semantic/conceptual information
IR – Berlin Chen 6
Keyword-based Querying (cont.)
similarity between
• Single-word queries (cont.) a query and doc
– The use of word statistics for IR ranking
• Word occurrences inside texts
– Term frequency (tf): number of times a word in a doc
– Inverse document frequency (IDF): number of docs in
which a word appears
– Word positions in the docs (see next slide)
• May be required, e.g., a interface that highlights each
occurrence of a specific word
IR – Berlin Chen 7
Keyword-based Querying (cont.)
IR – Berlin Chen 8
Keyword-based Querying (cont.)
• Context queries
– Complement single-word queries with ability to search words
in a given context, i.e., near other words
– Words appearing near each other may signal a higher
likelihood of relevance than if they appear apart
– E.g., Phrases of words or words are proximal in the text
IR – Berlin Chen 9
Keyword-based Querying (cont.)
• Context queries (cont.)
– Two types of queries
• Phrase
Features:
– A sequence of single-word queries 1. Separators in the text
Q: “enhance” and “retrieval” or query may not be
D: “…enhance the retrieval….” the same
2. uninteresting words
– Not all systems implement it! are not considered
• Proximity
– A relaxed version of the phrase query
– A sequence of single words (or phrases) is given
together with a maximum allowed distance between
them
– E.g., two keywords occur within four words Features:
1. May not consider
Q: “enhance” and “retrieval” word ordering
D: “…enhance the power of retrieval…”
IR – Berlin Chen 10
Keyword-based Querying (cont.)
• Context queries (cont.)
– Ranking
• Phrases: analogous to single words
• Proximity queries: the same way if physical proximity is not
used as a parameter in ranking
– Just as a hard-limiter
– But physical proximity has semantic value !
How to do better ranking ?
IR – Berlin Chen 11
Keyword-based Querying (cont.)
• Boolean Queries
– Have a syntax composed of atoms (basic queries) that
retrieve docs, and of Boolean operators which work on their
operands (sets of docs)
AND
translation OR
Leaves: basic queries
Internal nodes: operators
syntax syntactic
A query syntax tree.
IR – Berlin Chen 12
Keyword-based Querying (cont.)
• Boolean Queries (cont.)
– Commonly used operators
• OR, e.g. (e1 OR e2)
e1 and e2 are basic queries
– Select all docs which satisfy e1 or e2. Duplicates are
eliminated e e
1 e OR e e AND e e BUT e
2 1 2 1 2 1 2
d3 d4 d3 d7 d3
d7 d7 d4 d10
d10 d8
• AND, e.g. (e1 AND e2) d7
d8
– Select all docs which satisfy both e1 and e2 d10
• BUT, e.g. (e1 BUT e2)
– Select all docs which satisfy e1 but not e2
– Can use the inverted file to filter out undesired docs
No partial matching between a doc and a query
No ranking of retrieved docs are provided!
IR – Berlin Chen 13
Keyword-based Querying (cont.)
• Boolean Queries (cont.)
– A relaxed version: a “fuzzy Boolean” set of operators
• The meaning of AND and OR can be relaxed
– all : the AND operator
– one: the OR operator (at least one)
– some: retrieval elements appearing in more
operands (docs) than the OR
• Docs are ranked higher when having a larger number of
elements in common with the query
– Naïve users have trouble with Boolean Queries
IR – Berlin Chen 14
Keyword-based Querying (cont.)
• Natural language
– Push the fuzzy Boolean model even further
• The distinction between AND and OR are complete blurred
– A query can be an enumeration of words or/and context queries
– Typically, a query treated as a bag of words (ignoring the
context ) for the vector space model
• Term-weighting, relevance feedback, etc.
– All the documents matching a portion of the user query are
retrieved
• Docs matching more parts of the query assigned a higher
ranking
– Negation also can be handled by penalizing the ranking score
• E.g. some words are not desired
IR – Berlin Chen 15
Keyword-based Querying (cont.)
• Natural language
IR – Berlin Chen 16
Pattern Matching
• Pattern matching: allow the retrieval of docs based on
some patterns
– A pattern is a set of syntactic features that must occur in a text
segments
• Segments satisfying the pattern specifications are said to
“match the pattern”
• E.g. the prefix of a word
– A kind of data retrieval
• Pattern matching (data retrieval) can be viewed as an
enhanced tool for information retrieval
– Require more sophisticated data structures and algorithms to
retrieve efficiently
IR – Berlin Chen 17
Pattern Matching (cont.)
• Types of patterns
– Words: most basic patterns
– Prefixes: a string from the beginning of a text word
• E.g. ‘comput’: ‘computer’, ‘computation’,…
– Suffixes: a string from the termination of a text word
• E.g. ‘ters’: ‘computers’, ‘testers’, ‘painters’,…
– Substrings: A string within a text word
• E.g. ‘tal’: ‘coastal’, ‘talk’, ‘metallic’, …
– Ranges: a pair of strings matching any words lying between them
in lexicographic order
• E.g. between ‘held’ and ‘hold’: ‘hoax’ and ‘hissing’,…
IR – Berlin Chen 18
Pattern Matching (cont.)
– Allowing errors: a word together with an error threshold
• Useful for when query or doc contains typos or misspelling
• Retrieve all text words which are ‘similar’ to the given word
• edit (or Levenshtein) distance: the minimum number of
character insertions, deletions, and replacements needed
to make two strings equal
– E.g. ‘flower’ and ‘flo wer’
• maximum allowed edit distance: query specifies the
maximum number of allowed errors for a word to match the
pattern
IR – Berlin Chen 19
Pattern Matching (cont.)
• String Alignment: Using Dynamic Programming
Ins. (n,m)
query string m
(reference) m-1 Del.
.
Ins. (i,j)
j (i-1,j)
. Del.
. (i-1,j-1) (i,j-1)
.
4
3Del. 3
2Del. 2
Del.
1Del. 1
0
1 2 3 4 5 …. … i … … n-1 n
0
1Ins. 2Ins. 3Ins.
doc string
(test)
IR – Berlin Chen 20
Pattern Matching (cont.)
Step 2 : Iteration :
• String Alignment: Using for i = 1,..., n { //test
for j = 1,..., m { //reference
Dynamic Programming
⎡ G[i - 1][j] + 1 (Insertion) ⎤
Step 1 : Initializa tion : ⎢ G[i][j - 1] + 1 (Delection) ⎥
G[0][0] = 0; G[i][j] = min ⎢ ⎥
⎢G[i - 1][j - 1] + 1 (if LR[i]!= LT[i], Substitution)⎥
for i = 1,..., n { //test ⎢ ⎥
⎣ G[i - 1][j - 1] (if LR[i] = LT[i], Match) ⎦
G[i][0] = G[i - 1][0] + 1;
⎧ 1; //Insertion, (Horizontal Direction)
B[i][0] = 1; //Inserti on ⎪ 2; //Deletion , (Vertical Direction)
⎪
} (Horizonta l Direction) B[i][j]⎨
⎪3; //Substitution (Diagonal Direction)
for j = 1,..., m { //referen ce ⎪⎩4; //match (Diagonal Direction)
G[0][j] = G[0][j - 1] + 1;
B[0][j] = 2; // Deletion } //for j, reference
} (Vertical Direction) } //for i, test
Step 3 : Measure and Backtrace : Note: the penalties for substitution, deletion
G[n][m] and insertion errors are all set to be 1 here
String Error Rate = 100% ×
m
String Accuracy Rate = 100 % − Word Error Rate
Optimal backtrace path = (B[n][m] → ..... → B[0][0])
if B[i][j] = 1 print " LT[i]" ; //Insertio n, then go left
else if B[i][j] = 2 print " LR[j] " ; //Deletion , then go down
else print " LR[j] LR[i] " ; //Hit/Matc h or Substituti on, then go down diagonally
IR – Berlin Chen 21
Pattern Matching (cont.)
• String Alignment: Using Dynamic Programming
Correct
Note: the penalties for (0,5,0,0) C
(0,2,2,1) Delete C
substitution, deletion (0,4,0,1) (0,3,1,1) (1,2,1,2)
or (1,3,0,2)
and insertion errors are
all set to be 1 here (0,4,0,0) C
(0,3,0,1) (0,2,1,1) (1,2,1,1) (1,1,1,2)
Hit C
(Ins,Del,Sub,Hit) (0,3,0,0) B (0,2,0,1) (1,2,0,1) (1,1,1,1) (2,1,0,2)
j or (0,1,2,0) Sub B or (1,0,2,1)
(0,2,0,0) C (0,1,1,0) (1,0,1,1) (2,0,0,2)
(1,1,0,1)
or(0,0,2,0) Del C
(0,1,0,0) A
(0,0,1,0) (1,0,0,1) (2,0,0,1) (3,0,0,1)
Alignment 1: WER= 80% Hit A
(0,0,0,0) Test
Ins B
0
Correct:
B
A
A
C B
A
C
C
C 0 B
(1,0,0,0)
A
(2,0,0,0)
i A
(3,0,0,0)
C
(4,0,0,0)
Test:
Alignment 3:
Ins B Hit A Del C Sub B Hit c Del c WER=80%
Correct: A C B C C
Correct: A C B C C
Alignment 2: Test: B A A C
Test: B A A C
WER=80% Ins B Hit A Sub C Del B Hit c Del c
Hit A Del C Sub B Hit c Del c IR – Berlin Chen 22
Pattern Matching (cont.)
– Regular Expressions
• General patterns are built up by simple strings and several
operations
• union: if e1 and e2 are regular expressions, then (e1 | e2) matches
what e1 or e2 matches
• concatenation: if e1 and e2 are regular expressions, the
occurrences of (e1 e2) are formed by the occurrences of e1
immediately followed by those of e2
• repetition (Kleene closure): if e is a regular expression, then (e*)
matches a sequence of zero or more contiguous occurrence of e
• Example:
– ‘pro (blem | tein) (s | ε) (0 | 1 | 2)*’ matches words
‘problem2’, ‘proteins’, etc.
IR – Berlin Chen 23
Pattern Matching (cont.)
– Extended Patterns
• Subsets of the regular expressions expressed with a simpler
syntax
• System can convert extended patterns into regular expressions,
or search them with specific algorithms
• E.g.: classes of characters:
IR – Berlin Chen 24
Structural Queries
• Docs are allowed to be queried with respect to both their
text content and structural constraints
– Text content: words, phrases, or patterns
– Structural constraints: containment, proximity, or other
restrictions on the structural elements (e.g., chapters, sections,
etc.)
• Standardization of languages used to represent structured
text, e.g., HTML…
Mixing contents and structures in queries
built on the top of basic queries
Query on Text Retrieval A Set of The Final Set of
Boolean model
Text Content model Retrieved Documents Retrieved Documents
Structural
Query structural constraints
IR – Berlin Chen 25
Structural Queries (cont.)
• Three main (text) structures discussed here
– Form-like fixed structure simple
– Hierarchical structure
– Hypertext structure
complex
What structure a text may have?
What can be queried about that
structure? (the query model)
How to rank docs?
IR – Berlin Chen 26
Form-like Fixed Structure (cont.)
• Docs have a fixed set of fields, much like a filled form
– Each field has some text inside
– Some fields are not presented in all docs text
text
– Text has to be classified into a field
fields
– Fields are not allow to nest or overlap text
– A given pattern only can be associated
with a specified filed text
couldn’t represent the text hierarchy
– E.g., a mail achieve (sender, receiver, date, subject, body ..)
• Search for the mail sent to a given person with “football” in
the subject field
• Compared with the relational database systems
– Different fields with different data types more rigid !
IR – Berlin Chen 27
Hypertext Structure (cont.)
• A hypertext is a directed graph where
– Nodes hold some text (content)
– The links represents connection (structural connectivity)
between nodes or between positions inside the nodes
• Retrieval from a hypertext began as a merely
navigational activity
– Manually traverse the hypertext nodes following links to search A
what one wanted C
– It’s still difficult to query the hypertext based on its structure
B
• An interesting proposal to combine browsing and
searching on the web WebGlimpse
– Allow classical navigation plus the ability to search by content in
the neighborhood of the current node
IR – Berlin Chen 28
Hierarchical Structure (cont.)
• An intermediate structuring model which lies between
form-like fixed structure and hypertext structure
• Represent a recursive decomposition of the text and is a
natural model for many text collections
– E.g., books, articles, legal documents,…
A parsed query used to retrieve
the figure
IR – Berlin Chen 29
Issues of Hierarchical Structure
• Static or dynamic structure
– Static: one or more explicit hierarchies can be queried, e.g., by
ancestry
– Dynamic: not really a hierarchy, the required elements are built
on the fly
• Implemented over a normal text index
• Restrictions on the structure
– The text or the answers may have restrictions about nesting
and/or overlapping for efficiency reasons
– In other cases, the query language is restricted to avoid
restricting the structure
The more powerful the model, the less efficiently it can be implemented
IR – Berlin Chen 30
Issues of Hierarchical Structure (cont.)
• Integration with text
– Effective Integration of queries on text content with queries on
text structure
– From perspectives of classical IR models
and structural models, respectively Classical model: primary -> text
secondary->structure
Structural model: primary -> structure
• Query language secondary->text
– Some features for queries on structure including selection of
areas that
• Contain (or not) other areas
• Are contained (or not) in other areas
• Follow (or are followed by) other areas
• Are close to other areas
– Also including set manipulation
IR – Berlin Chen 31
Query Protocols
• The query languages used automatically by software
applications to query text databases
– Standards for querying CD-ROMs
– Or, intermediate languages to query library systems
• Important query protocols
– Z39.50
• For bibliographical information systems
• Protocols for not only the query language but also the client-
server connection
– WAIS (Wide Area Information Service)
• A networking publishing protocol
• For querying database through the Internet
IR – Berlin Chen 32
Query Protocols (cont.)
• CD-ROM publishing protocols
– Provide “disk interchangeability”: flexibility in data
communication between primary information providers and end
users
– Some example protocols
• CCL (Common Command Language)
• CD-RDx (Compact Disk Read only Data exchange)
• SFQL (Structured Full-text Query Languages)
IR – Berlin Chen 33
Trends and Research Issues
• Types of queries and how they are structured
IR – Berlin Chen 34