0% found this document useful (0 votes)
83 views34 pages

Types of Query Languages Explained

The document discusses various types of query languages used in data and information retrieval, emphasizing the differences between pattern-based and keyword-based querying. It covers the characteristics of keyword queries, including single-word, context, and Boolean queries, as well as the importance of ranking and retrieval units. Additionally, it explores pattern matching techniques and regular expressions for enhanced data retrieval capabilities.

Uploaded by

suryayellaalone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views34 pages

Types of Query Languages Explained

The document discusses various types of query languages used in data and information retrieval, emphasizing the differences between pattern-based and keyword-based querying. It covers the characteristics of keyword queries, including single-word, context, and Boolean queries, as well as the importance of ranking and retrieval units. Additionally, it explores pattern matching techniques and regular expressions for enhanced data retrieval capabilities.

Uploaded by

suryayellaalone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Query Languages

Berlin Chen 2005

Reference:
1. Modern Information Retrieval, chapter 4
The Kinds of Queries
• Data retrieval
– Pattern-based querying

– Retrieve docs that contains (or exactly match) the objects that
satisfy the conditions clearly specified in the query

– A single erroneous object implies failure!

• Information retrieval
– Keyword-based querying
– Retrieve relevant docs in response to the query
(the formulation of a user information need)

– Allow the answer to be ranked

IR – Berlin Chen 2
The Kinds of Queries

• On-line databases or CD-ROM archives


– High level software packages should be viewed as query
languages
– Named “protocols”

Different query languages are formulated and then


used at different situations, by considering
- The underlying retrieval models (ranking alogrithms)
- The content (semantics) and structure (syntax) of the text

Models: Boolean, vector-space, HMM ….


Formulations/word-treating machineries: stop-word list,
stemming, query-expansion, ….
IR – Berlin Chen 3
The Retrieval Units

• The retrieval unit: the basic element which can be


retrieved as an answer to a query
– A set of such basic elements with ranking information

• The retrieval unit can be a file, a doc, a Web page, a


paragraph, a passage, or some other structural units

• Simply referred as “docs”

kinds of
retrieval units

kinds of queries IR – Berlin Chen 4


Keyword-based Querying

• Keywords
– Those words can be used for retrieval by a query
– A small set of words extracted from the docs
• Preprocessing is needed

• Characteristics of keyword-based queries


– A query composed of keywords and the docs containing such
keywords are searching for
– Intuitive, easy to express, and allowing for fast ranking
– A query can be a single keyword, multiple keywords (basic
queries), or more complex combination of operation involving
several keywords
• AND, OR, BUT, …

IR – Berlin Chen 5
Keyword-based Querying (cont.)

• Single-word queries
– Query: The elementary query is a word

– Docs: The docs are long sequences of words

– What is a word in English ?


• A word is a sequence of letters surrounded by separators
• Some characters are not letters but do not split a word, e.g.
the hyphen in ‘on-line’
• Words possess semantic/conceptual information

IR – Berlin Chen 6
Keyword-based Querying (cont.)
similarity between
• Single-word queries (cont.) a query and doc
– The use of word statistics for IR ranking
• Word occurrences inside texts
– Term frequency (tf): number of times a word in a doc
– Inverse document frequency (IDF): number of docs in
which a word appears

– Word positions in the docs (see next slide)


• May be required, e.g., a interface that highlights each
occurrence of a specific word

IR – Berlin Chen 7
Keyword-based Querying (cont.)

IR – Berlin Chen 8
Keyword-based Querying (cont.)
• Context queries
– Complement single-word queries with ability to search words
in a given context, i.e., near other words

– Words appearing near each other may signal a higher


likelihood of relevance than if they appear apart

– E.g., Phrases of words or words are proximal in the text

IR – Berlin Chen 9
Keyword-based Querying (cont.)
• Context queries (cont.)
– Two types of queries
• Phrase
Features:
– A sequence of single-word queries 1. Separators in the text
Q: “enhance” and “retrieval” or query may not be
D: “…enhance the retrieval….” the same
2. uninteresting words
– Not all systems implement it! are not considered
• Proximity
– A relaxed version of the phrase query
– A sequence of single words (or phrases) is given
together with a maximum allowed distance between
them
– E.g., two keywords occur within four words Features:
1. May not consider
Q: “enhance” and “retrieval” word ordering
D: “…enhance the power of retrieval…”
IR – Berlin Chen 10
Keyword-based Querying (cont.)

• Context queries (cont.)


– Ranking
• Phrases: analogous to single words

• Proximity queries: the same way if physical proximity is not


used as a parameter in ranking
– Just as a hard-limiter
– But physical proximity has semantic value !

How to do better ranking ?

IR – Berlin Chen 11
Keyword-based Querying (cont.)

• Boolean Queries
– Have a syntax composed of atoms (basic queries) that
retrieve docs, and of Boolean operators which work on their
operands (sets of docs)
AND

translation OR
Leaves: basic queries
Internal nodes: operators

syntax syntactic

A query syntax tree.

IR – Berlin Chen 12
Keyword-based Querying (cont.)
• Boolean Queries (cont.)
– Commonly used operators
• OR, e.g. (e1 OR e2)
e1 and e2 are basic queries
– Select all docs which satisfy e1 or e2. Duplicates are
eliminated e e
1 e OR e e AND e e BUT e
2 1 2 1 2 1 2
d3 d4 d3 d7 d3
d7 d7 d4 d10
d10 d8
• AND, e.g. (e1 AND e2) d7
d8
– Select all docs which satisfy both e1 and e2 d10

• BUT, e.g. (e1 BUT e2)


– Select all docs which satisfy e1 but not e2
– Can use the inverted file to filter out undesired docs

No partial matching between a doc and a query


No ranking of retrieved docs are provided!
IR – Berlin Chen 13
Keyword-based Querying (cont.)

• Boolean Queries (cont.)


– A relaxed version: a “fuzzy Boolean” set of operators
• The meaning of AND and OR can be relaxed
– all : the AND operator
– one: the OR operator (at least one)
– some: retrieval elements appearing in more
operands (docs) than the OR

• Docs are ranked higher when having a larger number of


elements in common with the query

– Naïve users have trouble with Boolean Queries

IR – Berlin Chen 14
Keyword-based Querying (cont.)
• Natural language
– Push the fuzzy Boolean model even further
• The distinction between AND and OR are complete blurred

– A query can be an enumeration of words or/and context queries

– Typically, a query treated as a bag of words (ignoring the


context ) for the vector space model
• Term-weighting, relevance feedback, etc.

– All the documents matching a portion of the user query are


retrieved
• Docs matching more parts of the query assigned a higher
ranking

– Negation also can be handled by penalizing the ranking score


• E.g. some words are not desired
IR – Berlin Chen 15
Keyword-based Querying (cont.)

• Natural language

IR – Berlin Chen 16
Pattern Matching

• Pattern matching: allow the retrieval of docs based on


some patterns
– A pattern is a set of syntactic features that must occur in a text
segments
• Segments satisfying the pattern specifications are said to
“match the pattern”
• E.g. the prefix of a word
– A kind of data retrieval

• Pattern matching (data retrieval) can be viewed as an


enhanced tool for information retrieval
– Require more sophisticated data structures and algorithms to
retrieve efficiently

IR – Berlin Chen 17
Pattern Matching (cont.)

• Types of patterns
– Words: most basic patterns

– Prefixes: a string from the beginning of a text word


• E.g. ‘comput’: ‘computer’, ‘computation’,…

– Suffixes: a string from the termination of a text word


• E.g. ‘ters’: ‘computers’, ‘testers’, ‘painters’,…

– Substrings: A string within a text word


• E.g. ‘tal’: ‘coastal’, ‘talk’, ‘metallic’, …

– Ranges: a pair of strings matching any words lying between them


in lexicographic order
• E.g. between ‘held’ and ‘hold’: ‘hoax’ and ‘hissing’,…

IR – Berlin Chen 18
Pattern Matching (cont.)
– Allowing errors: a word together with an error threshold
• Useful for when query or doc contains typos or misspelling

• Retrieve all text words which are ‘similar’ to the given word

• edit (or Levenshtein) distance: the minimum number of


character insertions, deletions, and replacements needed
to make two strings equal
– E.g. ‘flower’ and ‘flo wer’

• maximum allowed edit distance: query specifies the


maximum number of allowed errors for a word to match the
pattern

IR – Berlin Chen 19
Pattern Matching (cont.)
• String Alignment: Using Dynamic Programming

Ins. (n,m)
query string m
(reference) m-1 Del.
.

Ins. (i,j)
j (i-1,j)
. Del.
. (i-1,j-1) (i,j-1)
.
4
3Del. 3
2Del. 2
Del.
1Del. 1
0
1 2 3 4 5 …. … i … … n-1 n
0
1Ins. 2Ins. 3Ins.
doc string
(test)

IR – Berlin Chen 20
Pattern Matching (cont.)
Step 2 : Iteration :
• String Alignment: Using for i = 1,..., n { //test
for j = 1,..., m { //reference
Dynamic Programming
⎡ G[i - 1][j] + 1 (Insertion) ⎤
Step 1 : Initializa tion : ⎢ G[i][j - 1] + 1 (Delection) ⎥
G[0][0] = 0; G[i][j] = min ⎢ ⎥
⎢G[i - 1][j - 1] + 1 (if LR[i]!= LT[i], Substitution)⎥
for i = 1,..., n { //test ⎢ ⎥
⎣ G[i - 1][j - 1] (if LR[i] = LT[i], Match) ⎦
G[i][0] = G[i - 1][0] + 1;
⎧ 1; //Insertion, (Horizontal Direction)
B[i][0] = 1; //Inserti on ⎪ 2; //Deletion , (Vertical Direction)

} (Horizonta l Direction) B[i][j]⎨
⎪3; //Substitution (Diagonal Direction)
for j = 1,..., m { //referen ce ⎪⎩4; //match (Diagonal Direction)
G[0][j] = G[0][j - 1] + 1;
B[0][j] = 2; // Deletion } //for j, reference
} (Vertical Direction) } //for i, test

Step 3 : Measure and Backtrace : Note: the penalties for substitution, deletion
G[n][m] and insertion errors are all set to be 1 here
String Error Rate = 100% ×
m
String Accuracy Rate = 100 % − Word Error Rate
Optimal backtrace path = (B[n][m] → ..... → B[0][0])
if B[i][j] = 1 print " LT[i]" ; //Insertio n, then go left
else if B[i][j] = 2 print " LR[j] " ; //Deletion , then go down
else print " LR[j] LR[i] " ; //Hit/Matc h or Substituti on, then go down diagonally
IR – Berlin Chen 21
Pattern Matching (cont.)
• String Alignment: Using Dynamic Programming
Correct
Note: the penalties for (0,5,0,0) C
(0,2,2,1) Delete C
substitution, deletion (0,4,0,1) (0,3,1,1) (1,2,1,2)
or (1,3,0,2)
and insertion errors are
all set to be 1 here (0,4,0,0) C
(0,3,0,1) (0,2,1,1) (1,2,1,1) (1,1,1,2)
Hit C
(Ins,Del,Sub,Hit) (0,3,0,0) B (0,2,0,1) (1,2,0,1) (1,1,1,1) (2,1,0,2)
j or (0,1,2,0) Sub B or (1,0,2,1)

(0,2,0,0) C (0,1,1,0) (1,0,1,1) (2,0,0,2)


(1,1,0,1)
or(0,0,2,0) Del C

(0,1,0,0) A
(0,0,1,0) (1,0,0,1) (2,0,0,1) (3,0,0,1)
Alignment 1: WER= 80% Hit A
(0,0,0,0) Test
Ins B
0
Correct:
B
A
A
C B
A
C
C
C 0 B
(1,0,0,0)
A
(2,0,0,0)
i A
(3,0,0,0)
C
(4,0,0,0)
Test:
Alignment 3:
Ins B Hit A Del C Sub B Hit c Del c WER=80%
Correct: A C B C C
Correct: A C B C C
Alignment 2: Test: B A A C
Test: B A A C
WER=80% Ins B Hit A Sub C Del B Hit c Del c
Hit A Del C Sub B Hit c Del c IR – Berlin Chen 22
Pattern Matching (cont.)
– Regular Expressions
• General patterns are built up by simple strings and several
operations

• union: if e1 and e2 are regular expressions, then (e1 | e2) matches


what e1 or e2 matches

• concatenation: if e1 and e2 are regular expressions, the


occurrences of (e1 e2) are formed by the occurrences of e1
immediately followed by those of e2

• repetition (Kleene closure): if e is a regular expression, then (e*)


matches a sequence of zero or more contiguous occurrence of e

• Example:
– ‘pro (blem | tein) (s | ε) (0 | 1 | 2)*’ matches words
‘problem2’, ‘proteins’, etc.

IR – Berlin Chen 23
Pattern Matching (cont.)

– Extended Patterns
• Subsets of the regular expressions expressed with a simpler
syntax
• System can convert extended patterns into regular expressions,
or search them with specific algorithms
• E.g.: classes of characters:

IR – Berlin Chen 24
Structural Queries

• Docs are allowed to be queried with respect to both their


text content and structural constraints
– Text content: words, phrases, or patterns
– Structural constraints: containment, proximity, or other
restrictions on the structural elements (e.g., chapters, sections,
etc.)
• Standardization of languages used to represent structured
text, e.g., HTML…
Mixing contents and structures in queries

built on the top of basic queries

Query on Text Retrieval A Set of The Final Set of


Boolean model
Text Content model Retrieved Documents Retrieved Documents

Structural
Query structural constraints
IR – Berlin Chen 25
Structural Queries (cont.)

• Three main (text) structures discussed here


– Form-like fixed structure simple
– Hierarchical structure
– Hypertext structure
complex

What structure a text may have?


What can be queried about that
structure? (the query model)
How to rank docs?

IR – Berlin Chen 26
Form-like Fixed Structure (cont.)
• Docs have a fixed set of fields, much like a filled form
– Each field has some text inside
– Some fields are not presented in all docs text

text

– Text has to be classified into a field


fields
– Fields are not allow to nest or overlap text

– A given pattern only can be associated


with a specified filed text

couldn’t represent the text hierarchy

– E.g., a mail achieve (sender, receiver, date, subject, body ..)


• Search for the mail sent to a given person with “football” in
the subject field

• Compared with the relational database systems


– Different fields with different data types more rigid !
IR – Berlin Chen 27
Hypertext Structure (cont.)
• A hypertext is a directed graph where
– Nodes hold some text (content)
– The links represents connection (structural connectivity)
between nodes or between positions inside the nodes

• Retrieval from a hypertext began as a merely


navigational activity
– Manually traverse the hypertext nodes following links to search A

what one wanted C

– It’s still difficult to query the hypertext based on its structure


B

• An interesting proposal to combine browsing and


searching on the web WebGlimpse
– Allow classical navigation plus the ability to search by content in
the neighborhood of the current node
IR – Berlin Chen 28
Hierarchical Structure (cont.)

• An intermediate structuring model which lies between


form-like fixed structure and hypertext structure
• Represent a recursive decomposition of the text and is a
natural model for many text collections
– E.g., books, articles, legal documents,…

A parsed query used to retrieve


the figure

IR – Berlin Chen 29
Issues of Hierarchical Structure

• Static or dynamic structure


– Static: one or more explicit hierarchies can be queried, e.g., by
ancestry
– Dynamic: not really a hierarchy, the required elements are built
on the fly
• Implemented over a normal text index

• Restrictions on the structure


– The text or the answers may have restrictions about nesting
and/or overlapping for efficiency reasons

– In other cases, the query language is restricted to avoid


restricting the structure
The more powerful the model, the less efficiently it can be implemented

IR – Berlin Chen 30
Issues of Hierarchical Structure (cont.)

• Integration with text


– Effective Integration of queries on text content with queries on
text structure

– From perspectives of classical IR models


and structural models, respectively Classical model: primary -> text
secondary->structure
Structural model: primary -> structure
• Query language secondary->text

– Some features for queries on structure including selection of


areas that
• Contain (or not) other areas
• Are contained (or not) in other areas
• Follow (or are followed by) other areas
• Are close to other areas

– Also including set manipulation


IR – Berlin Chen 31
Query Protocols
• The query languages used automatically by software
applications to query text databases
– Standards for querying CD-ROMs
– Or, intermediate languages to query library systems

• Important query protocols


– Z39.50
• For bibliographical information systems
• Protocols for not only the query language but also the client-
server connection
– WAIS (Wide Area Information Service)
• A networking publishing protocol
• For querying database through the Internet
IR – Berlin Chen 32
Query Protocols (cont.)

• CD-ROM publishing protocols


– Provide “disk interchangeability”: flexibility in data
communication between primary information providers and end
users

– Some example protocols


• CCL (Common Command Language)
• CD-RDx (Compact Disk Read only Data exchange)
• SFQL (Structured Full-text Query Languages)

IR – Berlin Chen 33
Trends and Research Issues

• Types of queries and how they are structured

IR – Berlin Chen 34

You might also like