0% found this document useful (0 votes)
138 views16 pages

Information Retrieval: File Structures Explained

The document is a report from the American College of Technology on Information Retrieval, focusing on file and data structures. It covers various indexing methods, file structures, and data structures used in information retrieval systems, including inverted files, tries, and suffix trees. The report is authored by a group of students under the supervision of Mr. Philimon Tsige and was submitted on January 10, 2023.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views16 pages

Information Retrieval: File Structures Explained

The document is a report from the American College of Technology on Information Retrieval, focusing on file and data structures. It covers various indexing methods, file structures, and data structures used in information retrieval systems, including inverted files, tries, and suffix trees. The report is authored by a group of students under the supervision of Mr. Philimon Tsige and was submitted on January 10, 2023.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

American College of Technology

Information Retrieval

File and Data Structure

Group Member ID Number


1. Alazar Demmelash 004/BSc-B2/20
2. Hayat Hussien 011/BSc-B2/20
3. Selam Girmay 025/BSc-B2/20
4. Yemisrach Ermiyas 028/BSc-B2/20

Supervisor: [Link] Tsige


Submission Date: 10/01/2023
Content
Introduction 2

1. What is index and indexing? 3


2. File structure 3
2.1. Inverted files 4
2.1.1. Construction of inverted file 4
2.1.2. Inverted index construction 6
2.1.3. Compressed inverted files 6
2.2. Tries and Suffix trees 7
2.2.1. Tries 7
[Link]. What are TRIE data structure usage or application? 8
2.2.2. Suffix Trees 10
2.3. Sequential files 12
2.4. Signature files 12
2.5. Flat files 13
2.6. PAT trees 13
3. Data structure 14
3.1. Linear data structure 14
3.2. Non-linear data structure 14
4. Data structures for posting lists 14
4.1. Singly linked list 15
4.2. Variable length array 15
4.3. Hybrid scheme 15
Introduction

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature


(usually text) that satisfies an information need from within large collections (usually stored on
computers). Information retrieval technology has been central to the success of the Web.
- Information Retrieval is the process of obtaining relevant information from a collection
of informational resources. It does not return information that is restricted to a single
object collection but matches several objects which vary in the degree of relevancy to the
query.
- So, we have to think about what concepts IR systems use to model this data so that they
can return all the documents that are relevant to the query term and ranked based on
certain importance measures.
- These concepts include dimensionality reduction, data modeling, ranking measures,
clustering etc. These tools that IR systems provide would help you get your results faster.
- So, while computing the results and their relevance, programmers use these concepts to
design their system, think of what data structures and procedures are to be used which
would increase speed of the searches and better handling of data.
An IR system accepts a query from a user and responds with a set of documents. The system
returns both relevant and non-relevant material and a document organization approach are
applied to assist the user finding the relevant information in the retrieved set.

2
1. What is index and indexing?

Index is data structure designed to make search faster. Text search has unique requirements,
which leads to unique data structures. The most common data structure is inverted index:

- General name for a class of structures.


- “Inverted” because documents are associated with words, rather than words with
documents (similar to a concordance).

Indexing is the process of transforming items (documents) into a searchable data structure.

- Is a data structure for computationally efficient retrieval


- For each term t we store the list of all documents that contain t.
- Creation of document surrogates to represent each document.
- Requires analysis of original documents
 Sample: identify meta-information.
 Complex: linguistic analysis of content.

The search process involve correlating user queries with the documents represented in the index.

2. File structure

A file structure is a combination of representations for data in files. It is also a collection of


operations for accessing the data. It enables applications to read, write, and modify data. File
structures may also help to find the data that matches certain criteria. Selection of a file structure
for the underlying document database is a fundamental decision in the design of IR systems.

The main goal of developing file structures is to minimize the number of trips to the disk in order
to get desired information. A fundamental decision in the design of information retrieval systems
is which type of file structure to use for the underlying document database.

The file structures used in IR systems are:

a. Sequential Files, d. Signature Files,


b. Flat Files, e. Tries and Suffix Tries,
c. Inverted Files, f. PAT Trees

3
2.1. Inverted files

Inverted file extracts all the words from each field for each record entered into the database, and
sorts them into alphabetical order. “Stop” words such as (the, an, of, and, that, is) words which
have no substantive (Functional) meaning and occur very frequently – are not included in the
inverted file.

The structure of an inverted file entry is usually keyword, document-ID, and field-ID. A
keyword is an indexing term that describes the document.

- Document-ID is a unique identifier for a document, and


- Field-ID is a unique name that indicates from which field in the document the keyword
came.

In computer science, an inverted index (also referred to as a posting file or inverted file) is a
database index storing a mapping from content, such as word or numbers, to its locations in a
table, or in a document or a set of document (named in contrast to a forward index, which maps
from documents to content).

An inverted index is an index data structure storing a mapping from content, such a words or
numbers, to its locations in document or a set of documents.

2.1.1. Construction of Inverted file

Inverted file index has 2 main parts:

A. Vocabulary File (Search Structure)

Stores all the distinct terms (key words) that appear in any of the documents records kept for
each term j in the word list contains the following:

- Term j
- Number of documents in which term j occurs
- Total frequency of term

4
B. Posting file

Posting file – for each distinct term in the vocabulary, stores a list of pointers to the document
that contain the term.

- Each item in the list - which records that a term appeared in a document (and, later, often,
the positions in the document) is conventionally called a posting.
- The list is then called a posting list and all the postings lists taken together are referred to
as the postings.

To increase the speed of searching time:

5
2.1.2. Inverted index construction

There are some steps that we can use to construct inverted index:

Doc 1: I did enact Julius Caesar I was killed I the capitol; Brutus killed me.

Doc 2: so let it be with Caesar. The Nobel Brutus hath told you Caesar was ambitious.

Modify the document Sort by terms Multiple term entries in


a single document are
merged.

2.1.3. Compressed inverted files


Compressed Inverted Files The inverted lists themselves are sequences of record
identifiers, sorted to allow fast query evaluation.

This approach has the disadvantage that inverted lists must be decoded as they are retrieved, but
such decompression can be fast. Moreover, by inserting a small amount of additional
indexing information in each list a large part of the decompression can be avoided, so
that on current hardware the limiting factor is transfer time, not decompression time .

6
There are several widely held beliefs about inverted files that are either fallacious or incorrect
once compression of index entries is taken into account:

- The assumption that sorting of inverted lists during query evaluation is an unacceptable
cost.
- The assumption that a random disk access will be required for each record identifier for
each term, as if inverted lists were stored as a linked list on disk.
- The assumption that, if the vocabulary is stored on disk, log N accesses are required to
fetch an inverted list, where N is variously the number of documents in the collection or
the number of distinct terms in the collection.

Example: Say there are three documents.


Doc 1: Milk is nutrition’s.
Doc 2: Bread and milk tastes good.
Doc 3: Brown bread is better.
After stop-word elimination and stemming, the inverted index looks like:
Terms Documents containing the term
Better 3
Bread 2.3
Brown 3
Good 2
Milk 1.2
Nutritious 1
Taste 2

2.2. Tries and suffix trees


2.2.1. Tries
Trie. ... In computer science, a trie, also called digital tree and sometimes radix tree or prefix
tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure
that is used to store a dynamic set or associative array where the keys are usually strings.

Trie is the data structure very similar to Binary Tree. Trie data structure stores the data in
particular fashion, so that retrieval of data became much faster and helps in performance. The
name "TRIE" is coined from the word retrieve.

7
[Link]. What are TRIE data structure usage or applications?
1. Dictionary suggestions OR Auto complete dictionary

Retrieving data stored in Trie data structure is very fast, so it is most suited for application where
retrieval are more frequently performed like phone directory where contact searching operation
is used frequently.

2. Searching contact from mobile contact list OR Phone directory

Auto suggestion of words while searching for anything in dictionary is very common. If we
search for word “tiny”, then it auto suggest words starting with same characters like “tine”, “tin”,
“tinny” etc.

8
A prefix tree or Trie is a tree whose nodes don’t hold keys, but rather, hold partial keys. For
example, if you have a prefix tree that stores strings, then each node would be a character of a
string. If you have a prefix tree that stores arrays, each node would be an element of that array.
The elements are ordered from the root.

Prefix tress are good for looking up keys with a particular prefix.

Example: Represent the following map with Trie:

Key Value
Instant 1
Internal 2
Internet 3

9
There are 2 categories of Tries:

A. Non-compact Tries – is one in which every edge of the underlying tree represents a
symbol of the alphabet.
B. Compact Tries – trims (decreases) unary nodes which leaf to leaves.

A Tries representing a set of string given below.

Example: aeef, ad, bbfe, bbfg, c

Non-compact Tries Compact Tries

2.2.2. Suffix Trees


A suffix Tree is an ordinary tree in which the input strings are all possible suffixes.

The suffix tree also referred to as position tree, is another variation of tries. From its name, it is
easy to imagine that this kind of trie cares more about the suffix of a given string. It is a static
structure that does some preprocessing of large string S for a faster matching of any sub-string of
S.

It stores the suffixes of a string as its keys, while the position of the suffixes in the string as its
values.

- To build the suffix TRIE we use these indices instead of the actual object.

10
Example: Suffix tree

Let s = abab, a suffic tree of s is a compressed trie of all suffixes of s = abab$.

 $
 b$
 ab$
 bab$
 abab$

Example 2: Banana
123456

Suffix Trie Suffix Tree Suffix Array

11
2.3. Sequential Files
Sequential file is the most primitive file structures. It has no vocabulary (unique list of words) as
well as linking pointers. The records are generally arranged serially, one after another, but in
lexicographic order on the value of some key field.

- A particular attribute is chosen as primary key whose value will determine the order of
the records.
- When the first key fails to discriminate among records, a second key is chosen to give an
order.
- No directory and no linking pointer.
- Records are generally organized/ordered according to the value of a particular attribute
- Multiple attributes may be used when the attribute value is the same for a large number of
records (i.e. the key fails to discriminate)
 Its main advantages:
- Easy to implement.
- Fast access to next record
 Disadvantage:
- No weights attached to terms. Individual words are treated independently.
- Random access is slow: since similar terms are indexed individually, we need to find all
terms that match with the query.

2.4. Signature Files

Signature files contain signatures it patterns that represent documents. There are various ways of
constructing signatures. Using one common signature method, for example, documents are split
into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist,
words. Each word in the block is hashed to give a signature--a bit pattern with some of the bits
set to 1. The block signatures are then concatenated to produce the document signature.
Searching is done by comparing the signatures of queries with document signatures.

The main idea is to divide the document into blocks of fixed size and each block has assign to it
a signature (also fixed size), which is used to search the document for the queried pattern.

12
Consider:

- H(information) = 010001
- H(text) = 010010
- H(data) = 110000
- H(retrieval) = 100010
- The block signatures of a document D containing text “textual retrieval and information
retrieval” (after removing stop words and stemming) for a bit size of two terms – would
be:
 B1D = 110010 and
 B2D = 110011

To search for a given term we compare where the term’s bit string could be “inside” the block
signatures.

2.5. Flat files

A flat file, though it is possible to keep file structures in main memory, in practice IR databases
are usually stored on disk because of their size. Using a flat file approach, one or more
documents are stored in a file, usually as ASCII or EBCDIC text. Flat file searching is usually
done via pattern matching. On UNIX, for example, one can store a document collection one per
file in a UNIX directory, and search it using pattern searching tools such as GREP or AWK.

2.6. PAT Trees

PAT trees are Patricia trees constructed over all strings in a text. If a document collection is
viewed as a sequentially numbered array of characters, a string is a subsequence of characters
from the array starting at a given point and extending an arbitrary distance to the right. A Patricia
tree is a digital tree where the individual bits of the keys are used to decide branching.

13
3. Data Structure

A data structure is a storage that is used to store and organize data. It is a way of arranging data
on a computer so that it can be accessed and updated efficiently.

Classification of Data Structure:

3.1. Linear data structure: Data structure in which data elements are arranged
sequentially or linearly, where each element is attached to its previous and next adjacent
elements is called a linear data structure.

- Examples of linear data structures are array, stack, queue, linked list, etc.
A. Static data structure: Static data structure has a fixed memory size. It is easier to
access the elements in a static data structure.

- Example of this data structure is an array.


B. Dynamic data structure: In the dynamic data structure, the size is not fixed. It
can be randomly updated during the runtime which may be considered efficient
concerning the memory (space) complexity of the code.

- Examples of this data structure are queue, stack, etc.


3.2. Non-linear data structure: Data structures where data elements are not placed
sequentially or linearly are called non-linear data structures. In a non-linear data
structure, we can’t traverse all the elements in a single run only.

- Examples of non-linear data structures are trees and graphs.


4. Data structures for Postings Lists

A posting list is a data structure that maintains the list of documents that contains a particular
term. Generally a dictionary of terms is built and then for each term in the dictionary a
posting list is formed containing the list of documents that contains the particular term. To
traverse a posting list again another data structure, pointer, is used which is called as a skip
pointer. A pointer is generally a variable which holds the address of a data item.

14
4.1. Singly linked list

- Allow cheap insertion of documents into postings lists (e.g., when re crawling)

- Naturally extend to skip lists for faster access


4.2. Variable length array

- Better in terms of space requirements

- Also better in terms of time requirements if memory caches are used, as they use
contiguous memory
4.3. Hybrid scheme

- Linked list of variable length array for each term.

- write posting lists on disk as contiguous block without explicit pointers

- Minimizes the size of postings lists and number of disk seeks

15

You might also like