0% found this document useful (0 votes)

138 views16 pages

Information Retrieval: File Structures Explained

The document is a report from the American College of Technology on Information Retrieval, focusing on file and data structures. It covers various indexing methods, file structures, and data structures used in information retrieval systems, including inverted files, tries, and suffix trees. The report is authored by a group of students under the supervision of Mr. Philimon Tsige and was submitted on January 10, 2023.

Uploaded by

Alazar Demmelash Getahun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views16 pages

Information Retrieval: File Structures Explained

Uploaded by

Alazar Demmelash Getahun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

American College of Technology

Information Retrieval

File and Data Structure

Group Member ID Number

1. Alazar Demmelash 004/BSc-B2/20
2. Hayat Hussien 011/BSc-B2/20
3. Selam Girmay 025/BSc-B2/20
4. Yemisrach Ermiyas 028/BSc-B2/20

Supervisor: [Link] Tsige

Submission Date: 10/01/2023
Content
Introduction 2

1. What is index and indexing? 3

2. File structure 3
2.1. Inverted files 4
2.1.1. Construction of inverted file 4
2.1.2. Inverted index construction 6
2.1.3. Compressed inverted files 6
2.2. Tries and Suffix trees 7
2.2.1. Tries 7
[Link]. What are TRIE data structure usage or application? 8
2.2.2. Suffix Trees 10
2.3. Sequential files 12
2.4. Signature files 12
2.5. Flat files 13
2.6. PAT trees 13
3. Data structure 14
3.1. Linear data structure 14
3.2. Non-linear data structure 14
4. Data structures for posting lists 14
4.1. Singly linked list 15
4.2. Variable length array 15
4.3. Hybrid scheme 15
Introduction

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature

(usually text) that satisfies an information need from within large collections (usually stored on
computers). Information retrieval technology has been central to the success of the Web.
- Information Retrieval is the process of obtaining relevant information from a collection
of informational resources. It does not return information that is restricted to a single
object collection but matches several objects which vary in the degree of relevancy to the
query.
- So, we have to think about what concepts IR systems use to model this data so that they
can return all the documents that are relevant to the query term and ranked based on
certain importance measures.
- These concepts include dimensionality reduction, data modeling, ranking measures,
clustering etc. These tools that IR systems provide would help you get your results faster.
- So, while computing the results and their relevance, programmers use these concepts to
design their system, think of what data structures and procedures are to be used which
would increase speed of the searches and better handling of data.
An IR system accepts a query from a user and responds with a set of documents. The system
returns both relevant and non-relevant material and a document organization approach are
applied to assist the user finding the relevant information in the retrieved set.

2
1. What is index and indexing?

Index is data structure designed to make search faster. Text search has unique requirements,
which leads to unique data structures. The most common data structure is inverted index:

- General name for a class of structures.

- “Inverted” because documents are associated with words, rather than words with
documents (similar to a concordance).

Indexing is the process of transforming items (documents) into a searchable data structure.

- Is a data structure for computationally efficient retrieval

- For each term t we store the list of all documents that contain t.
- Creation of document surrogates to represent each document.
- Requires analysis of original documents
 Sample: identify meta-information.
 Complex: linguistic analysis of content.

The search process involve correlating user queries with the documents represented in the index.

2. File structure

A file structure is a combination of representations for data in files. It is also a collection of

operations for accessing the data. It enables applications to read, write, and modify data. File
structures may also help to find the data that matches certain criteria. Selection of a file structure
for the underlying document database is a fundamental decision in the design of IR systems.

The main goal of developing file structures is to minimize the number of trips to the disk in order
to get desired information. A fundamental decision in the design of information retrieval systems
is which type of file structure to use for the underlying document database.

The file structures used in IR systems are:

a. Sequential Files, d. Signature Files,

b. Flat Files, e. Tries and Suffix Tries,
c. Inverted Files, f. PAT Trees

3
2.1. Inverted files

Inverted file extracts all the words from each field for each record entered into the database, and
sorts them into alphabetical order. “Stop” words such as (the, an, of, and, that, is) words which
have no substantive (Functional) meaning and occur very frequently – are not included in the
inverted file.

The structure of an inverted file entry is usually keyword, document-ID, and field-ID. A
keyword is an indexing term that describes the document.

- Document-ID is a unique identifier for a document, and

- Field-ID is a unique name that indicates from which field in the document the keyword
came.

In computer science, an inverted index (also referred to as a posting file or inverted file) is a
database index storing a mapping from content, such as word or numbers, to its locations in a
table, or in a document or a set of document (named in contrast to a forward index, which maps
from documents to content).

An inverted index is an index data structure storing a mapping from content, such a words or
numbers, to its locations in document or a set of documents.

2.1.1. Construction of Inverted file

Inverted file index has 2 main parts:

A. Vocabulary File (Search Structure)

Stores all the distinct terms (key words) that appear in any of the documents records kept for
each term j in the word list contains the following:

- Term j
- Number of documents in which term j occurs
- Total frequency of term

4
B. Posting file

Posting file – for each distinct term in the vocabulary, stores a list of pointers to the document
that contain the term.

- Each item in the list - which records that a term appeared in a document (and, later, often,
the positions in the document) is conventionally called a posting.
- The list is then called a posting list and all the postings lists taken together are referred to
as the postings.

To increase the speed of searching time:

5
2.1.2. Inverted index construction

There are some steps that we can use to construct inverted index:

Doc 1: I did enact Julius Caesar I was killed I the capitol; Brutus killed me.

Doc 2: so let it be with Caesar. The Nobel Brutus hath told you Caesar was ambitious.

Modify the document Sort by terms Multiple term entries in

a single document are
merged.

2.1.3. Compressed inverted files

Compressed Inverted Files The inverted lists themselves are sequences of record
identifiers, sorted to allow fast query evaluation.

This approach has the disadvantage that inverted lists must be decoded as they are retrieved, but
such decompression can be fast. Moreover, by inserting a small amount of additional
indexing information in each list a large part of the decompression can be avoided, so
that on current hardware the limiting factor is transfer time, not decompression time .

6
There are several widely held beliefs about inverted files that are either fallacious or incorrect
once compression of index entries is taken into account:

- The assumption that sorting of inverted lists during query evaluation is an unacceptable
cost.
- The assumption that a random disk access will be required for each record identifier for
each term, as if inverted lists were stored as a linked list on disk.
- The assumption that, if the vocabulary is stored on disk, log N accesses are required to
fetch an inverted list, where N is variously the number of documents in the collection or
the number of distinct terms in the collection.

Example: Say there are three documents.

Doc 1: Milk is nutrition’s.
Doc 2: Bread and milk tastes good.
Doc 3: Brown bread is better.
After stop-word elimination and stemming, the inverted index looks like:
Terms Documents containing the term
Better 3
Bread 2.3
Brown 3
Good 2
Milk 1.2
Nutritious 1
Taste 2

2.2. Tries and suffix trees

2.2.1. Tries
Trie. ... In computer science, a trie, also called digital tree and sometimes radix tree or prefix
tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure
that is used to store a dynamic set or associative array where the keys are usually strings.

Trie is the data structure very similar to Binary Tree. Trie data structure stores the data in
particular fashion, so that retrieval of data became much faster and helps in performance. The
name "TRIE" is coined from the word retrieve.

7
[Link]. What are TRIE data structure usage or applications?
1. Dictionary suggestions OR Auto complete dictionary

Retrieving data stored in Trie data structure is very fast, so it is most suited for application where
retrieval are more frequently performed like phone directory where contact searching operation
is used frequently.

2. Searching contact from mobile contact list OR Phone directory

Auto suggestion of words while searching for anything in dictionary is very common. If we
search for word “tiny”, then it auto suggest words starting with same characters like “tine”, “tin”,
“tinny” etc.

8
A prefix tree or Trie is a tree whose nodes don’t hold keys, but rather, hold partial keys. For
example, if you have a prefix tree that stores strings, then each node would be a character of a
string. If you have a prefix tree that stores arrays, each node would be an element of that array.
The elements are ordered from the root.

Prefix tress are good for looking up keys with a particular prefix.

Example: Represent the following map with Trie:

Key Value
Instant 1
Internal 2
Internet 3

9
There are 2 categories of Tries:

A. Non-compact Tries – is one in which every edge of the underlying tree represents a
symbol of the alphabet.
B. Compact Tries – trims (decreases) unary nodes which leaf to leaves.

A Tries representing a set of string given below.

Example: aeef, ad, bbfe, bbfg, c

Non-compact Tries Compact Tries

2.2.2. Suffix Trees

A suffix Tree is an ordinary tree in which the input strings are all possible suffixes.

The suffix tree also referred to as position tree, is another variation of tries. From its name, it is
easy to imagine that this kind of trie cares more about the suffix of a given string. It is a static
structure that does some preprocessing of large string S for a faster matching of any sub-string of
S.

It stores the suffixes of a string as its keys, while the position of the suffixes in the string as its
values.

- To build the suffix TRIE we use these indices instead of the actual object.

10
Example: Suffix tree

Let s = abab, a suffic tree of s is a compressed trie of all suffixes of s = abab$.

 $
 b$
 ab$
 bab$
 abab$

Example 2: Banana
123456

Suffix Trie Suffix Tree Suffix Array

11
2.3. Sequential Files
Sequential file is the most primitive file structures. It has no vocabulary (unique list of words) as
well as linking pointers. The records are generally arranged serially, one after another, but in
lexicographic order on the value of some key field.

- A particular attribute is chosen as primary key whose value will determine the order of
the records.
- When the first key fails to discriminate among records, a second key is chosen to give an
order.
- No directory and no linking pointer.
- Records are generally organized/ordered according to the value of a particular attribute
- Multiple attributes may be used when the attribute value is the same for a large number of
records (i.e. the key fails to discriminate)
 Its main advantages:
- Easy to implement.
- Fast access to next record
 Disadvantage:
- No weights attached to terms. Individual words are treated independently.
- Random access is slow: since similar terms are indexed individually, we need to find all
terms that match with the query.

2.4. Signature Files

Signature files contain signatures it patterns that represent documents. There are various ways of
constructing signatures. Using one common signature method, for example, documents are split
into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist,
words. Each word in the block is hashed to give a signature--a bit pattern with some of the bits
set to 1. The block signatures are then concatenated to produce the document signature.
Searching is done by comparing the signatures of queries with document signatures.

The main idea is to divide the document into blocks of fixed size and each block has assign to it
a signature (also fixed size), which is used to search the document for the queried pattern.

12
Consider:

- H(information) = 010001
- H(text) = 010010
- H(data) = 110000
- H(retrieval) = 100010
- The block signatures of a document D containing text “textual retrieval and information
retrieval” (after removing stop words and stemming) for a bit size of two terms – would
be:
 B1D = 110010 and
 B2D = 110011

To search for a given term we compare where the term’s bit string could be “inside” the block
signatures.

2.5. Flat files

A flat file, though it is possible to keep file structures in main memory, in practice IR databases
are usually stored on disk because of their size. Using a flat file approach, one or more
documents are stored in a file, usually as ASCII or EBCDIC text. Flat file searching is usually
done via pattern matching. On UNIX, for example, one can store a document collection one per
file in a UNIX directory, and search it using pattern searching tools such as GREP or AWK.

2.6. PAT Trees

PAT trees are Patricia trees constructed over all strings in a text. If a document collection is
viewed as a sequentially numbered array of characters, a string is a subsequence of characters
from the array starting at a given point and extending an arbitrary distance to the right. A Patricia
tree is a digital tree where the individual bits of the keys are used to decide branching.

13
3. Data Structure

A data structure is a storage that is used to store and organize data. It is a way of arranging data
on a computer so that it can be accessed and updated efficiently.

Classification of Data Structure:

3.1. Linear data structure: Data structure in which data elements are arranged
sequentially or linearly, where each element is attached to its previous and next adjacent
elements is called a linear data structure.

- Examples of linear data structures are array, stack, queue, linked list, etc.
A. Static data structure: Static data structure has a fixed memory size. It is easier to
access the elements in a static data structure.

- Example of this data structure is an array.

B. Dynamic data structure: In the dynamic data structure, the size is not fixed. It
can be randomly updated during the runtime which may be considered efficient
concerning the memory (space) complexity of the code.

- Examples of this data structure are queue, stack, etc.

3.2. Non-linear data structure: Data structures where data elements are not placed
sequentially or linearly are called non-linear data structures. In a non-linear data
structure, we can’t traverse all the elements in a single run only.

- Examples of non-linear data structures are trees and graphs.

4. Data structures for Postings Lists

A posting list is a data structure that maintains the list of documents that contains a particular
term. Generally a dictionary of terms is built and then for each term in the dictionary a
posting list is formed containing the list of documents that contains the particular term. To
traverse a posting list again another data structure, pointer, is used which is called as a skip
pointer. A pointer is generally a variable which holds the address of a data item.

14
4.1. Singly linked list

- Allow cheap insertion of documents into postings lists (e.g., when re crawling)

- Naturally extend to skip lists for faster access

4.2. Variable length array

- Better in terms of space requirements

- Also better in terms of time requirements if memory caches are used, as they use
contiguous memory
4.3. Hybrid scheme

- Linked list of variable length array for each term.

- write posting lists on disk as contiguous block without explicit pointers

- Minimizes the size of postings lists and number of disk seeks

Indexing Structures and Techniques Explained
No ratings yet
Indexing Structures and Techniques Explained
30 pages
Understanding Indexing Structures
No ratings yet
Understanding Indexing Structures
145 pages
Inverted Files and Signature Files Overview
No ratings yet
Inverted Files and Signature Files Overview
80 pages
Data Structures for Information Retrieval
No ratings yet
Data Structures for Information Retrieval
34 pages
Indexing and Search System Fundamentals
No ratings yet
Indexing and Search System Fundamentals
43 pages
Understanding Inverted Indexes in Search Engines
No ratings yet
Understanding Inverted Indexes in Search Engines
38 pages
Information Retrieval Systems Overview
No ratings yet
Information Retrieval Systems Overview
21 pages
Inverted & Signature Files in IR Systems
No ratings yet
Inverted & Signature Files in IR Systems
23 pages
Understanding Indexing Structures
No ratings yet
Understanding Indexing Structures
37 pages
Merging Indices in Information Retrieval
No ratings yet
Merging Indices in Information Retrieval
15 pages
Indexing and Searching Techniques
No ratings yet
Indexing and Searching Techniques
15 pages
Inverted File Structures in IR
No ratings yet
Inverted File Structures in IR
20 pages
Inverted Indexing Techniques Explained
No ratings yet
Inverted Indexing Techniques Explained
22 pages
Inverted File Structures Overview
No ratings yet
Inverted File Structures Overview
10 pages
Indexing and Searching Techniques Overview
No ratings yet
Indexing and Searching Techniques Overview
93 pages
Indexing Structures in Information Retrieval
No ratings yet
Indexing Structures in Information Retrieval
63 pages
Data Structures and Indexing Concepts
No ratings yet
Data Structures and Indexing Concepts
30 pages
Inverted Files and Suffix Structures Explained
No ratings yet
Inverted Files and Suffix Structures Explained
15 pages
Inverted Files in Information Retrieval
No ratings yet
Inverted Files in Information Retrieval
12 pages
Understanding Inverted Indexes in Search Engines
No ratings yet
Understanding Inverted Indexes in Search Engines
64 pages
Inverted Indexing in Information Retrieval
No ratings yet
Inverted Indexing in Information Retrieval
18 pages
File Structures Internal Test Paper
No ratings yet
File Structures Internal Test Paper
5 pages
Extendible Hashing for Student Records
100% (2)
Extendible Hashing for Student Records
37 pages
Irs Unit - 2 Part 2 Notes
No ratings yet
Irs Unit - 2 Part 2 Notes
11 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Inverted Index Implementation Guide
No ratings yet
Inverted Index Implementation Guide
2 pages
Inverted Files and Signature Files Explained
No ratings yet
Inverted Files and Signature Files Explained
83 pages
Understanding Inverted Indexing in IR
100% (1)
Understanding Inverted Indexing in IR
10 pages
Information Retrieval System Overview
No ratings yet
Information Retrieval System Overview
45 pages
Indexing Concepts and Techniques
No ratings yet
Indexing Concepts and Techniques
48 pages
IR System Design: Indexing & Efficiency
No ratings yet
IR System Design: Indexing & Efficiency
43 pages
Indexing Structures in Information Retrieval
No ratings yet
Indexing Structures in Information Retrieval
29 pages
Indexing Structures and File Types
No ratings yet
Indexing Structures and File Types
45 pages
Data Structures and File Organization
100% (5)
Data Structures and File Organization
219 pages
Unit I
No ratings yet
Unit I
33 pages
Information Retrieval File Structures Guide
No ratings yet
Information Retrieval File Structures Guide
54 pages
Indexing Structure in Information Retrieval
No ratings yet
Indexing Structure in Information Retrieval
41 pages
Data Structures for Information Retrieval
No ratings yet
Data Structures for Information Retrieval
133 pages
Text Indexing Techniques and Benefits
No ratings yet
Text Indexing Techniques and Benefits
11 pages
IRS Assignment Questions Overview
No ratings yet
IRS Assignment Questions Overview
4 pages
IR System Indexing and Searching Guide
No ratings yet
IR System Indexing and Searching Guide
59 pages
Word List and Inverted Index Algorithms
No ratings yet
Word List and Inverted Index Algorithms
33 pages
Index Construction for Document Retrieval
No ratings yet
Index Construction for Document Retrieval
43 pages
Document Indexing and Retrieval Systems
No ratings yet
Document Indexing and Retrieval Systems
66 pages
Indexing Techniques for IR Systems
No ratings yet
Indexing Techniques for IR Systems
42 pages
Domain Analysis in Information Retrieval
No ratings yet
Domain Analysis in Information Retrieval
33 pages
Inverted File Document Retrieval System
No ratings yet
Inverted File Document Retrieval System
3 pages
Information Retrieval Concepts Overview
No ratings yet
Information Retrieval Concepts Overview
15 pages
Indexing Structure and Process Explained
No ratings yet
Indexing Structure and Process Explained
59 pages
Domain Analysis of Information Retrieval
No ratings yet
Domain Analysis of Information Retrieval
17 pages
Unit I Ii
No ratings yet
Unit I Ii
33 pages
Inverted Index and Query Processing Guide
No ratings yet
Inverted Index and Query Processing Guide
13 pages
Compressed Full-Text Indexing Techniques
No ratings yet
Compressed Full-Text Indexing Techniques
30 pages
Indexing Structure Overview
No ratings yet
Indexing Structure Overview
38 pages
File Structure and Organization Concepts
No ratings yet
File Structure and Organization Concepts
17 pages
Data Structures: Algorithms Overview
No ratings yet
Data Structures: Algorithms Overview
8 pages
Web Programming Assignment Guide
No ratings yet
Web Programming Assignment Guide
3 pages
Data Exploration & Visualization Guide
No ratings yet
Data Exploration & Visualization Guide
32 pages
Python Tic-Tac-Toe Game Project
No ratings yet
Python Tic-Tac-Toe Game Project
6 pages
Robopaint Concentrate Exam Guidelines
No ratings yet
Robopaint Concentrate Exam Guidelines
3 pages
C++ Programming Concepts and Techniques
No ratings yet
C++ Programming Concepts and Techniques
15 pages
Registering a Concurrent Program in Oracle
No ratings yet
Registering a Concurrent Program in Oracle
26 pages
8086 Instruction Format Overview
No ratings yet
8086 Instruction Format Overview
5 pages
INDRA ALPIN PUTRA JASA - JobSheet 5 - C Decision Making
No ratings yet
INDRA ALPIN PUTRA JASA - JobSheet 5 - C Decision Making
38 pages
BCA Syllabus Overview at Bennett University
No ratings yet
BCA Syllabus Overview at Bennett University
4 pages
Java Design Patterns Explained
No ratings yet
Java Design Patterns Explained
17 pages
C Declarations and Initialisations
No ratings yet
C Declarations and Initialisations
33 pages
Understanding Android Broadcast Receivers
No ratings yet
Understanding Android Broadcast Receivers
18 pages
Java Examples: Jagged Arrays & Smiley Face
No ratings yet
Java Examples: Jagged Arrays & Smiley Face
3 pages
C++ STL Containers and Complexities Guide
No ratings yet
C++ STL Containers and Complexities Guide
1 page
CPSC1103 Lecture 7
No ratings yet
CPSC1103 Lecture 7
83 pages
Java Programming Model Answers 2019
No ratings yet
Java Programming Model Answers 2019
3 pages
MATLAB Transfer Function Analysis
No ratings yet
MATLAB Transfer Function Analysis
17 pages
Converting SIMULPS Data for LOTOS Code
No ratings yet
Converting SIMULPS Data for LOTOS Code
5 pages
Patent Agent Exam 2016 Answer Key
No ratings yet
Patent Agent Exam 2016 Answer Key
1 page
Database Systems Assignment Insights
No ratings yet
Database Systems Assignment Insights
5 pages
C Arrays and Functions in Programming
No ratings yet
C Arrays and Functions in Programming
4 pages
Django and Web Development Cheatsheet
No ratings yet
Django and Web Development Cheatsheet
3 pages
SQL DDL and DML Commands Guide
No ratings yet
SQL DDL and DML Commands Guide
5 pages
Efficiency in Parallel Computing
No ratings yet
Efficiency in Parallel Computing
74 pages
Module 2: Multithreaded Programming
No ratings yet
Module 2: Multithreaded Programming
35 pages
Creation of Database Using Derby DB. Our Derby Bookstore Database Contains Tables: Authors (Authorid, Firstname, Lastname, EMAIL) 4 Attributes
No ratings yet
Creation of Database Using Derby DB. Our Derby Bookstore Database Contains Tables: Authors (Authorid, Firstname, Lastname, EMAIL) 4 Attributes
16 pages
Tkinter GUI Tutorial for Beginners
No ratings yet
Tkinter GUI Tutorial for Beginners
23 pages
HookManager Errors in PUBG Mobile Logs
No ratings yet
HookManager Errors in PUBG Mobile Logs
5 pages
Definition of dfghjk
No ratings yet
Definition of dfghjk
2 pages
Process Scheduling Algorithms in C
No ratings yet
Process Scheduling Algorithms in C
12 pages

Information Retrieval: File Structures Explained

Uploaded by

Information Retrieval: File Structures Explained

Uploaded by

American College of Technology

File and Data Structure

Group Member ID Number

Supervisor: [Link] Tsige

1. What is index and indexing? 3

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature

- General name for a class of structures.

- Is a data structure for computationally efficient retrieval

A file structure is a combination of representations for data in files. It is also a collection of

The file structures used in IR systems are:

a. Sequential Files, d. Signature Files,

- Document-ID is a unique identifier for a document, and

2.1.1. Construction of Inverted file

Inverted file index has 2 main parts:

A. Vocabulary File (Search Structure)

To increase the speed of searching time:

Modify the document Sort by terms Multiple term entries in

2.1.3. Compressed inverted files

Example: Say there are three documents.

2.2. Tries and suffix trees

2. Searching contact from mobile contact list OR Phone directory

Example: Represent the following map with Trie:

A Tries representing a set of string given below.

Example: aeef, ad, bbfe, bbfg, c

Non-compact Tries Compact Tries

2.2.2. Suffix Trees

Let s = abab, a suffic tree of s is a compressed trie of all suffixes of s = abab$.

Suffix Trie Suffix Tree Suffix Array

2.4. Signature Files

2.5. Flat files

2.6. PAT Trees

Classification of Data Structure:

- Example of this data structure is an array.

- Examples of this data structure are queue, stack, etc.

- Examples of non-linear data structures are trees and graphs.

- Naturally extend to skip lists for faster access

- Better in terms of space requirements

- Linked list of variable length array for each term.

- write posting lists on disk as contiguous block without explicit pointers

- Minimizes the size of postings lists and number of disk seeks

You might also like