Optimizing Document Retrieval with Trie

This document describes the trie data structure and how it can be used to optimize document comparison. A trie is a tree-based data structure that stores keys (e.g. words) where each node represents a key character. It allows for efficient word insertion and searching in O(m) time, where m is the word length, independent of the file size. This improves upon standard substring matching which takes O(n*k) time, where n is the file size and k is the number of words to search. However, tries use a significant amount of memory due to node allocation.

Uploaded by

Harini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views4 pages

Optimizing Document Retrieval with Trie

Uploaded by

Harini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/281684909

TRIE DATA STRUCTURE

Conference Paper · September 2015

CITATIONS READS

0 1,021

1 author:

Pallavraj Sahoo
VIT University
1 PUBLICATION 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Pallavraj Sahoo on 12 September 2015.

The user has requested enhancement of the downloaded file.

TRIE DATA STRUCTURE

Introduction
This abstract is based on the tree data structure used in an efficient form. With high amount of
documentation used in the world, it is easier retrieve a document if organisation is done properly. So,
it is also important to classify the data into different categories efficiently. The documentation system
now a days is has to handle millions of documents together, and to do this work manually, can result
in many errors. A file has to be compared manually to every other file in the database to check for
their similarity. This task is time taking and risky.

Our abstract will try and find a solution to this problem. saying this we propose the data structure
TRIE. The Trie data structure is based on an idea of a tree. The simple tree data structure is augmented
to make a Trie. Every node of the tree usually contains a value member and a pointer to the children
of the node.

The 26 in the above representation must have given you a guess of what our data structure is doing.
The self-referential pointers used in this data structure are actually representing the 26 letters
of the English alphabet. This can be advanced to the upper cases according to the use.
following paragraph will tell you about the operational usage of this data structure.

WORKING:
Like every other tree, this tree also comes with the root.
This root consist of the 26 pointers corresponding to the characters in the English alphabet.
such as 'a'=0,'b'=1,'c'=2.....and so on.
These pointers are initialised to NULL.

INSERTING:
[Link] first task is to extract the file content word by word.
This can be done in simple O(n) time where n is the number of characters in the file.
[Link] next step is to start from the root, and follow according to the characters in word with the self-
referential pointers till
we don’t reach the NULL pointer.
[Link] the NULL pointer is reached we allocate free memory
to that pointer and carry forward till the whole word is fit into the trie.
[Link] the last character is reached we give the value of the last node to be 1.
This marks the end of the word.
[Link] is continued till every word in the file is inserted into the file.

After the file is completely loaded onto the trie, comes the part of searching and comparing.
SEARCHING:
[Link] is same as in setting, the only difference is that now, new memory need not be allocated.
[Link] a NULL is reached by travelling form the root even if the extracted word isn’t completely
searched,
it means that the word does not exist in the loaded file.
[Link] the word is processed completely and the last character leads the traversing to the node whole
value attribute is 1.
it means that the word is actually present in the file.
[Link] the words last character does not lead us to node with value attribute set to 1.
it means that the word does not exist in the file.

HOW TO USE THIS TO OPTIMIZE THE DOCUMENT DISTANCE PROBLEM:

-THE DOCUMENT DISTANCE PROBLEM-
The document distance problem is used to find the similarity between two documents which are
usually text files. The Idea of document distance is as follows:-
Given a file with N distinct words where the count of each word in the file is given as an N
dimensional array A[N] and another file with M distinct words with the same M dimensional array
B[M] , such that the ith element in A is corresponds to the same word in ith element of B.
Now computing the dot product of the arrays and dividing the product with the product of modulus
of those arrays, gives us the document distance between 0-1. 0 being very different and 1 being the
same.

OPTIMISATION
The problem itself requires to find the count of each word in the file 1 and then check corresponding
file 2 words
that match with the words in the file 1.
Let us see what is the amount of time taken by each of the above required operations.
-inserting-
Since insertion takes the word and places the word character by character it requires O(m) time where
m is the word length.
-searching-
Searching takes the same time as insertion because both the algorithms do the same thing.
i.e. O(m) time.
We see that this time is independent of the original file size.

COMPARING WITH SUBSTRING MATCHING

The method used usually for string matching is substring matching with O(n) methods where n is the
size of the original file.
If there are k words in our second file then the time complexity is shouted to O(n*k) time as each
word has to searched matched.
The trie data structure gives us a faster result, as the size of the word is obviously less than the size
of the file.

WHAT ARE ITS DISADVANTAGES?

[Link] time a node is created by allocating the space it takes up size of (int)+26*size of(trie pointers).
This is a big problem if the number of words in the file is large.
[Link] dynamic allocation of memory has a limit of 2gb for a 32bit computer and 8Tb for 64bit
computers.
[Link] software would take up a lot of memory.
View publication stats

KMP Algorithm Pseudocode Overview
No ratings yet
KMP Algorithm Pseudocode Overview
11 pages
C++ Trie Implementation for Word Dictionary
No ratings yet
C++ Trie Implementation for Word Dictionary
18 pages
Trie Study Notes
No ratings yet
Trie Study Notes
5 pages
Understanding Trie Data Structures
No ratings yet
Understanding Trie Data Structures
23 pages
Trie Data Structure: Insertion & Search
No ratings yet
Trie Data Structure: Insertion & Search
16 pages
Understanding Trie Data Structure
No ratings yet
Understanding Trie Data Structure
24 pages
Implementing a Trie Data Structure
No ratings yet
Implementing a Trie Data Structure
3 pages
De La Briandais Trie Overview
No ratings yet
De La Briandais Trie Overview
20 pages
Understanding Standard Tries in Data Structures
No ratings yet
Understanding Standard Tries in Data Structures
34 pages
Understanding Trie Data Structure
No ratings yet
Understanding Trie Data Structure
33 pages
DSA Questions: Trie Implementation Guide
No ratings yet
DSA Questions: Trie Implementation Guide
257 pages
Understanding Trie Data Structures
No ratings yet
Understanding Trie Data Structures
31 pages
Trie Data Structure: Insertion & Search
No ratings yet
Trie Data Structure: Insertion & Search
5 pages
UNIT-5.Pattern Matching and Tries
No ratings yet
UNIT-5.Pattern Matching and Tries
10 pages
Trie Data Structure: Insert & Search
No ratings yet
Trie Data Structure: Insert & Search
25 pages
Solutions to Horoppa Hackathon 2012
No ratings yet
Solutions to Horoppa Hackathon 2012
6 pages
Trie Data Structure for Word Dictionary
No ratings yet
Trie Data Structure for Word Dictionary
19 pages
Retrieval Data Structures for Strings
No ratings yet
Retrieval Data Structures for Strings
121 pages
Trie Data Structure in C Programming
No ratings yet
Trie Data Structure in C Programming
20 pages
Understanding Trie Data Structures
100% (2)
Understanding Trie Data Structures
11 pages
Tries: Efficient Word Search Structures
No ratings yet
Tries: Efficient Word Search Structures
10 pages
Understanding Trie Data Structure
No ratings yet
Understanding Trie Data Structure
20 pages
Project Report on TRIE Data Structure
No ratings yet
Project Report on TRIE Data Structure
9 pages
Trie Data Structure Implementation Guide
No ratings yet
Trie Data Structure Implementation Guide
39 pages
Data Structures for String Processing
No ratings yet
Data Structures for String Processing
18 pages
Pattern Matching and Trie Algorithms
No ratings yet
Pattern Matching and Trie Algorithms
10 pages
Understanding Trie Data Structure
No ratings yet
Understanding Trie Data Structure
168 pages
Trie Data Structure Implementation Guide
No ratings yet
Trie Data Structure Implementation Guide
4 pages
Trie vs Hashmap for Word Storage
No ratings yet
Trie vs Hashmap for Word Storage
16 pages
Trie Data Structure Overview and Applications
No ratings yet
Trie Data Structure Overview and Applications
8 pages
Data Structures for Information Retrieval
No ratings yet
Data Structures for Information Retrieval
34 pages
Understanding Trie Data Structures
No ratings yet
Understanding Trie Data Structures
6 pages
Advanced Data Structures in Java
No ratings yet
Advanced Data Structures in Java
9 pages
Understanding Trie Trees and Their Types
No ratings yet
Understanding Trie Trees and Their Types
21 pages
Pattern Matching Algorithms Overview
No ratings yet
Pattern Matching Algorithms Overview
10 pages
Essential Trie Interview Questions
No ratings yet
Essential Trie Interview Questions
41 pages
Introduction to TRIE Trees and Applications
No ratings yet
Introduction to TRIE Trees and Applications
92 pages
Efficient Suffix Array Construction
No ratings yet
Efficient Suffix Array Construction
17 pages
Understanding Indexing Structures
No ratings yet
Understanding Indexing Structures
145 pages
Advance Data Structures
No ratings yet
Advance Data Structures
184 pages
C++ Algorithms and Data Structures Guide
No ratings yet
C++ Algorithms and Data Structures Guide
5 pages
Trie Insertion Program Example
No ratings yet
Trie Insertion Program Example
2 pages
Binary Search Tree Dictionary Operations
No ratings yet
Binary Search Tree Dictionary Operations
10 pages
Compressed Suffix Trie Solution
No ratings yet
Compressed Suffix Trie Solution
2 pages
Implementing a Trie Data Structure
No ratings yet
Implementing a Trie Data Structure
12 pages
R-Way Tries in Java: Overview and Implementation
No ratings yet
R-Way Tries in Java: Overview and Implementation
8 pages
String Matching Algorithms Overview
No ratings yet
String Matching Algorithms Overview
48 pages
Understanding Tries in Computer Science
No ratings yet
Understanding Tries in Computer Science
4 pages
Oods Notes
No ratings yet
Oods Notes
27 pages
Trie Data Structure Implementation
No ratings yet
Trie Data Structure Implementation
4 pages
Optimizing Pattern Matching with Tries
No ratings yet
Optimizing Pattern Matching with Tries
16 pages
Myths and Realities of Entrepreneurship
No ratings yet
Myths and Realities of Entrepreneurship
42 pages
IPR and Entrepreneurship Course Overview
No ratings yet
IPR and Entrepreneurship Course Overview
3 pages
Unit III Industrial Design, Copy Right and Intellectual Property and Cyberspace Industrial Design
No ratings yet
Unit III Industrial Design, Copy Right and Intellectual Property and Cyberspace Industrial Design
40 pages
Intellectual Property Rights Exam Questions
No ratings yet
Intellectual Property Rights Exam Questions
2 pages
Patent Law and Intellectual Property Quiz
No ratings yet
Patent Law and Intellectual Property Quiz
1 page
Pipelining in Computer Performance
No ratings yet
Pipelining in Computer Performance
13 pages
C Programming Practical Exam Questions
No ratings yet
C Programming Practical Exam Questions
2 pages
Trend Micro ApexOne 2019 Req
No ratings yet
Trend Micro ApexOne 2019 Req
25 pages
B.Sc. Admission Counseling Details 2023-24
No ratings yet
B.Sc. Admission Counseling Details 2023-24
12 pages
E-Book Testbank For Introduction To Human Nutrition 3rd Edition Textbook Resources
100% (4)
E-Book Testbank For Introduction To Human Nutrition 3rd Edition Textbook Resources
279 pages
Matter in Our Surroundings Test Paper
No ratings yet
Matter in Our Surroundings Test Paper
3 pages
Statistical Analysis of Rice Varieties Yield
No ratings yet
Statistical Analysis of Rice Varieties Yield
4 pages
AI Lab Exam: A* & Semantic Networks
No ratings yet
AI Lab Exam: A* & Semantic Networks
9 pages
Business Analytics Exam Overview
0% (1)
Business Analytics Exam Overview
4 pages
Health and Culture-Specific Illnesses
No ratings yet
Health and Culture-Specific Illnesses
33 pages
High Lifter Pump Overview and FAQs
No ratings yet
High Lifter Pump Overview and FAQs
2 pages
Detailed Snag List for Omnicom Project
No ratings yet
Detailed Snag List for Omnicom Project
7 pages
Debate Activity Worksheet for Students
No ratings yet
Debate Activity Worksheet for Students
3 pages
Understanding Demand and Supply Basics
No ratings yet
Understanding Demand and Supply Basics
6 pages
Professional Ethics in Law Course Guide
No ratings yet
Professional Ethics in Law Course Guide
6 pages
Comprehensive Ammunition Index
No ratings yet
Comprehensive Ammunition Index
196 pages
Job Satisfaction and Creative Performance Analysis
No ratings yet
Job Satisfaction and Creative Performance Analysis
9 pages
SERVA Cementing Equipment Overview
100% (1)
SERVA Cementing Equipment Overview
16 pages
Philips SMD Aluminium Capacitors 085 CS
No ratings yet
Philips SMD Aluminium Capacitors 085 CS
14 pages
E-Marketing Plan Model Explained
No ratings yet
E-Marketing Plan Model Explained
2 pages
Chief Architect X8 Reference Manual
No ratings yet
Chief Architect X8 Reference Manual
1,372 pages
2022 MPS Application Call for 52 Vacancies
No ratings yet
2022 MPS Application Call for 52 Vacancies
10 pages
Special Purpose Diodes & Applications Quiz
No ratings yet
Special Purpose Diodes & Applications Quiz
9 pages
Mastering Master Data Management Steps
No ratings yet
Mastering Master Data Management Steps
27 pages
Construction Economics Exam - Feb 2023
No ratings yet
Construction Economics Exam - Feb 2023
3 pages
Microbial Zoonoses and Sapronoses Guide
No ratings yet
Microbial Zoonoses and Sapronoses Guide
15 pages
Project Report Structure Guide
No ratings yet
Project Report Structure Guide
6 pages
Grade 6 English Unit 1 Test
91% (11)
Grade 6 English Unit 1 Test
5 pages
Understanding Heredity and Variation
No ratings yet
Understanding Heredity and Variation
8 pages
Taxonomic Revisions in Bacillus Genus
No ratings yet
Taxonomic Revisions in Bacillus Genus
46 pages
24-Year Analysis of U.S. Parricide Cases
No ratings yet
24-Year Analysis of U.S. Parricide Cases
18 pages

Optimizing Document Retrieval with Trie

Uploaded by

Optimizing Document Retrieval with Trie

Uploaded by

See discussions, stats, and author profiles for this publication at: [Link]

TRIE DATA STRUCTURE

Conference Paper · September 2015

The user has requested enhancement of the downloaded file.

HOW TO USE THIS TO OPTIMIZE THE DOCUMENT DISTANCE PROBLEM:

COMPARING WITH SUBSTRING MATCHING

WHAT ARE ITS DISADVANTAGES?

You might also like