See discussions, stats, and author profiles for this publication at: [Link]
net/publication/281684909
TRIE DATA STRUCTURE
Conference Paper · September 2015
CITATIONS READS
0 1,021
1 author:
Pallavraj Sahoo
VIT University
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Pallavraj Sahoo on 12 September 2015.
The user has requested enhancement of the downloaded file.
TRIE DATA STRUCTURE
Introduction
This abstract is based on the tree data structure used in an efficient form. With high amount of
documentation used in the world, it is easier retrieve a document if organisation is done properly. So,
it is also important to classify the data into different categories efficiently. The documentation system
now a days is has to handle millions of documents together, and to do this work manually, can result
in many errors. A file has to be compared manually to every other file in the database to check for
their similarity. This task is time taking and risky.
Our abstract will try and find a solution to this problem. saying this we propose the data structure
TRIE. The Trie data structure is based on an idea of a tree. The simple tree data structure is augmented
to make a Trie. Every node of the tree usually contains a value member and a pointer to the children
of the node.
The 26 in the above representation must have given you a guess of what our data structure is doing.
The self-referential pointers used in this data structure are actually representing the 26 letters
of the English alphabet. This can be advanced to the upper cases according to the use.
following paragraph will tell you about the operational usage of this data structure.
WORKING:
Like every other tree, this tree also comes with the root.
This root consist of the 26 pointers corresponding to the characters in the English alphabet.
such as 'a'=0,'b'=1,'c'=2.....and so on.
These pointers are initialised to NULL.
INSERTING:
[Link] first task is to extract the file content word by word.
This can be done in simple O(n) time where n is the number of characters in the file.
[Link] next step is to start from the root, and follow according to the characters in word with the self-
referential pointers till
we don’t reach the NULL pointer.
[Link] the NULL pointer is reached we allocate free memory
to that pointer and carry forward till the whole word is fit into the trie.
[Link] the last character is reached we give the value of the last node to be 1.
This marks the end of the word.
[Link] is continued till every word in the file is inserted into the file.
After the file is completely loaded onto the trie, comes the part of searching and comparing.
SEARCHING:
[Link] is same as in setting, the only difference is that now, new memory need not be allocated.
[Link] a NULL is reached by travelling form the root even if the extracted word isn’t completely
searched,
it means that the word does not exist in the loaded file.
[Link] the word is processed completely and the last character leads the traversing to the node whole
value attribute is 1.
it means that the word is actually present in the file.
[Link] the words last character does not lead us to node with value attribute set to 1.
it means that the word does not exist in the file.
HOW TO USE THIS TO OPTIMIZE THE DOCUMENT DISTANCE PROBLEM:
-THE DOCUMENT DISTANCE PROBLEM-
The document distance problem is used to find the similarity between two documents which are
usually text files. The Idea of document distance is as follows:-
Given a file with N distinct words where the count of each word in the file is given as an N
dimensional array A[N] and another file with M distinct words with the same M dimensional array
B[M] , such that the ith element in A is corresponds to the same word in ith element of B.
Now computing the dot product of the arrays and dividing the product with the product of modulus
of those arrays, gives us the document distance between 0-1. 0 being very different and 1 being the
same.
OPTIMISATION
The problem itself requires to find the count of each word in the file 1 and then check corresponding
file 2 words
that match with the words in the file 1.
Let us see what is the amount of time taken by each of the above required operations.
-inserting-
Since insertion takes the word and places the word character by character it requires O(m) time where
m is the word length.
-searching-
Searching takes the same time as insertion because both the algorithms do the same thing.
i.e. O(m) time.
We see that this time is independent of the original file size.
COMPARING WITH SUBSTRING MATCHING
The method used usually for string matching is substring matching with O(n) methods where n is the
size of the original file.
If there are k words in our second file then the time complexity is shouted to O(n*k) time as each
word has to searched matched.
The trie data structure gives us a faster result, as the size of the word is obviously less than the size
of the file.
WHAT ARE ITS DISADVANTAGES?
[Link] time a node is created by allocating the space it takes up size of (int)+26*size of(trie pointers).
This is a big problem if the number of words in the file is large.
[Link] dynamic allocation of memory has a limit of 2gb for a 32bit computer and 8Tb for 64bit
computers.
[Link] software would take up a lot of memory.
View publication stats