Trie Data Structure Implementation
Trie Data Structure Implementation
2. A trie is used to store character strings in encoded form, such that each character is a node of a tree
data structure, and a word is formed by concatenating the characters from the root of the trie to one of its
terminals. One of the advantages of using tries is that insertion and search can be done in time linear on
the size of the string.
The main restriction of this implementation is that it works only for ASCII characters the range [A-Z].
I am doing this mainly to test how well the algorithm perfoms. Real implementations should be able to
handle most character sets, and therefore they need more space to store child nodes at each trie node.
h trie.cpp 2 i ≡
h trie includes 3 i
h trie functions 4 i
3.
h trie includes 3 i ≡
#include "trie.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#define New (t) ( t ∗ ) malloc (sizeof (t))
#define NewV (t, s) ( t ∗ ) malloc (sizeof (t) ∗ (s))
This code is used in section 2.
4. A simple utility function that aborts the program with an error message to stderr.
h trie functions 4 i ≡
void abort (const char ∗msg )
{
fprintf (stderr , msg );
exit (1);
}
See also sections 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, and 19.
This code is used in section 2.
5. A function that checks if a character can be inserted as a child of a trie node. All characters stored are
in uppercase. The function returns an uppercase character, or 0 if the character is not between A and Z.
h trie functions 4 i +≡
char check char (char c)
{
if (’a’ ≤ c ∧ c ≤ ’z’) c −= 32;
if (’A’ ≤ c ∧ c ≤ ’Z’) return c;
return 0;
}
2 TRIE DATA STRUCTURE TRIECODE §6
6. The node initialization function just returns a new trie node, initialized to the character c. Everything
else in the node is NULL.
h trie functions 4 i +≡
static TrieNode ∗trie new node (char c)
{
TrieNode ∗ t = New (TrieNode );
if (¬t) return 0;
t~ data = c;
t~ children = Λ;
t~ n child = 0;
t~ size = 0;
t~ is terminal = 0;
return t;
}
8. Allocates a new array of pointers to child trie node, and initialize it to NULLs.
h trie functions 4 i +≡
static TrieNode ∗∗new child array (int size ){ TrieNode ∗ ∗nodes = NewV ( TrieNode ∗ , size ) ;
if (¬nodes ) return 0;
int i;
for (i = 0; i < size ; ++ i) nodes [i] = Λ;
return nodes ; }
9. searches an array of children of the first type (with 8 characters). Returns a trie node if the character
can be found, NULL otherwise.
h trie functions 4 i +≡
static TrieNode ∗in children8 (TrieNode ∗ t, char c)
{
int i, n = t~ n child ;
for (i = 0; i < n; ++ i)
if (t~ children [i]~ data ≡ c) return t~ children [i];
return 0;
}
§10 TRIECODE TRIE DATA STRUCTURE 3
13. add given nodes (usually from a list of the first type) into a list of the second type
h trie functions 4 i +≡
void add existing nodes26 (TrieNode ∗ t, TrieNode ∗ ∗nodes , int n)
{
assert (t~ size ≡ 26);
int i;
for (i = 0; i < n; ++ i) {
int pos = nodes [i]~ data − ’A’;
assert (¬t~ children [pos ]); /∗ this position must be NULL ∗/
t~ children [pos ] = nodes [i];
}
}
4 TRIE DATA STRUCTURE TRIECODE §14
14. A function that checks if a character c exists as a child of the given trie node. If insert is true, the
value should be inserted as a child if it is not present. modified is an output value, and will be true if there
was any insertion on the child list.
h trie functions 4 i +≡
static TrieNode ∗get or insert child for char (TrieNode ∗ t, char c, int insert , int ∗modified , int
is terminal ){ assert (’A’ ≤ c ∧ c ≤ ’Z’);
TrieNode ∗ node ; h check cases of child list 15 i
15. To reduce memory consumption, each node can have a child list of one of two types: 8-position or
26-position arrays. The first type stores at most 8 characters. They are unordered, so any search needs to
look at all characters currently stored (which on average is 4). The second type of list has an array for all
26 possible characters. The list is in alphabetical order, so it is very easy to find and insert elements
h check cases of child list 15 i ≡
if (t~ size ≡ 0) {
if (¬insert ) return 0; /∗ add memory, return position ∗/
t~ children = new child array (8);
if (¬t~ children ) abort ("cannot add child array");
t~ size = 8;
node = add children8 (t, c);
}
else if (t~ size ≡ 8) { /∗ check if it is there ∗/
node = in children8 (t, c);
if (node ) return node ; /∗ if not, add ∗/
if (¬insert ) return 0;
if (t~ n child < 8) node = add children8 (t, c);
else { /∗ we need more space ∗/
TrieNode ∗ ∗current = t~ children ;
t~ children = new child array (26);
if (¬t~ children ) abort ("cannot add child array");
t~ size = 26;
add existing nodes26 (t, current , 8);
node = add to children26 (t, c);
}
}
else if (t~ size ≡ 26) {
node = in children26 (t, c);
if (node ) return node ;
if (¬insert ) return 0;
node = add to children26 (t, c);
}
else {
abort ("wrong size of child list");
} /∗ we come here only in case the child list was modified ∗/
if (¬node ) abort ("cannot add node");
if (is terminal ) node~ is terminal = 1;
∗modified = 1;
return node ; }
This code is used in section 14.
§16 TRIECODE TRIE DATA STRUCTURE 5
16. we visit the trie recursively, because word lengths are usually small
h trie functions 4 i +≡
static int trie visit rec (TrieNode ∗ t, char ∗word , int len , int insert , int ∗modified )
{
if (len ≡ 0) { /∗ end of recursion ∗/
if (∗modified ) return 2;
return 1;
}
char c = check char (word [0]);
TrieNode ∗ child = get or insert child for char (t, c, insert , modified , len ≡ 1);
if (¬child ) {
if (insert ) abort ("cannot insert word");
else return 0;
}
return trie visit rec (child , word + 1, len − 1, insert , modified );
}
17. searches or inserts a word into a trie. If insert is true, the word will be inserted if it is not already in
the tree. Returns true if the operation is successful.
h trie functions 4 i +≡
int trie visit (Trie ∗ t, char ∗word , int insert )
{
int modified = 0;
int res = trie visit rec (t~ root , word , strlen (word ), insert , &modified );
if (insert ∧ res ≡ 2) t~ n words ++ ;
return res ;
}
6 TRIE DATA STRUCTURE TRIECODE §18
18. Print all words contained in a trie. the algorithm is recursive. If, at some iteration, the is terminal
flag is set, then print the current word. For each of the children of the current node, add the caracter they
represent to the word, and call the function recursively.
h trie functions 4 i +≡
static int trie print rec (TrieNode ∗ t, char ∗word , int pos , int max size )
{
if (pos ≡ max size ) { /∗ check buffer overrun ∗/
word [pos ] = ’\0’;
printf ("word truncated: %s\n", word );
return 0;
}
if (t~ is terminal ) { /∗ print current word ∗/
word [pos ] = ’\0’;
printf ("%s\n", word );
}
int i;
for (i = 0; i < t~ size ; ++ i) { /∗ print all children ∗/
TrieNode ∗ node = t~ children [i];
if (node ) {
word [pos ] = node~ data ;
trie print rec(node , word , pos + 1, max size );
}
}
return 1;
}
#define MAX_SIZE 64
20.
h main.cpp 20 i ≡
h main headers 21 i
h main functions 23 i
21.
h main headers 21 i ≡
#include "trie.h"
#include <stdio.h>
#include <string.h>
This code is used in section 20.
§22 TRIECODE STRING OPERATIONS 7
23. returns true if the string w is a word composed only of the characters from A-Z
h main functions 23 i ≡
int is word (char ∗w, int len )
{
int i;
for (i = 0; i < len ; ++ i)
if (¬check char (w[i])) return 0;
return i;
}
See also sections 24, 25, 26, 27, 28, and 29.
This code is used in section 20.
26.
h main functions 23 i +≡
#define MAX_WSIZE 64
8 STRING OPERATIONS TRIECODE §27
28. reads all words in the given file name, and inserts them on the trie.
h main functions 23 i +≡
int proc file (Trie ∗ t, char ∗file )
{
FILE ∗f = fopen (file , "r");
if (¬f ) abort ("error opening file");
char w[MAX_WSIZE + 1]; /∗ write format string ∗/
char format [6];
sprintf (format , "%%%ds", MAX_WSIZE); /∗ insert each word found in file ∗/
for ( ; ; ) {
if (EOF ≡ fscanf (f , format , w)) break;
char ∗word = adjust word (w);
if (is word (word , strlen (word ))) trie visit (t, word , 1);
}
return 1;
}
29.
h main functions 23 i +≡
int main (int argc , char ∗∗argv )
{
Trie ∗ t = trie new ( );
if (¬t) return 1;
if (argc < 2 ∨ argc ≡ 3 ∧ strcmp (argv [2], "−p")) {
printf ("usage:\n %s inputFile [−p]\n", argv [0]);
return 1;
}
proc file (t, argv [1]);
printf ("number of words: %d\n", t~ n words );
if (argc ≡ 3) trie print (t);
return 0;
}
§30 TRIECODE HEADER FILE 9
30. Header File. This is the header file for the trie module.
h trie.h 30 i ≡
#ifndef _TRIE_H_
#define _TRIE_H_
struct TrieNode {
char data ;
struct TrieNode ∗∗children ;
int n child ;
int size ;
int is terminal ;
};
typedef struct TrieNode TrieNode;
struct Trie {
int n words ;
TrieNode ∗root ;
};
typedef struct Trie Trie;
Trie ∗trie new ( );
int trie visit (Trie ∗t, char ∗word , int insert );
int trie print (Trie ∗t);
char check char (char c);
void abort (const char ∗msg );
#endif
10 INDEX TRIECODE §31
31. Index.
_TRIE_H_: 30. pos : 11, 13, 18.
abort : 4, 11, 15, 16, 28, 30. printf : 18, 19, 29.
add children8 : 10, 15. proc file : 28, 29.
add existing nodes26 : 13, 15. res : 17.
add to children26 : 12, 15. root : 7, 17, 19, 30.
adjust word : 27, 28. search add children26 : 11, 12.
argc : 29. size : 6, 8, 11, 13, 15, 18, 30.
argv : 29. sprintf : 28.
assert : 10, 11, 13, 14. stderr : 4.
c: 5, 6, 9, 10, 11, 12, 14, 16, 24, 25, 30. strcmp : 29.
check char : 5, 16, 23, 30. strlen : 17, 27, 28.
child : 16. t: 30.
children : 6, 9, 10, 11, 13, 15, 18, 30. Trie: 7, 17, 19, 28, 29, 30.
current : 15. Trie : 30.
data : 6, 9, 13, 18, 30. trie new : 7, 29, 30.
EOF: 28. trie new node : 6, 7, 10, 11.
exit : 4. trie print : 19, 29, 30.
e1 : 7. trie print rec: 18, 19.
f : 28. trie visit : 17, 28, 30.
file : 28. trie visit rec : 16, 17.
fopen : 28. TrieNode: 6, 8, 9, 10, 11, 12, 13, 14, 15,
format : 28. 16, 18, 30.
fprintf : 4. TrieNode : 30.
free : 7. w: 23, 27, 28.
fscanf : 28. word : 16, 17, 18, 19, 28, 30.
get or insert child for char : 14, 16.
i: 8, 9, 13, 18, 23, 27.
in children26 : 12, 15.
in children8 : 9, 15.
insert : 11, 14, 15, 16, 17, 30.
is punctuation : 24, 27.
is quote : 25, 27.
is terminal : 6, 14, 15, 18, 30.
is word : 23, 28.
last : 27.
len : 16, 23, 27.
main : 29.
malloc : 3.
MAX_SIZE: 18, 19.
max size : 18.
MAX_WSIZE: 26, 28.
modified : 14, 15, 16, 17.
msg : 4, 30.
n: 9, 13.
n child : 6, 9, 10, 11, 15, 30.
n words : 7, 17, 29, 30.
New : 3, 6, 7.
new child array : 8, 15.
NewV : 3, 8.
node : 11, 14, 15, 18.
nodes : 8, 13.
TRIECODE NAMES OF THE SECTIONS 11
Section Page
Trie Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1
String Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7
Header File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10