CSI 2110 Summary PDF
CSI 2110 Summary PDF
Summary
Fall 2012
CSI 2110 Summary, Fall 2012
Analysis of Algorithms
Algorithm - a step by step procedure for solving a problem in a finite amount of time. Analyzing an
algorithm means determining its efficiency.
Primitive operations - low-level computations independent from the programming language that can be
identified in the psuedocode.
Big-Oh - given two functions and , we say is if and only if there are positive
constant and such that for . It means that an algorithm has the complexity of
AT MOST .
Just multiply all terms by the highest degree of , add, and you have your (the coefficient) and
(the function).
Logarithms always in base 2 in this class unless otherwise stated
Always want the lowest possible bound, approximation should be as tight as possible
Drop lower order terms and constant factors
Big-Omega - is such that for all . It means an algorithm has
the complexity of AT LEAST .
Big-Theta - is if it is ( ) ( ) It means the complexity IS EXACTLY .
1|P a ge
CSI 2110 Summary, Fall 2012
Extendable Arrays
Growth function is
Regular push: 1
Special push: Cost of creating the new array ( ) + copying the old elements into the new array
(length of old array) + 1
Phase: starts at creation of the new array and ends with the last element being pushed. The one pushed
after that is a special push, and starts a new phase.
Complexities
Method ArrayList LinkedList Unsorted Seq SortedSeq
size
isEmpty
get
replace
insert
remove
minKey/minElement
removeMin
Selection Sort
Using an external data structure - insert your elements into the first data structure, find the smallest
one, add it into your second data structure. Takes time.
In place (Not using an extra data structure) - search through the array, find the smallest, add it to the
front of the array. Repeat until it's sorted.
Complexity: Average and worst case: performs in operations regardless of input, since
is always executed in .
Insertion Sort
Using an external data structure - insert the first element of the first data structure into the other one,
but adding each element before or after the existing elements according to size. Then,
and add it into the old ADT.
In place - Take the second element and switch it with the first if needed. Take the third element and
switch it with the second, and then the first if need be, and so on.
Complexity: Average and worst case is .
Trees
Graph - consists of vertices and edges.
Tree - a graph that contains no cycles.
Root - a node without a parent.
Internal node - a node with at least one child.
External node - a node without any children.
Ancestor - parent, grandparent, great-grandparent, etc.
Descendent - child, grandchild, great-grandchild.
Subtree - a tree consisting of a node and its descendants.
Distance - number of edges between two nodes.
2|P a ge
CSI 2110 Summary, Fall 2012
Binary Trees
Each node has at most two children.
Examples: decision trees (yes/no), arithmetic expressions,
Full binary tree - each node is either a leaf, or has two children.
In the book, children are completed with dummy nodes, and all trees are considered full.
Perfect binary tree - a full binary tree with all the leaves at the same level.
Complete binary tree - perfect until the level , then one or more leaves at level .
Properties of Height
Binary:
Binary (Full):
Binary(Complete): (integer part of )
Binary(Perfect):
3|P a ge
CSI 2110 Summary, Fall 2012
Removal -
Remove the top element
Replace with last key in the heap.
Begin the downheap.
Downheap
Compares the parent with the smallest child.
If the child is smaller, switch the two.
Keep going.
Stops when the key is greater than the keys of both its children or the bottom of the heap is
reached.
Insertion
Add key into the next available position.
Begin upheap.
Upheap
Similar to downheap.
Swap parent-child keys out of order.
Regular Heap Construction -We could insert items one at a time with a sequence of heap insertions,
∑ . But we can do better with bottom-up heap construction.
4|P a ge
CSI 2110 Summary, Fall 2012
Unordered Sequence
Searching and removing takes time
Inserting takes time.
Applications to log files (frequent insertions, rare searches and removals)
5|P a ge
CSI 2110 Summary, Fall 2012
Complexity:
o Worst case: where all keys are to the right or to the left.
o Best case: where leaves are on the same level or on an adjacent level.
Insertion
Always insert at the end of the search tree based on the correct order
If you're adding a double, always insert to the right, not to the left of the tree.
Deletion
If it’s at the end, you can just remove it.
If it's not at the end, replace it with the next in the inorder traversal.
AVL Trees
AVL trees are balanced.
They are binary search trees that for every internal node of , the heights of the children can
differ by at most .
Height of an AVL tree is always .
The height is .
Insertion
Balanced - if for ever node , the height of the 's children differ by at most 1.
It tree becomes unbalanced, we need to rebalance.
Rebalance
o Identify three nodes (grandparent, parent, child) and the 4 subtrees attached to them.
o Find the node whose grandparent is unbalanced. This node (child) is , the parent is , and
grandparent is .
Choice of is not unique
o Identify the four subtrees, left to right, as and
o The first who comes in the inorder traversal is , second is , and third is .
Removal
Remove the same way as in a binary search tree. However, this may cause an imbalance.
6|P a ge
CSI 2110 Summary, Fall 2012
Complexity
Searching, inserting, and removing are all , that's what makes AVL trees so nice.
Trees
A tree is a multi-way search tree with the following properties:
o Node size property: every internal node has at most four children
o Depth property: all external nodes have the same depth
Can't have more than four children or less than two children.
Depending on the number of children, an internal node of a tree is called a -node, -node,
or -node.
Searching in a tree with items takes time
Min number of items: when all internal nodes have key and children: ,
Maximum number of items :when all internal nodes have three keys and four children.
∑
Insertion
Insert similar to a binary search tree. Insert at the end after (binary) searching where it goes.
May cause overflow, since you're only allowed three elements in one node.
o Take the third element, send it up, and make new nodes out of the first two (one) and the
fourth one (two).
Insertion takes time
Deletion
Replace the deleted key with the inorder successor
Can cause underflow, might need to fuse nodes together to fix this. To handle an underflow at
node with parent , we consider two cases:
o Case 1: the adjacent siblings of are 2-nodes.
Fusion operation: we merge with an adjacent sibling and move an item from to
the merged node .
After a fusion, the underflow may propagate to the parent .
o Case 2: an adjacent sibling of is a -node or a -node.
Transfer operation:
1. We move a child from to
2. We move an item from to
3. We move an item from to
After a transfer, no underflow occurs.
Deleting takes time. Note that searching, inserting, and deleting all take time.
7|P a ge
CSI 2110 Summary, Fall 2012
Hash Tables
Problem A / Address Generation: Construction of the function . It needs to be simple to calculate,
and must uniformly distribute the elements in the table. For all keys , is the position of in the
table. This position is an integer. Also, ( ) if . Searching for a key and inserting a key
(all dictionary ADT operations) takes time. We have the function ( ) where we
have the two following sub-functions:
1. Hash code map
o They reinterpret a key as an integer. They need to:
i. Give the same result for the same key
ii. Provide a good "spread"
o Polynomial accumulation
We partition the bits of the key into a sequence of components of fixed length,
.
Then we evaluate the polynomial at a fixed value
, ignoring overflows.
Especially suitable for strings.
o Examples
Memory address (we reinterpret the memory address of the key object as an integer,
this is the default hash code for all Java objects)
Integer cast (Reinterpret the bits of the key as an integer)
Component sum (We partition the bits of the key into components of fixed length and
we sum the components).
2. Compression map
o They take the output of the hash code and compress into the desired range.
o If the result of the hash code was the same, the result of the compression map should be
the same.
o Compression maps should maximize "spread" so as to minimize collisions
o Examples
Division: where is usually chosen to be a prime number (number
theory).
Multiply Add Divide (MAD): where and are nonnegative
integers such that .
Problem B / Collision Resolution: What strategy do we use if two keys map to the same location ?
Load factor of a Hash table: where is the number of elements and is the number of cells. The
smaller the load factor, the better.
Linear Probing: We have . Consider a hash table A that uses linear probing.
In order to search, we do the following:
:We start at cell , and we probe consecutive locations until one of the
following occurs:
o An item with key is found
o An empty cell is found
o cells have been unsuccessfully probed
To handle insertions and deletions, we introduce a special object, called AVAILABLE, denoted , which
replaces deleted elements.
8|P a ge
CSI 2110 Summary, Fall 2012
Quadratic Probing: We have . The problem with this is that modulo is hard
to calculate and it only visits half of the table but it's not a big deal. Similar problem: you avoid linear
clustering, but every key that's mapped to the same cell will follow the same path. There's more
distributed clustering, which should be avoided. Called secondary clustering.
Bubble Sort
You literally bubble up the largest element. Move from the front to the end, Bubble the largest
value to the end using pairwise comparisons and swapping. Once you go through the whole array,
you start from the first element again and go through the same way.
In order to detect that an array is already sorted so we don't have to go through it again
unnecessarily, we can use a boolean flag. If no swaps occurred, we know that the collection is
already sorted. The flag needs to be reset after each "bubble up".
Complexity is .
Recursive Sorts
Divide and Conquer paradigm:
Divide: divide one large problem into two smaller problems of the same type.
Recur: solve the subproblems.
Conquer: combine the two solutions into a solution to the larger problem.
Merge Sort
Merge sort on an input sequence S with n elements consists of three steps. It is based on the divide-and-
conquer paradigm:
Divide: partition into two groups of about each.
Recur: recursively sort and .
Conquer: merge and into a unique sorted sequence.
The conquer step merged the two sorted sequences and into one sorted sequence by comparing
the lowest element of each and and insert whichever is smaller. Merging two sorted sequences,
each with elements and implemented by means of a doubly linked list, takes time.
It's depicted in a binary search tree kind of style.
9|P a ge
CSI 2110 Summary, Fall 2012
The height h of the merge sort tree is , since at each recursive call we divide the sequence in
half. The overall amount or work done at the nodes of depth is , since we partition and merge
of size , and we make recursive calls. From this we get:
Complexity:
Quick Sort
Quick sort is also based on the divide-and-conquer paradigm.
Divide: pick an element , called the pivot, and partition into
o elements less than
o elements equal to
o elements greater than
Recur: sort and
Conquer: join , , and .
Pivot can always be chosen randomly, or we can decide always to choose the first element of the array,
or the last.
Complexity:
o Worst case:
o Average case:
10 | P a g e
CSI 2110 Summary, Fall 2012
Summary of Sorts
Radix-Sort
Crucial point of this whole idea is the stable sorting algorithm, which is a sorting algorithm which
preserves the order of items with identical key.
Question: the best sorts that we have seen so far have been , and there is no way to beat
that unless we're under certain circumstances, in which case we can reach .
Bucket Sort
Let be a sequence of n (key, element) items with keys in the range . Bucket sort uses the
keys as indices to an auxiliary array of sequences (buckets).
Phase 1: Empty sequence by moving each item into its bucket .
Phase 2: For , move the items of bucket to the end of sequence .
Takes time.
Lexicographic Order
A -tuple is a sequence of d keys where is said to be the th dimension of the tuple.
The lexicographic order of two -tuples is recursively defined as , ie
the tuples are compared by the first dimension, then the second, etc.
Lexicographic sort: Let be the comparator that compares two tuples by their ith dimension, ie
for , if . Let be any stable sorting
algorithm that uses comparator . Lexicographic sort sorts a sequence of -tuples in lexicographic
order by executing times algorithm , one per dimension.
You do it by starting from dimension , putting those in order, then moving to and putting
them in order, then continue this way until you reach dimension , put that in order, and you're
done.
Lexicographic sort runs in where is the running time of the stable-sort algorithm.
11 | P a g e
CSI 2110 Summary, Fall 2012
Radix Sort Variation 1: This one uses the bucket-sort as the stable sorting algorithm. Applicable to
tuples where keys in each dimension are integers in the range .
This one runs in time.
Radix Sort Variation 2: Consider a sequence of n b-bit integers We represent each
element as a b-tuple of integers in the range [0,1] and apply radix sort with N=2. It sorts Java integers
(32-bits) in linear time.
This one runs in time.
Radix Sort Variation 3: The keys are integers in the range . We represent a key as a -tuple
of digits in the range and apply variation 1, ie write it in base notation.
This means write a number in this notation: where are the coefficients of the
following equation:
where is the number you're putting in base .
Examples
o If , write as , since
o If , write as , since
This one runs in time.
Graph Traversals
Subgraphs
Subgraph: A subgraph of a graph is a graph such that the vertices of are a subset of the vertices of
and the edges of are a subset of the edges of .
Spanning subgraph: A subgraph that contains all the vertices of .
Connected: When there is a path between every pair of vertices.
Connected component: A maximal connected subgraph of G.
(Free) Tree: An undirected graph such that is connected and has no cycles.
Forest: A collection of trees. The connected components of a forest are trees.
Spanning tree: A spanning subgraph that is a tree.
Spanning forest: A spanning subgraph that is a forest.
Traversal: A traversal of a graph visits all vertices and edges of , determines whether is connected,
computes the connected components of , computes a spanning forest of , builds a spanning tree in a
connected graph.
With a stack: Start at a vertex, add it to your visited set . Push all the edges into the stack, then
pop the first one. Add the vertex it brings you to to . Push that vertex's vertices, then pop, visit
that vertex if it hasn't been visited yet, push its vertices, etc. until there are no edges left in the stack.
With recursion: : mark visited, for all vertices that are adjacent to , visit if it hasn't been
visited yet, .
12 | P a g e
CSI 2110 Summary, Fall 2012
Properties of DFS:
visits all the evrtices and edges in the connected component of
The discovery edges labeled by form a spanning tree of the connected component of
v.
Setting/getting a vertex/edge label takes time.
Each vertex is labeled twice, once as unexplored and once as visited.
Each edge is labeled twice, once as unexplored and once as discovery (or back).
Complexity:
o Adjacency List
Average case is
Worst case is when , so .
o Adjacency Matrix
Applications:
Path Finding: We can specialize the DFS algorithm to find a path between two given vertices u and
z using the template method pattern. We call with as the starting vertex, using a
stack to keep track of the path between the start vertex and the current vertex. As soon as we
reach , we return the path, which is the contents of stack .
Cycle Finding: We can specialize the DFS algorithm to find a simple cycle using the template
method pattern. We use a stack to keep track of the path between the start vertex and the
current vertex. As soon as a back edge is encountered, we return the cycle as the portion
of the stack from the top to vertex .
Breadth-First Search
BFS is a graph traversal that can be further extended to solve other problems and, on a graph with
vertices and edges, takes time (with adjacency list implementation).
Idea: Visit a vertex, then visit all unvisited vertices that are adjacent to it before visiting a vertex which is
two sports away from it.
With a queue: Add your starting vertex to . Enqueue its adjacent vertices, then dequeue and
visit that vertex if it hasn't been visited already. Enqueue its adjacent vertices. Dequeue the next one,
visit, enqueue adjacent… and so on until everything has been visited.
13 | P a g e
CSI 2110 Summary, Fall 2012
Properties of BFS:
Notation: is the connected component of .
visits all the vertices and edges of
The discovery edges labeled by form a spanning tree of
For each vertex in , the path of from to has edges, and every path from to in
has at least edges.
Setting/getting a vertex/edge label takes time.
Each vertex is labeled twice (once as unexplored, once as visited).
Each edge is labeled twice (once as unexplored, once as discovery or cross)
Runs in time given the graph is represented in an adjacency list.
Applications: Using the template method pattern, we can specialize the BFS traversal of a graph G to
solve the following problems in O(n+m) time:
Compute connected components of G
Compute spanning forest of G
Find a simple cycle in G or report that G is a forest
Given two vertices of G, find a path between them with the minimum number of edges, or report
that no such path exists.
DFS vs BFS
Application DFS BFS
Spanning forest X X
Connected components X X
Paths X X
Cycles X X
Shortest paths X
Biconnected components (if the removal of any single X
vertex, and all edges incident on that vertex, cannot
disconnect the graph)
Edges that lead to an already visited vertex Back edge Cross edge
Shortest Path
Shortest path describes the shortest path to the starting vertex.
Properties:
A subpath of a shortest path is itself a shortest path.
There is a tree of shortest paths from a start vertex to all other vertices.
Dijkstra's Algorithm
The distance of a vertex from a vertex is the length of a shortest path between and .
Dijkstra's algorithm computes the distances of all the vertices from a given start vertex .
Assumptions: Graph is connected, Edges are undirected, The edge weights are nonnegative.
We can grow a cloud of vertices, beginning with and eventually covering all the vertices. At each
vertex , we store , which is the best distance of from in the subgraph consisting of the
cloud and its adjacent vertices.
14 | P a g e
CSI 2110 Summary, Fall 2012
At each step, we add to the cloud the vertex outside the could with the smallest distance label,
and we update the labels of the vertices adjacent to , ie we change the labels if there's a better
way to get to the s vertex with the newly augmented cloud.
We use a priority queue to store the vertices not in the cloud, where is the key of a vertex
in .
Using a heap: Add the vertex to the cloud (distance in the priority queue is 0). Then you
and update. Whatever returned is the new vertex in your cloud.
Updating means removing old keys and putting in new ones. Again, , add to cloud,
update, etc. until the heap is empty.
Complexity:
o In a heap, it is .
o In an unsorted sequence, it is
Prim-Jarnik Algorithm
We assume the graph is connected.
We pick an arbitrary vertex , and we grow the minimum spanning three as a cloud of vertices,
starting from . We store with each vertex a label , which is the smallest weight of an edge
connecting to any vertex in the cloud (not the root as in Dijkstra).
At each step, we add to the cloud the vertex u outside the cloud with the smallest distance label.
We update the labels of the vertices adjacent to .
We use a priority queue whose keys are labels, and whose elements are vertex-edge pairs.
Key: distance, Element: vertex. Any vertex can be the starting vertex. We still initialize all the
values to infinite, and also initialize (edge associated with ) to null. It returns the
minimum spanning tree .
It is an application of the cycle property.
Complexity is .
Kruskal's Algorithm
Each vertex is initially stored as its own cluster.
At each iteration, the minimum weight edge is added to the spanning tree if it joins two distinct
clusters.
The algorithm ends when all the vertices are in the same cluster.
Application of the partition property.
15 | P a g e
CSI 2110 Summary, Fall 2012
A priority queue stores the edges outside the cloud. Key: weight, Element: edge. At the end of the
algorithm, we are left with one cloud that encompasses the minimum spanning tree, and with a
tree which is our minimum spanning tree.
Essentially: start with the edge with the lowest weight, add it to the tree. Continue with the next
lowest weight, and add it to the tree (if it does not form a cycle with existing edges of the tree).
Continue until you've gone through all the edges. The resulting tree is your minimum spanning
tree.
Complexity is ( )
Pattern Matching
Brute-Force
Compares the pattern with the text for each possible shift of relative to , until either a
match is found or all placements of the pattern have been tried.
Compare, shift over one, compare, shift over one…
Worst case: ,
Complexity , where is the size of the text and is the size of the pattern that we're
trying to find.
Boyer-Moore
Based on two heuristics
o Looking-glass heuristic: Compare with a subsequence of moving backwards.
o Character-jump heuristic: When a mismatch occurs at , where c is the character in
at which the mismatch occurs (with ):
If contains , shift to align the last occurrence of in with .
If does not contain , shift completely past to align with .
If a match occurs, compare the previous two characters. Of they match, keep comparing
right to left. If a mismatch occurs, do what it says up there.
Worst case: ,
Complexity is | | where | | is the size of the alphabet you're using.
KMP Algorithm
Compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than
brute force.
Compare each letter of , left to right, to . When you find a mismatch, look at the word to the
left of the mismatch. Find the largest prefix such that there is an identical suffix, and move by
matching up the suffix with the prefix.
Failure function is defined as the size of the largest prefix of that is also a suffix of
. Usually organized into a table. It is computed in time.
The complexity of this algorithm is .
16 | P a g e