Algorithms
What follows are some notes on algorithms I've been reviewing from Algorithms by Robert Sedgewick and Kevin Wayne, The Algorithm Design Manual by Steven S. Skiena, and other sources around the Internet ^{1} ^{2} ^{3}. I wanted to write some notes on the material so that I could easily look back on it, but mainly so that I could be sure that I understand the material.
# Approaches
# Strategies
Some general strategies to approaching a given problem:
 concrete examples: manually solve concrete instances and build a general solution
 case analysis: split the input/execution into a number of cases and solve each in isolation
 iterative refinement: bruteforce and improve upon it
 reduction: use wellknown solution to another problem as subroutine
 graph modeling: model the problem as a graph
# Modeling
 permutations: arrangements, tours, orderings, or sequences
 subsets: clusters, collections, committees, groups, packagings, or selections
 trees: hierarchies, dominance relationships, ancestor/descendant relationships, or taxonomies
 graphs: networks, circuits, webs, relationships
 points: sites, positions, data records, locations
 polygons: shapes, regions, configurations, boundaries
 strings: text, characters, patterns, labels
# Techniques
 divideandconquer: decompose the problem into two or more subproblems until they are simple enough to solve directly, then combine the results
 greedy algorithms: make locally optimum, seemedlikeagoodideaatthetime decisions in stages and never change them (don't backtrack). This won't always yield the optimum solution.
In divideandconquer, the problem is often solved in the combination phase. For example, merge sort divides the problem and actually solves (i.e. sorts) it during the merge phase.
# Intractability
An intractable problem is one that has no efficient solution. It can be proved that a problem is intractable if a known intractable problem can be reduced to the given problem.
 change the problem formulation such that it still achieves the higher goal
 bruteforce or dynamic programming: acceptable if instances or exponential parameter is small
 search: prune searchspace via backtracking, branchandbound, hillclimbing
 heuristics: insight, common case analysis
 parallelization: solve subparts in parallel
 approximation: solution that is provably close to optimum
# Analysis
The goal of asymptotic analysis is to suppress the constant factors and lowerorder terms. This is because the constant factors are very systemdependent, such as how many cycles a certain operation may take between different pieces of hardware. The lowerorder terms are also not as important because they are rendered irrelevant for very large inputs, where the higherorder terms dominate.
# Summations
The summation of a constant is simply the product of the constant and the range:
The sum of the first integers can be visualized as folding the range of values at the middle so that the first integer is paired with the last, or more generally: the paired with the . Below, the bound of refers to the "folding at the middle," then each pair is added. Note that the sum is quadratic!
The sum of a harmonic series is approximately equal to the logarithm of the bound.
# Logarithms
The exponent of a logarithm operand can be extracted:
# Bounds
The upperbound means that there exists some constant such that is always for a large enough , that is, for some offset such that .
The lowerbound is similar except that it is a lowerbound, so that there exists some constant such that is always for .
There is also which means that is an upperbound and is a lowerbound on for . This is a tighter bound on than simply a lower or upperbound alone would provide.
Constant factors are ignored since they can easily be beaten out by a different chosen value of .
# Dominance Relations
A fastergrowing function dominates a slowergrowing one , i.e. . In order of increasing dominance:
Name  Complexity  Examples 

Constant  adding two numbers  
Logarithmic  binary search  
Linear  incrementing each element in an array  
Linearithmic  QuickSort and MergeSort  
Quadratic  looking at all/most pairs of an array  
Cubic  looking at all/most triples of an array  
Exponential  looking at all subsets of an array  
Factorial  looking at all permutations or orderings of n items 
When analyzing an algorithm it is common to produce an expression of bounds which can easily be simplified by keeping in mind the principle of dominance relations.
# Sum of Bounds
For example, is an algorithm first sorts its input and then prints each element, then that's a sorting operation of followed by a linear printing operation of , essentially becoming . However, the linearithmic term clearly dominates the linear term, so simplifying it to still leaves an accurate bound.
# Product of Bounds
Constant factors are ignored since a different value of the constant can be chosen to compensate for any arbitrary constant factor.
However, the product of functions is important. For example, a linear scan of an array in where for each element another linear scan of the array is made in produces a product of .
# Master Theorem
The master theorem provides a straightforward, "blackbox" way of determining the running time of a recursive, divideandconquer algorithm. It's stated as:
where:
 is the size of the problem
 is the number of recursive calls per level
 is the size of each subproblem
 is the work done outside of the recursive calls, e.g. the merge in mergesort
Then the runtime complexity of an algorithm can be determined based on the values of , , and .
when , the complexity is
The same amount of work is being done at each level, of which there are .
when , the complexity is
Most of the work is done at the root, as if only at a single level.
when , the complexity is
It's equivalent to the number of leaves in the recursion tree, since most of the work is done at the bottom of the tree.
Essentially, the master theorem is a tugofwar between:
 : the rate of subproblem proliferation
 : the rate of work shrinkage per subproblem
# Approximations
Oftentimes it's useful to use approximations instead of exact values.
Stirling's approximation:
# Bitwise Operations
Bits can be unconditionally turned on by ORing with 1:
Bits can be unconditionally turned off by ANDing with 0:
Bits can be toggled by XORing with 1:
XOR can be replicated without actually using XOR by simply writing out what XOR means, i.e. exclusive OR: (a AND NOT b) OR (NOT a AND b)
The least significant set bit can be obtained with:
The way this works is:
all bits uptoandincluding the least significant set bit are flipped
flip all bits so that only the previouslyleast significant set bit will pass when ANDed with the original value
AND with the original value so that only the least significant set bit passes
Likewise, the least significant set bit can be unset with:
This works because all bits uptoandincluding the least significant set bit are flipped, such that the least significant set bit is now unset (0) and all prior bits are set (1). All other more significant bits remain unchanged. This way, ANDing both causes the previously least significant set bit to fail.
# Parity
The parity can be computed by continuously rightshifting and ANDing the shifted bit with 1 and adding the result to the running parity. However, only the 1 bits will have an effect on the parity, so a quicker way is to continuously be accessing and turning off the next lowest set bit. Finally, lookup tables of e.g. 16bit chunks can be precomputed so that on a given input, the parity for each 16bit chunk of the input is looked up and added together.
# Bitwise Set Operations
A very important property of XOR is that it a number XORed with itself is 0.
For example, given a list of numbers where each number appears exactly twice except for one number, the number that appears only once can be obtained by XORing each element with the next one, since duplicate elements would cancel themselves out.
More generally, bytes can be treated as a set, where bitwise operations correspond to set operations.
A set can be tested for the presence of a value by using the AND operator. More generally, this can be seen as a set intersection operation, in which the operands may, after all, be singleton sets.
A value can be unconditionally added to set using the OR operator:
A value can be unconditionally removed from set using the AND operator with the complement of the value to be removed:
A value can be conditionally added or removed from the set using the XOR operator. The value will be removed if it is already contained and added if it isn't.
For example, given an array of numbers, each of which appears exactly three times except for one number, the number that appears only once can be obtained using a combination of the above operations.
There will be two sets. The ones
set will contain those numbers that have been observed to appear once so far. The twos
set will contain those numbers that have been observed to appear twice so far. It's not necessary to track those numbers that appear three times, as that is implicit.
For each number in the array, it will be added to the ones
set if it has been seen exactly once. If a number is already in the ones
set and is seen again, it's removed from the ones
set and added to the twos
set. If a number is already in the twos
set and is seen again, it's removed from the twos
set. Ultimately this will leave the ones
set as a singleton set containing the number that appears only once.
First, the number being considered is checked to see if it has already been seen once so far:
This would mean that this is the second time that this number has been seen. If the number is indeed in the ones
set, the above expression evaluates to the number itself by nature of the AND operation, otherwise it evaluates to 0.
The result of the above expression can directly be used to add the number to the twos
set using the OR operation, which adds the number to the set, or does nothing if the number is 0.
The number is then conditionally added to or removed from the ones
set using the XOR operation. If the ones
set already contained the number, then the previous operation will have added the number to the twos
set, in which case it's no longer needed inside the ones
set, since the ones
set only contains those numbers that have appeared exactly once. If the ones
set didn't already contain the number, then the above presence check results in 0 which does nothing when added to the twos
set.
Therefore, we can add the number to the ones
set if it wasn't already contained, or remove it if it was contained.
However, if this has been the third time that this number has been seen, then the number will have been in the twos
set already but not in the ones
set. The above expression would therefore also add it to the ones
set. Therefore we can be sure that if the number was present in both the ones
and twos
sets after the above operation, then this is the third time that this number has been observed.
In this case, the number should be removed from both sets. This would ensure that the ones
set really does only contain those numbers that have appeared exactly once, which is what we're interested in.
However, the ones
set and the twos
set may contain various numbers. We know that the one that appears in both is the one that has appeared three times, and hence the one we want to remove, so we can obtain this number by intersecting both sets.
We then want to remove this number from both sets:
Here is the full algorithm.
# Integers
It's possible to "shift" a number to the left by digits by multiplying it with . For example, the number can be shifted to the left by 2 digits to by multiplying it by .
Similarly, it's possible to get the last digits of a number by getting the remainder of dividing it by . For example, the last 2 digits of can be obtained by getting the remainder of dividing it (i.e. the modulo) by .
The square root of a number can be found via binary search by searching from for a number which, squared, is equal to the input (given some tolerance or until convergence). If the input is real and less than 1, the bounds change to .
The logarithm of a number can naively be computed by counting how many times the operand can be divided by the base without going below 1. For example, can divide a total of six times until it reaches 1, doing it a seventh time causes the value to go below 1.
# Arrays
The 2SUM problem is one where the input is a sorted array of integers and a target value and the objective is to determine if there is a pair of entries and such that:
The bruteforce approach would be to check if this is true for all possible pairs of values in the array, of which there are . Alternatively, a hash table can be used.
Another approach is to iterate from both ends of the array. For example, at first, the first and last elements will be used. If their sum is greater than the target, the right end will be iterated leftward so as to attempt to obtain a smaller element. If instead their sum is less than the target, the left end will be iterated rightward in hopes of increasing their sum.
# Sorting
Many problems can be reduced to sorting.
Answers: Ordering a sequence into a specified order.
Data Structure: Array or other sequence.
The following algorithms are described with the assumption that the sequence is an array of contiguous memory and constant access time. This is noteworthy because it is important to recognize algorithms can have different speeds depending on the underlying data structure.
For example, selection sort backed by a priority queue or balanced binary tree can help to speed up the operation of finding the smallest element in the unsorted region. Instead of being linear, the operation would be . Given that this is done at every element in the sequence, of which there are , this means that selection sort backed by such a structure can be improved from to ^{4}.
A sorting algorithm is known as stable if it maintains the same relative order of equal keys as it was before the sorting operation.
The best case complexity of comparisonbased sorting is . If the distribution of the data is known, sorting can be done much faster using counting or bucket sort, for example.
# Selection Sort
Case  Growth 

Any 
This is a pretty naive algorithm that is mainly useful for didactic purposes.
Algorithm operation:
 go through entire sequence to find smallest element
 swap element with the leftmost unsorted element
 repeat until the end of the sequence
This essentially splits the sequence into a left sorted region and a right unsorted region.
# Insertion Sort
Case  Growth 

Best  
Worst 
This is a stable algorithm that is still pretty straightforward but somewhat improves upon selection sort if the array is already sorted or if it's nearly sorted.
It operates as follows:
 go through the entire sequence until an element is found which is smaller than the previous element
 swap the smaller element with the one on the left until the element to its left is no longer larger than itself
 repeat until the end of the sequence
The benefit of insertion sort is that if the sequence is already sorted then the algorithm operates in linear time. Similarly, if the sequence is nearly sorted, the algorithm will perform better than the worst case.
Performance Factors: order of the items
# Shell Sort
Case  Growth 

Worst 
While insertion sort can be faster than selection sort, one problem with it is that the swap operations are done one at a time. This means that in the worst case, when sorting position 1 of the array, the smallest element could be at the very end of the array, meaning a total of swaps where is the length of the array.
Shell sort aims to mitigate this by doing the following:
 pick a large number some constant factor less than the length of the sequence
 consider every element in the sequence and apply insertion sort to those elements
 now consider every element and do the same
 repeat incrementing until the end of the array is reached
 repeat steps 2  4 but with reduced by some factor until the reduction reaches
 ultimately do regular insertion sort, i.e.
The value picked for and the factor which is used to reduce it form what is known as a gap sequence. The overall worstcase time complexity depends on the chosen gap sequence. A commonly chosen gap sequence with a worstcase time complexity of is:
This sequence begins at the largest increment less than and decreases to 1. This means that for a sequence of length the sequence is .
The effect of shell sort is that it sorts elements that are elements apart with one swap instead of . The granularity of the sorting operation increases as itself decreases such that every element is eventually sorted, but with the added benefit that as decreases, the distance of the longestdistance swap decreases.
# Merge Sort
Case  Growth 

Worst  
Space 
This is a stable algorithm and the first algorithm that is linearithmic in complexity. The general idea is that the sequence is split into many pieces and then they're all merged back together. The sorting occurs during the merging phase. The merging algorithm works such that the resultant merged piece is sorted.
The main drawback is that it has space complexity because an auxiliary sequence has to be created to facilitate the merging process.
The complexity is because the number of subproblems is doubling at each level (i.e. the two recursive calls), but the work to be done by those subproblems is halving. That is, for a given level , the amount of work done is:
Given an input size of , the number of levels in the recursion tree is , which means that at each of the levels in the tree there is work being done, hence .
The number of inversions in an array can be counted in by reducing the problem to merge sort. Specifically, during a merge, each time an element from the right half is merged and there are elements remaining in the left half, then the chosen element from the right half represents an inversion between each of the elements remaining on the left half.
A possible implementation would have merge return the inversions it encountered, which has to be passed up the recursion tree by having the sort functions return the sum of the recursive sorts and merges.
# TopDown
This is a recursive approach that works by splitting the array into two pieces until the pieces consist of pairs of elements. On each recurrence, the two pieces that were split for that recurrence are merged back.
# Merge Sort Improvements
There are a couple of improvements that can be made to topdown merge sort:
 use insertion sort for small subarrays: create a cutoff, e.g. 15 elements, where the pieces are sorted with insertion sort instead of being broken down further
 test if sequence is already in order: skip the merging phase if
seq[mid] <= seq[mid + 1]
# BottomUp
The other approach to merge sort is bottomup, that is, starting with arrays consisting of one element and merging them together, then merging all of the arrays of size two, and so on until the entire array is merged.
 increments a counter in the series of powers of two until
 merges every subarray of length
One advantage of bottomup merge sort is that it can be modified to perform on linkedlists in place.
# Quick Sort
Case  Growth 

Worst  
Space 
QuickSort works by choosing an element in the arraythe pivotand partitioning the array such that all elements less than the pivot are moved to its left and all elements greater than the pivot are moved to its right. This has the effect that, at the end of this operation, the chosen element will be at its "sorted order position," i.e. the position in which it would be if the entire array were already sorted.
Note that the elements are simply moved to the correct side of the pivot, but the order of neither side is defined, i.e. neither the left nor the right side are necessarily sorted after partitioning.
The partition algorithm is similar to merge in merge sort in that it is what actually does the sorting.
 choose a partition element separator
 scan through the array from to in both directions
 while do
i++
 while do
j
 swap and
 while do
 repeat step 2 until the iterators and cross
 swap the partition element with the final position of the rightside iterator
The sorting algorithm then recurses on the two partitions. Note that i
is set to lo
and not lo + 1
to ensure that the pivot at lo
is skipped, since the first operation is ++i
. However, j
is set to hi + 1
to ensure that hi
is not skipped, since it's not the pivot.
# Quick Sort Improvements
use insertion sort for small subarrays: Adding a cutoff size for which to apply insertion sort to small subarrays can improve the performance of the algorithm.
Instead of:
use:
where
M
is the cutoff. Recommended sizes are between 5 and 15.medianofthree partitioning: Choose a sample of size 3 from the sequence and choose the middle element as the partitioning element.
# Threeway Partitioning
Case  Growth 

Best  
Worst  
Space 
One problem with quick sort as it is implemented above is that items with keys equal to that of the partition item are swapped anyways, unnecessarily. Threeway partitioning aims to resolve this by partitioning into three separate subarrays, the middle of which corresponds to those items with keys equal to the partition point. E. W. Dijkstra popularized this as the Dutch National Flag problem.
Performance Factors: distribution of the keys
 perform a 3way comparison between element and
 : swap and and
lt++
andi++
 : swap and and
gt
 :
i++
 : swap and and
 repeat step 1 until and cross, i.e. while
 recurse on the left and right segments
Quick sort performs a lot better than merge sort in sequences that have duplicate keys. Its time is reduced from linearithmic to linear for sequences with large numbers of duplicate keys.
# Heaps
A priority queue is an abstract data type that allows adding elements and retrieving the smallest or largest element. Priority queues are useful for an unbounded sequence for which we want to retrieve the smallest elements at any given moment.
The data structure commonly used to back a priority queue is an array embedding the contents of a complete binary tree in levelorder that maintains two invariants:
 the parent of is
 the children of are at and
# Heap Insertion
Case  Growth 

Worst 
Swimming in a heap is when a node is checked to ensure the invariant that every node is smaller than its parent. If a node's value becomes larger than its parent, the node is swapped with its parent and the process is repeated at the new parent until the tree root is reached. This can be characterized as a new, larger node having to swim up the tree to its proper place.
To insert into the heap:
 add element to the end of the array
 increment heap size
 swim up the heap to restore heap order
# Heap Removal
Case  Growth 

Worst 
From a different perspective, if a node's key becomes smaller than one or both of its children, the heaporder invariant will also be violated, because it conversely means that one or more of its children are larger than the parent. In this case, the node is simply swapped with the larger of its two children, a process known as sinking. This process is repeated for the new child all the way down the tree until the invariant holds.
To remove the maximum from the heap:
 take the largest item off of the top
 put the item from the end of the heap at the top
 decrement heap size
 sink down the heap to restore heap order
# Heap Sort
Case  Growth 

Worst 
Heap sort is a sorting algorithm facilitated by a priority queue which performs well when backed by a binary heap. Heap sort more or less amounts to:
 feeding the sequence into a priority queue
 extracting the sequence out of the priority queue
However, there are certain details involved to make it operate faster. Usually these operations are performed in place to avoid using extra space.
First, the sequence has to be put into heap order, which is accomplished by walking up the tree (bottomup) and sinking every root node with more than one child. The starting point for this is always , which is the last node with two children. It's important to note that "sinking a node" doesn't mean that the node will definitively be swapped.
Assuming a maximumoriented priority queue, the sorting is then accomplished by:
 swap the maximum with the last item in the heap
 decrease logical heap size
 sink the new root to ensure or repair heap order
 repeat 13 until the priority queue becomes empty
Note that the notion of a logical heap size is important, as the sorted sequence is increasingly added to the end of the same array that backs the heap, so it's necessary to make a distinction between the two regions of the array, to prevent the heap sink or swim operations from corrupting the sorted region. This is accomplished in this code by using a hypothetical sink
method that takes an upper bound parameter, which corresponds to the end of the heap region, i.e. the logical heap size.
# Median Maintenance
The median of a streaming sequence of numbers can be computed in constant time by maintaining two heaps: a maxheap for the lower/left half and a minheap for the upper/right half. This has the effect of keeping the elements sorted, and the top of the maxheap yields one of the overall middle elements, whereas the top of the minheap yields the other middle element.
The elements must be kept equal in size, or the minheap may be larger by one element, in which case there are an odd number of numbers, so the top of the minheap is the middle element alone.
If one of the heaps grows larger, its top element should be popped and pushed onto the other heap to balance them.
# Selection
Case  Growth 

Average 
Selecting the smallest item in a sequence can be accomplished by using QuickSort's partition algorithm. This is guaranteed by the invariant held by QuickSort's partition algorithm which states that given the partition index , all elements to the left are less than or equal to and all elements to the right are greater than or equal to , effectively making the subsequence up to consist of the smallest elements in the sequence.
With that in mind, the desired index is input to QuickSelect. After partitioning in , the resulting position of the element is compared to the input . If the resulting position is less than the desired then QuickSelect is repeated on the right region with a compensated , i.e. . If the resulting position is greater than the desired then QuickSelect is repeated on the left region with the same .
# Binary Search Trees
Case  Growth 

Worst 
This is the classical data structure consisting of a binary tree where each node has two children. The subtree to the left of each node consists of elements smaller than the node and the subtree to the right of each node consists of elements greater than the node.
The performance of BSTs greatly depends on the shape of the tree, which is a result of the distribution and order of the elements that are input.
The rank operation counts how many keys are less than or equal to the given value.
The predecessor of any node can be obtained easily if it has a left child, in which case the predecessor of the node is the maximum of the subtree rooted at the left child. If there is no left child, the predecessor is the first ancestor larger than the node (in other words, the first parent found for which the child is the right child). The successor can be found similarly, flipping left⟷right and maximum⟷minimum.
# BST Structure
A perfect binary tree is one where every level has the maximum number of nodes.
The depth of a tree node is the number of edges from the root to the node, i.e. the level  1, so the depth of the root node is 0. The level of a node in a tree is its depth + 1. The difference between levels and depth is that levels are the visual levels whereas depth is the number of edges from the root to the node.
The height of a tree node is the number of edges on the longest path from the node to a leaf. The height of an entire tree is the height of its root, i.e. the depth of the deepest node.
The number of levels in a binary tree with nodes is .
In a perfect binary tree, the number of nodes at depth is .
In a perfect binary tree with levels, the total number of nodes is . So if there are 3 levels, total nodes.
# BST Traversal
There are three main forms of traversing a BST. The order refers to the order in which the current node is visited, that is, the time at which is visited is the only thing that varies, so is always visited before .
Traversal  Order 

preorder  
inorder  
postorder 
# Morris Traversal
It's possible to perform an inorder traversal of a BST without using a stack or recursion by performing a Morris traversal. In essence, this traversal transforms the tree during the traversal so that the entire right branch of the BST forms the inorder traversal. The BST is returned to its original structure as the traversal takes place.
if left is
null
 visit
current
 go right
 visit
else:
 set
temp
tocurrent
's left go right of
temp
until its right isnull
orcurrent
This finds the maximum of
current
's left, i.e. the inorder predecessor ofcurrent
, or it finds the node whose right iscurrent
.if
temp
's right isnull
:This is the inorder predecessor of
current
. Transform the tree so thattemp
leads tocurrent
and is therefore inorder. set
temp
's right tocurrent
go left of
current
This "rewinds" back to the last known minimum.
 set
if
temp
's right iscurrent
This is a node that was previously transformed to be inorder by making its right be
current
. Repair transformation. visit
current
 unset
temp
's right (previouslycurrent
)  go right of
current
 visit
 set
# BST Deletion
Most operations such as insertion and lookup are very straightforward. Deletion is somewhat more involved.
To delete node :
 has no children: remove it
 has just one child: swap it with child and delete it
 has two children:
 compute 's predecessor , i.e. maximum of left subtree
 swap and
 delete
 now has no right child, recurse starting at 1
The transplant operation can be handled by simply associating the parent with the new child and vice versa:
# BST Select
The BST can be augmented so that each node contains the count of notes rooted at it, including itself. Then the count can be computed for node base by adding the count of left child and right child plus one for :
It's important to keep this augmented information uptodate with the operations on the tree, such as insertion or deletion, by traversing up the parents from the affected node to increment or decrement their counts.
Selection of the order statistic can be found easily by guiding the traversal of the tree with the augmented size information.
 the node in question is itself the ith order statistic, because
 the order is somewhere in the left subtree, recurse
 the order is somewhere in the right subtree, recurse. Since the right subtree only knows about itself, shift to discard the left subtree and the root node.
# 23 Search Trees
While 23 search tree can be implemented, they're mainly used to help understand the implementation of RedBlack Trees, which have better performance.
A 23 tree is either empty or:
 2node: one key and two links
 left for keys smaller than the left key
 right for keys larger than the right key
 3node: two keys and three links
 left for keys smaller than the left key
 middle for keys between the node's keys
 right for keys larger than the right key
# 23 Tree Searching
Searching follows simply from the structure of the tree.
 search hit if the key is in the node
 if not, recurse into the appropriate link
 search miss if a null link is reached
# 23 Tree Insertion
Insertion needs to take into consideration the fact that the tree must remain balanced after the operation. The general procedure is that the key is searched for until a node with a null link is reached at the bottom of the tree.
 single 2node
 replace the 2node with a 3node containing the new key
 single 3node
 create two 2nodes out of each of the two keys
 replace the 3node with a 2node consisting of the new key
 set the 2node's links to the two new 2nodes
 3node with 2node parent (slight variation of above)
 create two 2nodes out of each of the two keys
 move the new key into the parent 2node to make it a 3node
 set the middle link to the 3node's left key and right link to the right key
 3node with 3node parent
 propagate the above operation until the root or a 2node is encountered
 if the root is encountered, split it as in the case of a single 3node
Perfect balance is preserved because tree height increase occurs at the root, and additions at the bottom of the tree are performed in the form of splitting existing nodes such that the height remains the same.
The problem with implementing a direct representation of 23 trees is that there are many cases to handle and nodes have to be converted between various types. These operations can incur overhead that nullifies or even makes worse the performance of 23 trees compared to regular BSTs.
# RedBlack Trees
Case  Growth 

Worst 
RedBlack trees are trees that guarantee nearperfect balance by maintaining 5 invariants:
 a node is either red or black
 root is black
 all leavesrepresented as nilare black
 both children of every red node are black, i.e. there must not be more than one red node in a row in any vertical path
 every path from a given node to any of its descendant leaves contains the same number of black nodes
These properties allow redblack trees to be nearly balanced in even the worst case, allowing them more performance than regular BSTs. A very neat implementation is available here.
# RedBlack Tree Insertion
The inserted node is attached in the same manner as for BSTs, except that every node is painted red on insertion. However, the inserted node has the possibility of violating any one of the 5 invariants, in which case the situation must be remedied. The following code representing the different cases that must be remedied are split into corresponding individual functions for didactic purposes.
There are three main scenarios that may arise from adding a node:
 first node added creates a red root, violating property 2 (root is black)
 node is added as child of black node, operation completes successfully
 consecutive red nodes, violating properties 4 (both children of red nodes are black) and 5 (equal number of black nodes per path)
Note that scenarios 1 and 3 violate the properties of redblack trees.
First, the inserted node may be the only node in the tree, making it the root. Since all nodes are inserted as red, it should be repainted black to satisfy property 2 (root is black):
Second, if the parent of the inserted node is black, the insertion is complete because it is not possible for that to have violated any of the properties:
Third, it is possible that the inserted node creates two consecutive red nodes, violating property 5 (equal number of black nodes per path). For this, there are three different scenarios:
 parent and uncle are both red
 direction in which new node and parent lean differ
 new node and parent lean in the same direction
First, if the parent and its uncle are red, flip their colors and make the grandparent red instead. This allows the newly added red node to satisfy all properties, since its parent is black. However, making the grandparent red may possibly violate properties 2 (root is black) and 4 (both children of red nodes are black), so recurse the enforcement algorithm on the grandparent starting from case 1:
Second, the new node could be added diagonal to a red parent node, meaning for example the parent node being red and the left child of its parent and the new node could be red (as always) and the right child of its parent.
This is ultimately resolved by two rotations, but the first rotation is made to get the new node leaning in the same direction as its parent. This is accomplished by rotating the new node in the direction of the parent's direction from its parent. In the above example, the new node is its parent's right child and the parent is the grandparent's left child, so the new node is rotated left.
There are still consecutive red nodes after this rotation, albeit leaning in the same direction. This makes it simple for case 3c to handle, provided it is applied to the exparent, i.e. the nowbottom node, since case 3c operates in a more general sense from the perspective of the grandchild.
Third, the new node could be added below a red parent node and leaning in the same direction. For example, the new node is the left child of its parent and its parent is the left child of its parent (grandparent of the new node) as well.
This is resolved by rotating the grandparent in the direction opposite to the direction in which the consecutive red links lean. This has the effect of making the parent be the new root of the subtree previously rooted by the grandparent.
The grandparent was known to be black, since the red parent could not have been a child of it otherwise. Knowing this, the parentnow the rootswitches colors with the grandparent, such that the subtree now consists of the black root and two red children.
# RedBlack Tree Deletion
Deletion is handled similar to deletion in BSTs, but is a lot more complicated because the tree has to be rebalanced if removing a node from the tree causes it to become unbalanced.
Every resource I looked atbooks, sites, university slidessimply handwaived the deletion process presumably due to its complexity. The one place that managed to somewhat explain it well was the classic CLRS book, but its implementation consisted of a big, difficulttofollow whileloop. Instead I decided to go with wikipedia's long and dense explanation of its relatively simple implementation which even the Linux kernel uses.
First, if the node to be deleted has two children then it is replaced by its successor. The successor then has to be deleted, and by definition the successor will have at most one nonleaf child, otherwise it would not be the minimum in that subtree and the left child would have been followed.
Second, if the node to be deleted has one child, simply replace the successor with its child.
Third, if the node to be deleted has no children, then it is possible to simply delete it.
# RedBlack Tree Balance
If the node is replaced with a successor, that successor is essentially removed from its original location, thereby possibly causing tree unbalanced. For this reason, the original successor node is removed using delete_one_child
which rebalances the tree if necessary.
 node : successor to the node to be deleted
 node : child of , prioritized to be a nonleaf child if possible
 node : child in its new position
 node : 's parent
 node : 's sibling
 nodes and : 's left and right child respectively
First, if is red, then simply replace it with its child which must be black by property 4 (both children of red nodes are black). Any paths that passed through the deleted node will simply pass through one fewer red node, maintaining balance:
Second, if is black and is red, paint black and put it in 's place. This preserves the same amount of black nodes along that path:
Third, the most complex case is when both and are black. Replacing one with the other effectively removes one black node along that path, unbalancing the tree. Begin by replacing with its child , then proceed to the first rebalancing case:
When both and are black nodes, four situations ^{5} can arise that require rebalancing, unless 's new position is the new root. If becomes the root it simply means that a black node was removed from all paths, effectively decreasing the blackheight of every path by one and the tree therefore requires no rebalancing.
First: 's sibling is red. In this case, reverse the colors of and and rotate left. Although all paths still have the same blackheight, 's sibling is now black and its parent is red, allowing fallthrough to case 4, 5, or 6:
Second: , , and 's children are all black. Repaint red so that all paths passing through have the same blackheight as those that go through .
If is red, then the tree is violating property 4 (both children of red nodes are black), fix it by simply painting black.
Otherwise, if was already black, however, then after the painting of to red, now has effectively lost one level from its blackheight, so case 1 should be applied to :
Third: is black, is red, is black, is left child of its . Rotate right, then exchange colors of and its new parent. This case just prepares the tree for falling into case 6, since now has a black siblingwhose right child is red.
Fourth: is black, is red, is left child of its . Rotate left, exchange colors of and , and make black.
This unbalances the tree by increasing blackheight of paths through by one because either became black or it was black and became a black grandparent.
# Hash Tables
Hash tables consist of an array coupled with a hash functionsuch as MurmurHash or CityHashand a collision resolution scheme, both of which help map the key to an index within the array.
Hash Tables can be used for deduplication, as well as keeping track of what states have already been seen in search algorithms, especially for those applications where it's not feasible to store all of the nodes.
In the 2SUM problem, given an unsorted array of integers and a target sum , we need to find if there is a pair of integers that sum to .
The bruteforce approach is to check all possible pairs in to see if they add up to the target .
Alternatively, we can sort the array in and scan through the array, for each element determine the required summand , then look for in the array using binary search . If is found, then there's a match, i.e. .
This can be improved further by using a hash table. Put each element of the array into a hash table, then for each element in the array compute the required summand and check if is present in the hash table. If so, then there's a match.
# Hash Functions
Hash functions need to be consistent, efficient, and should uniformly distribute the set of keys.
A popular and simple hashing function is modular hashing of the form:
where is the key and is the array size, used to avoid integer overflow, usually chosen to be prime. Multiple pieces of data can be combined into one hash by doing:
where is a prime number such as a 31, is the hash as constructed so far (initially set to some prime number) and is the new piece of data.
For example, given a three propertiesday, month, and yearthe following hash computation could be used:
Or to hash a given string:
A simpler hashing scheme that doesn't account for integer overflow is:
So for example, given a day, month, and year:
# Pathological Data Sets
It's possible to craft a pathological data set that can cause a denial of service attack on a hash table.
One way to mitigate this is to use a cryptographic hash function, which also has pathological data sets but it's less feasible to discover them.
Alternatively, design a family of hash functions and choose one randomly.
# Load Factor
The load factor is defined by where is the percentage of table entries that are occupied, is the number of objects in the hash table, and is the number of buckets in the hash table.
Note that a load factor is still relevant in an open addressing scheme, in which case each bucket can only hold one value.
In linear probing, can never be 1 because if the table becomes full, a search miss would go into an infinite loop. Instead, array resizing is performed to ensure that the load factor is between and .
The average number of compares, or probes, in a linearprobing hash table of size and keys is:
Based on this, when is about 0.5 there will be 1.5 compares for a search hit and 2.5 compares for a search miss on average. For this reason, should be kept under 0.5 through the use of array resizing.
# Separate Chaining
Case  Growth 

Worst 
This collision resolution strategy involves storing a linkedlist at every entry in the array. The intent is to choose the size of the array large enough so that the linkedlists are sufficiently short.
Separate chaining consists of a twostep process:
 hash the key to get the index to retrieve the list
 sequentially search the list for the key
A property of separate chaining is that the average length of the lists is always in a hash table with lists and keys.
# Double Hashing
Double hashing is a form of open addressing in which two hash functions are used. If the first hash function incurs a collision, then the result of the second hash function serves as an offset at which to try insertion. For example, if caused a collision, and , then it will try inserting at position , then , and so on.
# Linear Probing
Case  Growth 

Worst 
Linear probing is a form of open addressing that relies on empty entries in the array for collision resolution. Linear probing simply consists of:
 hash the key to get the index
 the element at the index determines three outcomes:
 if it's an empty position, insert the element
 if the position is not empty and the key is equal, replace the value
 if the key is not equal, try the next entry and repeat until it can be inserted
# Linear Probing Deletion
The insert and retrieval operations retrieve the index and perform the same operation until the entry is null. This has the consequence that deleting a node cannot simply entail setting the entry to null, or it would prematurely stop the lookup of other keys.
As a result, after setting the entry to null, every key to the right of the removed key also has to be removed, i.e. set to null, and then reinserted into the hash table using the regular insertion operation.
# Sparse Vectors
An application of hash tables can be to implement sparse vectors for the purpose of performing matrixvector multiplications. In certain situations, the rowvector from a matrix can have a very small amount of nonzero elements. If the matrix was stored in a naive array format it would amount to an immense waste of space and computation.
Instead, sparse vectors are vectors backed by hash tables where the keys correspond to the index of a given element and the value corresponds to that element's value. This solution is used in Google's PageRank algorithm.
# Bloom Filters
Bloom filters are useful for remembering which values have been seen; they don't store the actual values or keys, so they use very little space. There are no deletions.
Membership lookups can yield falsepositives, but not falsenegatives. So a bloom filter can answer if an element is possibly in the set or definitely not in the set.
For example, a bloom filter could be used to back a spellchecker. All correctly spelled words are inserted into the bloom filter, then a word can be checked for correct spelling by checking if it's in the bloom filter. However, since there is a small possibility of a falsepositive, it may incorrectly determine that the word is correctly spelled even if it's not.
Bloom filters are often used in network routers for tasks such as keeping track of blocked IP addresses, the contents of a cache to avoid spurious lookups, and maintaining statistics to prevent denial of service attacks.
Bloom filters consist of a bitset, where each entry uses bits. The bloom filter has hash functions.
Insertion is accomplished by hashing the input with each of the hash functions and turning on the bit at that position, regardless of whether that bit was already on.
Lookup is accomplished by hashing the input with each hash function and checking if all of those positions is on.
Therefore the falsepositives come about if other insertions set the same bits as those used by other elements.
# Graphs
A graph is a set of vertices and a collection of edges that each connect a pair of vertices. This definition allows for selfloops (edges that connect a vertex to itself) and parallel edges (multiple edges connecting the same vertex pair).
Graphs with parallel edges are sometimes known as multigraphs, whereas graphs with no parallel edges or selfloops are simple graphs.
Two vertices connected by an edge are adjacent, and the edge is incident to both vertices. A vertex' degree is the number of edges connected to it. A subgraph is a subset of edges and associated vertices that still constitutes a graph.
Paths in graphs are sequences of vertices connected by edges. Simple paths have no repeated vertices. A path forms a cycle if it has at least one edge whose first and last vertices are the same, and a simple cycle if the cycle consists of no repeated edges or vertices. The number of edges in a path determines its length.
A graph is connected if a path exists from every vertex to every other vertex. A graph that isn't connected consists of connected components which are connected subgraphs of the graph.
Acyclic graphs are graphs with no cycles. A tree is an acyclic connected graph, and a disjoint set of trees is a forest.
A graph with vertices is a tree if any of the following are satisfied:
 has edges and no cycles
 has edges and is connected
 is connected but removing a single edge disconnects it
 is acyclic but adding any edge creates a cycle
 exactly one simple path connects each pair of vertices in
A spanning tree of a connected graph is a subgraph that contains all of the vertices as a single tree. A spanning forest of a graph is the union of all spanning trees of its connected components.
A graph's density is its proportion of possible paris of vertices that are connected. A sparse graph has relatively few of the possible edges present compared to a dense one.
As a rule of thumb, a graph is considered sparse if it has an edge count closer to the number of its vertices and it's considered dense if it has an edge count closer to the number of vertices squared .
A bipartite graph is one whose vertices can be divided into two sets such that all edges connect a vertex in one set with a vertex in the other.
Oftentimes, the number of nodes/vertices is represented by and the number of edges is represented by .
Answers:
 is there a way to connect one item to another by following the connections?
 how many other items are connected to a given item?
 what is the shortest chain of connections between two items?
# Undirected Graphs
An undirected graph is one in which the connections don't have an associated direction. There are various data structures that can be used represent graphs:
 adjacency matrix: a boolean array where row and column are set to true if vertices and are connected with an edge.
 array of adjacency lists: a vertexindexed array of lists of the vertices adjacent to each vertex, similar to hash tables with separate chaining
 array of edges: a collection of Edge objects each containing two instance variables for each of the connected vertices
Adjacency lists have the best balance between space and time performance. They have space usage proportional to , constant time to add an edge, and time proportional to the degree of to iterate through adjacent vertices.
An undirected graph can have a minimum of edges and a maximum of edges.
# DepthFirst Search
DepthFirst Search (DFS) is a graph traversal algorithm that visits a vertex, marks that vertex as visited, then visits all unmarked adjacent vertices.
To trace the paths in the graph, an array can be kept of size indexed by a given vertex whose value is the vertex that connects to it. This array of edges represents a tree rooted at the source vertex.
# BreadthFirst Search
BreadthFirst Search (BFS) traversal aids in finding the shortest path between two vertices. Its basic operation consists of:
 enqueue the source vertex
 dequeue the current vertex
 mark and enqueue all adjacent vertices
 repeat 23 until the queue is empty
# Connected Components
DepthFirst Search can also be used to find connected components of a graph. This is accomplished by initiating DFS on every unmarked vertex and each time it is called on a vertex, set the vertex' connected component identifier.
A run of DFS finds, and thus marks, every vertex in a connected component. Upon completing such a run, a counter variable signifying the connected componenet identifier is incremented and then it is called on the next unmarked vertex in the graph, i.e. a vertex not in a connected component found so far.
Compared to UnionFind, the DFS approach is theoretically faster because it provides a constanttime guarantee. However, in practice the difference is negligible and UnionFind tends to be faster because it doesn't have to build a full representation of a graph. Perhaps more importantly, the DFS approach has to preprocess the graph by running DFS on the separate connected components. As a result, UnionFind is an online algorithm where it can be queried even while new edges are added without having to repreprocess the graph.
# Cycle Detection
DFS can also be used to determine if there are cycles present in a graph. This is accomplished by keeping track of the vertex previous to the one being focused on by the DFS. If one of the current vertex' neighbors is already marked and it is not the previous vertex, then it means that there is an edge to an already marked vertex, thus forming a cycle.
# Bipartite Detection
DFS can also be used to determine whether or not the graph is bipartite. Another way to frame the question is: can the vertices of the graph be assigned one of two colors such that no edge connects vertices of the game color?
This is accomplished by maintaining a vertexindexed array that will store that vertex' color. As DFS traverses the graph, it will alternate the color of every vertex it visits. The graph starts out as assumed to be bipartite, and only if DFS encounters a marked vertex whose color is the same as the current vertex does it conclude that the graph is not bipartite.
# Directed Graphs
The edges in directed graphs have an associated oneway direction, such that edges are defined by an ordered pair of vertices that define a oneway adjacency. A directed graph (or digraph) is a set of vertices and a collection of directed edges, each connecting an ordered pair of vertices. The outdegree of a vertex is the number of edges pointing from it, while the indegree is the number of edges pointing to it.
The first vertex in a directed edge is the head and the second vertex is the tail. Edges are drawn as arrows pointing from head to tail, such as .
Directed graphs can be represented by adjacency lists with the stricter property that if node is present in the adjacency list corresponding to , it simply means that there is a directed edge , but not vice versa unless explicitly defined.
# Digraph Reachability
The same exact implementation of reachability testing by DFS used in undirected graphs can be used for digraphs, and can be expanded to allow for reachability testing from multiple sources which has applications in regular expression matchers or markandsweep garbage collection strategies, for example.
Markandsweep garbage collection (GC) strategies typically reserve one bit per object for the purpose of garbage collection. The GC then periodically marks a set of potentially accessible objects by running digraph reachability tests on the graph of object references, then it sweeps through all of the unmarked objects, collecting them for reuse for new objects.
# Directed Cycle Detection
A digraph with no directed cycles is known as a directed acyclic graph (DAG). For this reason, checking a digraph for directed cycles answers the question of whether the digraph is DAG.
Directed cycle detection is accomplished by maintaining a boolean array representing whether or not a directed path belongs to the same connected component. Then during DFS if the encountered vertex is already marked and is part of the same component, it returns the path from the current vertex through the cycle back to the current vertex. If no such cycle exists, the graph is a DAG.
Currency arbitrage can be discovered if the problem is modeled as a graph where the nodes are the different kinds of currency and the edge weights are the logarithm of the exchange rate. In this case, an instance of arbitrage is one where there is a cycle with positive weight.
# Topological Order
Topological sort puts the vertices of a digraph in order such that all of its directed edges point from a vertex earlier in the order to a vertex later in the order. Three different orders are possible, which are accomplished by saving each vertex covered by the DFS in a queue or stack, depending on the desired order:
 preorder: put the vertex on a queue before the recursive calls
 postorder: put the vertex on a queue after the recursive calls
 reverse postorder, aka topological order: put the vertex on a stack after the recursive calls
This ability of DFS follows from the fact that DFS covers each vertex exactly once when run on digraphs.
For example, consider an alien or unknown alphabet and we're given an array of words which are sorted according to the lexigraphical order of the alphabet. In order to to reconstruct, or extract, the lexicographical order of this unknown alphabet, first treat the lexicographical order simply as a "relationship". Graphs can model relationships, so start by creating a node for each character.
Information about the lexicographical order of the alphabet can be inferred from the sorted order of the input. Word comes before because mismatches with at some character position such that , by definition of a lexicographical sorted order.
What's necessary then is to determine the mismatching characters and for each pair of adjacent words in the input and to establish a relationship between those two characters which denotes precedence, i.e. a directed edge to mean that comes before in the alphabet.
Once this is all done, the topological order of the graph can be obtained to determine the full order of the alphabet.
# Strong Connectivity
Two vertices and are strongly connected if they are mutually reachable, i.e. . Consequently, an entire digraph is strongly connected if all of its vertices are strongly connected to one another. Further, strong components are connected components of a graph that are strongly connected.
The KosarajuSharir algorithm is able to find strongly connected components in digraphs in . The algorithm operates as follows:
 given digraph and its reverse digraph , compute the reverse postorder of
 run standard DFS on on the vertices in the order generated by step 1
 all vertices visited on a recursive DFS call from the constructor are a strong component, so identify them
The algorithm can answer the following questions:
 are two given vertices strongly connected?
 how many strong components does the digraph contain?
The algorithm can be understood by considering a kernel DAG, or condensation digraph, associated with each digraph, formed by collapsing all vertices in each strong component to a single vertex. This DAG can then be put into reverse topological order. Remember that reverse postorder of a DAG is equivalent to topological sort.
The algorithm begins by finding a vertex that is in a sink component of the kernel DAG. A sink component is one that has no edges pointing from it. Running DFS from this vertex only visits the vertices in that component. DFS then marks the vertices in that component, effectively removing them from further consideration in that digraph. It then repeats this by finding another sink component in the resulting kernel DAG.
The first vertex in a reverse postorder of is in a source component of the kernel DAG, whereas the first vertex in a reverse postorder of the reverse digraph is in a sink component of the kernel DAG.
# AllPairs Reachability
AllPairs reachability asks: given a digraph, is there a directed path from a given vertex to another given vertex ? This can be answered by creating a separate graph representation known as a transitive closure, which allows for straightforward checking of which vertex is reachable by others.
The transitive closure of digraph is another digraph with the same set of vertices but with an edge from to in the transitive closure if and only if is reachable from in . Transitive closures are generally represented as a matrix of booleans where row at column is true if is reachable from in the digraph.
Finding the transitive closure of a digraph can be accomplished by running DFS on every vertex of the digraph and storing the resulting reachability array for each each vertex from which DFS was run. However, it can be impractical for large graphs because it uses space proportional to and time proportional to .
# Dynamic Connectivity
Answers: Is a pair of nodes connected?
Data Structure: Array, indexed by any given site to the value corresponding to the component its a part of: id[site] = component
. All sites are initially set to be members of their own component, i.e. id[5] = 5
.
General Flow: Sites are all partitioned into singleton sets. Successive union()
operations merge sets together. The find()
operation determines if a given pair of sites are from the same component.
A site is an element or node in a disjoint set. The disjoint set is known as a component, which typically models a set or graph. Two sites are connected if they are part of the same component.
# QuickFind
Operation  Growth 

Find  
Union 
This algorithm favors a quick find()
operation by sacrificing the union()
operation.
Union operates as follows:
 of the two sites and , arbitrarily choose one to merge under the other
 gets the associated components of and
 goes through the whole array, setting sites which were part of 's component to now be part of 's
 decrements the number of components in the disjointset
# QuickUnion
Operation  Growth 

Find  
Union 
This algorithm aims to speed up the union()
operation by avoiding the act of going through the whole array to change the component of every affected site.
This is accomplished by creating a treelike relationship between sites. With a tree representation, sites are added as direct leaves to the root node of the component to which they were merged.
As a result of this, the find()
operation needs to walk up the tree from any given site to find the root note which designates the component to which the given site belongs to. The walk is terminated when it encounters a site whose component is itself.
# Weighted QuickUnion
Operation  Growth 

Find  
Union 
The problem with vanilla QuickUnion is that the trees are merged arbitrarily. This can cause bad performance depending on which tree is merged under the other.
Given the arbitrary form in which components are merged in QuickUnion, input of the form 01, 02, 03, ... 0N can have worstcase effects:
 01 can connect component 0 under component 1
 02 can connect component 1 under component 2
 03 can connect component 2 under component 3
This input eventually creates a linkedlist, where the deepest node in the tree incurs the cost of having to traverse the entire list of sites before determining the component to which it belongs.
Weighted QuickUnion fixes this by keeping track of each component's size in a separate array. With this information it then chooses to merge the smaller component under the larger one.
In the example above, by step 2, component 1 is size 2, so component 2, being size 1, is merged under component 1 and not the other way around.
# Path Compression
Operation  Growth 

Union 
A further improvement can be done called path compression in which every site traversed due to a call to find()
is directly linked to the component root.
# Minimum Cut
Adding an edge to a tree creates a cycle and removing an edge from a tree breaks it into two separate subtrees. Knowing this, a cut of a graph is a partition of its vertices into two nonempty disjoint sets, connected by a crossing edge. A graph with vertices has cuts because each vertex has two choices as to which set it's placed in, left or right, i.e. blanks to be filled with one of two values.
A minimum cut (mincut) is the cut with the fewest number of crossing edges, with parallel edges allowed, i.e. edges which connect the same vertices. Mincuts are useful for identifying weaknesses in networks (i.e. hotspots), identifying tightlyknit communities in social networks, and image segmentation.
The minimum cut can (potentially) be obtained through a randomized algorithm known as random contraction. It works by, as long as more than 2 vertices remain in the graph, picking a random remaining edge and merging or "contracting" them into a single vertex, removing any selfloops. When only 2 vertices remain, the cut represented by them is returned.
It's possible that random contraction will not find the minimum cut. This is mitigated by running it a large number of times, since it is very fast, and returning the smallest cut found. The largest number of mincuts that a graph with vertices can have is .
# Minimum Spanning Trees
An edgeweighted graph is a graph where the edges have associated weights or costs. Edgeweighted graphs can be represented with adjacency lists containing edge objects which contain the two vertices, one of which is the index of the adjacency list, as well as the weight for that edge.
A spanning tree is a connected subgraph with no cycles that includes all of the vertices in the graph. A minimum spanning tree (MST) is a spanning tree whose weightthe sum of all of its edges' weightsis no larger than the weight of any other spanning tree for that graph.
# Prim's Algorithm
Case  Growth 

Worst  
Space 
This method of finding the MST operates by attaching a new edge to a growing tree at each step. Starting with any vertex from the graph to create a singlevertex tree, each time taking the minimumweight edge that connects a vertex on the tree to a vertex not yet on the tree.
The vertices in the tree being built are represented using a vertexindexed boolean array where an element is set to true if the vertex is in the tree. The edges in the tree can be represented with a queue that collects edges or a vertexindexed array of edge objects. Crossing edges are held in a minimum priority queue, making the operation of choosing the edge with the lowest weight particularly straightforward.
The act of adding an edge to the tree corresponds to adding a vertex to the tree. When this occurs, all edges from the newly added vertex to all vertices not in the tree must be added to the crossing edges priority queue. Furthermore, any edges previously in the priority queue that connected the newly added vertex to a vertex already in the tree become ineligibleotherwise they would create a cycleand should be ignored or removed.
Instead of storing edges in the priority queue, it's faster to store vertices that have not been explored/spanned yet which are on the other end of edges crossing the cut. If a new vertex is visited which has incident edges which are shorter to reach a vertex for which an edgeto already existed in the heap, that edge must be replaced with the new shorter edge. This way, the shortest edge is always at the top of the heap.
This is very similar to what is done in Dijkstra's algorithm.
# Eager Prim's Algorithm
Case  Growth 

Worst  
Space 
The above implementation is lazy with respect to ignoring ineligible edges in the priority queue. That approach leaves ineligible edges in the priority queue until they're dequeued for consideration and discarded if they are ineligible.
By contrast, an eager approach would make sure those edges aren't present in the priority queue from the beginning. The eager version of Prim's algorithm uses two vertexindex arrays:
 an array for the shortest edges to vertices which are reachable from the tree within one edge
 an array for the weight of the shortest edge stored in the aforementioned array
For each vertex present in the above arrays, the vertex index associated with its weight are stored in a minimum priority queue, such that when the minimum weight is removed the associated index is returned. The implication of maintaining the priority queue this way is that given the next minimumweight crossing edge returned by the priority queue, its associated vertex is the next one to add to the MST.
An improvement from the lazy implementation is that the eager implementation uses space proportional to whereas the lazy implementation uses .
# Kruskal's Algorithm
Case  Growth 

Worst  
Space 
An alternative method for finding the MST is to process the edges in increasing order of their weight values, each time taking an edge for the MST that doesn't form a cycle, stopping once edges have been aggregated. The edges form a forest of trees, gradually growing into a single tree (the MST). The algorithm can be thought of as starting with a forest of singlevertex trees, and on each step finding an edge to connect two trees until there is only one left (the MST).
The implementation uses a priority queue of edges based on their weight, a unionfind data structure to identify potential cycles, and a queue to collect edges for for the MST.
Despite the simplicity of Kruskal's algorithm, it is generally slower than Prim's because it has to check if an edge is already connected using the unionfind data structure on each edge that is considered for the MST.
# Shortest Paths
The shortest path from vertex to in an edgeweighted digraph is a directed path from to such that no other such path has a lower weight. A shortestpath tree (SPT) for a source vertex is a subgraph containing and all the vertices reachable from that forms a directed tree rooted at such that every path is a shortest path in the digraph.
Edge relaxation refers to replacing an existing edge that reaches with a new edge if the new edge makes the path from the source vertex to be of lower cost than it was previously.
Vertex relaxation is similar to edge relaxation except that it relaxes all of the edges pointing from a given vertex.
# Dijkstra's Algorithm
Case  Growth 

Worst  
Space 
Dijkstra's alrogithm is similar to Prim's algorithm for finding the MST. Dijkstra's algorithm finds the SPT by finding the lowestweight nontree vertex as provided by an index minimumpriority queue and relaxing that vertex.
Dijkstra's algorithm requires that edges be nonnegative.
To specifically find the shortest path from the source vertex to an arbitrary vertex, simply terminate the search as soon as the target vertex comes off of the priority queue.
# Topological Sort
Case  Growth 

Worst  
Space 
Shortest paths can be found much more efficiently in acyclic graphs, specifically, the singlesource problem can be solved in linear time, negative edge weights are easily handled, and other related problems such as finding the longest paths are solvable. This is possible by relaxing vertices in topological order.
This approach can be used for finding the longest path between two vertices in a DAG, accomplished by creating a copy of the DAG and negating the weight of every edge.
# Parallel Job Scheduling
The critical path method for parallel job scheduling consists of encoding the constraints of the scheduling problem in a DAG. Both a source vertex and a sink vertex are created on either ends of the graph. Jobs are encoded in the graph as a pair of nodes connected by an edge whose weight corresponds to that job's duration. For each precedence constraint , add a zeroweight edge from to . Finally, add a zeroweight edge from the source to every job's start vertex and from every job's end vertex to the sink.
When the scheduling problem is encoded in this manner, it can be solved by scheduling each job at the time corresponding to its longest path from the source vertex.
Relative deadlines can be encoded as a negative weighted edge going from the constrained job (vertex) to the job (vertex) which the deadline is relative to. However, relative deadlines can quickly make solutions infeasible with the aforementioned algorithms (Dijkstra's and Acyclic Shortest Paths).
# BellmanFord Algorithm
Case  Growth 

Worst  
Average  
Space 
The problem of finding the shortest paths can be generalized to graphs containing negative cycles. The BellmanFord algorithm accomplishes this by adding the source vertex to a queue and entering a loop where a vertex is dequeued and relaxed, and any vertex affected by that relaxation gets enqueued.
A negative cycle is a directed cycle with net negative weight. No shortest path between and can consist of a vertex that lies within a negative cycle, or the weight of the path can be made arbitrarily low and a shortest path would "never" be achieved.
To prevent the BellmanFord algorithm from looping infinitely due to negative cycles, it has to ensure to terminate after passes either by keeping track with a counter or by detecting negative cycles within a subgraph.
If the queue is not empty after passes through each edge then there is a negative cycle. By extension, if a negative cycle is present in a graph, the BellmanFord algorithm can end up in an infinite loop, continuously lowering the weight of each affected path.
This is mitigated by checking for negative cycles on every call to relax, as on line 26 of the above code listing. On every such interval, a cycle finder is initiated on the subgraph denoted by the edges sofar considered by BellmanFord.
# Constraint Satisfaction Problems
Constraint Satisfaction Problems (CSP) ^{6} are a special subset of search problems where the state is defined by variables with corresponding values from a domain (which may depend on ), and the goal test is a set of constraints specifying the allowable combinations of values for the variables. A solution in this case is simply an assignment to all variables which satisfies the constraints.
Example problems that may be modeled as CSPs are map coloring, NQueens, and Sudoku. Map coloring consists of coloring in different regions in a map such that their bordering regions don't have the same color. In this case, the variables would be the individual regions and the domain would consist of the possible set of colors, e.g. . The constraints could then be modeled implicitly in the form Region1 ≠ Region2 where Region2 borders Region1, or by explicitly specifying every legitimate configuration.
NQueens looks for a possible configuration of an N×N chess board with N queens on it such that there is one queen on each row and none of them threaten each other, i.e. they cannot be on the same row, column, or diagonal. This problem can be modeled so that there is one variable for each queen taking on a value from the domain which corresponds to the column the queen is on. The constraints can be modeled implicitly with .
# Backtracking Search
Case  Growth 

Worst 
Given a state tree of the constraint satisfaction problem, all of the solutions would be at the bottom, so BFS would experience the worstcase. DFS with its backtracking gets to the bottom quicker, but it must be adapted to the context of CSPs in order to be optimal.
This adaptation is known as backtracking search. Backtracking search only considers one variable at a time and checks the constraints at each step, so that only values that don't conflict with previous assignments are considered. Backtracking naturally occurs if there are no more successors. A naive implementation of this, that will be optimized later, follows:
 start with an empty solution
 if the solution is complete, return it
 select an unassigned variable
 try giving it a value from its domain that hasn't been tried:
 if there are no more values in the domain, return failure (no successors). This goes back to the previous variable, i.e. backtracking, so that it may try another value for it (and backtracking again if there are no more).
 if the value satisfies the constraints, set it
 recurse starting at #2 and get its result
 if the result didn't fail, return it
 otherwise unset the variable and go to #4 to try another value
This algorithm can be optimized further by ordering the variables in a specific way, filtering out values from domains as other variables are set in order to detect failure earlier, and exploiting the problem's structure.
Forward checking keeps track of domains for unassigned variables and removes from them values which would violate a constraint when added to the existing assignment. This is done whenever a new variable is assigned. For example, in a map coloring problem, if the domain is and Region1 is set to red, then red would be removed from the domain of Region2 which borders it, since setting Region2 to red would violate the constraints.
Constraint propagation takes this further by propagating these effects farther, in order to detect potential failures earlier. This is done by having a notion of an arc which leads from other variables on the constraint graph to the variable in question, so that the head of the arc is the variable in question and the tail is the other variable. Then it is said that a given arc is consistent iff for every in the tail's domain, there is some in the head's domain which could be assigned without violating the constraint.
Forward checking uses this concept so that, when a new variable is assigned, arc consistency is enforced for each variable by removing values from their domain which would otherwise make them inconsistent. Naturally, when a value is removed from a varible's domain, all neighbors of that variable (incoming arcs) have to be reenforced. Arc consistency is run after every assignment in backtracking search.
The algorithm, known as the AC3 algorithm for enforcing arc consistency follows (specifically for binary CSPs, where there are at most two variables per constraint):
 create a queue containing all of the arcs in the CSP
 while the queue is not empty:
 retrieve an arc from the queue
 for each value in the tail's domain:
 if no value in the head's domain satisfies the constraints given :
 delete from the tail's domain
 if no value in the head's domain satisfies the constraints given :
 if there were values removed, then add an arc to the queue for each neighbor (i.e. each incoming arc)
Variable Ordering refers to optimizing by prioritizing some variables over others. Minimum Remaining Values (MRV) consists of prioritizing variables which have the fewest legal values left in their domain. This is so that, if backtracking becomes necessary, the amount of backtracking will be much less.
Value Ordering refers to optimizing by prioritizing certain values in a domain. Least Constraining Value refers to choosing the value which rules out the fewest values in the remaining variables. Knowledge of this may require rerunning filtering.
# KConsistency
There are increasing degrees of consistency. For example, 1Consistency (Node Consistency) is when each single variable's (node) domain has a value which meets that node's unary constraints. 2Consistency (Arc Consistency) is when any consistent assignment for one variable can be extended to the other for each pair of nodes. KConsistency is the generalized notion where any consistent assignment to variables can be extended to the node for each nodes, i.e. whatever is done at the tail variables can be extended to the head.
Strong NConsistency requires that all of the lower orders of KConsistency are also satisfied, e.g. , , etc. This would mean that the CSP could be solved without backtracking, since the constraints could be enforced further and further until the entire constraint graph is enforced. Naturally this is very difficult to accomplish, though a good middle ground is where , referred to as path consistency.
# TreeStructured CSPs
Case  Growth 

Worst 
The CSP can be solved much faster if there are no cycles in the constraint graph, specifically linear in the size of the graph and quadratic in the size of the domains.
The tree must first be reordered by choosing a root variable so that all parents precede children by replacing the undirected connections with directed connections. Once the constraint graph is structured in this manner, the algorithm is simple:
 all nodes are traversed one level at a time, starting at the lowest level and going towards but not including the root
 for a given node, its incoming arc's consistency is enforced
 set all of the nodes starting at the root. Each node is guaranteed by step #1 to have at least one valid value
# Cutset Conditioning
Case  Growth 

Worst 
This optimization only applies to treestructured CSPs, but not all problems are treestructured. However, sometimes a constraint graph can easily be converted into a treestructured CSP by removing a particular set of nodes. This is accomplished by setting the value of the variable and then severing the connection to its neighbors, imposing an additional unary constraint on the neighbors reflecting the value the node was set to, essentially removing the nowinvalid values from the domains of the neighbors.
Cutset conditioning is an algorithm that accomplishes this transformation, which essentially works by instantiating (in all ways) a set of variables so that the remaining constraint graph is a tree.
 choose a cutset
 instantiate the cutset in all possible ways
 compute residual CSP by removing instantiated nodes and replacing their constraints with smaller constraints over remaining neighboring variables (NPHard)
 solve residual treestructured CSP
# Iterative Algorithms
Iterative algorithms begin with a constraint graph where every variable is set to a value, whether or not the value satisfies the constraints.
 while not solved:
 select a conflicted variable
 choose a new value (minconflicts heuristic)
 choose value that violates the fewest constraints (i.e. hill climb with h(n) = total number of violated constraints)
This approach to CSP solving is very performant for any randomlygenerated CSP particularly if there are many variables but few constraints or vice versa, but not when both are the case:
# Strings
Strings have special properties which necessitate more efficient algorithms for sorting and searching. Other subjects concerning strings include tries, regular expressions, and data compression.
# String Rotation
A string such as "house"
can be rotated as "sehou"
. An quick way to determine if a string is a rotation of another is by concatenating the original string to itself and then looking for the rotated string within it. For example, the concatenation would be "househouse"
, which clearly contains "sehou"
as a substring.
# String Sorting
Certain properties of strings and alphabets can make for more efficient sorting algorithms for strings.
# Counting Sort
Counting sort, also known as keyindexed counting, essentially involves computing a histogram of the number of occurrences of each character, then regenerating the array in sorted order using that information.
# Least Significant Digit Sort
Case  Growth 

Worst  
Space 
Least Significant Digit (LSD) sort works by sorting the strings based on the last character and then repeating this operation up until the first character. This is accomplished by modifying the counting sort algorithm so that it does a pass for every character in the string. This is mainly useful if all strings are the same length and relatively small alphabet size .
# Most Significant Digit Sort
Case  Growth 

Best  
Worst  
Space 
Most Significant Digit (MSD) sort is similar to LSD except that it operates in lefttoright order instead, meaning it works fine for variablelength strings. This is accomplished by performing counting sort to sort the array of strings based on their first character, then recursively performing the same operation on the subarray of strings with the same first letter.
Because MSD works lefttoright and strings may be of variable length, the possibility of reaching the end of the string requires special handling. This is solved by observing the fact that a smaller string that is a prefix of larger string should naturally come before it in lexicographically sorted order. For example, sea should come before seashore.
This order is maintained by keeping a separate count of such strings that have had all of their characters sorted. This count is held at count[1]
. A string has had all of its characters sorted if the character position currently being sorted is past the length of the string currently being considered. Once the counts are converted to key ranges, such strings will naturally be inserted at the beginning of the sorted subarray.
On each recursion of the sorting operation, an array for counts is allocated whose size is proportional to the alphabet size, occurrences are counted, transformed to key ranges, and so on. The point is that these operations can come to dominate the sorting operation, which makes having a cutoff for small subarrays crucial. After the cutoff, insertion sort takes over, with the slight modification that it only operates on the character position onward.
# Threeway String QuickSort
Case  Growth 

Best  
Worst  
Space 
Threeway quicksort can be adapted to work on a percharacter basis similar to MSD. The advantages of this are that the algorithm doesn't use extra spaceunlike MSDand that the number of subarrays per recurse is bounded at three.
A direct result of only splitting into three subarrays is that more data movements are required to get elements into their correct position compared to MSD. However, threeway quicksort's threeway splits adapt well to handling equal keys, keys with small arrays, and keys that fall into a small range.
Research has shown that no algorithm can beat 3way string quicksort by more than a constant factor.
# Tries
Trie structures exploit string properties to provide much faster string search, with hits taking time proportional to the length of the key and where misses require examining only a few characters.
The structure of tries is comprised of a tree where every node has links where is the size of the alphabet. Every node also has an associated label corresponding to the character value consumed to reach the node. The root node has no such label as there is no link pointing to it. Every node also also has an associated value corresponding to the value associated with the key denoted by the path ending at the particular node.
A search hit occurs when the trie search arrives at the final node and that node's value is not empty. A search hit occurs both if the final node's value is empty or if the search terminated on a null link.
Trie insertion simply consists of searching for the key and setting the value. If the key does not already exist, then create nodes for every character not yet in the trie.
Tries also allow operations for collecting keys with a common prefix. This is accomplished by finding the node at the end of the prefix' path and then recursively performing BFS on every node and enqueueing any node that has a nonempty value.
This can also be modified to allow wildcard pattern matches, for example, keys that match fin.
could include fine
, find
, etc.
# Trie Deletion
Deletion is a straightforward process in tries, simply involving finding the node and emptying its value. If this operation makes the node's parent's children all be null, then the same operation must be run on the parent.
# Ternary Search Trees
Ternary Search Trees (TSTs) seek to avoid the excessive space cost of regular Rway tries demonstrated above. TSTs are structured such that each node has only three links for characters less than, equal to, and greater than the node.
Rway tries can provide the fastest search, finishing the operation with a constant number of compares. However, space usage increases rapidly with larger alphabets TSTs are preferable, sacrificing a constant number of compares for a logarithmic number of compares.
Insertion is similar to insertion with tries except that only one of three links can be taken, instead of links.
# Substring Search
Searching for a string within another string is a very common operation that can also benefit from exploiting certain properties of strings.
# BruteForce Substring Search
The most straightforward approach is a bruteforce algorithm where every character in the text is checked to see if the pattern's first character matches, and if so, checks to see if the second character in the pattern matches, and so on.
If any character in the pattern matches during this check, the pattern iterator is not incremented and instead the text iterator is set back the amount of spaces equal to the pattern iterator, which essentially moves the text iterator one position past the position where the match checking was initiated. The pattern iterator is then reset to zero.
# KnuthMorrisPratt
The KnuthMorrisPratt (KMP) substring search algorithm considers that it's probably not necessary to backtrack all the way to the beginning, since the characters along that stretch of the sequence have already been seen. One way to know the correct distance to backtrack is accomplished using a Deterministic FiniteState Automaton (DFA). There are other methods that either build an NFA or build a partialmatch table.
# KMP DFA Composition
The DFA is constructed such that every state corresponds to the characters in the patterns, storing their position in the pattern. At each state there exists a transition to the next state corresponding with the character consumed in the pattern. At each state there are also transitions going back to previous states, corresponding to backtracking on a pattern mismatch. Finally, the end state corresponds to the halt state and as such has no transitions leaving it.
The DFA is essentially represented by a table dfa[c][j]
such that c
corresponds to the character in the text currently being considered and j
corresponds to the position of the character currently being considered in the pattern, i.e. the state in the DFA. In effect, dfa[c][j]
determines which state to proceed to when at state j
considering character c
.
The value stored at dfa[c][j]
therefore is the identifier of the state that the algorithm should jump to, which could mean either backtracking in the case of a mismatch when or a progression to the next state when .
# Preventing Backtracking in KMP
In a normal bruteforce algorithm when a pattern matching a segment of the text starting at t[i]
mismatches at position j
, the entire pattern is rechecked starting on the character to the right: t[i + 1]
, effectively having to recheck characters t[i + 1]
to t[i + j  1]
.
For example, the following mismatches at position 4:
So in a bruteforce algorithm the pattern would have to be shifted to the right by one position:
However, this essentially means that the text segment from position 1 to 3 has to be rechecked, which we would prefer to avoid. The important observation to make is that the text had already matched the pattern up to (but not including) position j
where the mismatch occurred. That is, the text segment t[i .. i + j  1]
is equal to p[0 .. j  1]
where p
is the pattern. Since we would have to shift to the right one character, this means that the text that would have to be rechecked corresponds to p[1 .. j  1]
. Feeding this to the DFA takes us to the state where we can appropriately handle t[i + j]
.
Based on this observation, we can conclude that at every state we can add transitions for mismatch cases based on the transitions that would be made for the equivalent mismatch that would occur at the state we would arrive at if we had fed the input p[0 .. j  1]
to the DFA. For this reason, a "pointer" to this state is kept at every iteration of the DFA construction, where each iteration is comprised of defining all transitions for a given state.
# KMP DFA Construction
Given the important observation above, the construction of the DFA is very straightforward. A pointer to a fallback state X
is maintained to appropriately establish transitions in the event of a mismatch.
 the first transition is established:
dfa[p[0]][0] = 1
 for each character in the pattern, a state is created
 for every character in the alphabet, a transition is established based on the transition that would be taken at state
X
, since these are the mismatch transitions  a match transition is created for the current pattern character
 the pointer to the fallback state is updated to the state arrived at by following the transition corresponding to the current pattern character from the previous fallback state
 for every character in the alphabet, a transition is established based on the transition that would be taken at state
# KMP Search
Now that the DFA is constructed, a string can be searched easily. It simply iterates the text pointer on each iteration, while the pattern's pointer iterates based on the output from the DFA given the current text character as input. Iteration ends when the full length of either the text or the pattern is exhausted. If the full pattern was consumed then there was a match and the pointer to the start of the match is returned.
# BoyerMoore
The BoyerMoore substring search algorithm works by reading the pattern for comparison in reverse order while skipping through the text accordingly to facilitate this. When a comparison mismatches, the algorithm looks in a skip table to determine how far ahead to jump forward to begin the next match attempt. This behavior is known as the mismatched character heuristic.
# BM Skip Table
The mismatched character heuristic makes use of the aforementioned skip table. The table is indexed by a character from the alphabet and gives the index of its rightmost occurrence in the pattern, or 1 if not present. That very value defines how far ahead to skip if that character from the text caused the mismatch.
The table is constructed by first setting all entries to 1, then for every character in the pattern, set that character's entry to its position in the pattern.
# BM Search
The searching algorithm, as previously stated, iterates the text pointer i
from lefttoright and the pattern pointer j
righttoleft. If there is a mismatch with character c
in the text, then one of three things can occur:
 if
c
is not in the pattern: incrementi
byj + 1
to effectively skip that segment of the text that will not match  if
c
is in the pattern: use theright
array to line up the pattern with the text such that the rightmost occurrence ofc
in the pattern is lined up withc
in the text  if
i
is not increased due to the above case: then just incrementi
instead so that the pattern always slides at least one position to the right
The above cases are handled with the simple statement skip = j  right[text.charAt(i + j)]
. Case 1 is handled because characters not present in the pattern are stored as 1 in the table, thereby turning the statement into skip = j + 1
. Case 2 is handled normally by finding the rightmost occurrence' position of c
in the table and subtracting that from j
. Case 3 is handled by simply checking if skip
is less than one and if so setting it to one. If skip
was never changed from its initial value of zero, then a match was found.
# RabinKarp
The RabinKarp algorithm conceptually works by computing a hash of the pattern and then hashing every equallengthed substring in the text to find a match. The key idea is that a string of length corresponds to an digit base number. So a proper hash function would convert an digit base number to an integer value between and where is some very large prime number. This is possible with a simple modular hashing scheme, by taking the remainder of dividing the number by .
The problem with using the above approach for the text is that it incurs the cost of multiplication, addition, and remainder calculations for each character. Instead, for an character substring of the text where corresponds to text.charAt(i)
the hash can be computed as:
From the above formula it's apparent that the hash is constructed by individual hash components derived from each character in the text. It stands to reason, then, that the hash of the text shifted one character to the right is:
That is, the original hash minus the hash component of the first character of the previous text, plus the hash component of the new ending character.
# Regular Expressions
A Regular Expression pattern can be represented as a NonDeterministic FiniteState Automaton (NFA) where every character in the pattern corresponds to a state in the NFA, followed by an accept state. Characters from the alphabet have an outgoing edge (match transition) going to the next state (character) in the pattern. Metacharacters such as parentheses, pipes, and asterisks have at least one outgoing edge (transition) going to another state that represents their purpose.
NFA traversal in this context occurs as follows:
 match transitions: if current state corresponds to a character in the alphabet and the current character in the text matches it, the automaton can transition from it, i.e. consume the character
 transitions: if no match is made in the pattern, any transition can be taken from a metacharacter, so called for effectively matching the empty string
The traversal of the NFA is handled in the following manner:
 at the start state: find all set of states reachable via transitions
 consume pattern character if there's a match in one of the possible states
 from each match state:
 add set of states reachable via match transitions
 add set of states reachable via transitions
 repeat at 2
As the text input is fed to the NFA, on input character the following conditions can arise:
 set of states contains accept state: the NFA therefore accepts the text, i.e. there was a match
 set of states doesn't contain the accept state: feed it the next character
 the end of the text has been reached: there was no match
The NFA is simply represented by the pattern string and a digraph representing the transitions.
# Regex Match Checking
From this information, it is possible to create an algorithm that determines whether a regular expression matches the provided text. Reachability is determined by a Directed DFS implementation ^{7}. This is straightforward because the DFS would only operate on the digraph, which only represents transitions.
First, the set of states reachable via transitions from the start state are collected:
As the text is fed into the NFA one character at a time, the set of reachable states is checked for a match with the current character. For each match, its next state is added to the collection of matches representing the set of states reachable from the current state(s).
Each of the states reachable via transitions from each of the states collected are added to the collection:
Once the entire text has been consumed, the final iteration of the above loop would leave the final set of reachable states intact. If this set contains the final, accept state, then the NFA accepts the text. Otherwise, there wasn't a match.
# Regex NFA Construction
The construction of the NFA is accomplished similar to how Djikstra's shuntingyard algorithm works for evaluating mathematical expressions in infix notation by using two stacks: one for operators and another for values.
In this context, a stack is maintained for the operators and a digraph the size of the length of the pattern plus one (to account for the accept state) is maintained to represent the NFA's transitions. Concatenation is already handled implicitly by nature of how the pattern is stored.
For parentheses and or expressions, the position of the (
or 
is pushed.
If a )
is encountered and it signified the end of an or expression, then the appropriate edges must be created. A regex (A  B)
is handled by adding two transitions: one from the (
to the B
and the other from the 
to the )
. Push the position of the 
(having previously pushed the (
).
Closures are detected by looking ahead of the current state (if possible). If one is found, then an edge is created to the *
and another is created from the *
to the current state.
Finally, )
, *
, and )
each also have an transition leading to the next state in the pattern.
# Data Compression
Universally good lossless data compression is impossible because, for example, it would mean that data could be compressed over and over again until eventually reaching a compressed length of 0. Instead, lossless compression aims to exploit the known structure of the target data for the best compression ratio.
# RunLength Encoding
RunLength Encoding (RLE) is a classic method of encryption that replaces repeat occurrences of characters with their repeat count. For example, the following consists of 15 zeros, 7 ones, 7 zeros, and 11 ones:
With RLE, given a count size of 4 bits, it can be replaced with 15 (1111
), 7 (0111
), 7, and 11 (1011
):
In general, each count is encoded in one byte. If a run of repeated characters is greater than the maximum size representable by the count size (i.e. 255), the first 255 is encoded, then a zerolengthed run of the alternate character, then again the next chunk of the original long repeated character.
# Huffman Compression
Huffman Compression exploits the frequency of individual characters. For example, in ABRACADABRA!
, the most frequently occurring character A
could be represented by 0
, B
by 1
, R
with 00
, C
with 01
, D
with 10
, and !
with 11
, resulting in 01000010100100011
.
The problem with the above representation is that the interpretation of the above encoded data is ambiguous because the characters aren't delimited and some of the characters' codes are prefixes of others. For example, A
is 0
, B
is 1
, and C
is 01
, so when 01
is read, it isn't clear if it is meant to be interpreted as AB
or C
.
Instead, a property known as prefixfree code is enforced for the encodings, which prevents any code from being a prefix of another. In the above, a possible representation could be A
with 0
, B
with 1111
, C
with 110
, D
with 100
, R
with 1110
, and !
with 101
, yielding the encoding 011111110011001000111111100101
. While this is a slightly longer representation, it is unambiguous.
Prefixfree codes can be easily represented using a trie where left links are 0
and right links are 1
. Leave nodes contain the character represented by the bits of the edges of the path used to reach them. Each node in the trie has an associated frequency (used during construction) and character (for leaves).
Constructing the trie consists of first creating a forest of 1node treesall of which are leavesone for each character in the input, with its frequency variable set to the number of times it appears in the input. The trie is then constructed from the bottomup by merging the two least frequent characters (nodes) with a new parent node with its frequency set to their sum. This is greatly facilitated by a priority queue:
Then label left branches with a 0 and right edges with a 1. The path of this bitstring to a leaf represents that leaf's Huffman code.
The way in which the trie is constructed ensures that the more frequent characters (nodes) are closer to the root, and as a result are encoded with fewer bits.
One thing to recognize is that the trie has to somehow be encoded in the compressed data so that it can then be decompressed. The trie can be encoded in a bitstream by performing preorder traversal (root → left → right), and at each node:
 if the node is a leaf, output a
1
and then the binary representation of the character  otherwise, write a
0
then recurse on the left node then the right (i.e. preorder)
Reading the trie into an actual trie structure is just as straightforward, where the type of node to create is determined by the leading bit.
Decompression consists of simply traversing the trie as each bit is read. If a leaf is encountered, output the character and restart traversal from the root.
Compression requires the existence of a code table mapping each character to the appropriate code. This table is derived from the trie by traversing the trie, keeping track of the bitstring along its path, and when a leaf node is encountered, the bitstring is associated with that character in the code table. Compression then simply requires looking up each character from the data in the code table and outputting the appropriate code.
Alternatively, left child edges are 0 and right child edges are 1.
# LZW Compression
LZW compression works by having variablelength code words for fixedlength input patterns. Code words are kept in a trie as with Huffman compression. A code counter is maintained and incremented after each new code is added to the trie. The initial trie is constructed from the alphabet, one node being created from each character with its code stored within. The rest of the trie is constructed as the input is read:
 the longest prefix of the input present in the trie is found and its value output to the compressed stream
 if the length of the prefix is shorter than the remaining input length, a new code is added for the string consisting of the prefix concatenated with the next character in the input stream. This is a simple operation, essentially done by adding a new node with the new code to the node at which the prefix ends
Decompression depends on a table indexed by codes and valued by strings (prefixes), this is constructed from the alphabet. The code of the first character in the input stream is read and its associated string is retrieved from the table. Decompression continues until the EOF character is encountered, on each iteration doing the following:
 the string associated with the code is output
 another code is read, break if EOF
 the string associated with the code is retrieved
 if the current code counter is equal to the next (lookahead) codetherefore making it impossible to read what the next code's first character is, since it's in the process of being constructedthen first character of the string currently being constructed is appended to its end, following basic logic
 a new code is added to the table at an incremented code corresponding to the previously read string concatenated with the first character of the current string's first character
# Greedy Algorithms
Greedy algorithms are ones which make "myopic" decisions, i.e. they seemed like good decisions at the time and there's a hope that everything works out in the end.
Dijkstra's shortestpath algorithm is greedy for example because it processes each destination once, it doesn't backtrack to find a different path.
# BTrees
A BTrees of order is a tree consisting of internal and external nodes each consisting of keys where at the root and at every other node. Internal nodes contain copies of keys, where every key is greater than or equal to its parent node's associated key, but not greater than the parent node's next largest key. External nodes are the leaves of the tree that associate keys with data. A sentinel key is created to be less than all other keys and is the first key in the root node.
# BTree Insertion
To insert a key, the tree is recursively descended by following the link pertaining to the interval upon which the inserted key falls until an external node is reached. The tree is balanced on the way up the tree after the recursive call. If a node is full it is split into two nodes and attached to a parent 2node (if at the root) or a node where was the original size of the full node's parent. Whenever a node is split, the smallest key in the new node (or both smallest keys from both nodes if at the root) is inserted into the parent node.
# Suffix Arrays
Suffix arrays are arrays of suffixes of a given text which help with procedures such as finding the longest repeated substring in some text.
Using this suffix array class, the longest repeated substring can be found efficiently:
# NetworkFlow
The NetworkFlow problem concerns itself with finding the settings in a network that maximize the flow from source to sink. At each junction in the network there are switches that control the flow's distribution between it's outgoing edges. The problem can be modeled as an edgeweighted digraph with a single source and sink pair, where the weights correspond to the capacity of the edge.
An stflow is a set of edge flows for the network that represent the distribution of flow values for each edge. An stflow value is the sink's inflow. The networkflow problem can be described as finding an stflow such that no other stflow has a larger stflow value. Such an stflow can be referred to as a maxflow.
# FordFulkerson
The FordFulkerson algorithm, also known as the augmentingpath algorithm, works by increasing flows incrementally along paths from the source to the sink. It works by considering that each edge consists of a forward edge and a backward edge.
A path is found in the network in which there are no full forward edges and no empty backward edges. The flow of the network can then be increased by an amount , by increasing flow in forward edges by and decreasing flow in backward edges by in this path. The value of is the minimum of the unused capacities in forward edges and backward edges in the path. This path that can be used to increase flow in the network is known as an augmenting path.
Following from this, the maxflow can be found by starting with zero flow everywhere and gradually increase the flow along any augmenting path from source to sink until there are no more augmenting paths.
A residual network has the same vertices as the original. For every edge in the original network: if its flow is positive, an edge should be created in the residual with an opposite direction and capacity equal to the flow. Also, if its flow is less than its capacity, an edge should be added in the same direction as the original edge with capacity equal to the difference between its capacity and flow.
This means that if, in the original, an edge's flow is zero then there'll only be one edge (in the same direction) and if instead the flow is full there'll only be one edge (in the opposite direction).
The residual network is useful because any path in it from source to sink corresponds directly to an augmenting path in the original network. As an augmenting path's flow is incremented, when an edge in the path becomes full or empty, it corresponds to changing direction or disappearing in the residual network.
The shortestaugmentingpath method finds the maxflow by finding an augmenting path using BFS and incrementing it.
# Geometric Algorithms
# Augmented BST as Interval Tree
An interval search tree stores ranges and provides operations for searching for overlaps of a given range within the tree. A binary search tree can be augmented into an interval search tree. The lower bound of a range is used as the node key. Each node also stores the maximum upper bound of its children, similar to the rank.
For searching, the input interval is checked to see if it overlaps with the current node. If not, and the left node is null, search proceeds on the right child. If the left node's max upper bound is less than the input interval's lower bound, search proceeds on the right node. Otherwise search proceeds on the left node.
 if input interval overlaps current node, return
 if left node is
null
or left's max upper < : go right
else: go left
# Interval Trees
Interval trees are useful for efficiently finding all intervals that overlap with any given interval or point.
To construct the tree, the median of the entire range of all of the set of ranges is found. Those ranges in the set that are intersected by the median are stored in the current node. Ranges that fall completely to the left of the median are stored in the left child node, and vice versa with the right node.
At any given node representing the set of ranges intersected by the median at that node, two sorted lists are maintained: one containing all beginning points and the other containing all end points.
# Intersection Queries
The general operation for queries is to test the set of ranges in a node and then test those in the appropriate child node if the query isn't equal to the median.
Given a point query, the current node is compared with the median. If it's equal, then every range in that node matches and the search is complete. If the query is less than the median, then the list of beginning points is searched for those beginning points that start before the query point, all of which are matches. Then the search continues into the left child.
Given an interval query, the set of beginning and end points are searched to see if they fall within the query interval. These ranges are matches, and they have potential for duplicates if the matched interval begins and ends within the query interval. Finally, to match for ranges which possibly contain the query interval, a point is chosen in the query interval, perhaps the begin or end point, and that point is used as a point query as in the aforementioned point query algorithm.
# OneDimensional Range Count
This can be modeled as a BST where each node maintains a rank: the count of children that are strictly less than the node. It's possible to determine how many keys fall within a given range by subtracting the rank of the node containing the lower bound from the rank of the node containing the higher bound, adding one if the current node is the higher bound.
# Line Segment Intersection
Given a collection of line segments, it's possible to determine which pairs of line segments intersect by using a sweepline algorithm. The coordinates could be sorted by xcoordinate or added to a priority queue. For each distinct xcoordinate encountered, its accompanying ycoordinate is added to a BST. If the same xcoordinate is encountered again, its accompanying ycoordinate is removed from the BST. If a vertical segment is encountered, a range search is performed on the BST to see if any ycoordinates fall within the vertical segment's ycoordinate endpoints.
# Rectangle Intersection
Checking for rectangle intersection is similar to line segment intersection. The left edge of a rectangle prompts the vertical range of the rectangle is checked for overlaps in an interval search tree, and added if none are detected. The rectangle's vertical range is removed from the interval search tree when the right edge of the rectangle is encountered.
# Dynamic Programming
It's important to reason about the structure of an optimal solution, in terms of optimal solutions of smaller subproblems. In other words, imagine that optimal solutions to two subproblems were already provided.
First it's important to identify a suitable small collection of subproblems.
Second it's important to quickly and correctly solve "larger" subproblems given solutions to "smaller" subproblems.
Third it's important to compute the final solution after solving all subproblems.
# MaximumWeight Independent Set
Example problem: given a path graph with nonnegative weights on the vertices, produce a subset of the graph's vertices so that no two vertices are adjacent and that the subset has the maximum total weight of every such subset.
A bruteforce search would be exponential in the number of vertices.
A greedy approach would be to iteratively choose the maximumweight vertex that is not adjacent to any previously chosen vertex. However, this wouldn't give the correct answer (consider above it would choose the first and third vertices, although the second and fourth is the answer).
A divideandconquer approach would be problematic when attempting to combine the results of subproblems.
Let be a maxweight independent set (IS). Let be the last/rightmost/final vertex of the path. Either is in or it isn't.
Suppose and let be the graph with deleted off of the end (since is the last vertex on the path). Then is also an independent set of , specifically a maxweight independent set.
Suppose , then the penultimate vertex , since it is adjacent to . Then let be the graph with and deleted.
Unlike the earlier claim that if then is also an independent set of , it's not true that if then is also an independent set of , because the but .
However is an independent set of , specifically a maxweight independent set.
This means that if we knew whether or not was in the maxweight independent set, we could recursively compute the maxweight independent set of or . We can try both possibilities and return the better solution.
Essentially, recursively compute as the maxweight independent set of as well as as the maxweight independent set of . Then return max of or .
However, this would take exponential time because very little work is done before recursing.
A realization is that there is only a linear number of distinct subproblems, one for each prefix of the graph since the recursion only plucks vertices off from the right. This causes repeated work for the same prefixes.
This can be mitigated by caching the solution to the subproblem in a global table for subsequent time lookup, i.e. memoization, where there is an array of solutions to subproblems where index holds the solution to the subproblem.
This is more straightforward if it's reformulated as a bottomup iterative algorithm.
Let be the first vertices of . Then populate an array A
from lefttoright with A[i]
set to the value of the maxweight independent set of . Naturally A[0]
is an empty set so it's set to weight 0, and A[1]
is a singlevertex graph so it's set to the weight of the first vertex.
After adding another vertex, determine the maxweight independent set for this new . This will be the maximum of either or . If it's then it means that the maxweight independent set is of . If it's then it means that the maxweight independent set is of .
One problem is that this only produces a total weight, but not the actual set of vertices. This can be reconstructed from the completed array by walking backwards from righttoleft starting at the last element, since that was the answer/result.
# Knapsack Problem
There are items and each has a value:
 value (nonnegative)
 size (nonnegative and integral)
 capacity (nonnegative and integral)
The output should be a subset that maximizes the sum of the values that are selected while preventing the sum of the weights from exceeding the capacity .
This is useful for whenever we have a budget of a resource that we can use and we want to use it in the smartest way possible.
First we'll formulate a recurrencean optimal solution as a function of solutions to smaller subproblemsbased on the structure of an optimal solution.
Let be the maxvalue solution to an instance of knapsack, and let be the final selected item.
Suppose , then must be optimal with the first items.
Suppose , then must be an optimal solution with respect to the first items with capacity .
The solution is the value of the best solution that:
 uses only the first items
 has total capacity
Therefore:
However, an edge case exists in that if , then we must use case 1.
Now given the above recurrence, the algorithm can be implemented.
# NPComplete Problems
 clique problem: find complete subgraphs, or cliques, in a graph
 vertex cover: find a set of vertices in a graph such that each edge in the graph is incident to at least one vertex in the set
 travelling salesman problem: find the shortest possible path cycle that visits every vertex in a graph
 graph coloring: color every vertexedge in a graph such that no two adjacent verticesedges have the same color
 knapsack: given a set of items with different values and a container of a maximum capacity, find the combination of items that fits in the container and has the largest total value.

The Wikipedia implementation's 6 cases were condensed to 4 as was done in the Linux kernel RedBlack tree implementation. Cases 1 and 2 were merged since case 1 is simply a check to see if the node is the root. Cases 3 and 4 were merged because they handle the same scenario, with case 4 simply being a handler for a special case of 3.

See Week 3 of CS 188.1x for more information.