There are three ways to traverse a binary tree
To numerically ascertain whether a binary tree is balanced or not, we define balance factor for every node in the tree as the difference between the heigts of the right and left subtree. In a balanced binary tree, the absolute value of this difference would be atmost one at any node in the tree.
To balance an unbalanced binary tree, we need to look at the following two structres
N N
/ \ \
/ \ C1
C1 C2 \
C2
Mountain Stick
We know the stick is not balanced since the balance factor of N
is 2. On the other hand, the mountain has a balance factor of 0 throughout and is balanced.
Balancing a binary tree in the smallest case is turning the stick to a mountain. In the stick, N
< C1
< C2
. Hence, converting this to a mountain is simply reordering to have C1
as the root with N as the left child and C2
as the right child.
Another case is unbending the elbow where we convert an elbow to a stick and then follow the procedure above to convert the stick to a mountain. The following set of tree sturctures show the transformation left to right.
N N N
\ \ / \
C2 C1 / \
/ \ C1 C2
C1 C2
Elbow Stick Mountain
In the elbow, N
has a balance factor of 2 and at this point we must balance the tree.
Suppose an already balanced tree becomes unbalanced due to addition of one node. To balance such a tree, we have the following rotations that can be used
Suppose we had an addition to t3
or t4
in the left tree that causes b
to have a balance factor of 2.
\ \
b c
/ \ / \
t1 c ------> / \
/ \ ------> b d
t2 d / \ / \
/ \ t1 t2 t3 t4
t3 t4
Notice the following
b
is the deepest node (measuring depth from the root node) that is unbalancedb
has a balance factor of 2c
has a balance factor of 1t3
or t4
caused the imbalanceb-c-d
To perform the rotation
b-c-d
and use the same method used earlier to convert this to a mountainc
to be the root of this entire subtree under considerationc
as isb
as the left node of c
and keep the left subtree t1
of b
as ist2
as the right subtree of b
to complete the rotation and achieve balance. \ \ \
b b d
/ \ / \ / \
t1 c ------> t1 d ------> / \
/ \ ------> / \ ------> b c
d t4 t2 c / \ / \
/ \ / \ t1 t2 t3 t4
t2 t3 t3 t4
Notice the following
b
is the deepest node with the imbalanceb
has a balance factor of 2c
, immediately below b
has a balance factor of -1t2
or t3
caused the imbalanceIf we look at linkage b-c-d
, its an unbalanced elbow discussed above which can be converted to a balanced mountain with two rotation steps- first convert to a stick (right rotation) and then apply left rotation. the same is depicted in the set of diagrams above.
Balance factor of the lowest point of imbalance | Balance factor of the next node in the direction of imbalance | Type of Rotation |
---|---|---|
2 | 1 | Left Rotation |
-2 | -1 | Right Rotation |
2 | -1 | Right-Left Rotation |
-2 | 1 | Left-Right Rotation |
The final rotation (which is the only rotation in the first two cases) is determined by the sign of the first column. If the sign of rotation of both the columns are same, a single rotation is performed. The cases of differing sign are of an elbow.
All these rotation change the local tree structure and are just rearrangement of couple of pointers. These run in O(1) time.
An AVL tree is an extension of a balanced binary search tree. All the operations related to find, insertion and removal remain the same. The following are the additional properties of an AVL tree
To do insert in an AVL tree, we follow the following steps
Removal operation is slightly more involved and proceeds as follows
A B-Tree of order m means that
Visually, a tree of order 3 could look like
[65, 89]
/ | \
/ | \
[23] [72] [99]
To do a search, at any node we will first do a linear search to find the key and if we don’t locate it in the array, we need to locate the appropriate child node to search next and repeat this function recursively. The runtime will be O(log m (n))
A Priority Queue is an abstract data type that is optimized to insert and perform find min (or max) operations. The min (or max) are often referred to as priorities and are numerically comparable. A priority queue can be implemented in several (but not limited to) ways
Data Structure | Insert | Get Min |
---|---|---|
Unsorted Array | O(1) (Amortized) | O(n) |
Unsorted Linked List | O(1) | O(n) |
Sorted Array | O(n) | O(1) |
Sorted Linked List | O(n) | O(1) |
A balanced binary tree is a min heap if the following properties hold
When considering the above checks for any subtree, we will not worry about the remaining part of the tree. Every subtree will satisfy the above properties for it to be a min heap.
An example of such a complete binary tree is
4
/ \
/ \
5 6
/ \ /
9 15 18
When thinking about the tree in an abstract fashion, the above structure seems intuitive. When actually working with the same as an implementation, we consider an array with the following structure
The array is filled starting at index 1 The left child of a node at index i is at the index i * 2 The right child of a node at index i is at the index i * 2 + 1
The above tree implemented as an array will look like
Index --> | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Elements --> | | 4 | 5 | 6 | 9 | 15 | 18 |
Thus, to traverse the tree, we can simply jumpy the indices using the above formula.
Inserting efficiently into a min heap is an intuitive algorithm that can be described as follows
As a working example, consider the above described tree and the following operations to insert 3
Index --> | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Elements --> | | 4 | 5 | 6 | 9 | 15 | 18 |
Add 3
Elements --> | | 4 | 5 | 6 | 9 | 15 | 18 | 3 |
heap property violated between 3 and its parent at index 7/2 = 3, swap
Elements --> | | 4 | 5 | 3 | 9 | 15 | 18 | 6 |
heap property violated between 3 and its new parent at index 3/2 = 1, swap
Elements --> | | 3 | 5 | 4 | 9 | 15 | 18 | 6 |
Note that any swap will satisfy the heap property for subtree below it and we only need to worry about the remaining upper part of the tree. We can refer to this function has heapify up.
Removal of element from heap will always return the minimum element in the min heap. Hence, we can simply return the element of the array at index 1. But, after removing this element, we need to restructure the tree to satisfy the min heap property.
To do this,
As an example, consider remove min operation on the above tree
Index --> | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Elements --> | | 3 | 5 | 4 | 9 | 15 | 18 | 6 |
Swap root with the last element and reduce the size of array
Elements --> | | 6 | 5 | 4 | 9 | 15 | 18 |
heap property violated between 6 and its children, pick the minimum
of the two children and swap with that, (children at indices 1 * 2, 1 * 2 + 1)
Elements --> | | 4 | 5 | 6 | 9 | 15 | 18 |
heap property satisfied for element at index 3, its satisfied throughout the tree
Given any data that is stored in an array like structure, we can perform a build heap operation in O(n) time by recursively using heapify down from bottom to up.
Hashing is the process of mapping any input, integer, string etc., to a fixed integer in a given range. Suppose we have several key-value pairs available with us. Using hashing, we can map every key to some integer/index, and store the value for this key-value pair in that index of an array. Now, when we want to search for that key, we first run hashing through the key to find the index where to look in the array and the retrieve the value stored at the location in O(1).
This fixed integer in a given range determines the index of the array (or the key in dictionary terms) where we put the new element. The function that enables such a mapping is called a hash function. It is possible that multiple inputs are mapped to the same integer and this can cause collision. Any implementaition of a hashmap will have built in logic to handle these for us.
A simple case of hash function can be taking the first character of the name of a string if the input to the hashing function are all strings. In that case, assuming sanitized strings, we can map each string to one of 26 possibilities. As is clear, this is not a great hash function since many different strings will be mapped to the same integers causing a lot of collisions.
Consider a hash function for integer inputs that determines the position to place an input by taking the modulo of the input with 7. Our key array is of fixed size 7. Multile collisions will happen since there are several inputs that can leave the same remainder with 7.
To handle the collisions, we simple make the index of the key array to point to a linked list. As we encounter collisions, we keep adding elements to the beginning of the linked list. This gives an O(1) insertion time. In worst case, the find/remove operation is O(n). Let’s denote the load factor of the hashing function by load factor alpha = n/N where n is the number of elements in the hash table and N is distinct number of keys possible in the array.
A load factor < 1 is ideal since that would mean there are at most one element per index on average. This ensure remove and find operations on the linked list to be almost O(1).
Linear probing is a special method to handle collisions where if we find a collision, we check for the next available position in the array to put the element. Clearly, in worse case this can require looping over the entire array before we find the suitable position to find an available slot.
This is an add on to the linear probing method, where we utilize a second hash function to find the index to store the element. Since this method jumps around to find an available slot, we have a less chance of creating local clusters. The runtime of this approach will depend on the load factor and not the size of the data itself.
The formula is h(k, i) = (h1(x) + i * h2(x))%N, where h1 and h2 are two indices. We start with i = 0 and keep incrementing it if we encounter collisions. Thus, in the first try h1 is only used to determine the index. h2 begins to come into play after the first try has produced a collision.
As seen above, maintaining the load factor is critical to have an efficiently working hash function. Rehashing resizes the array whenever it encounters a full array. Rehashing will reassign new keys for already existing data and create space to add new entries. Usually, the array size will be doubled.
A disjoint-set data structure keeps tracks of elements that have been partitioned into disjoint subsets. It supports two main operations:
Disjoint-Sets data structures are also sometimes called Union-Find data structures due to the primary operations associated with these, union and find.
However, note that union is a reserved keyword in C and C++, and thus the actual function name could be something like set_union.
A naive implementation of this data structure is using an array. We simply have the indices as the elements/keys in the disjoint subsets, and the stored array value as the representative element of the disjoint subset. This can simply be the first element of the subset.
Thus, if we have three subsets with elements (0,4), (1,3,5), (2,6,7), the corresponding representation as an array will be
Index --> | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Elements --> | 0 | 1 | 2 | 1 | 0 | 1 | 2 | 2 |
Clearly, find will work in O(1) since we can directly fetch the value at the index. However, union will be O(n), since we will need to traverse the entire array to check if the element belongs to the particular subset which we need to merge.
In this implementation, we still continue using the array indices as the keys, but now the value of the array is
Suppose we have 4 elements, all in different subsets initially,
Index --> | 0 | 1 | 2 | 3 |
Elements --> | -1 | -1 | -2 | -1 |
Now if we want to merge 0 and 3, we simply make 0 as the parent of 3
Index --> | 0 | 1 | 2 | 3 |
Elements --> | -1 | -1 | -1 | 0 |
Thus, find(3) is 0, which is not -1 and we recurse. find(0) is -1 which means that the representative element of the subset in which 3 resides is 0. We can also refer to this element as the root of the corresponding UpTree.
Extending this idea, merging any two elements is same as making the root node of one subset point to the root node of the other subset. This way, the update operation boils down to finding the root nodes of the two elements on which union operation is called.
For instance, if we want to merge 3 and 1, we make the representative element/root of the subset where 3 resides to point to 1. The first part of the algorithm is to traverse the tree and get the root of the tree. find(3) gives 0, and find(0) gives -1, indicating that 0 is the root element. We update the array as array[0] = 1
Index --> | 0 | 1 | 2 | 3 |
Elements --> | 1 | -1 | -1 | 0 |
The run time of this find algorithm is O(height) where the height will be n in the worst case, and the run time will be same as the naive implementation.
With this worst case complexity, an ideal UpTree is the one that has flat trees, i.e., every element of a subset points to the root. The complexity of find in such a case will be O(1).
Both the approaches ensure that the height of trees are bounded by O(log n).
When we called find on an element, we need to traverse the entire path to reach the root from that element. Since we are traversing the entire path, we can ensure that the subsequent find calls are efficient by path compression. For every node on this path, we will update its value in the array so that it points directly to the root element.
Any node pointing to one of the nodes along the path, has their height reduced. This helps bring the data structure closer to the ideal UpTree where the trees are as flat as possible.
The iterative log is defined as
log * (n) = 0 if n <= 1
1 + log * (log(n)) otherwise
log * (2 ^ 65536) = 5
For disjoint-sets implemented with smart union and path compression, the runtime for m queries (combination of union and find operations) is O(m * log * (n)) where n is the size of disjoint-set. Since iterative log is a very small value for even very large values of n, the amortized time can be said to be same as O(m)* , making one operation on the disjoint set constant time (amortized).
Graphs are a collection of vertices (nodes) and edges denoted by G = (V, E). The number of vertices is denoted by n and the number of edges by m.
Key terms
Term | Definition |
---|---|
Incident Edges (v) | For a vertex, it is the collection of all edges directly connected to it {(x,v) in E} |
Degree (v) | The number of incident edges |
Adjacent Vertices (v) | Collections of all vertices that are connected to v by an edge, {x : (x,v) in E} |
Path | Sequence of vertices connected by edges |
Cycle | Path with a common begin and end vertex |
Simple Graph | A graph with no cycles or multiple-edges |
Subgraph | A graph whose vertices and edges are a subset of another (parent) graph |
In an edgelist implementation, we store a list of vertices (as a vector), and a list of edges (as a vector). Every element in the list of edges contains information about the two vertices connected by the edge, and any other information, like the edge weight, stored as well.
Different operations on this implementation
Operation | Description |
---|---|
InsertVertex(v) | simply expand the vertex list to add the new node, O(1) amortized |
RemoveVertex(v) | O(m) since we will also remove all edges which have vertex v; this requires traversal through the entire list of edges |
areAdjacent(vertex v1, vertex v2) | O(m) since we need to traverse the entire edge list and check if the two vertices are part of the same edge anywhere in the list |
IncedentEdges(v) | O(m) since we will traverse the entire list to check all edges and count the ones which have the vertex v present in them |
In the implementation, we still store a list of vertices, along with a matrix whose rows and columns are all the vertices. The value at any location indicates whether an edge is present between the two vertices. Usually, 0 will denote the absence of any edge.
For a more exhaustive implementation, we could store edges and related information in a separate list of edges (as above), and store pointers to this edge list in the adjacency matrix. With this extra storage, we have the flexibility to traverse the list of edges for certain operations, while still retaining the ability to find edges corresponding to any vertex via the matrix.
Different operations and their runtimes
Operation | Description |
---|---|
InsertVertex(v) | this now becomes O(n) since a new row and column must be added to the adjaceny matrix |
RemoveVertex(v) | O(m) + O(n) since we remove one row and column from the adjacency matrix |
areAdjacent(vertex v1, vertex v2) | O(1) because its a simple lookup in the matrix |
IncedentEdges(v) | O(n) as we lookup in the corresponding row and column of the matrix to check for existence of edges |
This implementation extends the idea of an edge list. In addition to the vertex list and the edge list, each element of the vertex list points to a linked list which is the list of all edges adjacent to that vertex in the graph. Each element of this list points to the corresponding edge in the list of edges.
Different operations and their runtimes
Operation | Description |
---|---|
InsertVertex(v) | Still O(1)* since we will be adding an element to the array |
RemoveVertex(v) | O(deg(v)) where deg denotes degree, since we need to traverse through all the edges having v as one of the vertices |
areAdjacent(vertex v1, vertex v2) | min(O(deg(v1), deg(v2))) as we will traverse the smaller list of one of the two nodes and check all the edges |
IncedentEdges(v) | O(deg(v)) since the entire linked list of the node needs to be traversed |
The objective of traversal is to visit every vertex and edge in the graph. This helps find interesting sub-structures within the graph.
Graph traversal is quite different from tree traversal because there is no obvious order in the graph, no starting point, and possible cycles.
A minimum spanning tree of a graph is a spanning tree that has the minimum number of edges/weight of edges across all possible spanning trees. By definition, MST is completely connected.
Consider an undirected graph with weighted edges. To get the MST with Kruskal’s algorithm, we will utilize a min-heap and a disjoint-set.
The founding idea of this algorithm is a partition property of the graph:
Let U and V be two connected components of the graph G. The edge e, that is of the minimum weight that connects these two partitions is part of some MST.
We can build from this idea by starting from a single vertex as one partition:
This algorithm is very similar to Prim’s algorithm except for the distance update step. The distance of all neighbors is updated as distance of self + weight of the connecting edge if distance already present on the vertex is less than this new calculated distance.
At every iteration of the algorithm, the vertex with the least distance is picked from the min-heap and added to the connected graph created so far.
One major pitfall of Dikstra’s algorithm is that it does not work on graphs with negative weights. This is not limited to graphs with a negative weight cycle, but also graphs with any negative weight.
The runtime of this algorithm is O(m + n * log(n)).
Consider an undirected graph where all edges are equal in weight. We want to find the shortest path between two vertices A and F which passes through some fixed vertex L.
To get the shortest path, we can simply get the MST using BFS as only the number of edges matter and not the weights. One simple approach is to first run BFS on A to get the shortest path from A to L, and then run the algorithm again to get the shortest path from L to F.
However, note that the shortest path from A to L is also the shortest path from L to A. Thus, when we run BFS on L, we get shortest paths from L to F and L to A or A to L. Only running BFS once will provide us with the required answer.