Study Guide 1
Study Guide 1
Study Guide 1
Study guide:
General:
To do:
Repeat the problem out loud, slowly but confidently.
Ask questions, check assumptions, know what the constraints are.
Talk out the problem!
Work out examples on the whiteboard, devising some kind of brute force
algorithm along the way.
Don't stay silent here.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 3/112
4/25/2021 Study guide: - WorkFlowy
b 8 bi
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 5/112
4/25/2021 Study guide: - WorkFlowy
byte -- 8 bits
E.g.: from −128 to 127.
short -- 16 bits
char -- unsigned 16 bits
int -- 32 bits
long -- 64 bits
Python supports 4 primary numeric types:
https://docs.python.org/2/library/stdtypes.html
spots.
Only in Java.
^ (XOR)
x^0=x
x ^ 1 = ~x
~ = negation
x^x=0
& (AND)
x&0=0
x&1=x
x&x=x
| (inclusive OR)
x|0=x
x|1=1
x|x=x
Swapping two values without a temporary variable:
a=a^b
b=a^b
a=a^b
E.g.: a = 2, b = 3; a = 0010, b = 0011.
a = a ^ b = 0010 ^ 0011 = 0001
b = a ^ b = 0001 ^ 0011 = 0010
a = a ^ b = 0001 ^ 0010 = 0011
a = 3, b = 2.
Common (and not-so-common) bit tasks:
Get the value of the bit at i in num. Specifically, return whether the ith bit is
1.
boolean getBit(int num, int i) {
return ((num & (1 << i)) != 0);
}
First left-shift 1 by i to get something like 00100000. Then AND with
num to clear (set to 0) all the bits except the ith bit. If that value is not
0, then it's 1 and return true. Else it's 0 so return false.
To be more precise, the ANDing doesn't quite clear. It just makes it so
that there are only two cases: all the bits are 0 or all the bits except the
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 7/112
4/25/2021 Study guide: - WorkFlowy
y p
ith bit is 0. In the first case, the ith bit must be 0; in the second, it must
be 1.
Set num's ith bit to 1.
int setBit(int num, int i) {
return num | (1 << i);
}
First left-shift 1 by i to again get something like 00100000. Then OR
with num so that the ith bit will become 1 and the other values will not
change. Recall that x OR-ing with 1 will always result in 1 and that x
OR-ing with 0 will not change x.
Clear num's ith bit. Specifically, set the ith bit to 0.
int clearBit(int num, int i) {
int mask = ~(1 << i);
return num & mask;
}
Do something like the inverse of set (naturally). Left-shift 1 by i to again
get something like 00100000. Then invert it so it looks like 11011111.
AND with num: ANDing with 1 doesn't change anything, so the only bit
that will change is the ith one, which will become a 0 if it isn't already.
Update num's ith bit with v. Specifically, set the ith bit to v.
int updateBit(int num, int i, int v) {
int mask = ~(1 << i);
return (num & mask) | (v << i);
}
The first part is equivalent to the clear bit method. Create a mask that
looks like 11011111 then AND with num to clear the ith bit. Next, left-
shift v by i bits so that v becomes a number where the ith bit is equal
to what v was and all the other bits are 0. Finally, OR with the modified
num (ith bit cleared) so that num's ith bit becomes 1 if v is 1 or is
otherwise left as 0--recall that no other bits in num are modified since
ORing with a 0 does not change the original.
Clear (unset) rightmost set bit. That is, change rightmost 1 to 0.
int clearRightmost(int num) {
return num & (num-1);
}
So, e.g., a 12 (1100) becomes a 1000 since ANDing 1100 and 1011 is
1000.
This forms the basis of the operation to determine parity.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 8/112
4/25/2021 Study guide: - WorkFlowy
Calculate parity. That is, return 1 (true) if the number of 1s is odd and 0
otherwise.
EPI 5.1. #link http://www.geeksforgeeks.org/write-a-c-program-to-find-the-parity-of-an-
unsigned-integer/ and
https://graphics.stanford.edu/~seander/bithacks.html#OperationCounting
Naive method:
boolean getParity(int num) {
boolean parity = false;
while (num) {
parity = !parity;
num = num & (num - 1);
}
return parity;
}
Each time through the loop (until num is 0), clear the rightmost
set bit and invert parity. Every odd number of clears, parity will be
true / 1, so in the end, parity will be 1 if the number of 1s is odd
and 0 otherwise.
Time complexity is the number of set bits.
We can also get the rightmost bit and xor with a 0/1 result. If 0,
result doesn't change. Otherwise, result flips.
Not-so-naive method uses a precomputed lookup table. So for
example the table could have the parities of the values from 0 through
15 inclusive. And then to find the parity of a 64 bit number, you would
go 16 bits at a time by right shifting and XORing the 16 bit parities
together.
This works because bitwise operations are associative (i.e., the
grouping doesn't matter).
Get the Hamming distance.
https://leetcode.com/problems/hamming-distance/#/description
ct += 1
# Unsets the rightmost 1.
z &= z - 1
return ct
Arrays Dictionaries and Strings:
Hash tables are important! Note that insertion and find/lookup proceed through
identical starting steps: hash the key and go to that table location. This enables
the O(1) complexity for both.
Chaining is the common way to handle collisions. Just means each position
is actually the head of a linked list.
Open addressing is the other method. Deletions are tricky with this.
Java has built-in hashcodes.
Python's dictionaries are implemented using hash tables. They are simply arrays
whose indexes are obtained using a hash function on the keys.
Motivation:
"Think of a hash map as a "hack" on top of an array to let us use
flexible keys instead of being stuck with sequential integer "indices."
All we need is a function to convert a key into an array index (an
integer). That function is called a hashing function."
For the common case where I increment a key if it exists, but set it to 0
otherwise, use the get method:
my_dict[some_key] = my_dict.get(some_key, 0) + 1
Arrays offer easy random access but hard modification (have to move elements
around).
In the worst case a larger array has to be reallocated too (but amortizes out).
In Java:
Use StringBuilder for expensive string concatenation. And for easy string
reversal.
Everything is initialized to 0, including arrays.
Remember that chars are also ints.
Python only really has lists; arrays are just thin wrappers over C arrays.
Always use lists unless have a really good explicit reason to need C-style arrays.
Linked Lists:
All a linked list really is is the head node (which points to the next node, and so
on).
To reverse a linked list, just need to remember to store next and update prev and
curr pointers at each iteration:
nxt = curr.next
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 10/112
4/25/2021 Study guide: - WorkFlowy
curr.next = prev
prev = curr
curr = nxt
To recursively reverse:
base case is null node or null node.next
remember to set head.next.next to head and then head.next = None
Deleting a node n is just setting the references to skip n: prev.next = n.next;
Can also delete a curr node even without a prev pointer: curr.data =
curr.next.data; curr.next = curr.next.next;
Basically, copy (and overwrite) next's data to curr and delete next.
Just make sure to check for the null pointer (!!) and be aware of the head pointer.
Linked lists offer easy modification but non-existent random access.
An important technique is the "runner" one:
Have two pointers iterating through the list, with one pointer either ahead
by a fixed amount or actually moving faster.
For example, to determine the midpoint of a linked list, have two pointers
such that one of them jumps 2 nodes at once and the other just 1. When the
fast pointer hits the end, the slow one will be in the middle of the list.
To determine if a linked list has a cycle, do the same thing where 1 pointer
moves ahead 2 nodes at a time. If there's a cycle, the runner will eventually
equal the normal curr node. If not, will hit null node.
https://www.wikiwand.com/en/Cycle_detection#/Tortoise_and_hare
def hasCycle(self, head):
slow = fast = head
while fast and fast.next:
slow = slow.next
fast = fast.next.next
if slow == fast: return True
return False
then, once fast = slow, reset slow to head of list, and move forward
both pointers 1 at a time. once they equal again, that's the start of the
cycle. then move the fast pointer 1 at a time, once they equal for the
3rd time, that's the length of the cycle
can also use this to detect duplicates in an array given certain
constraints
http://keithschwarz.com/interesting/code/?dir=find-
duplicate
https://leetcode.com/problems/find-the-duplicate-
number/description/
def detectCycle(self head):
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 11/112
4/25/2021 Study guide: - WorkFlowy
def detectCycle(self, head):
Binary trees (branching factor = 2) are recursively built from nodes that have
pointers to a left and right child.
Binary trees aren't necessarily binary search trees (BSTs), which require that the left
children are less than or equal to the current node which is less than all the right
nodes.
Traversals:
Depth-first traversal (DFS) prefixes are in reference to when the root/current
node is visited. So pre-order traversal means first root, then left children,
then right; post-order means first left, then right, then root. In-order traversal
will give a sorted list if it's a valid BST. Make sure to recurse fully when
traversing.
DFS generally will hit leaves first/faster.
In-order traversal pseudocode is simple:
inorder(left)
visit(root)
inorder(right)
Same for the other traversals:
visit(root)
preorder(left)
preorder(right)
Only one kind of breadth-first traversal—the level order traversal. This
traversal visits nodes by levels from top to bottom and from left to right.
Uses a queue.
To choose between the two, we want to use DFS if we need to hit a leaf fast
and BFS if we're more concerned with a "middle" level closer to the root. But
if we need to print all leaf nodes, then by necessity they're the same.
Balanced trees are better. It also doesn't mean perfectly balanced.
Depth and height:
A node's depth = the number of edges from the node to the tree's root
node. A root node will have a depth of 0.
A node's height = the number of edges on the longest path from the node
to a leaf. A leaf (root?) node will have a height of 0.
The depth and height of a tree are equal.
Full and complete is a tree where all leaves are at the bottom and each non-leaf
node has exactly 2 children (sometimes called a perfect tree as in
http://www.cs.gettysburg.edu/~ilinkin/courses/Fall-2012/cs216/notes/bintree.pdf).
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 14/112
4/25/2021 Study guide: - WorkFlowy
Rare because then must have exactly (2^n) - 1 nodes where n is the number of levels, or (2^
(h+1)) - 1 where h is the height.
Full = every node other than the leaves has 2 children. I.e., every node either
has 0 or 2 children.
Complete = every level (except maybe the last) is completely filled and all
nodes are as far left as possible. I.e., the tree is as compact as possible.
In a perfect tree, h = O(log n) because h = log2 (n + 1) − 1 and we drop the
less significant terms.
Algo tips:
For DFS, use a stack as always.
To validate BST, in-order traversal and comparing to sorted version is O(N lg
N)*, but the better O(N) way is to recurse down the tree and set floor and
ceiling conditions.
*Actually we can check if a list is sorted in O(N) time (we don't need to
resort) so same thing.
For lowest common ancestor (LCA), the LCA is just the first node that is in
between the keys I'm looking for. E.g., if a root node is greater than both key
nodes, then I know the LCA has to be on the left side of the tree.
In-order successor:
Important in a lot of binary tree operations.
The next node in an in-order traversal; i.e., the node with the smallest key
that's still greater than the current node's key.
Pseudo-pseudocode:
If current node has a right child, go to it then as far left as possible.
Must be smallest key in right subtree.
Otherwise, must travel up the tree (its left child / subtree is by
definition smaller, i.e., can't be successor):
If current node is the left child of its parent, then the parent must
be the successor.
Otherwise, keep going up until escape the right subtree and hit
the middle. I.e., when the current node is a left child, its parent
must be the in-order successor. We're looking for its ("upper-
right") parent.
Heaps:
A natural implementation of the priority queue ADT (the other being a
balanced binary tree).
The highest priority element (whether min or max) recursively sits at
the top of the heap.
Works by maintaining a partial ordering, so less expensive than fully
sorted, but more valuable than no ordering.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 15/112
4/25/2021 Study guide: - WorkFlowy
sorted, but more valuable than no ordering.
A complete tree where the highest (or lowest) priority element is always
stored at the root—hence a heap. A max heap is one where a node is always
greater than or equal to its children; a min heap is one where a node is
always lesser than or equal to its children.
Are used to implement priority queues—queues where the ordering is
based on priority, not on position in queue—and to do heap sort.
There is no implied ordering within a level (that is, no ordering between
siblings) and so a heap is not a sorted structure.
Insertion and deletion (need to respect the heap ordering property) are both
O(log N) where N is the number of nodes.
(Binary) heaps are implemented in arrays.
Note that any binary tree can be so implemented, but that it makes
particular sense for heaps, because heaps are guaranteed to be
complete trees and thus the array implementation will be compact.
If a non-complete tree is implemented in an array, then there will be
empty slots because a node may have only 1 child and the tree may go
on quite a bit deeper than that level.
Saves space because can use array position to implicitly maintain the
ordering property, instead of storing pointers.
In the array implementation:
Let n be the number of elements in the heap and i be an arbitrary valid
index of the array storing the heap. If the tree root is at index 0, with
valid indices 0 through n − 1, then each element a at index i has:
children at indices 2i + 1 and 2i + 2
and its parent at floor((i − 1) ⁄ 2).
Alternatively, if the tree root is at index 1, with valid indices 1
through n, then each element a at index i has:
1-basing sacrifices a tiny amount of space to simplify index arithmetic.
If not, swap the element with its parent and return to the previous
step.
Down-heap (or sift-down) is used, e.g., to restore the heap ordering
when the root is deleted/removed and replaced with the last element
(on the last level):
Compare the new root with its children; if they are in the correct
order, stop.
If not, swap the element with one of its children and return to the
previous step (for the newly ordered subtree). (Swap with its
smaller child in a min-heap and its larger child in a max-heap.)
So basically, swap with higher priority child.
Heapify, given a complete binary tree, turns the data structure into a
heap:
We want the heap property to obtain, so we need to ensure that
each parent node is greater (or lesser, if min heap) than its
children.
Starting at the last parent node (take the last node and use index
arithmetic to figure out the last parent), call down-heap on it. This
will make the subtree rooted at this parent node a heap.
Continue on for all the other parent nodes.
Where the heap is 1-based implemented:
Mostly from visualgo.
a linked list; a maximally unbalanced tree is just a linked list — where the
height is 1 less than the number of nodes.
Specifically, a balanced tree has performance O(log n) (think binary
search), but an unbalanced tree has performance O(n) (think linked list).
Self-balancing trees thus guarantee balanced-ness and consequently
efficient performance.
Red-black trees:
http://www.eternallyconfuzzled.com/tuts/datastructures/jsw_tut_rbtree.aspx
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 18/112
4/25/2021 Study guide: - WorkFlowy
AVL trees:
http://www.wikiwand.com/en/AVL_tree
node = node[char]
return prefix
Links:
http://www.wikiwand.com/en/Tree_(data_structure)
this is a pretty good general reference:
http://www.cs.cmu.edu/~adamchik/15-121/lectures/Trees/trees.html.
Graphs:
The graph ADT comprises a set of vertices (or nodes) together with a set of pairs
of vertices, the edges. A vertex can be any arbitrary abstract object.
Graphs are an extremely powerful concept. Linked lists, trees, state transition
diagrams, etc. are all just cases of graphs.
Terminology:
Direction:
Directed graph (or digraph) = a graph where the edges have a
direction associated with them. I.e., each edge is now an ordered pair
of vertices.
In conceptual terms, direction implies that relations between vertices
are not symmetric. For example, a graph of friends would be
undirected, since X being friends with Y implies that Y is friends with X.
And on the other hand, a graph of Twitter followers would be directed,
since X being a follower of Y does not imply that Y is a follower of X.
Directed acyclic graph (or DAG) = a digraph with no (directed) cycles.
https://www.wikiwand.com/en/Directed_acyclic_graph
Multiple (or parallel) edges = 2 or more edges that connect the same 2
vertices.
Loop = an edge that connects a vertex to itself.
Multigraph = a graph that can have multiple edges and (depending on
the convention) sometimes loops.
Simple graph = a graph that cannot have multiple edges or loops.
Connectivity:
Connected graph = graph with no vertices by themselves; i.e., there is
some path between every pair of vertices / every pair of the graph's
vertices is connected.
A connected component is a maximal subgraph that is (1) connected
and (2) not connected to any vertices in the rest of the supergraph.
Subgraph of a graph G = graph whose vertices are a subset of G's
vertices and whose edges are a subset of G's edges. Note that
this is subset, not proper subset.
Tracking the connected components of a changing graph is a
straightforward partitioning problem, which can be naturally
solved with the application of disjoint-set (or union-find) data
structures.
https://www.wikiwand.com/en/Disjoint-set_data_structure
but have different total weights (or distances) because their edges' weights
differ.
Cut:
A partition of a graph's vertices into two disjoint subsets.
Also defines a cut-set, the set of edges that have 1 endpoint in each
subset of the partition. I.e., the edges that cross the cut (also called a
crossing edge). Sometimes a cut is identified as its cut-set.
The size / weight of a cut is the (total) weight of the cut-set, the edges
crossing the cut. If unweighted, the size is the number of such crossing
edges.
Degree of a node is the number of edges connected to it. If directed graph,
there is an indegree and an outdegree.
A Hamiltonian path is a path that visits each vertex exactly once. Similarly, a
Hamiltonian cycle (or circuit) is a Hamiltonian path that is a cycle.
Determining whether such a path exists in a graph is known as the NP-
complete Hamiltonian path problem.
Coloring:
A graph coloring is just assigning colors to the nodes in a graph.
A legal coloring is when no two adjacent / connected nodes have the
same color.
Finding the fewest number of colors we can use for a legal coloring
(the chromatic number) is an NP problem.
Three primary implementation methods / data structures:
https://www.wikiwand.com/en/Graph_(abstract_data_type) and
http://www.cs.rochester.edu/~nelson/courses/csc_173/graphs/implementation.html
Adjacency list:
https://www.wikiwand.com/en/Adjacency_list
visited.add(vertex)
stack.extend(graph[vertex] - visited)
return visited
Recursive:
def dfs(graph, start, visited=None):
if visited is None:
visited = set()
visited.add(start)
for next in graph[start] - visited:
dfs(graph, next, visited)
return visited
Easy to iteratively implement with a stack. Just make sure to "mark"
visited nodes to prevent infinite loops.
Also easy to recursively implement (also via a stack, i.e., the program's
call stack).
Breadth-first search (BFS):
Breadth-first search is like throwing a stone in the center of a pond.
The nodes you explore "ripple out" from the starting point.
BFS explores the neighbor nodes first, before moving to the next
"level."
Pro: will always find the shortest path.
Con: usually requires more memory than DFS.
Pseudo-pseudocode:
Visit node r and then iterate through each of r's unvisited
neighbors.
When visiting a neighbor n, add its neighbors to queue, but don't
visit them yet.
Visit all of node r's neighbors before visiting any of r's "
(grand)children."
Python code:
http://eddmann.com/posts/depth-first-search-and-breadth-first-search-in-
python/
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 30/112
4/25/2021 Study guide: - WorkFlowy
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 31/112
4/25/2021 Study guide: - WorkFlowy
The following method colors black all edges in the the MST
of any connected edge-weighted graph with V vertices:
Starting with all edges colored gray, find a cut with no black
edges, color its minimum-weight edge black, and continue
until V-1 edges have been colored black.
http://algs4.cs.princeton.edu/43mst/
Prim's algorithm:
Prim's algorithm works by attaching a new edge to a single
growing tree at each step: Start with any vertex as a single-
vertex tree; then add V-1 edges to it, always taking next
(coloring black) the minimum-weight edge that connects a
vertex on the tree to a vertex not yet on the tree (a crossing
edge for the cut defined by tree vertices).
This is a relatively simple method, but the crucial step is
efficiently detecting the minimum-weight crossing edge to
be added at each step:
Lazy way:
Just maintain a priority queue (PQ) of the
crossing edges (priority is obviously by minimum
weight). Each time a crossing edge is added to
the MST, a vertex is also added. The new crossing
edges to be added to the PQ are just the edges
from the new vertex to all non-MST vertices.
However, there might be old edges in the PQ
which are now ineligible (i.e., they are non-
crossing and now connect two MST vertices). This
way is lazy since it leaves these edges in the PQ.
Eager way:
The first improvement is to delete ineligible
edges from the PQ, such that only crossing edges
from the growing MST to the rest of the graph
are present. But we can do even better. Since (1)
we are only concerned with the minimal crossing
edge and since (2) when we add a new edge /
vertex v to the MST, the only possible
improvement for each non-tree vertex w is that
adding v brings w closer to the tree, we really just
need to keep track of the minimum-weight edge
and check whether the addition of v to the tree
necessitates that we update that minimum
(because of an edge v-w that has lower weight),
which we can do as we process each edge In
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 32/112
4/25/2021 Study guide: - WorkFlowy
which we can do as we process each edge. In
m = l + ((r - l) / 2)
curr = nums[m]
if target == curr:
return m
elif target > curr:
l=m
else:
r=m
return l + 1
could also do:
while l <= r:
m = l + (r - l) / 2
curr = nums[n]
if target == curr:
return m
elif target > curr:
l=m+1
else:
r=m-1
return -1
Bubble sort and selection sort suck. Both O(n^2).
Merge sort and quick sort are both good and, in the average case, O(n Log(n)).
Merge sort:
Worse case is O(n Log(n)).
From Algorithms class:
like quicksort in that it recursively splits array and builds it back in right
order (both also divide-and-conquer)
divide array into 2 halves, recursively sort halves, merge results
recursive calls then sorting action (contra quicksort); "on the way up";
only way sorting occurs is through these "on the way up" merges
pro: guaranteed N log N performance
con: extra space proportional to N (aux array for merging)
time efficiency: average, best, and worst are all O(N log N); does
require some aux space
All the work happens in the merging where consider both arrays and pick
the smallest element to copy to main array each time. Base case is merging
two 1-element arrays (or null arrays) because these are sorted by definition.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 34/112
4/25/2021 Study guide: - WorkFlowy
Simplified pseudocode:
if len(list) < 2: return list
sorted_first = mergesort(first_half)
sorted_second = mergesort(second_half)
return merge(sorted_first, sorted_second)
Most important part is merging the sorted lists. For this, just add smallest
element each time until one of the lists is empty, and then copy any
remaining elements over.
Usually stable.
Quicksort:
Worse case is O(n^2) when array already sorted, but small constant factor
and can be in-place. Can also mitigate / check for the worst degenerate case.
From Algorithms class:
array rearranged such that, when two subarrays sorted, the whole array
is ordered (contra merge where array is broken into 2 subarrays to be
sorted and then combined to make whole ordered array)
recursive calls happen after working on whole array
partition/pivot not necessarily in middle
Or necessarily the median value, leading to the worst case performance.
Difference is that in mergesort, the sorting action happens on the way up (when
ordered subarrays are merged), whereas in quicksort, the sorting happens on the
way down when the (sub)array is split and the low and high elements are
separated and the pivot element is placed in its correct final position.
"The difference between the two is basically which linear operation they
leverage. QuickSort is efficient because you can partition a list into items that
come before and items that come after a given item in linear time.
MergeSort is efficient because given two already sorted lists you can
combine the two, and maintain sorted order, in linear time."
Dynamic Programming & Memoization:
Top-down recursion can be memory-intensive because of building up the call
stack. Can avoid by going bottom-up and using DP.
Current point.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 36/112
4/25/2021 Study guide: - WorkFlowy
return prev1
Medium:
https://leetcode.com/problems/decode-ways/description/
code:
def numDecodings(self, s):
# dp[i] =
# dp[i-1] if dp[i] != 0
# + dp[i-2] if 9 < dp[i-2:i] < 27
if not s or s[0] == '0': return 0
pre = cur = 1
for i, c in enumerate(s[1:]):
if c == '0': cur = 0
if 9 < int(s[i:i+2]) < 27:
cur = cur + pre
pre = cur - pre
else: pre = cur
return cur
Knuth-Morris-Pratt:
the key is that if we have a prefix == suffix match in the substring up to the
pattern[i] where pattern[i] stopped matching with text[j], then the suffix suf
must also be the last characters we saw in text, text[ j-len(suf) : j ]; and since
this substring suf is equal to some prefix pre, we can slide the pattern over,
match up pre with text[ j-len(suf) : j ], and start trying to match again at
pattern[ len(p) ] and text[j]
https://www.wikiwand.com/en/Knuth%E2%80%93Morris%E2%80%93Pratt_al
gorithm
https://www.youtube.com/watch?v=GTJr8OvyEVQ
Bloom filters:
http://prakhar.me/articles/bloom-filters-for-dummies/
Tells you either that a value is possibly in the set or that it's definitely not in
the set.
I.e., some false positives, but guaranteed no false negatives. Probabilistic
data structure.
Trie and suffix trees:
Union-find / disjoint-set:
Network connectivity, connected components
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 38/112
4/25/2021 Study guide: - WorkFlowy
Basic idea is that each distinct subset is identified by its root, so if two
elements have the same root, they're in the same set; and to merge, just
need to make one subset's root the other subset's
https://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf
Useful for questions where I need to merge disjoint sets or network-like
problems etc. like https://leetcode.com/problems/number-of-islands-
ii/description/:
clear answer: https://leetcode.com/problems/number-of-islands-
ii/discuss/75459/JavaPython-clear-solution-with-UnionFind-Class-
(Weighting-and-Path-compression)
https://leetcode.com/problems/number-of-islands-
ii/discuss/75468/Compact-Python.
also https://leetcode.com/problems/friend-circles/submissions/
Union-find template code:
class UnionFind(object):
def __init__(self, n):
self.count = n
self.parent = [_ for _ in xrange(n)]
self.rank = [1 for _ in xrange(n)]
def find(self, k):
while k != self.parent[k]:
self.parent[k] = self.parent[self.parent[k]]
k = self.parent[k]
return k
def union(self, k1, k2):
k1, k2 = self.find(k1), self.find(k2)
if k1 == k2: return
# ensure k1 is smaller component
if self.rank[k1] > self.rank[k2]:
k1, k2 = k2, k1
self.parent[k1] = k2
self.rank[k2] += self.rank[k1]
self.count -= 1
A*:
LRU/LFU cache:
Least recently used item is invalidated / evicted when cache is full.
https://www.kunxi.org/blog/2014/05/lru-cache-in-python/
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 39/112
4/25/2021 Study guide: - WorkFlowy
The idea here is to use an OrderedDict and when returning an item, pop and
re-insert to maintain the order invariant. popitem(last=False) lets us pop off
the tail, i.e., in FIFO order.
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
if key not in self.cache: return -1
val = self.cache.pop(key)
self.cache[key] = val
return val
def set(self, key, value):
try:
self.cache.pop(key)
except KeyError:
if len(self.cache) == self.capacity:
self.cache.popitem(last=False)
self.cache[key] = value
Without OrderedDict, would have to create my own. Basically would have a
normal dictionary and then a doubly linked list. The dictionary would map
each key to its node; there would be dummy head and tail pointers; and I'd
add to the tail.prev and remove/invalidate from the head.next:
https://leetcode.com/problems/lru-cache/discuss/45926/Python-Dict-
+-Double-LinkedList
like this:
class LLNode(object):
def __init__(self, key=0, val=0):
self.key = key
self.val = val
self.prev = None
self.next = None
class LRUCache(object):
def __init__(self, capacity):
# dict mapping key to LLNode
self.cache = {}
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 40/112
4/25/2021 Study guide: - WorkFlowy
evicted =
self.freq2node[self.least_freq].popitem(last=False)
self.key2node.pop(evicted[0])
else: self.remain -= 1
self.least_freq = 1
Minimum edit distance:
https://www.youtube.com/watch?v=We3YDTzNXEk
Key ideas are that the edit distance of two strings x, y has to be least the edit
distance of their prefixes, e.g. x[:-1], y; insertion and deletion are
"symmetric"; if the current characters are the same, the edit distance at that
point is just the edit distance of the two strings minus the current character;
if not, they're the min of the diagonal, previous column and previous row
edit distances + 1 (to represent the operation itself)
i.e., assume I did delete, so I look at the previous column for that edit
distance, then add 1 for the current deletion; if I replaced, then it's the
diagonal plus 1 etc.
Fisher-Yates shuffle:
import random
def inplace_shuffle(lst):
n = len(lst)
if n < 2: return lst
for i in xrange(n - 1):
repl = random.randint(i, n - 1)
if i != repl: lst[i], lst[repl] = lst[repl], lst[i]
The naive algorithm of swapping every number with any other number fails
because the number of possible orderings it outputs is n^n, but the number
of actual distinct possible orderings is n!, so by the pigeonhole principle,
some distinct permutations / "buckets" will be overrepresented.
Dijkstra's:
Boyer-Moore majority vote algorithm:
O(n) time and O(1) space algo for finding majority element if it exists
used in LC #277 Find the Celebrity https://leetcode.com/problems/find-the-
celebrity/description/
https://www.wikiwand.com/en/Boyer%E2%80%93Moore_majority_vote_algor
ithm
Also in https://leetcode.com/problems/majority-element/description/
Basically have a candidate majority element and a count, increment the
t if
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE
did t if t i 0 th i did t l 43/112
4/25/2021 Study guide: - WorkFlowy
count if curr == candidate, if count is 0 then curr is new candidate, else
decrement count.
Reverse a number (integer):
I.e., how to turn a number into its reverse without converting to a string.
Basically use mod (%) and integer division (//) to find each digit and
construct the reverse number.
def reverse(num):
rev = 0
while num:
rev *= 10
rev += num % 10
num /= 10
return num
E.g. useful for testing numerical palindromes without converting to a string.
Kadane's algorithm (maximum subarray problem):
https://www.wikiwand.com/en/Maximum_subarray_problem
def max_subarray(A):
max_ending_here = max_so_far = A[0]
for x in A[1:]:
max_ending_here = max(x, max_ending_here + x)
max_so_far = max(max_so_far, max_ending_here)
return max_so_far
Sieve of Eratosthenes:
https://www.wikiwand.com/en/Sieve_of_Eratosthenes
https://leetcode.com/problems/count-primes/description/
def countPrimes(self, n):
if n < 2: return 0
arr = [1] * n
# from 2 to ceil(sqrt of n)
for i in xrange(2, int(n ** 0.5) + 1):
# if prime
if arr[i]:
# increment by i starting at i squared
for j in xrange(i ** 2, n, i):
arr[j] = 0
count = sum([1 for x in arr if x])
return count - 2
Heaps:
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 44/112
4/25/2021 Study guide: - WorkFlowy
Heaps:
https://docs.python.org/2/library/heapq.html
Heaps can be very useful if I need to maintain some kind of min/max value,
but it needs to be dynamically updated over time among other possible
previous values. Notably of course, the invariant just applies to the start/min
of the list; the rest of the list remains unsorted.
Basically just initialize heap as a normal list (or use in-place heapify) and then
use heaq.heappush(arr, x), heappop, and heapreplace to push, pop, and
replace an item while maintaining min-heap invariant.
Meeting room variant where getting fewest # of rooms needed:
https://leetcode.com/problems/meeting-rooms-ii/description/
intervals.sort(key=lambda x: x.start)
# keep track of min end time
heap = []
for n in intervals:
# heap[0] is always the min end time
# this means we can reuse this room but just need to update end
if heap and n.start >= heap[0]:
heapreplace(heap, n.end)
# otherwise we need to allocate a new room
else:
heappush(heap, n.end)
return len(heap)
Common hard problem of merging k sorted (linked) lists (O(n k log(k)):
https://leetcode.com/problems/merge-k-sorted-lists/description/
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE f d l d 46/112
4/25/2021 Study guide: - WorkFlowy
if word_i + 1 == len(word): return True
visited.add((x, y))
# add backtracking node before forward nodes
stk.append((x, y, word_i, True))
cands = []
for a, b in [(-1,0), (0,-1), (0,1), (1,0)]:
xa, yb = x + a, y + b
if (xa < 0 or xa >= len(board) or yb < 0 or
yb >= len(board[0])
or (xa, yb) in visited or board[xa][yb] !=
word[word_i+1]):
continue
stk.append((xa, yb, word_i+1, False))
return False
N queens:
https://www.geeksforgeeks.org/n-queen-problem-backtracking-3/
http://www.thecodenote.com/2017/05/beginners-guide-to-solving-n-
queens.html
code:
def isSafe(board, row, col):
# Check this row on left side
for i in range(col):
if board[row][i] == 1:
return False
# Check upper diagonal on left side
for i,j in zip(range(row,-1,-1), range(col,-1,-1)):
if board[i][j] == 1:
return False
# Check lower diagonal on left side
for i,j in zip(range(row,N,1), range(col,-1,-1)):
if board[i][j] == 1:
return False
return True
def solveNQUtil(board, col):
# base case: If all queens are placed
# then return true
if col >= N:
return True
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 47/112
4/25/2021 Study guide: - WorkFlowy
flags[prev] = False
return num_patterns
# many keys are symmetric
return sum([dfs(1, i) * 4 + dfs(2, i) * 4 + dfs(5, i) for i in
xrange(m, n + 1)])
DFS in detail:
Very common in island or flood fill-type problems. Make sure that I'm
resetting the node or point to avoid infinite loops. I.e., once something
passes the constraint, make sure I mark it in some way (could also just be a
visited set or flags[i] boolean arr).
https://leetcode.com/problems/flood-fill/description/
code:
def floodFill(self, image, sr, sc, newColor):
def dfs(x, y, prev_color):
# make sure we're within bounds
if x < 0 or x >= len(image) or y < 0 or y >=
len(image[0]):
return
curr_color = image[x][y]
# make sure we don't loop indefinitely
if image[x][y] != prev_color or image[x][y] ==
newColor:
return
# fill ..
image[x][y] = newColor
# .. recursively
dfs(x - 1, y, prev_color)
dfs(x, y - 1, prev_color)
dfs(x, y + 1, prev_color)
dfs(x + 1, y, prev_color)
dfs(sr, sc, image[sr][sc])
return image
For island problems where I have to count, make sure I'm not double-
counting. Be careful with how I'm updating the return value. Either pass in
the value each time or just add incrementally, but not both.
https://leetcode.com/problems/max-area-of-island/description/
code:
def maxAreaOfIsland(self grid):
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 49/112
4/25/2021 Study guide: - WorkFlowy
def maxAreaOfIsland(self, grid):
Using a heap is kinda cheating, but can just use merge 2 linked lists function
recursively
code:
def mergeKLists(self, lists):
if not lists: return None
if len(lists) == 1: return lists[0]
def merge(l1, l2):
dummy = curr = ListNode(0)
while l1 and l2:
if l1.val <= l2.val:
curr.next = l1
l1 = l1.next
else:
curr.next = l2
l2 = l2.next
curr = curr.next
curr.next = l1 or l2
return dummy.next
mid = len(lists) / 2
l lf
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE l d 51/112
4/25/2021 Study guide: - WorkFlowy
l = self.mergeKLists(lists[:mid])
r = self.mergeKLists(lists[mid:])
return merge(l, r)
important for external sort when arrays can't fit into main memory:
https://www.wikiwand.com/en/External_sorting#/External_merge_sort
One example of external sorting is the external merge sort algorithm,
which is a K-way merge algorithm. It sorts chunks that each fit in RAM,
then merges the sorted chunks together.[1][2]
The algorithm first sorts M items at a time and puts the sorted lists
back into external memory. It then recursively does a -way merge on
those sorted lists. To do this merge, B elements from each sorted list
are loaded into internal memory, and the minimum is repeatedly
outputted.
For example, for sorting 900 megabytes of data using only 100
megabytes of RAM:
Read 100 MB of the data in main memory and sort by some
conventional method, like quicksort.
Write the sorted data to disk.
Repeat steps 1 and 2 until all of the data is in sorted 100 MB
chunks (there are 900MB / 100MB = 9 chunks), which now need
to be merged into one single output file.
Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted
chunk into input buffers in main memory and allocate the
remaining 10 MB for an output buffer. (In practice, it might
provide better performance to make the output buffer larger and
the input buffers slightly smaller.)
Perform a 9-way merge and store the result in the output buffer.
Whenever the output buffer fills, write it to the final sorted file
and empty it. Whenever any of the 9 input buffers empties, fill it
with the next 10 MB of its associated 100 MB sorted chunk until
no more data from the chunk is available. This is the key step that
makes external merge sort work externally -- because the merge
algorithm only makes one pass sequentially through each of the
chunks, each chunk does not have to be loaded completely;
rather, sequential parts of the chunk can be loaded as needed.
Topological sort for DAGs / Kahn's algorithm:
https://en.wikipedia.org/wiki/Topological_sorting
First, find a list of "start nodes" which have no incoming edges and
insert them into a set S; at least one such node must exist in a non-
empty acyclic graph. Then:
L ← Empty list that will contain the sorted elements
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 52/112
4/25/2021 Study guide: - WorkFlowy
window[prev] -= 1
if not window[prev]:
del window[prev]
j += 1
max_len = max(i - j + 1, max_len)
return max_len
template explained here: https://leetcode.com/problems/find-all-anagrams-
in-a-string/discuss/92007/sliding-window-algorithm-template-to-solve-all-
the-leetcode-substring-search-problem
note that for the anagram problem (https://leetcode.com/problems/find-all-
anagrams-in-a-string/) the window stays a constant size as it slides to the
right. for the fruit problem, each time the window moves 1 to the right, if the
window no longer fits the criteria, we shrink the window starting from the
left end of it
Combinatorics:
permutations = order matters, arrangements
n permute r = n! / (n-r)!
combinations = order does not matter, selections of elements
n choose r = n! / (r! * (n-r)!)
else: i += 1
return -1
# Build next jump table.
def create_next(self, pattern):
next_arr = [0] * len(pattern)
pre_i, suf_i = 0, 1
while suf_i < len(pattern):
# Found prefix-suffix match.
if pattern[pre_i] == pattern[suf_i]:
next_arr[suf_i] = pre_i + 1
pre_i, suf_i = pre_i + 1, suf_i + 1
else:
if pre_i:
pre_i = next_arr[pre_i-1]
else:
next_arr[suf_i] = 0
suf_i += 1
return next_arr
Making change:
https://www.interviewcake.com/question/python/coin
Task:
Write a function that, given:
an amount of money
a list of coin denominations
computes the number of ways to make amount of money with coins of
the available denominations.
Code:
Bottom-up DP memoization:
def change_possibilities_bottom_up(amount, denominations):
ways_of_doing_n_cents = [0] * (amount + 1)
ways_of_doing_n_cents[0] = 1
for coin in denominations:
for higher_amount in xrange(coin, amount + 1):
higher_amount_remainder = higher_amount -
coin
ways_of_doing_n_cents[higher_amount] +=
ways_of_doing_n_cents[higher_amount_remainde
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 56/112
4/25/2021 Study guide: - WorkFlowy
r]
return ways_of_doing_n_cents[amount]
Maximum sum subarray / Kadane's algorithm:
https://leetcode.com/problems/best-time-to-buy-and-sell-stock/
Task:
find the contiguous subarray which has the largest sum
Algorithm:
"a scan through the array values, computing at each position the
maximum (positive sum) subarray ending at that position. This subarray
is either empty (in which case its sum is zero) or consists of one more
element than the maximum subarray ending at the previous position.
The algorithm only needs to keep track of the ending position because
the implied starting position is just after the last position at which the
sum went negative; a higher sum can always be found by dropping any
negative-sum prefix."
https://www.wikiwand.com/en/Maximum_subarray_problem
Wiki:
def max_subarray(A):
max_ending_here = max_so_far = 0
for x in A:
max_ending_here = max(0, max_ending_here + x)
max_so_far = max(max_so_far, max_ending_here)
return max_so_far
Not as clean:
def maxSubArraySum(a,size):
max_so_far = -maxint - 1
max_ending_here = 0
for i in range(0, size):
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 57/112
4/25/2021 Study guide: - WorkFlowy
Task:
Given an array nums and a target value k, find the maximum length of
a subarray that sums to k. If there isn't one, return 0 instead.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 58/112
4/25/2021 Study guide: - WorkFlowy
Task:
Given a binary tree, find its maximum depth.
The maximum depth is the number of nodes along the longest path
from the root node down to the farthest leaf node.
Code:
Recursive:
# Definition for a binary tree node.
# class TreeNode:
# def __init__(self, x):
# self.val = x
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE lf l f 59/112
4/25/2021 Study guide: - WorkFlowy
# self.left = None
# self.right = None
class Solution:
# @param {TreeNode} root
# @return {integer}
def maxDepth(self, root):
# Base case: no node.
if not root: return 0
# Recursive case: the maximum of the depths of the
left and right sides plus the root depth (1).
return max(Solution.maxDepth(self, root.left),
Solution.maxDepth(self, root.right)) + 1
Iterative:
from collections import deque
def maxDepth(self, root):
"""
:type root: TreeNode
:rtype: int
"""
# Iterative method: count number of levels.
if not root: return 0
q, depth = deque(), 0
q.append(root)
while(True):
nodect = len(q)
if not nodect: return depth
depth += 1
while nodect:
node = q.popleft()
if node.left: q.append(node.left)
if node.right: q.append(node.right)
nodect -= 1
Binary tree paths
https://leetcode.com/problems/binary-tree-paths/
Task:
Given a binary tree, return all root-to-leaf paths.
For example, given the following binary tree:
1
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 60/112
4/25/2021 Study guide: - WorkFlowy
/ \
2 3
\
5
All root-to-leaf paths are:
["1->2->5", "1->3"]
Code:
DFS recursive:
def binaryTreePaths(self, root):
"""
:type root: TreeNode
:rtype: List[str]
"""
if not root: return []
res = []
self.dfs(root, "", res)
return res
def dfs(self, root, ret_str, res):
if not root.left and not root.right:
res.append(ret_str + str(root.val))
if root.left:
self.dfs(root.left, ret_str + str(root.val) + '->', res)
if root.right:
self.dfs(root.right, ret_str + str(root.val) + '->', res)
DFS iterative (with stack):
def binaryTreePaths1(self, root):
if not root:
return []
res, stack = [], [(root, "")]
while stack:
node, ls = stack.pop()
if not node.left and not node.right:
res.append(ls+str(node.val))
if node.right:
stack.append((node.right, ls+str(node.val)+"->"))
if node.left:
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE
t k d(( d l ft l t( d l) " >")) 61/112
4/25/2021 Study guide: - WorkFlowy
stack.append((node.left, ls+str(node.val)+"->"))
return res
BFS (with queue):
def binaryTreePaths2(self, root):
if not root:
return []
res, queue = [], collections.deque([(root, "")])
while queue:
node, ls = queue.popleft()
if not node.left and not node.right:
res.append(ls+str(node.val))
if node.left:
queue.append((node.left, ls+str(node.val)+"->"))
if node.right:
queue.append((node.right, ls+str(node.val)+"-
>"))
return res
Single number:
https://leetcode.com/problems/single-number/
Task:
Given an array of integers, every element appears twice except for one.
Find that single one.
Note:
Your algorithm should have a linear runtime complexity. Could you
implement it without using extra memory?
Code:
My non-xor one:
def singleNumber(self, nums):
"""
:type nums: List[int]
:rtype: int
"""
# 1, 1, 2, 3, 3
# 1, 2, 2
# 1, 1, 2, 2, 3
nums.sort()
for i, n in enumerate(nums):
# If odd number of elements and on last element,
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE b dd 62/112
4/25/2021 Study guide: - WorkFlowy
must be odd man out.
Task:
Reverse a singly linked list.
Hint: A linked list can be reversed either iteratively or recursively. Could
you implement both?
Code:
Concise iterative:
def reverseList(self, head):
"""
:type head: ListNode
:rtype: ListNode
"""
prev = None
curr = head
# Change one link at a time.
while(curr is not None):
# Hold on to next node since will redirect link.
next = curr.next
# Reverse link direction.
curr.next = prev
# Update prev pointer.
prev = curr
# Update curr pointer for next iteration.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 63/112
4/25/2021 Study guide: - WorkFlowy
curr = next
# Not curr since will be None.
return prev
Concise recursive:
def reverseList(self, head, prev=None):
"""
:type head: ListNode
:rtype: ListNode
"""
if not head: return prev
# Reverse links, making sure to hold on to the next node.
curr = head.next
head.next = prev
return self.reverseList(curr, head)
Worked out example:
# 4 -> 1 -> 3 -> None
# None <- 4 <- 1 <- 3
# = 3 -> 1 -> 4 -> None
#
# curr = 1
# head.next = 4.next = None -- None <- 4; curr = 1 -> 3 ->
None
# return reverseList(curr, head) = reverseList(1 -> 3 -> None,
None <- 4)
# curr = 3
# head.next = 1.next = 4 -- None <- 4 <- 1; curr = 3 ->
None
# return reverseList(curr, head) = reverseList(3 -> None,
None <- 4 <- 1)
# curr = None
# head.next = 3.next = 1 -- None <- 4 <- 1 <- 3; curr =
None
# return reverseList(None, None <- 4 <- 1 <- 3)
# not head: return prev = None <- 4 <- 1 <- 3
Naive brute force iterative:
def reverseList(self, head):
"""
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 64/112
4/25/2021 Study guide: - WorkFlowy
Code:
def isValid(self, s):
map = {'(': ')', '{': '}', '[': ']'}
# Use a list as a stack.
stack = []
for i, c in enumerate(s):
if c in map:
stack.append(c)
elif c in map.values():
# Also fails if stack is empty -> too many closing
parens.
if not stack or map[stack.pop()] != c: return False
return not stack
Remove duplicates from unsorted singly linked list:
CTCI 2.1
Code:
With buffer / auxiliary data structure:
from linkedlist import Node
def remove_dupes(head):
# Edge case where single node list.
if not head.next_node: return head
curr, prev = head, Node()
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 65/112
4/25/2021 Study guide: - WorkFlowy
dupes = {}
while curr:
if curr.data in dupes:
prev.next_node = curr.next_node
else:
dupes[curr.data] = 1
prev = curr
curr = curr.next_node
Without buffer:
# Now do it without another data structure.
def remove_dupes2(head):
if not head.next_node: return head
curr, runner = head, head.next_node
# Curr, not curr.next, because otherwise would miss last
node.
while curr:
runner = curr
while runner.next_node:
# Compare using runner.next because no prev
pointer.
if curr.data == runner.next_node.data:
runner.next_node =
runner.next_node.next_node
else:
runner = runner.next_node
curr = curr.next_node
Two sum:
https://leetcode.com/problems/two-sum/
Task:
Given an array of integers, return indices of the two numbers such that
they add up to a specific target.
You may assume that each input would have exactly one solution.
Example:
Given nums = [2, 7, 11, 15], target = 9,
Because nums[0] + nums[1] = 2 + 7 = 9,
return [0, 1].
Code:
def twoSum(self, nums, target):
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 66/112
4/25/2021 Study guide: - WorkFlowy
"""
Task:
You are given two non-empty linked lists representing two non-
negative integers. The digits are stored in reverse order and each of
their nodes contain a single digit. Add the two numbers and return it as
a linked list.
You may assume the two numbers do not contain any leading zero,
except the number 0 itself.
Input: (2 -> 4 -> 3) + (5 -> 6 -> 4)
Output: 7 -> 0 -> 8
Code:
def addTwoNumbers(self, l1, l2):
"""
:type l1: ListNode
:type l2: ListNode
:rtype: ListNode
"""
l1_str = ''
l2_str = ''
while l1.next != None:
l1_str += str(l1.val)
l1 = l1.next
l1_str += str(l1.val)
while l2.next != None:
l
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE l l 67/112
4/25/2021 Study guide: - WorkFlowy
l2_str += str(l2.val)
l2 = l2.next
l2_str += str(l2.val)
l1_num, l2_num = int(l1_str[::-1]), int(l2_str[::-1])
sum = l1_num + l2_num
sum_str_rev = str(sum)[::-1]
# Need a copy of root node to be able to return at the end since
modifying while iterating.
root = ListNode(0)
curr = root
for c in sum_str_rev:
l_curr = ListNode(int(c))
curr.next = l_curr
curr = curr.next
return root.next
Sparse matrix multiplication:
https://leetcode.com/problems/sparse-matrix-multiplication/
Task:
Given two sparse matrices A and B, return the result of AB.
You may assume that A's column number is equal to B's row number.
Example:
A=[
[ 1, 0, 0],
[-1, 0, 3]
]
B=[
[ 7, 0, 0 ],
[ 0, 0, 0 ],
[ 0, 0, 1 ]
]
| 1 0 0 | | 7 0 0 | | 7 0 0 |
AB = | -1 0 3 | x | 0 0 0 | = | -7 0 3 |
| 001|
Code:
def multiply(self, A, B):
"""
:type A: List[List[int]]
:type B: List[List[int]]
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 68/112
4/25/2021 Study guide: - WorkFlowy
:rtype: List[List[int]]
"""
# Matrix is a list of lists where each list is a row.
"""
if not A or not B: return None
AB = [[0 for _ in B[0]] for _ in A]
# Iterate through rows of A.
for i, row_a in enumerate(A):
# 0, 1
#Iterate through cols of A.
for j, ele_a in enumerate(row_a):
# 0, 1, 2
if ele_a:
# Iterate through rows of B.
for k, ele_b in enumerate(B[j]):
# 0, 1, 2
if ele_b:
AB[i][k] += A[i][k] * B[k][j]
return AB
ZigZag conversion:
https://leetcode.com/problems/zigzag-conversion/
Task:
The string "PAYPALISHIRING" is written in a zigzag pattern on a given
number of rows like this: (you may want to display this pattern in a
fixed font for better legibility)
P A H N
A P L S I I G
Y I R
And then read line by line: "PAHNAPLSIIGYIR"
Write the code that will take a string and make this conversion given a
number of rows:
string convert(string text, int nRows);
convert("PAYPALISHIRING", 3) should return "PAHNAPLSIIGYIR".
Code:
def convert(self, s, numRows):
# The key is that the distance down then up (or up then down) to
get to the next letter on the current row is symmetric regardless
of the horizontal distance traveled / the diagonal
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 69/112
4/25/2021 Study guide: - WorkFlowy
of the horizontal distance traveled / the diagonal.
linenum = 1
ret_str = ''
ind = 0
# Have to alternate strategy for middle lines.
mid_cycle_index = 1
if numRows == 1: return s
for i in xrange(len(s)):
ret_str += s[ind]
# First line.
if linenum == 1:
ind += 2 * (numRows - 1)
elif linenum == numRows:
ind += 2 * (linenum - 1)
else:
if mid_cycle_index % 2 == 1:
ind += 2 * (numRows - linenum)
else:
ind += 2 * (linenum - 1)
mid_cycle_index += 1
# Go to next line.
if ind >= len(s):
linenum += 1
ind = linenum - 1 # ind is 0-based, linenum 1-based
mid_cycle_index = 1
return ret_str
Meeting rooms:
https://leetcode.com/problems/meeting-rooms/
Task:
Given an array of meeting time intervals consisting of start and end
times [[s1,e1],[s2,e2],...] (si < ei), determine if a person could attend all
meetings.
For example,
Given [[0, 30],[5, 10],[15, 20]],
return false.
Code:
# Definition for an interval.
# class Interval(object):
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 70/112
4/25/2021 Study guide: - WorkFlowy
def inorder_traversal(root):
stack = [root]
while stack:
curr = stack.pop()
if not curr.left and not curr.right:
print curr.val
else:
if curr.right: stack.append(curr.right)
stack.append(Node(curr.val))
Dummy leaf node so next time will print right away.
if curr.left: stack.append(curr.left)
Rotate a matrix 90 degrees:
A[:] = zip(*A[::-1])
Reverse then transpose (rows as cols and vice versa).
Stacks are easily implemented with lists: just use append() for push and pop() for
pop.
Append adds to end of list, pop (without explicit index) removes and returns from end of list.
Queues, however, should be implemented with deques, with append() for queue
and popleft() for dequeue.
Re: popleft(), Imagine the list going from right-to-left, where the right is the
end. (I.e., add elements at the right).
To initialize a deque with a tuple, need to do deque([(x, y)]):
but can append with just q.append((x, y))
If don't want to use a deque, can just do pop(0).
equivalent to popleft()
When updating / incrementing a dictionary, use get to simplify the null check
(whether the key already exists):
for c in s:
freqs[c] = freqs.get(c, 0) + 1
or: my_dict[some_key] = my_dict.get(some_key, 0) + 1
the set equivalent is my_dict.setdefault(some_key, default), but most of the
time, can either use (preferably) defaultdict or get
Dictionary comprehension and sorting:
iterating over a dict only iterates over its keys
i.e., for i in dict: i will be each key, not each key-value pair
similarly, to test for key membership, just do: if key in dict
to get the actual key-value items:
for i, j in dict.iteritems()
applies similarly for comprehensions: new_d = {k: v+1 for k, v in
d.iteritems()}
sorted list of items (k-v tuples) by key:
sorted(d.iteritems())
sorted list of keys:
sorted(d)
sorted list of items (tuples) by value:
sorted(d.iteritems(), key=lambda x: x[1])
sorted list of keys by value:
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 75/112
4/25/2021 Study guide: - WorkFlowy
Basically just initialize heap as a normal list and then use heaq.heappush(arr,
x), heappop, and heapreplace to push, pop, and replace an item while
maintaining min-heap invariant.
https://leetcode.com/problems/meeting-rooms-ii/description/
intervals.sort(key=lambda x: x.start)
# keep track of min end time
heap = []
for n in intervals:
# heap[0] is always the min end time
# this means we can reuse this room but just need to update end
if h d
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE h [0] 76/112
4/25/2021 Study guide: - WorkFlowy
if heap and n.start >= heap[0]:
heapq.heapreplace(heap, n.end)
# otherwise we need to allocate a new room
else:
heapq.heappush(heap, n.end)
return len(heap)
When using reduce, might need to have an initializer / default value (e.g., when it's
an empty list):
prod = prev_prod * reduce(lambda x, y: x*y, list_ints[i+1:], 1)
Deep vs. shallow copying: https://www.wikiwand.com/en/Object_copying
https://www.cs.utexas.edu/~scottm/cs307/handouts/deepCopying.htm
Shallow copies duplicate as little as possible. A shallow copy of a collection is
a copy of the collection structure, not the elements. With a shallow copy, two
collections now share the individual elements. A chance in the underlying
elements will change the copy as well.
Deep copies duplicate everything. A deep copy of a collection is two
collections with all of the elements in the original collection duplicated. This
means that the elements are unique to each copy.
Can't use a list as a dict key (can use tuple, but can't have a mutable type inside
like a list):
Keys need to be immutable to be hashable.
c1 = [0].extend(curr)
c2 = curr.append(0)
curr = [c1 + c2 for c1, c2 in zip(c1, c2)]
This does work:
for i in xrange(2, k + 1):
c1 = [0] + curr
c2 = curr + [0]
Can just convert a string into a set of the chars directly: str_set = set(the_str)
In any case where I need to do 2 for loops going forward then backward (2
pointers), remember that I can always construct the reverse pointer index from the
"forward" one, i.e.:
for i in xrange(len(nums)):
j = -1 - i
will give me the opposite end pointer each time
can also do: new_j = -j - 1
Be careful when initializing a matrix:
do: self.board = [[None] * n for _ in xrange(n)]
not: self.board = [[None] * n] * n
The second way makes copies of / references to the inner list such that any
modification to one row changes the other rows
To modify a list in place, have to use nums[:], if use nums =, then that generates
new list.
One common Python gotcha involves default mutable arguments:
If we try to do something like: def func(x, arr=[]):
"A new list is created once when the function is defined, and the same list is
used in each successive call.
Python’s default arguments are evaluated once when the function is defined,
not each time the function is called (like it is in say, Ruby). This means that if
you use a mutable default argument and mutate it, you will and have
mutated that object for all future calls to the function as well."
http://docs.python-guide.org/en/latest/writing/gotchas/
Instead, do:
def func(x, arr=None):
if not arr: arr = []
To compare dictionaries (i.e., whether the keys and values are the same), just use
==. Dictionaries are not ordered so it works.
or shortcircuits, so can do things like: cur.next = l1 or l2
This will add the rest of linked list 1 if it's not null, otherwise add l2.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 80/112
4/25/2021 Study guide: - WorkFlowy
Beware of using IndexError to bounds check, since negative indices are valid and
will just wrap around.
Note that abstract data types (e.g., stack, queue) are distinct from data structures
(e.g., array, heap). The former is a mathematical model, whereas the latter is a
concrete representation of data from the perspective of the implementer, not the
user.
From Wiki.
For example, a list is not the same sort of thing as a linked list -- one is an
ADT, the other a data structure.
Further, abstract data types specify a sort of contract: what operations can
be done (e.g., pop, push), what constraints exist on those operations (e.g.,
dequeue returns the first element), and what sort of data on which the
operations act (e.g., nodes). Data structures, then, are just the
implementations of ADTs.
Remember that algorithm efficiency analysis (i.e., Big O) represents the upper
bound of an algorithm's running time. Only the highest order term matters,
disregarding constants. For example, natural log running time has the same
complexity as logarithmic (base 10) running time: O(log n).
I.e., the runtime is expressed in terms of how quickly it grows relative to the
input, as the input gets arbitrarily large.
http://stackoverflow.com/questions/2307283/what-does-olog-n-mean-
exactly
Logarithms of all bases grow at the same asymptotic rate.
http://stackoverflow.com/questions/17121006/confusion-around-big-o-
with-natural-logs
Remember that parameters are passed by assignment (by reference but
references passed as values), so if pass a mutable argument (like list), can mutate
it in a function and have the changes reflected elsewhere / outside the scope.
http://stackoverflow.com/questions/986006/how-do-i-pass-a-variable-by-reference
In a nested list comprehension, the order of for- and if-expressions is the same as
it would be in a loop format.
I.e., if trying to duplicate:
for row in grid:
for x in row
would just be:
[x for row in grid for x in row]
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 82/112
4/25/2021 Study guide: - WorkFlowy
[x for row in grid for x in row]
remove(element/index)
get(index)
getFirst()
Dequeue if implementing queue.
list.size()
Stack:
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 84/112
4/25/2021 Study guide: - WorkFlowy
Stack:
empty()
peek()
pop()
push(element)
search(element)
Returns offset from top of stack.
Arrays.sort(primitiveArray);
Collections.sort(arrayList);
Primitive arrays have to be initialized with a capacity, ArrayLists don't -- they
dynamically grow.
currChar = theStr.charAt(i);
From char to String and back:
String s = String.valueOf(c);
char c = s.charAt(0);
char[] chars = s.toCharArray();
From int to String and back:
String str = String.valueOf(foo);
Note that this is the same as above!
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
When using a subquery with conditional logic, ensure the subquery is either
scalar (returns just 1 value) or the operator is IN:
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date = (SELECT MIN(date)
FROM tutorial.sf_crime_incidents_2014_01
)
or
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date IN (SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
)
Case details:
Useful to "bucket" values in a column (e.g., output 1 when the value is
X, 0 otherwise).
The CASE statement always goes in the SELECT clause
CASE must include the following components: WHEN, THEN,
and END. ELSE is an optional component.
You can make any conditional statement using any conditional
operator (like WHERE) between WHEN and THEN. This includes
stringing together multiple conditional statements using AND and OR.
You can include multiple WHEN statements, as well as
an ELSE statement to deal with any unaddressed conditions.
Example:
SELECT player_name,
CASE WHEN year = 'FR' AND position = 'WR' THEN 'frosh_wr'
WHEN year = 'SO' AND position = 'WR' THEN 'soph_wr'
ELSE NULL END AS sample_case_statement
FROM benn.college_football_players
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 86/112
4/25/2021 Study guide: - WorkFlowy
FROM population
WHERE state > 0 -- only state-level rows
ORDER BY name
Without window function, would have to either get the national
population by itself (without the state-level information) or use a
subquery:
SELECT name AS state_name,
popestimate2015 AS state_population,
x.national_population
FROM population,
(
SELECT SUM(popestimate2015) AS national_population
FROM population
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 87/112
4/25/2021 Study guide: - WorkFlowy
)x
WHERE state > 0 -- only state-level rows
ORDER BY name
Can apply conditions with PARTITION BY (a la GROUP BY) and ORDER
BY:
PARTITION BY specifies the subset of rows considered (in the
window).
ORDER BY orders the rows in the specified window frame, useful
for RANK() and running sum-type information.
ORDER BY can act as an implicit partition:
Query:
SELECT start_time, duration_seconds,
SUM(duration_seconds) OVER (ORDER BY
start_time) AS running_total
FROM tutorial.dc_bikeshare_q1_2012
Results:
start_time | duration_seconds |
running_total
00:04:00 475 475
00:10:00 1162 2782
00:10:00 1145 2782
00:15:00 485 3738
00:15:00 471 3738
00:17:00 358 4096
00:18:00 1754 5850
Also note that the ordering "goes up to" the current query
row based on the column being ordered by; that is, the sum
is the sum of the current row plus the previous rows (hence
running total).
Evaluated after the join, group, and having clauses, at the same time as
other select statements—meaning window functions can't refer to
other fields in the select statement.
Running average of daily active users example:
Daily active users (DAU):
select date(created_at), count(distinct user_id)
from activity
group by 1
Lots of noise, so let's do a daily running average:
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 88/112
4/25/2021 Study guide: - WorkFlowy
select
date(created_at) d,
avg(count(distinct user_id)) over (
order by d
rows between 7 preceding and current row
)
from activity
group by 1
Specific unique window functions:
RANK vs DENSE_RANK vs ROW_NUMBER:
| V | ROW_NUM | RANK | DENSE_RANK |
|---|----------------|---------|-------------|
|a| 1 | 1 | 1 |
|a| 2 | 1 | 1 |
|a| 3 | 1 | 1 |
|b| 4 | 4 | 2 |
|c| 5 | 5 | 3 |
|c| 6 | 5 | 3 |
|d| 7 | 7 | 4 |
|e| 8 | 8 | 5 |
NTILE:
You can use window functions to identify what percentile (or
quartile, or any other subdivision) a given row falls into. The
syntax is NTILE(*# of buckets*). In this case, ORDER BY determines
which column to use to determine the quartiles (or whatever
number of ‘tiles you specify).
LAG / LEAD:
It can often be useful to compare rows to preceding or following
rows, especially if you’ve got the data in an order that makes
sense. You can use LAG or LEAD to create columns that pull
values from other rows—all you need to do is enter which column
to pull from and how many rows away you’d like to do the pull.
LAG pulls from previous rows and LEAD pulls from following rows:
Using count and case:
SELECT
COUNT(CASE WHEN rsp_ind = 0 then 1 ELSE NULL END) as
"New",
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 89/112
4/25/2021 Study guide: - WorkFlowy
1:1 relationships are rare and should usually just merge the tables
together. But if not, could do a primary:unique foreign key setup.
1:1 = PK:FK (one record on FK side has PK values).
Normalized (no dupes) schema is better for writes; denormalized better for
reads.
LIMIT:
SELECT * FROM tbl LIMIT 5,10;
returns rows 6-15
The offset (the first number) is 0-indexed.
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
You can reference a SELECT alias in the ORDER BY.
deptno) values
(7369, 'smith', 'clerk', 7902, '1980-12-17', 800, null, 20),
(7499, 'allen', 'salesman', 7698, '1981-02-20', 1600, 300, 30),
(7369, 'smith', 'clerk', 7902, '1980-12-17', 800, null, 20),
MapReduce:
purpose:
scalable, parallelizable computations on big data
abstracts away looping
simple to write programs, just need to supply mapper and reducer
functions
algorithm:
take a bunch (a list) of data items
system/master program splits list and gives a sublist to each (mapper)
node
mapper takes its list and applies some map function to it that
transforms the list of items into a list of key, value pairs
list of (k,v) pairs output to some buffer
system takes the lists of (k,v) pairs and filters them such that it can give
each reducer node a list where all the keys are the same
reducer aggregates values (since system ensured they all share the
same key)
example:
operates on a file in HDFS in a similar way. in HDFS, a file is split into chunks,
or blocks. many mapper nodes can work on different blocks of the same file.
then, reducer nodes can do the same. scales essentially linearly
Tez is Apache's MapReduce successor:
better performance
intermediate data doesn't need to be written to HDFS (can be
cached in-memory)
YARN containers can be reused
simpler "API"
computations expressed as workflow DAG vs multiple MR jobs
supports Hive, Pig, etc.
Hive:
Performance tips:
https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
FROM answer
LATERAL VIEW explode(vIds) visitor AS vId
WHERE cId = 2
“vIds” is the field we're exploding, “visitor” is the LATERAL VIEW
TABLE alias and “vId” is the new column alias so that it can be
selected in the query.
To get:
qId cId vId
1 2 v1
1 2 v2
1 2 v3
Spark:
Spark is in-memory, more functional since it uses RDDs that distinguish
between transformations and actions (enabling lazy execution/evaluation),
and has playback abilities (owing to the lazy evaluation).
In-memory as in it doesn't write intermediate results to disk/HDFS like map-
reduce does.
Dimensional Modeling:
Facts vs. dimensions:
Facts are (almost always) numeric measurements / low-level metrics we
use for calculation purposes
Represent a granular measurement event
Dimensions are the (usually non-numeric) attributes we use to filter or
group by
The "by" nouns, e.g., "revenue by product", so product is a
dimension
Facts = business process metrics; dimensions = descriptive attributes
A fact table has foreign keys to dimension tables
In a star schema, key assumption is that the fact::dimension relationship is
many::1, e.g., many transactions can have a shared product dimension
If many::many, then aggregations will be wonky (think about the
cardinalities in the join: 1 * N = N, but M * N = MN)
Generally, need to represent many::many relationships as two many::1 and
1::many relationships using a bridge/junction table:
https://www.dropbox.com/s/kra73kh4l9tm468/Screenshot%202017-03-
04%2016.24.31.png?dl=0
"Think about a simple relationship like the one between Authors and
Books. An author can write many books. A book could have many
authors. Now, without a bridge table to resolve the many-to-many
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 95/112
4/25/2021 Study guide: - WorkFlowy
relationship, what would the alternative be? You'd have to add multiple
Author_ID columns to the Books table, one for each author. But how
many do you add? 2? 3? 10? However many you choose, you'll
probably end up with a lot of sparse rows where many of the Author_ID
values are NULL and there's a good chance that you'll run across a case
where you need "just one more." So then you're either constantly
modifying the schema to try to accommodate or you're imposing some
artificial restriction ("no book can have more than 3 authors") to force
things to fit."
In other words, with a many::many relationship, there's no meaningful
parent-child relationship, no way to place foreign keys that makes
sense. So with a bridge/junction/linking table, each record of the
bridge table represents each unique combination of, say, author and
book. There can be many authors and many books, but each
combination only shows up once in the bridge.
select *
from author a join author_book ab on a.id = ab.author_id
join book b on b.id = ab.book_id;
Database modeling notation:
https://www.dropbox.com/s/dsld9gs3frocxxi/Screenshot%202017-03-
04%2016.34.02.png?dl=0
https://www.dropbox.com/s/24l3nwsogzxgxhm/Screenshot%202017-
03-04%2016.33.27.png?dl=0
https://www.smartdraw.com/entity-relationship-diagram/
http://www.vertabelo.com/blog/technical-articles/data-warehouse-
modeling-star-schema-vs-snowflake-schema
https://www.dropbox.com/s/l680pyvr8gxn9ue/Screenshot%20201
7-03-04%2016.55.32.png?dl=0
https://www.dropbox.com/s/573602bdme4vex2/Screenshot%202
017-03-04%2016.55.57.png?dl=0
Dealing with Slowly Changing Dimensions (SCDs):
https://www.dropbox.com/s/xfvf0556idivn75/Screenshot%202017-03-
05%2011.59.38.png?dl=0
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 96/112
4/25/2021 Study guide: - WorkFlowy
http://www.vertabelo.com/blog/technical-articles/inheritance-in-a-
relational-database
http://searchsqlserver.techtarget.com/feature/Supertype-and-subtype-
tables-in-SQL-Server
Conditional join on column value:
http://stackoverflow.com/questions/25255240/sql-conditional-join-
column
http://www.mysqldiary.com/conditional-joins-in-mysql/
basically, do left joins then combine results into 1 column to get rid of
nulls
Freeform text comments should arguably be in a dimension table
Modeling groups (e g FB users and groups of users):
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 97/112
4/25/2021 Study guide: - WorkFlowy
Modeling groups (e.g., FB users and groups of users):
Incremental loading:
https://dwbi.org/etl/etl/53-methods-of-incremental-loading-in-data-
warehouse
System Design & Distributed Systems:
HiredInTech:
https://www.hiredintech.com/classrooms/system-design/lesson/59#
A strong process is crucial to successfully solving system design questions.
We broke it down into four steps:
Scope the problem: Don't make assumptions; Ask questions;
Understand the constraints and use cases.
Sketch up an abstract design that illustrates the basic components of
the system and the relationships between them.
Think about the bottlenecks these components face when the system
scales.
Address these bottlenecks by using the fundamentals principles of
scalable system design.
Make sure to start with asking questions, what constraints are we under?
Typical access patterns? How much data? How many requests? Etc.
(Database) Sharding:
Just partition data by hashing the primary key.
https://www.hiredintech.com/classrooms/system-design/lesson/60
http://highscalability.com/blog/2009/8/6/an-unorthodox-approach-to-
database-design-the-coming-of-the.html
http://www.addsimplicity.com/adding_simplicity_an_engi/2008/08/shard-
lessons.html
https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/
To fix the problem of hotspots (incoming data not evenly distributed), use a
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 98/112
4/25/2021 Study guide: - WorkFlowy
good hash function. To fix the problem of node failures or of being able to
scale up (with naive hashing, would be forced to migrate tons of data /
nearly all data to new nodes), use consistent (ring) hashing:
https://www.paperplanes.de/2011/12/9/the-magic-of-consistent-
hashing.html
http://michaelnielsen.org/blog/consistent-hashing/
"But unlike naive hashing, consistent hashing requires only a
relatively small amount of data to be moved: if you add a
machine to the cluster, only the data that needs to live on that
machine is moved there; all the other data stays where it is."
https://akshatm.svbtle.com/consistent-hash-rings-theory-and-
implementation
Could also use "dynamic" sharding where there's a specialized metadata
locator service that points an incoming query to the right partition (basically
just key:value store of key range:partition). Think, e.g., HDFS' NameNode.
Easy to modify mappings, but single point of failure.
partioning is vertical; sharding horizontal on some attribute of the data, eg
for users, sharded into a-m and n-z datasets; federation is splitting the data
completely on some obvious data attribute, like putting all user data on one
node and all purchases data on another
CAP:
A dismal guide to concurrency:
https://www.facebook.com/notes/facebook-engineering/a-dismal-guide-to-
concurrency/379717628919/
share you only have to tell one or two people and trust that in time it
will reach everyone who cares. Spreading gossip among computers is a
bit more reliable because they are endlessly patient and (usually) don't
garble messages."
Eventually Consistent — Amazon.com paper
http://queue.acm.org/detail.cfm?id=1466448
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 100/112
4/25/2021 Study guide: - WorkFlowy
ACID vs BASE:
Things to notice:
Notice the magnitude differences in the performance of different
options.
Datacenters are far away so it takes a long time to send anything
between them.
Memory is fast and disks are slow.
By using a cheap compression algorithm a lot (by a factor of 2) of
network bandwidth can be saved.
Writes are 40 times more expensive than reads.
Global shared data is expensive. This is a fundamental limitation of
distributed systems. The lock contention in shared heavily written
objects kills performance as transactions become serialized and slow.
Architect for scaling writes.
Optimize for low write contention.
Optimize wide. Make writes as parallel as you can.
https://gist.github.com/jboner/2841832
http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-
envelope-calculations-to-choo.html
https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
http://highscalability.com/blog/2013/1/15/more-numbers-every-awesome-
programmer-must-know.html
Designing Data-Intensive Applications notes:
logs (not narrow concept of application logs, but any immutable append-
only key:value sequence structure) are a crucial abstraction for distributed
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 102/112
4/25/2021 Study guide: - WorkFlowy
only key:value sequence structure) are a crucial abstraction for distributed
systems:
https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
LSM-trees (log-structured merge trees) and SSTables (sorted string tables)
are an important under-the-hood concept:
write to memtable (in-memory [ordered] balanced tree structure like a
red-black or AVL tree)
when full, write to disk as an SSTable file/segment, becomes most
recent segment of DB; while being written to disk, writes continue on
new memtable instance
an SSTable is just sorted key:value pairs, but with some offsets
stored in an index file. so when searching, can check against index
to narrow down block that I have to look through
to serve reads, first try to find key in memtable, then in most recent on-
disk SSTable segment, then next most recent. can use a bloom filter to
avoid having to check all segments for nonexistent key
to avoid losing recent writes to memtable in event of DB crash,
separate log on disk where every write is immediately appended
(append-only immutable log etc.); not in order bc only purpose is
to restore memtable. each time successfully written to disk,
discard it
in background, from time to time, run merging and compaction
process to combine segments and discard overwritten/deleted values
(think tombstone for deletions)
Links:
General:
https://www.palantir.com/2011/10/how-to-ace-a-systems-design-
interview/
https://www.youtube.com/watch?v=-W9F__D3oY4
Focusing on scalability.
https://github.com/donnemartin/system-design-primer
Concurrency:
https://www.facebook.com/notes/facebook-engineering/a-dismal-
guide-to-concurrency/379717628919/
http://cs.lmu.edu/~ray/notes/introconcurrency/
Machine Learning & Stats:
Algorithms:
Meta:
cheatsheets:
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 103/112
4/25/2021 Study guide: - WorkFlowy
https://machinelearningmastery.com/a-tour-of-machine-learning-
algorithms/
sci-kit learn flowchart: http://scikit-
learn.org/stable/tutorial/machine_learning_map/index.html
SAS more readable / general:
https://blogs.sas.com/content/subconsciousmusings/2017/04/12/
machine-learning-algorithm-use/
multi-class classification:
just use one-vs-all (aka one-vs-rest)
train a model for each class where the negative labels are all the
other classes
regression:
linear vs logistic: linear is continuous, logistic is categorical
finding coefficients for standard linear equation: y = a + bx
for logistic, we use logistic function (s-shaped curve) to convert output
into 0 - 1 value, then we can apply a rule to get a binary label
like >0.5 then yes else no
decision trees:
start with most important feature, split based on that, then continue
splitting train set on successive features (where each split has the same
value for that feature) until get to leaf nodes (which are the output,
e.g., the yes/no decision)
just like any binary tree
https://medium.com/@chiragsehra42/decision-trees-explained-easily-
28f23241248
prone to overfitting (high variance), so some ways to account for that
are:
bagging:
similar to bootstrapping, construct multiple tree models for
different samples of input, then for each new data point,
average all the predictions
random forests:
instead of selecting optimal split (i.e., instead of selecting
splits that best result in even "sides"), introduce randomness
boosting (think gradient boosted decision trees / GBDT):
combine weak learners
each successive learner attempts to correct for the errors of
the previous one
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 104/112
4/25/2021 Study guide: - WorkFlowy
the previous one
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 105/112
4/25/2021 Study guide: - WorkFlowy
actual vs predicted
https://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix_f
iles/confusion_matrix_1.png
Predicted
P N
P: TP FN
N: FP TN
recall or sensitivity = TP / (TP + FN)
or true positives
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 107/112
4/25/2021 Study guide: - WorkFlowy
Style in the Long Tail: Discovering Unique Interests with Latent Variable
Models in Large Scale Social E-commerce (KDD 2014)
Evaluation:
Large-scale validation and analysis of interleaved search evaluation
Deep Learning
Learning Deep Architectures for AI
Wide & Deep Learning for Recommender Systems
Deep Neural Networks for YouTube Recommendations
General Machine Learning
Learning Classifiers from Only Positive and Unlabeled Data (KDD 2008,
useful for lookalike)
Behavioral:
Situation -> Task -> Action -> Result
Haseeb:
http://haseebq.com/how-to-break-into-tech-job-hunting-and-interviews/
answers. If you’re at a loss for stories, it may help to sit down and just
brainstorm every relevant story you can remember (for example, every bug
you remember working on), and then narrow it down from a larger list.
It’s hard to practice this effectively in a vacuum, so this is a good thing to
work with someone else on.
Interview Cake:
https://www.interviewcake.com/coding-interview-tips#chitchat
Time complexity of recursive algorithms that make multiple calls per call is often
exponential, think of the fib algorithm: f(n) = f(n-1) + f(n-2).
This isn't just O(2n) since 2 calls are being made for each call, not 2 calls for
the whole thing. So it's O(2^n) instead.
On a similar note, even if a function doesn't take any extra space with local
variables, if it makes a recursive call, that state has to be added to the call stack.
(The caller function "pauses" and the callee is pushed to the call stack. When the
callee is done executing, it's popped and control is returned to the caller.)
If the recursion is tail recursive (last call of function is the recursive call),
however, then can be tail call optimized so that the last call acts as a GOTO
and there's no space overhead—i.e., don't need to push frame to call stack.
Space used for a recursive function is proportional to the maximum depth of its
recursion tree, in other words, the length of the longest path in the tree.
https://www.dropbox.com/s/x7emb4iy4f8vsz6/Screenshot%202017-03-11%2010.30.52.png?
dl=0
It's not the number of function calls since the call stack is popping function
stack frames as it finishes executing each.
https://workflowy.com/s/study-guide/RD5kZ682pWX5oxiE 112/112