I read this article Retiring a Great Interview Problem, the author came up with a work break
problem and gave three solutions. The efficient one uses memoization
algorithm and the author said its worst case time complexity is O(n^2)
since the key insight is that SegmentString is only called on suffixes of the original input string, and that there are only O(n) suffixes
.
However, I find it difficult for me to understand why it is O(n^2)
. Could someone please give me a hint or proof?
Work Break Problem:
Given an input string and a dictionary of words,
segment the input string into a space-separated
sequence of dictionary words if possible. For
example, if the input string is "applepie" and
dictionary contains a standard set of English words,
then we would return the string "apple pie" as output.
The memoization algorithm from Retiring a Great Interview Problem
Map<String, String> memoized;
String SegmentString(String input, Set<String> dict) {
if (dict.contains(input))
return input;
if (memoized.containsKey(input) {
return memoized.get(input);
}
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
String segSuffix = SegmentString(suffix, dict);
if (segSuffix != null) {
return prefix + " " + segSuffix;
}
}
}
memoized.put(input, null);
return null;
}
Intuition
(it is clear that the worst case is the one where there are no solutions, so I focus on that)
Since the recursive call is made before putting values in the memoization-cache, the last (shortest) suffixes will get cached first. This is because the function is first called with a string of length N, which then calls itself with a string of length N-1, which then .... , with a string of len 0, which is cached and returns, then the string of length 1 is cached and returns, ..., length N is cached and returns.
As the hint suggests, only suffixes get cached, and there are only N of them. This means that by the time the top-level function gets the result of its first recursive call (i.e. on a suffix of length N-1), the cache is already populated with the N-1 suffixes.
Now, assuming the N-1 last suffixes are already cached, the for-loop needs to make N-1 recursive calls, each taking O(1) (since the answer is already cached), totalling O(N). However, (pre-) building the cache of the last N-1 takes O(N^2) (explained below), totalling with O(N)+O(N^2) = O(N^2).
Proof by mathematical induction
This explanation can easily be translated to a formal proof using induction. Here is the gist of it:
(f(N)
being the number of operations required for the function to complete on an input of length N)
Induction hypothesis -- there exists a constant c
s.t. f(N) < c*N^2
.
The base case is trivial -- for a string of length 1, you can find a constant c
such that f(1) < c*N^2 = c
Induction step
Recap of the order things happen:
Step 1: the function first calls itself on the suffix of length N-1, building a cache containing the answer for the last N-1 suffixes
Step 2: the function then calls itself O(N) more times, each taking O(1) (thanks to this test: if (memoized.containsKey(input))
, and the fact that the cache has already been populated in Step 1).
So we get:
f(N+1) = f(N) + a*N < (by the hypothesis)
< c*N^2 + a*N < (for c large enough)
< c*N^2 + 2*c*N + c =
= c*(N+1)^2
Thus we got f(N+1) < c*(N+1)^2
, which completes the proof.