My girlfriend got this question in an interview, and I liked it so much I thought I'd share it... Write an algorithm that receives a dictionary (Array of words). The array is sorted lexicographically, but the abc order can be anything. For example, it could be z, y, x, .., c, b, a. Or it could be completely messed up: d, g, w, y, ... It doesn't even need to include all the abc letters, and finally it doesn't have to be letters at all. It could be any symbols that form a string. For example it could be made up of 5, α, !, @, θ... You get the idea. It's up to your algorithm to discover what the letters are (easy part).
The algorithm should return the correct lexicographic order of the symbols.
Note/Things to consider: 1. For a given dictionary, can you always discover the full order of all the letters? Consider a dictionary that only has 1 word, with more than 1 symbol... 2. You CANNOT assume that the dictionary is without error. The algorithm should determine if the dictionary contains contradictions and output that there is an error. 3. HINT: Think of a good data structure to represent the relations you discover between symbols. This should make the problem much easier.
I'll post my solution probably tomorrow. By no means do I claim that it's the most efficient one. I want to see other people''s thoughts first. Hope you enjoy the question
P.S. I think the best format to post solutions is with pseudo code, but I leave this to your consideration
This is topological sorting on a directed acyclic graph. You need to first build the graph: vertices are letters, and there's an edge if one is lexicographically less than the other. The topological order then gives you the answer.
A contradiction is when the directed graph is not acyclic. Uniqueness is determined by whether or not a Hamiltonian path exists, which is testable in polynomial time.
You do this by comparing each two consecutive "words" from the dictionary. Let's say you have these two words appearing one after another:
x156@
x1$#2z
Then you find the longest common prefix, x1
in this case, and check the immediately following characters after this prefix. In this case,, we have 5
and $
. Since the words appear in this order in the dictionary, we can determine that 5
must be lexicographically smaller than $
.
Similarly, given the following words (appearing one after another in the dictionary)
jhdsgf
19846
19846adlk
We can tell that 'j' < '1'
from the first pair (where the longest common prefix is the empty string). The second pair doesn't tell us anything useful (since one is a prefix of another, so there are no characters to compare).
Now suppose later we see the following:
oi1019823
oij(*#@&$
Then we've found a contradiction, because this pair says that '1' < 'j'
.
There are two traditional ways to do topological sorting. Algorithmically simpler is the depth-first search approach, where there's an edge from x
to y
if y < x
.
The pseudocode of the algorithm is given in Wikipedia:
L ← Empty list that will contain the sorted nodes
S ← Set of all nodes with no incoming edges
function visit(node n)
if n has not been visited yet then
mark n as visited
for each node m with an edge from n to m do
visit(m)
add n to L
for each node n in S do
visit(n)
Upon conclusion of the above algorithm, the list L
would contain the vertices in topological order.
The following is a quote from Wikipedia:
If a topological sort has the property that all pairs of consecutive vertices in the sorted order are connected by edges, then these edges form a directed Hamiltonian path in the DAG. If a Hamiltonian path exists, the topological sort order is unique; no other order respects the edges of the path. Conversely, if a topological sort does not form a Hamiltonian path, the DAG will have two or more valid topological orderings, for in this case it is always possible to form a second valid ordering by swapping two consecutive vertices that are not connected by an edge to each other. Therefore, it is possible to test in polynomial time whether a unique ordering exists, and whether a Hamiltonian path exists.
Thus, to check if the order is unique or not, you simply check if all two consecutive vertices in L
(from the above algorithm) are connected by direct edges. If they are, then the order is unique.
Once the graph is built, topological sort is O(|V|+|E|)
. Uniqueness check is O(|V| edgeTest)
, where edgeTest
is the complexity of testing whether two vertices are connected by an edge. With an adjacency matrix, this is O(1)
.
Building the graph requires only a single linear scan of the dictionary. If there are W
words, then it's O(W cmp)
, where cmp
is the complexity of comparing two words. You always compare two subsequent words, so you can do all sorts of optimizations if necessary, but otherwise a naive comparison is O(L)
where L
is the length of the words.
You may also shortcircuit reading the dictionary once you've determined that you have enough information about the alphabet, etc, but even a naive building step would take O(WL)
, which is the size of the dictionary.