I came across the following programming interview problem:
Challenge 1: N-grams
An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and an n-gram length Your task is to
• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)
Note that your function will receive the following arguments:
• text
○ which is a string containing words separated by whitespaces
• ngramLength
○ which is an integer value giving the length of the n-gram
Data constraints
• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
Efficiency constraints
• your function is expected to print the result in less than 2 seconds
Example Input text: “aaaab a0a baaab c”
Output aaa ngramLength: 3
Explanation
For the input presented above the 3-grams sorted by frequency are:
• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1
If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...
If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....
Is that the fastest option possible?
Thank you!!!
If implementation efficiency is what matters and you are using C, I would initialize an array of pointers to the starts of n-grams in the string, use qsort
to sort the pointers according to the n-gram that they are part of, and then loop over that sorted array and figure out counts.
This should execute fast enough, and there is no need to code any fancy data structures.