Build an ASCII chart of the most commonly used words in a given text.
The rules:
a-z
and A-Z
(alphabetic characters) as part of a word. She
== she
for our purpose). the, and, of, to, a, i, it, in, or, is
Clarification: considering don't
: this would be taken as 2 different 'words' in the ranges a-z
and A-Z
: (don
and t
).
Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).
Parse a given text
(read a file specified via command line arguments or piped in; presume us-ascii
) and build us a word frequency chart
with the following characteristics:
width
represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.bar
+ [space]
+ word
+ [space]
should be always <= 80
characters (make sure you account for possible differing bar and word lengths: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).An example:
The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).
This specific text would yield the following chart:
_________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |____________________________________________________| alice |______________________________________________| was |__________________________________________| that |___________________________________| as |_______________________________| her |____________________________| with |____________________________| at |___________________________| s |___________________________| t |_________________________| on |_________________________| all |______________________| this |______________________| for |______________________| had |_____________________| but |____________________| be |____________________| not |___________________| they |__________________| so
For your information: these are the frequencies the above chart is built upon:
[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that ', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t' , 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), (' but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]
A second example (to check if you implemented the complete spec):
Replace every occurence of you
in the linked Alice in Wonderland file with superlongstringstring
:
________________________________________________________________ |________________________________________________________________| she |_______________________________________________________| superlongstringstring |_____________________________________________________| said |______________________________________________| alice |________________________________________| was |_____________________________________| that |______________________________| as |___________________________| her |_________________________| with |_________________________| at |________________________| s |________________________| t |______________________| on |_____________________| all |___________________| this |___________________| for |___________________| had |__________________| but |_________________| be |_________________| not |________________| they |________________| so
The winner:
Shortest solution (by character count, per language). Have fun!
Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):
Language Relaxed Strict ========= ======= ====== GolfScript 130 143 Perl 185 Windows PowerShell 148 199 Mathematica 199 Ruby 185 205 Unix Toolchain 194 228 Python 183 243 Clojure 282 Scala 311 Haskell 333 Awk 336 R 298 Javascript 304 354 Groovy 321 Matlab 404 C# 422 Smalltalk 386 PHP 450 F# 452 TSQL 483 507
The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____|
bars, closes the first bar on top with a ____
line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.
Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).