A word counting problem: leonardo

leonardo_m

A word counting problem

Jun 16, 2013 16:31

A little programming problem from this blog post:
http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

The problem is:
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
A solution with shell scripting:
Here's the script that does that, with each command given its own line:

bash:
1: tr -cs A-Za-z '\n' |
2: tr A-Z a-z |
3: sort |
4: uniq -c |
5: sort -rn |
6: head -${1}
Explanation:
1) Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-s) multiple newlines.
2) Transliterate upper case to lower case.
3) Sort to bring identical words together.
4) Replace each run of duplicate words with a single representative and include a count (-c).
5) Sort in reverse (-r) numeric (-n) order.
6) Pass through a stream editor; quit (q) after printing the number of lines designated by the script's first parameter (${1}).

Even if data-filtering programs for these exact purposes were not at hand, these operations would well be implemented separately: for separation of concerns, for easier development, for piecewise debugging, and for potential reuse. The simple pipeline given above will suffice to get answers right now, not next week or next month. It could well be enough to finish the job.

Some comments to that blog post:
Andre is quite right that this is an example of the elegance of FP vs. imperative -- but the flip side is that FP often fails when it runs up against real-world edge cases. Knuth's version may not handle contractions -- but it could, easily. Could McIlroy's? What if I want British/US spellings (such as "color" and "colour") to be counted as one word? What about hyphenated words?

20+ years later, we still don't have sed or awk available as a library, afaik. We also don't have a tr-like function which works on streams, or a sort for text files in standard libraries. Which is why you simply can't use it when you write something that isn't completely solvable via shell scripting - it doesn't get compiled in, and your team leader/senior programmer/whatever will definitely not allow you to call shell commands from C, C++ or Pascal, even when it pays off.
This is a modern solution (from the blog post http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ with several small changes, probably it can be improved further):

import System.Environment (getArgs)
import Data.List (sortBy)
import Data.Char (toLower, isLetter)
import qualified Data.HashMap.Strict as HM

createReport :: Int -> String -> String
createReport n text =
unlines $
map (\(count, w) -> w ++ " " ++ show count) $
take n $
sortBy (flip compare) $
map (\(w, count) -> (count, w)) $
HM.toList $
HM.fromListWith (+) $
map (\w -> (w, 1)) $
words $
map toLower $
map (\c -> if isLetter c then c else '\n') $
text

main = do
[fileName, nstr] <- getArgs
let n = read nstr
text <- readFile fileName
putStr $ createReport n text
Usage: runhaskell wfreqh.hs test_data.txt 10

A comment from that blog post:
The shell script operates on raw text and everything is just strings being parsed and reparsed by the respective Unix utility programs.

The Haskell program is statically typed. It is type-checked by the compiler, which generates native code. The program uses standard libraries and data types, such as lists and hash maps.

Also, the Haskell program could be refined, extended, optimized in various ways. The most important optimizations I can think of off the top of my head:
- Using a better representation of strings than the default built-in "string as list of characters". Easily accessible advice can be found on Stack Overflow and browsing through Haskell library documentation, such as for the text package.
- Loop fusion, deforestation can be applied to deal with the apparent allocation of lots of new lists in the pipeline. One of the selling points of using a language like Haskell is the opportunity for the compiler to perform radical optimizations that are impossible for languages that have side effects.

I don't write many bash scripts these days. General-purpose programming languages can do a decent job munging data without difficulty. The situation was different decades ago when there was C, and few standard high-level libraries for the C world.
This is a D language version (it requires no installation of libraries like the unordered-containers from Cabal in the Haskell program). It works with unicode text. If you only have ASCII text the code becomes simpler. That Haskell version seems to have problems with unicode text (on the other hand don't expect the D code to perform a fully correct handling of unicode sorting, that is a complex problem).

import std.stdio, std.conv, std.file, std.string,
std.algorithm, std.range;

void main(string[] args) {
args[1]
.readText
.toLower
.tr("A-Za-z", "\n", "cs")
.split
.sort()
.group
.array
.schwartzSort!q{ -a[1] }
.take(args[2].to!uint)
.map!q{ text(a[0], " ", a[1]) }
.join("\n")
.writeln;
}
D version: 17 nonblank lines, Haskell: 23 lines.
(The cast is needed because of small D bug that probably will be fixed. The sort function has () to avoid calling the built-in sort, that will be deprecated.)

File sizes:
9.005.142 test_data.txt
365 wfreqd.d
701 wfreqh.hs
1.565.710 wfreqh.exe (stripped)
510.492 wfreqd.exe
Haskell code compiled with:
ghc -O3 wfreqh.hs
Glasgow Haskell Compiler, Version 7.6.1, stage 2 booted by GHC version 7.4.2

D code compiled with:
dmd -O -release -inline -noboundscheck wfreqd.d
DMD Version 2.064alpha.

Runtime, best of 3 (top 100 words):
wfreqh: 7.45
wfreqd: 6.24
Tests performed on Windows 32 bit, using an about 9 MB text file in English.

In theory the shell program is fully lazy (even the sort can be done with an external sorting algorithm), while the D (and probably the Haskell) version load the whole file in memory.

The D and Haskell programs are strongly statically typed so they are able to catch several programming errors at compile-time.

The shell program uses text, so it has to convert numbers to text and then from text to numbers, while the D and Haskell programs use lazy sequences of arbitrary types, that can be more powerful and save time in the textual serialization and deserialization. And unless the floating point values are serialized as hex floats, such textual conversions can also introduce some error.

If you have to perform some operation not present among the unix tools, you can write a small program (in C or another language) that uses text pipes, and call it like the other shell commands. In Haskell or D you just write a function (or a lazy Range in D, often a struct), and call it, this is simpler and more integrated, but it's less modular.

Update Jun 21 2013: wiz offers this Haskell version (gist.github.com/wiz/5826131 ):

{-# LANGUAGE TupleSections #-}

import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import Data.Char (isLetter)
import Data.List (sortBy)
import Data.Function (on)
import qualified Data.HashMap.Strict as HM
import System.Environment (getArgs)

collect = sortBy (flip compare `on` snd)
. HM.toList
. HM.fromListWith (+)
. map (,1)
. filter (not . TL.null)
. TL.split (not . isLetter)
. TL.toLower

display (word, count) = do
TL.putStr word
putStr ": "
print count

main = do
[fname, count] <- getArgs
text <- TL.readFile fname
mapM_ display $ take (read count) (collect text)
This Haskell version is slower than the precedent one, it takes about 10.8 seconds to run. The D code compiled with the ldc2 compiler (that optimizes better than dmd) takes 4.5 seconds to run (if you use ldc2 2.063 then use .sort!q{-a[1] < -b[1]} instead of the schwartzSort, that was improved later in Phobos).

There are many ways to improve the performance of the D code. Like using an hash counter, that is missing in Phobos (run-time with ldc2 is about 3.17 seconds):

import std.stdio, std.conv, std.file, std.string,
std.algorithm, std.range, std.traits;

auto hashCounter(R)(R items) if (isForwardRange!R) {
size_t[ForeachType!R] result;
foreach (x; items)
result[x]++;
return result.byKey.zip(result.byValue);
}

void main(string[] args) {
args[1]
.readText
.toLower
.tr("A-Za-z", "\n", "cs")
.split
.hashCounter
.array
.sort!q{ -a[1] < -b[1] }
.take(args[2].to!uint)
.map!q{ text(a[0], " ", a[1]) }
.join("\n")
.writeln;
}