How much speed are you willing to give up for pretty code?
After experimenting a bit with Haskell and using it for some toy projects, I decided to see if there was any chance it could be used for a non-toy task; i.e., one where it might one day be necessary to actually have fast code. As a proxy for this, I decided to try a very simple task: counting the number of lines in a file. As input, I generated a file with a million lines of random data and an average of 40 bytes per line; all the Haskell programs were compiled with the current GHC in Debian unstable and with optimizations enabled.
The baseline is, of course, our friend "wc":
real 0m0.084s
user 0m0.043s
sys 0m0.040s
(this and all other timing data was taken after "cat /tmp/testwc.txt > /dev/null" to ensure that the file was cached) Now, it's probably too much to hope that a Haskell program will run as fast as a C program does, since the C program can run in a tight loop. But it seems reasonable that it should be possible for Haskell to be competitive with the trivial Python implementation of this program -- after all, Python is interpreted and Haskell is compiled.
import sys
n = 0
for line in sys.stdin:
n += 1
print n
Running this simple Python program under "time" produces the following statistics, a bit less than 10 times slower than the C version:
real 0m0.698s
user 0m0.643s
sys 0m0.050s
The trivial Haskell program to count the lines of its input is:
main = do c <- getContents ; print $ length $ lines c
Note that this is basically identical to the Python program, but expressed in a functional style. Pretty, but how does it stack up?
real 0m6.744s
user 0m6.426s
sys 0m0.213s
OW! Ten times slower than the Python code, and a good 80 times faster than "wc"! The gap can be lowered to "only" 50 times with the following program --
do c <- getContents; print $ length $ elemIndices '\n' c
-- but surely it's possible to do better. I'd expect that the expense here is in the use of lazy streams for IO; eliminating them gives us the following code:
wc :: Handle -> IO Int
wc h = do a <- mallocArray bufSize :: IO (Ptr Word8)
loop a 0
where loop a n = do amt <- hGetBuf h a bufSize
elts <- peekArray amt a
(if amt == 0
then return n
else loop a $! n + nlCount elts)
At this point, we're basically using Haskell as a (bad) imperative language. This function allocates a buffer, repeatedly reads into it, and counts the number of newlines in the buffer (I dropped some auxillary code to keep the post short). The $! is important; without it, even though this routine is tail-recursive, Haskell will still allocate an unbounded amount of memory to store the return value. (because, of course, it's terribly important to remember that you're actually returning "1+2+3", not "6")
So, how does this unreadable, hand-optimized version of the code perform?
real 0m1.980s
user 0m1.757s
sys 0m0.094s
That's right -- I've eliminated most uses of higher-level functional techniques, and the Haskell version is still two times slower than an ultra-high-level interpreted language, while also being uglier and harder to read! Ouch. It appears the old performance albatross of functional languages is not quite dead yet.
12 Comments:
How does Python do for better code?
import sys
print len(sys.stdin)
My understanding of Haskell is almost zero, but from what I gather the tree programs do different things: the C and the Python program read a file line by line, the Haskell programs (all three of them as far as I can tell) read the whole file in one go and then count the number of newlines in them.
This is a radically different thing from reading a file line by line, not only are the memory requirements different, also the kernel has less information on how to prefetch the new few lines, nor can it see with ease that the old lines can be swapped out.
I'm willing to bet that writing a Haskell program that reads line by line will be significantly faster.
Following up, in Common Lisp I get:
Base wc result:
real 0m0.018s
user 0m0.004s
sys 0m0.012s
Reading the file in one swoop, counting newlines:
Evaluation took:
2.89 seconds of real time
2.360147 seconds of user run time
0.152009 seconds of system run time
43 page faults and
122,947,360 bytes consed.
Reading line by line:
Evaluation took:
0.35 seconds of real time
0.260016 seconds of user run time
0.072004 seconds of system run time
0 page faults and
35,717,576 bytes consed.
The last function can be improved upon by using a static buffer for the lines, thus removing a lot of the consing. Oh, and by using a newer version of the Lisp that has faster unicode processing.
Sorry, that should be
import sys
print len(list(sys.stdin))
Or possibly,
import sys
print len(sys.stdin.readlines())
I'm not actually sure which will be faster for that size of file.
Out of interest, how well does this do:
perl -le '$i++ while <>; print $i'
The real problem is in the representation of strings. String is really a linked list of Char; try the foilowing, which uses the FastPackedString library:
module Main where
import IO
import qualified Data.FastPackedString as FPS
main :: IO ()
main = do fs <- FPS.hGetContents stdin
putStrLn $ show $ length (FPS.lines fs)
This isn't really the kind of problem Haskell was designed for, of course, but it's nice to know that there are escapes if you really need speed.
This has been discussed on the haskell-cafe mailling-list: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/3907
I did some tests myself. My best Haskell program, though unidiomatic, was essentially as fast as wc -l.
I just tried out the FastPackedString version suggested ecb. Its performance sucked worse than that of the trivial list-based one.
I'm surprised... I tried the naive lazy list version and the FPS version, along with wc. The FPS version gives (for a one million line file):
real 0m1.257s
user 0m0.810s
sys 0m0.418s
wc gives:
real 0m0.844s
user 0m0.451s
sys 0m0.275s
I didn't run the lazy list version for long enough to get results...
pvaneynd, the "getContents" call generates a lazy stream which ought to just be read as it is consumed, modulo buffering; so no, it doesn't read the whole file at once. I wasn't too surprised that this performed poorly, although I was surprised at just how dreadful it was (I would have naively guessed more like 10 times the C version).
The third version is an explicit, imperative, tail-recursive loop like you'd do in Lisp. An interesting note about it is that performance actually decreases as the buffer size increases, to the point that it's worse than the purely functional version! I'm still trying to work that one out.
This comment has been removed by a blog administrator.
I apologize for shutting comments on this post down, but it seems to have become a bit of a spam magnet. Email me if you have comments and I'll attach them here.
<< Home