C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

© 2006-2015 Andrew Cooke (site) / post authors (content).

Optimising Clojure

From: andrew cooke <andrew@...>

Date: Wed, 31 Aug 2011 08:54:22 -0300

I had some Clojure code that needed optimising.  Eventually I achieved a 5x
speedup.  This is how.

Finding Something to Measure
----------------------------

First, I needed something to measure, so that I could tell how effective my
changes were.  This was complicated by three things: the need to generate test
data to process; the delayed action of the JIT; garbage collection.

Generating test data is quite expensive, so I needed to carefully tune
parameters so that my processing used more time than the test data generation
(not necessary for the simple timings, but important for profiling, since I
could find no simple way to delay profiling to start after the data was
ready).  Then I placed the processing in a loop, repeating it multiple times
(re-using the same test data).  Each pass through the loop was timed - the
successive times let me see when the JIT engaged (typically the first
calculation was 50% slower) and also reduce the effect of GC by selecting the
lowest value.

Profiling
---------

Once I had a good test running in my IDE, I needed to make standalone Java
classes to profile.  The simplest way to do this seems to be to use a build
tool called leiningen - https://github.com/technomancy/leiningen - which was
very easy to use.  You just download and run it, create an empty project, and
modify the project descriptor file (which I then copied into my existing
project).

My file contents are:

(defproject compressive "0.0.0"
:description "set-related experiments"
:aot [#"com.isti.compset.*"]
:main com.isti.compset.stack
:dependencies [[org.clojure/clojure "1.3.0-beta1"]
[org.clojure/clojure-contrib "1.2.0"]])

where:

- aot indicates which classes need compiling
- main indicates the location of the "main" file

I also added the (-main) method and put (:gen-class) in my ns at the top of
the file.

With all that, "lein compile" generated a set of classes that could be run
with java:

java -cp classes:lib/clojure-1.3.0-beta1.jar \
-agentlib:hprof=cpu=samples,depth=10,file=hprof com.isti.compset.stack

giving a profile in the file "hprof".

This profile was dominated by expected methods related to generating
and consuming sequences, and adding and multiplying generic numbers.

Vector of Doubles
-----------------

The central loop in my code sums various spectra and the square of their
values.  The spectra were stored as "vec" instances.  I hope to remove the
use of generic numbers by replacing this with a vector of doubles and, if
necessary, adding some type annotations to various functions.

So I replaced "vec" with

(defn d-vec [collection]
(apply conj (vector-of :double) collection))

However, this did not have the desired effect - the code was
significantly slower and the trace was dominated by calls to "Vec.count".

I asked for help on stack overflow http://stackoverflow.com/q/7223297/181772
but didn't get any very useful response, so I removed these changes.

Array of Doubles
----------------

At this point I was a little disheartened.  The traces didn't have much useful
information (that I could see, even using a nice little tool called PerfAnal
http://java.sun.com/developer/technicalArticles/Programming/perfanal/).

So I looked at my inner loop and decided that I would simply rewrite it as I
thought it should be written, relying on experience rather than profiling.  I

- replace the vector in which I was accumulating results with a native array
of doubles (using double-array)

- mutate the accumulator rather than creating a new instance each loop

- use additional indices to identify the "active range" of the accumulator
(which may get smaller over time, due to restricted data ranges in
spectra).    This had not been necessary when I was generating new
immutable instances - the instances simply decreased in size.

- change the spectra from vectors to arrays.

- replace some maps and reduces with explicit loops that mutate data.

This was, initially a complete disaster.  Running time increased by roughly a
hundredfold.

Type Annotations
----------------

Rather than reverting these changes, which seemed justified from experience, I
looked again at profiling.  The profile was now dominated by methods related
to java introspection.

This is a known issue with Clojure and has a simple solution - add

(set! *warn-on-reflection* true)

and remove all warnings by adding type hints.  I did not enable this for all
code - only for the central loop and related functions.  And I used the
"^doubles" hint which indicates a native array of doubles.

This, finally, gave a significant speedup - about 5x faster than the initial
code.

Conclusions
-----------

- Clojure code is simplest, and reasonably efficient, when you stay within
Clojure's own types (vec, seq etc).

- Profiling with hprof can give broad information, but I, at least, couldn't
understand it in detail (I normally can read profiling data!) - there
seemed to be information missing, or I simply do not understand how Clojure
works.

- It was fairly simple to change the code to use typed, mutable data in the
inner loop, which gave significant speedups.

- Dynamic interop with Java is apallingly slow, but easy to fix once you see
that it is happening.

- More generally, Clojure works fairly well for exploratory data processing.
It's nicely high-level and easy to work with, and can be optimised where
necessary.  I have also implemented this particular algorithm in C and on
GPUs, and I am sure it is faster in both cases, but (without doing any
formal measurements) the code feels close enough to C for me to experiment
with new ideas quickly.  I doubt this would have been possible in Python,
for example (even with numpy - the inner loop is annoyingly hard to
vectorize)

More on (vector-of :double)

From: andrew cooke <andrew@...>

Date: Wed, 31 Aug 2011 22:50:48 -0300

Justin Kramer at http://stackoverflow.com/q/7223297/181772 sugegsted using

(into (vector-of :double) collection)

to create vectors of doubles.  This seemed like a good idea for the immutable
spectra (it's probably not clear above but I introduced arrays of soubles in
two places - the mutable accumulator used to sum spectra, and the indvidual
spectra themselves).  However, when I tried it, I found that there was no way
to add type hints for these vectors.  That led to warnings of dynamic code
when the contents were used.

So using arrays of doubles is heping in two ways - it's supporting fast,
mutable access and also helping the compiler infer (from the ^doubles hint for
the array) the tyoe of the content.

Andrew

PS I posted this on HN where it gained some view, but no useful comments.