In our last excursion involving fractals and the Julia programming language I focused on the simplicity and features of the language itself, willfully ignoring any practical concerns about performance. However, generating fractal images can be quite time consuming, so optimization is worth caring about. Furthermore, this problem is ideally suited to parallel processing, so in this installment, we will examine one way of parallelizing our Julia code, and compare the results to similar approaches in C++, and for the coups de grâce, to an implementation leveraging OpenCL.
A Short Note About Parallelism
All too often, attempts to introduce parallelism turn out like this:
Fortunately, computing the Mandelbrot set resembles the first image much more than the second. This computation is an example par excellence of an embarassingly parallel problem — the computation at each point (pixel) is completely independent of the computation at other points, so it is straightforward to implement a parallel solution.
Julia
As a reminder, here is the Julia version of the main computation:
function boundedorbit(f, seed, bound, bailout=100, itmap=(n,zn,b)>n)
z = f(seed)
for k = 1:bailout
if abs(z) > bound
return itmap(k, z, bailout)
end
z = f(z)
end
return Inf
end
We also used a special function to transform the discrete iteration counts into a continuous range of values:
normalized_iterations(n, zn, bailout) = n + (log(log(bailout))log(log(abs(zn))))/(log(2))
For reference, you’ll also want to recall that we applied these functions via a map
over the input array:
values = map(z > boundedorbit(x > x^2 + z, 0, 2, 200, normalized_iterations)
> x > 5x
> grayvalue, linspace(2.51.25im, 1.0+1.25im, 3500, 2500));
Now it quickly became apparent in benchmarking this code that it was not optimal. Execution times were hovering around a full minute, which (as we shall soon see) is not even close to what native code can do. Based on the performance claims of the Julia project, I expected better. A little probing into the issue eventually revealed the culprit: anonymous functions. At the moment, anonymous functions like our cute x > x^2 + z
are not able to be inlined, which incurs a lot of overhead in a hot loop like ours. If we concede the genericity and rewrite a Mandelbrotspecific function to perform the iterations, like so:
function mandelbrotorbit(c, seed, bound, bailout=100, itmap=(n,zn,b)>n)
z = seed^2 + c # z = f(seed)
for k = 1:bailout
if abs(z) > bound
return itmap(k, z, bailout)
end
z = z^2 + c # z = f(z)
end
return Inf
end
Then we see performance improve to around six seconds! Now that is more in line with my expectations, and an impressive number — we’ll see why when we look at the results for C++. But first, we want to see how much further we can improve that time by introducing some parallelism.
In researching options for parallel processing in Julia, I stumbled on these slides which were quite helpful, in addition to the standard documentation. In short, Julia does not have anything like explicit threads, but provides a messagepassing system which allows multiple worker processes (running locally or on some other machine on a network, it doesn’t matter which) to coordinate their work. Julia has a standard DArray
type intended for splitting data among workers running on separate computers in a cluster, which is overkill for our purposes of just utilizing the multiple cores on one machine. However, there is an experimental SharedArray
type that allows multiple worker processes to access the same data on one machine, like shared memory. That is the approach we will try.
We need a SharedArray
for all workers to store their results to, and though it’s not strictly necessary we convert the input points to a SharedArray
as well.
# prepare input data
points = convert(SharedArray, linspace(2.51.25im, 1.0+1.25im, 3500, 2500))
# prepare results array
colors = SharedArray(Uint8, size(points))
You might be tempted (like I was) to then use pmap
as a parallel replacement for the map
as in our original function, but that would be a mistake as pmap
is intended to be used with functions that do a large amount of work for each invocation; for an individual point, mandelbrotorbit
doesn’t. On the other hand, the documentation points us to the @parallel
macro, which “can handle situations where each iteration is tiny”.
const BAILOUT = 200
f(z) = boundedorbit(x > x^2 + z, 0.0im, 2, BAILOUT, normalized_iterations)
@sync @parallel for j=1:size(points,2)
for k=1:size(points,1)
@inbounds colors[k,j] = grayvalue(f(points[k,j]))
end
end
The @sync
is there to prevent the program from continuing until all the workers are finished. Also, we are applying @parallel
to the outer loop, which means each execution of the loop is doing more than a tiny amount of work because it’s executing the whole inner loop, but the results are actually worse when applying @parallel
to the inner loop, similar to serial execution.
C++
The Julia implementation of boundedorbit
and normalized_iterations
is surprisingly simple to translate into C++ code, at least when we use a modern standard and boost (irange in particular):
template<typename Iterate, typename IterationMap, typename T>
int boundedorbit(Iterate f, std::complex<T> seed, int bound, int bailout=100,
IterationMap itmap = [](int n, std::complex<T> z, int bailout) { return n; })
{
auto z = f(seed);
for (auto k : boost::irange(1, bailout)) {
if (abs(z) > bound)
return itmap(k, z, bailout);
z = f(z);
}
return std::numeric_limits<int>::min();
}
template<typename T>
float normalized_iterations(int n, std::complex<T> zn, int bailout)
{
return n + (log(log(bailout))log(log(abs(zn))))/log(2);
}
Template parameters give us some of the flexibility of Julia’s polymorphism, but at compile time only. We also rely on C++11 lambda expressions and std::complex
datatype — the translation would be a little less transparent without them.
Things are rarely so simple in the world of C++, however, and as soon as we try to use expressions like a*z
where z
is a std::complex<T>
value and a
is of a type convertible to T
, we run into problems. So for convenience, we add a couple of conversion operators:
template <typename T, typename U>
inline std::complex<T> operator*(const std::complex<T>& lhs, const U& rhs)
{
return lhs * T(rhs);
}
template <typename T, typename U>
inline std::complex<T> operator*(const U& lhs, const std::complex<T>& rhs)
{
return T(lhs) * rhs;
}
We are now prepared to rewrite the main loop. Boost’s irange
iterator pairs well with C++11 foreach
expressions to give us something not much more verbose than the corresponding Julia version. The verbosity of C++ lambda functions does not begin to approach the elegance of Julia’s z>z^2+c
syntax, but use of a C++14 feature — generic lambdas — helps a little:
for (auto j : boost::irange(0, 2500)) {
for (auto k : boost::irange(0, 3500)) {
auto c = (2.5 + k*3.5/3500.0) + (1.25i + j*2.5i/2500.0);
iteration_counts[3500*j + k] = boundedorbit([&c](auto z) { return z*z + c; }, 0.0i, 2, 200, &normalized_iterations<double>);
}
}
Complex literals (another C++14 feature) also help declutter the syntax, admittedly at the risk of confusing the imaginary identifier i
with a loop index variable.
Now when it comes to parallelizing the execution of this C++ code, we have many options. For this exercise, I chose to use Intel’s Threading Building Blocks because it’s designed specifically for C++ in a very idiomatic way, as opposed to pthreads or OpenMP which are more idiomatic to C. Additionally, TBB has some nice features like work stealing, and I was curious to see how it would perform in this situation. Converting the serial version above to use TBBparallelism is pretty trivial (didn’t I say it was idiomatic?):
const int GRAIN = 100;
tbb::parallel_for(tbb::blocked_range2d<int>(0, 2500, GRAIN, 0, 3500, GRAIN), [&](auto r) {
for (auto j = r.rows().begin(); j != r.rows().end(); j++) {
for (auto k = r.cols().begin(); k != r.cols().end(); k++) {
auto c = (2.5 + k*3.5/3500.0) + (1.25i + j*2.5i/2500.0);
iteration_counts[3500*j + k] = boundedorbit([&c](auto z) { return z*z + c; }, 0.0i, 2, 200, &normalized_iterations<double>);
}
}
});
Essentially, we wrap the serial loops above into one parallel_for
call, which splits the entire index space up into chunks based on the grain size (GRAIN
). TBB takes care of creating and scheduling threads to run these chunks of work or work items. Whenever a worker thread finishes a work item, it dequeues the next available one, so no thread spends much time in an idle state, improving efficiency. We’ll take a look at how effective this is when we get to the results.
Once we’ve done all this computation, we’re going to want a way to write out the resulting image so we can verify its correctness. Julia had the handy (and extensive!) Images
package available to use; here we can at least get basic PNG writing using the handy stb headeronly library. A simple #define
and #include
later:
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
Now the stbi_write_png
function is available to output our results as a PNG image.
We’ll also want a way to measure the execution time of the calculation itself, by bracketing the calculation between
auto start = std::chrono::high_resolution_clock::now();
and
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end  start;
std::cout << "computation took " << elapsed_seconds.count() << "s" << std::endl;
In case you are unfamililar with std::chrono
, it too is a C++11 feature, and as you can see makes uses like ours quite simple; much simpler than it otherwise would be to use platformspecific highresolution timing facilities and the necessary conversions to standard units of time.
OpenCL
Strictly speaking, OpenCL is not a language, but rather a crossplatform API for highperformance parallel computing. If the bindings for OpenCL in Julia were compatible with the latest bleedingedge version of Julia (0.4.0dev), we could use it just as easily as from C++ or other languages. At this time of writing, that is not the case, so C++ will have to do .
To use the C++ bindings for OpenCL, you will need this header unless your system already provides it (OS X 10.10 where I tested this does not).
Remember how I said OpenCL is not a language? Well, it does specify a language for writing the actual computational code that will run on whichever supported device(s) we submit it to; this language is similar to C99. We specify kernels of computation in this language, which can then be compiled and executed on various devices, most commonly CPUs or GPUs. Rather than going into all the boilerplate necessary to take our kernel, execute it on some device(s), and gather the results, we will limit ourselves to a brief look at how the kernel code compares to the previous implementations in Julia and C++.
At this time, there is no complex datatype in OpenCL, but there are vector types, so for clarity we alias a 2vector of double
to Complex
. Note that not every OpenCL device is capable of doubleprecision floating point math; hence the prescence of the #pragma
requiring the cl_khr_fp64
extension. The OpenCL compiler on my Mac also prefers function declarations ahead of time, so we may as well include those here:
#pragma OPENCL_EXTENSION cl_khr_fp64 : enable
typedef double2 Complex;
Complex multiply(Complex a, Complex b);
float boundedorbit(Complex seed, Complex c, float bound, int bailout);
float normalized_iterations(int n, Complex zn, int bailout);
unsigned char grayvalue(float n);
We also need a function to perform complex multiplication, which you will recall is defined by:
\[ (a + b i)(c + d i) = (ac  bd) + (bc + ad)i \]
Simple enough, but note the syntax for constructing a vector (Complex
) literal; it might not be what you expect. Also, for the uninitiated, .s0
and .s1
are simply accessors for the first and second elements of the vector (the real and imaginary parts of our complex number).
Complex multiply(Complex a, Complex b) {
return (Complex)(a.s0*b.s0a.s1*b.s1, a.s1*b.s0+a.s0*b.s1);
}
It remains to convert our nowfamiliar functions boundedorbit
, normalized_iterations
, and grayvalue
, but with some changes. For one, OpenCL does not allow function pointers, so we won’t be able to pass normalized_iterations
as a parameter; it will have to be called directly instead. This is a loss in flexibility compared to our previous approaches, but there is no getting around it. normalized_iterations
is essentially unchanged, and grayvalue
just wraps another (strangelooking) function:
float normalized_iterations(int n, Complex zn, int bailout) {
return n + (log(log(convert_float(bailout)))log(log(length(zn))))/log(2.0);
}
float boundedorbit(Complex seed, Complex c, float bound, int bailout)
{
Complex z = multiply(seed, seed) + c;
for (int k = 0; k < bailout; k++) {
if (length(z) > bound)
return normalized_iterations(k, z, bailout);
z = multiply(z, z) + c;
}
return FLT_MIN;
}
unsigned char grayvalue(float n) {
return convert_uchar_sat_rte(n);
}
The function grayvalue
is just a glorified alias for a very specific explicit type conversion (convert_<type>_sat_rte
) with saturation (_sat
), i.e. values are clamped to the representable range of the destination type (uchar
) and roundtoeven (_rte
) behavior. This family of conversion functions is really quite nice, allowing explicit control over the conversion behavior and making grayvalue
superfluous except to better parallel (and perhaps better document) the implementations above.
The supporting machinery is now in place, all that remains it to define the actual kernel function, the entry point:
__kernel void mandelbrot(__global unsigned char* output, size_t offset)
{
int k = get_global_id(0);
int j = get_global_id(1);
// construct position
Complex c = (Complex)(2.5 + 3.5*(k/3500.0), 1.25 + 2.5*((j+offset)/2500.0));
float count = boundedorbit((Complex)(0,0), c, 2.0, 200);
//if (3500*j+k < 3500*2500) // debugging guard
output[3500*j+k] = grayvalue(count);
}
If you’ve never seen an OpenCL kernel function before, it may strike you as odd, for instance, that there is no loop. In fact what we have here looks like just the body of a loop that would perform our computation. With OpenCL, rather than any explicit loop, we schedule the kernel above to operate over an index space and the device is responsible for executing the kernel on every element of the index space, which may occur in parallel. Like the puppy analogy above, we setup a massive grid of “food bowls” (our index space) and turn a large number of “puppies” loose on them concurrently. It should come as no surprise that our index space for this computation is 3500x2500, and the kernel determines which point to work on based on the index in that space retreived from the calls to get_global_id()
. Because of the independence of the computation at each point of our grid, the translation into an OpenCL kernel is unusually simple and natural.
By the way, if you’re wondering about the “debugging guard” comment at the end of the kernel, that check was useful during development to avoid writing outofbounds, which is a great way to cause a kernel panic (ask me how I know).
Results
For benchmarking, I had access to three different Apple computers: a 2012 MacBook Air, 2013 MacBook Pro, and 2013 Mac Pro, which I’ll refer to as MBA, MBP, and MP for short. Each was spec’d as follows:
MBA 
2core Intel Core i7 @ 2.0GHz 

MBP 
4core Intel Core i7 @ 2.6GHz 
Nvidia GT750M 
MP 
6core Intel Xeon @ 2.5GHz 
2 x AMD D700 
The following table summarizes performance metrics for each of the approaches discussed above, on several platforms. All times recorded are in seconds.
Julia 
10.9 
6.15 

6.51 

Parallel Julia 
5.55 
2.53 

2.45 

Serial C++ 
10.05 
8.45 

7.52 

TBB C++ 
2.65 
1.21 

0.74 

OpenCL 
1.40 
0.45 
1.05 
0.28 
0.07 






So OpenCL on highend hardware blows everything else out of the water, but Julia slightly outstrips C++ in serial execution. On the other hand, it appears that C++ benefits the most from parallelization:
Julia: Parallel Speedup
MBA 
1.96 
2 
0.98 
MBP 
2.43 
4 
0.61 
MP 
2.66 
6 
0.44 
C++: Parallel Speedup
MBA 
3.79 
2 
1.90 
MBP 
6.98 
4 
1.75 
MP 
10.16 
6 
1.69 
We might expect a theoretical maxiumum speedup of 1x per core, but because of Intel’s Hyperthreading technology, each core is capable (at times) of running two threads, so our ideal speedup may be more like 2x per core. Clearly C++/TBB is currently able to better utilize multiple cores, perhaps because of better scheduling via work stealing.
One interesting possibility that remains to be explored is scaling up the number of computers participating in the computation. Because of Julia’s builtin distributed computing model, it stands to benefit most directly (and probably most easily) from scaling up to a computation distributed over many computers; however, new overhead costs for communication would be incurred as well. For now, we will leave that as a topic for a future post.
Source Code