Parallel Mandelbrot in Julia, C++, and OpenCL

In our last excursion involving fractals and the Julia programming language I focused on the simplicity and features of the language itself, willfully ignoring any practical concerns about performance. However, generating fractal images can be quite time consuming, so optimization is worth caring about. Furthermore, this problem is ideally suited to parallel processing, so in this installment, we will examine one way of parallelizing our Julia code, and compare the results to similar approaches in C++, and for the coups de grâce, to an implementation leveraging OpenCL.

A Short Note About Parallelism

All too often, attempts to introduce parallelism turn out like this:

Multithreaded programming - theory and practice. pic.twitter.com/cwRV2Ys1Id
— Alexander Ilyushin (@chronum) December 4, 2014

Fortunately, computing the Mandelbrot set resembles the first image much more than the second. This computation is an example par excellence of an embarassingly parallel problem — the computation at each point (pixel) is completely independent of the computation at other points, so it is straightforward to implement a parallel solution.

Julia

As a reminder, here is the Julia version of the main computation:

function boundedorbit(f, seed, bound, bailout=100, itmap=(n,zn,b)->n)
    z = f(seed)
    for k = 1:bailout
        if abs(z) > bound
            return itmap(k, z, bailout)
        end
        z = f(z)
    end

    return -Inf
end

We also used a special function to transform the discrete iteration counts into a continuous range of values:

normalized_iterations(n, zn, bailout) = n + (log(log(bailout))-log(log(abs(zn))))/(log(2))

For reference, you’ll also want to recall that we applied these functions via a map over the input array:

values = map(z -> boundedorbit(x -> x^2 + z, 0, 2, 200, normalized_iterations) 
               |> x -> 5x 
               |> grayvalue, linspace(-2.5-1.25im, 1.0+1.25im, 3500, 2500));

Now it quickly became apparent in benchmarking this code that it was not optimal. Execution times were hovering around a full minute, which (as we shall soon see) is not even close to what native code can do. Based on the performance claims of the Julia project, I expected better. A little probing into the issue eventually revealed the culprit: anonymous functions. At the moment, anonymous functions like our cute x -> x^2 + z are not able to be inlined, which incurs a lot of overhead in a hot loop like ours. If we concede the genericity and rewrite a Mandelbrot-specific function to perform the iterations, like so:

function mandelbrotorbit(c, seed, bound, bailout=100, itmap=(n,zn,b)->n)
    z = seed^2 + c     # z = f(seed)
    for k = 1:bailout
        if abs(z) > bound
            return itmap(k, z, bailout)
        end
        z = z^2 + c    # z = f(z)
    end

    return -Inf
end

Then we see performance improve to around six seconds! Now that is more in line with my expectations, and an impressive number — we’ll see why when we look at the results for C++. But first, we want to see how much further we can improve that time by introducing some parallelism.

In researching options for parallel processing in Julia, I stumbled on these slides which were quite helpful, in addition to the standard documentation. In short, Julia does not have anything like explicit threads, but provides a message-passing system which allows multiple worker processes (running locally or on some other machine on a network, it doesn’t matter which) to coordinate their work. Julia has a standard DArray type intended for splitting data among workers running on separate computers in a cluster, which is overkill for our purposes of just utilizing the multiple cores on one machine. However, there is an experimental SharedArray type that allows multiple worker processes to access the same data on one machine, like shared memory. That is the approach we will try.

We need a SharedArray for all workers to store their results to, and though it’s not strictly necessary we convert the input points to a SharedArray as well.

# prepare input data
points = convert(SharedArray, linspace(-2.5-1.25im, 1.0+1.25im, 3500, 2500))

# prepare results array
colors = SharedArray(Uint8, size(points))

You might be tempted (like I was) to then use pmap as a parallel replacement for the map as in our original function, but that would be a mistake as pmap is intended to be used with functions that do a large amount of work for each invocation; for an individual point, mandelbrotorbit doesn’t. On the other hand, the documentation points us to the @parallel macro, which “can handle situations where each iteration is tiny”.

const BAILOUT = 200
f(z) = boundedorbit(x -> x^2 + z, 0.0im, 2, BAILOUT, normalized_iterations)

@sync @parallel for j=1:size(points,2)
    for k=1:size(points,1)
        @inbounds colors[k,j] = grayvalue(f(points[k,j]))
    end
end

The @sync is there to prevent the program from continuing until all the workers are finished. Also, we are applying @parallel to the outer loop, which means each execution of the loop is doing more than a tiny amount of work because it’s executing the whole inner loop, but the results are actually worse when applying @parallel to the inner loop, similar to serial execution.

C++

The Julia implementation of boundedorbit and normalized_iterations is surprisingly simple to translate into C++ code, at least when we use a modern standard and boost (irange in particular):

template<typename Iterate, typename IterationMap, typename T>
int boundedorbit(Iterate f, std::complex<T> seed, int bound, int bailout=100, 
                 IterationMap itmap = [](int n, std::complex<T> z, int bailout) { return n; })
{
    auto z = f(seed);
    for (auto k : boost::irange(1, bailout)) {
        if (abs(z) > bound) 
            return itmap(k, z, bailout);
        z = f(z);
    }
    return std::numeric_limits<int>::min();
}

template<typename T>
float normalized_iterations(int n, std::complex<T> zn, int bailout)
{
    return n + (log(log(bailout))-log(log(abs(zn))))/log(2);
}

Template parameters give us some of the flexibility of Julia’s polymorphism, but at compile time only. We also rely on C++11 lambda expressions and std::complex datatype — the translation would be a little less transparent without them.

Things are rarely so simple in the world of C++, however, and as soon as we try to use expressions like a*z where z is a std::complex<T> value and a is of a type convertible to T, we run into problems. So for convenience, we add a couple of conversion operators:

template <typename T, typename U>
inline std::complex<T> operator*(const std::complex<T>& lhs, const U& rhs)
{
    return lhs * T(rhs);
}

template <typename T, typename U>
inline std::complex<T> operator*(const U& lhs, const std::complex<T>& rhs)
{
    return T(lhs) * rhs;
}

We are now prepared to rewrite the main loop. Boosts irange iterator pairs well with C++11 foreach expressions to give us something not much more verbose than the corresponding Julia version. The verbosity of C++ lambda functions does not begin to approach the elegance of Julia’s z->z^2+c syntax, but use of a C++14 feature — generic lambdas — helps a little:

for (auto j : boost::irange(0, 2500)) {
    for (auto k : boost::irange(0, 3500)) {
        auto c = (-2.5 + k*3.5/3500.0) + (-1.25i + j*2.5i/2500.0);
        iteration_counts[3500*j + k] = boundedorbit([&c](auto z) { return z*z + c; }, 0.0i, 2, 200, &normalized_iterations<double>);
    }
}

Complex literals (another C++14 feature) also help declutter the syntax, admittedly at the risk of confusing the imaginary identifier i with a loop index variable.

Now when it comes to parallelizing the execution of this C++ code, we have many options. For this exercise, I chose to use Intel’s Threading Building Blocks because it’s designed specifically for C++ in a very idiomatic way, as opposed to pthreads or OpenMP which are more idiomatic to C. Additionally, TBB has some nice features like work stealing, and I was curious to see how it would perform in this situation. Converting the serial version above to use TBB-parallelism is pretty trivial (didn’t I say it was idiomatic?):

const int GRAIN = 100;

tbb::parallel_for(tbb::blocked_range2d<int>(0, 2500, GRAIN, 0, 3500, GRAIN), [&](auto r) {
    for (auto j = r.rows().begin(); j != r.rows().end(); j++) {
        for (auto k = r.cols().begin(); k != r.cols().end(); k++) {
            auto c = (-2.5 + k*3.5/3500.0) + (-1.25i + j*2.5i/2500.0);
            iteration_counts[3500*j + k] = boundedorbit([&c](auto z) { return z*z + c; }, 0.0i, 2, 200, &normalized_iterations<double>);
        }
    }
});

Essentially, we wrap the serial loops above into one parallel_for call, which splits the entire index space up into chunks based on the grain size (GRAIN). TBB takes care of creating and scheduling threads to run these chunks of work or work items. Whenever a worker thread finishes a work item, it de-queues the next available one, so no thread spends much time in an idle state, improving efficiency. We’ll take a look at how effective this is when we get to the results.

Once we’ve done all this computation, we’re going to want a way to write out the resulting image so we can verify its correctness. Julia had the handy (and extensive!) Images package available to use; here we can at least get basic PNG writing using the handy stb header-only library. A simple #define and #include later:

#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"

Now the stbi_write_png function is available to output our results as a PNG image.

We’ll also want a way to measure the execution time of the calculation itself, by bracketing the calculation between

auto start = std::chrono::high_resolution_clock::now();

and

auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;

std::cout << "computation took " << elapsed_seconds.count() << "s" << std::endl;

In case you are unfamililar with std::chrono, it too is a C++11 feature, and as you can see makes uses like ours quite simple; much simpler than it otherwise would be to use platform-specific high-resolution timing facilities and the necessary conversions to standard units of time.

OpenCL

Strictly speaking, OpenCL is not a language, but rather a cross-platform API for high-performance parallel computing. If the bindings for OpenCL in Julia were compatible with the latest bleeding-edge version of Julia (0.4.0-dev), we could use it just as easily as from C++ or other languages. At this time of writing, that is not the case, so C++ will have to do ¹.

To use the C++ bindings for OpenCL, you will need this header unless your system already provides it (OS X 10.10 where I tested this does not).

Remember how I said OpenCL is not a language? Well, it does specify a language for writing the actual computational code that will run on whichever supported device(s) we submit it to; this language is similar to C99. We specify kernels of computation in this language, which can then be compiled and executed on various devices, most commonly CPUs or GPUs. Rather than going into all the boilerplate necessary to take our kernel, execute it on some device(s), and gather the results, we will limit ourselves to a brief look at how the kernel code compares to the previous implementations in Julia and C++.

At this time, there is no complex datatype in OpenCL, but there are vector types, so for clarity we alias a 2-vector of double to Complex. Note that not every OpenCL device is capable of double-precision floating point math; hence the prescence of the #pragma requiring the cl_khr_fp64 extension. The OpenCL compiler on my Mac also prefers function declarations ahead of time, so we may as well include those here:

#pragma OPENCL_EXTENSION cl_khr_fp64 : enable

typedef double2 Complex;

Complex multiply(Complex a, Complex b);
float boundedorbit(Complex seed, Complex c, float bound, int bailout);
float normalized_iterations(int n, Complex zn, int bailout);
unsigned char grayvalue(float n);

We also need a function to perform complex multiplication, which you will recall is defined by:

\[ (a + b i)(c + d i) = (ac - bd) + (bc + ad)i \]

Simple enough, but note the syntax for constructing a vector (Complex) literal; it might not be what you expect². Also, for the uninitiated, .s0 and .s1 are simply accessors for the first and second elements of the vector (the real and imaginary parts of our complex number).

Complex multiply(Complex a, Complex b) {
    return (Complex)(a.s0*b.s0-a.s1*b.s1, a.s1*b.s0+a.s0*b.s1);
}

It remains to convert our now-familiar functions boundedorbit, normalized_iterations, and grayvalue, but with some changes. For one, OpenCL does not allow function pointers, so we won’t be able to pass normalized_iterations as a parameter; it will have to be called directly instead. This is a loss in flexibility compared to our previous approaches, but there is no getting around it. normalized_iterations is essentially unchanged, and grayvalue just wraps another (strange-looking) function:

float normalized_iterations(int n, Complex zn, int bailout) {
    return n + (log(log(convert_float(bailout)))-log(log(length(zn))))/log(2.0);
}

float boundedorbit(Complex seed, Complex c, float bound, int bailout)
{
    Complex z = multiply(seed, seed) + c;
    for (int k = 0; k < bailout; k++) {
        if (length(z) > bound) 
            return normalized_iterations(k, z, bailout);
        z = multiply(z, z) + c;
    }
    return FLT_MIN;
}

unsigned char grayvalue(float n) {
    return convert_uchar_sat_rte(n);
}

The function grayvalue is just a glorified alias for a very specific explicit type conversion (convert_<type>_sat_rte) with saturation (_sat), i.e. values are clamped to the representable range of the destination type (uchar) and round-to-even (_rte) behavior. This family of conversion functions is really quite nice, allowing explicit control over the conversion behavior and making grayvalue superfluous except to better parallel (and perhaps better document) the implementations above.

The supporting machinery is now in place, all that remains it to define the actual kernel function, the entry point:

__kernel void mandelbrot(__global unsigned char* output, size_t offset)
{
    int k = get_global_id(0);
    int j = get_global_id(1);

    // construct position
    Complex c = (Complex)(-2.5 + 3.5*(k/3500.0), -1.25 + 2.5*((j+offset)/2500.0));

    float count = boundedorbit((Complex)(0,0), c, 2.0, 200);

    //if (3500*j+k < 3500*2500) // debugging guard
    output[3500*j+k] = grayvalue(count);
}

If you’ve never seen an OpenCL kernel function before, it may strike you as odd, for instance, that there is no loop. In fact what we have here looks like just the body of a loop that would perform our computation. With OpenCL, rather than any explicit loop, we schedule the kernel above to operate over an index space and the device is responsible for executing the kernel on every element of the index space, which may occur in parallel. Like the puppy analogy above, we setup a massive grid of “food bowls” (our index space) and turn a large number of “puppies” loose on them concurrently. It should come as no surprise that our index space for this computation is 3500x2500, and the kernel determines which point to work on based on the index in that space retreived from the calls to get_global_id(). Because of the independence of the computation at each point of our grid, the translation into an OpenCL kernel is unusually simple and natural.

By the way, if you’re wondering about the “debugging guard” comment at the end of the kernel, that check was useful during development to avoid writing out-of-bounds, which is a great way to cause a kernel panic (ask me how I know).

Results

For benchmarking, I had access to three different Apple computers: a 2012 MacBook Air, 2013 MacBook Pro, and 2013 Mac Pro, which I’ll refer to as MBA, MBP, and MP for short. Each was spec’d as follows:

	CPU	GPU
MBA	2-core Intel Core i7 @ 2.0GHz
MBP	4-core Intel Core i7 @ 2.6GHz	Nvidia GT750M
MP	6-core Intel Xeon @ 2.5GHz	2 x AMD D700

The following table summarizes performance metrics for each of the approaches discussed above, on several platforms. All times recorded are in seconds.

	MBA	MBP CPU	MBP GPU	MP CPU	MP GPU
Julia	10.9	6.15		6.51
Parallel Julia	5.55 ³	2.53 ⁴		2.45 ⁵
Serial C++	10.05	8.45		7.52
TBB C++ ⁶	2.65	1.21		0.74
OpenCL	1.40	0.45	1.05	0.28	0.07

So OpenCL on high-end hardware blows everything else out of the water, but Julia slightly outstrips C++ in serial execution. On the other hand, it appears that C++ benefits the most from parallelization:

Julia: Parallel Speedup

	Speedup	Cores	Speedup/Core
MBA	1.96	2	0.98
MBP	2.43	4	0.61
MP	2.66	6	0.44

C++: Parallel Speedup

	Speedup	Cores	Speedup/Core
MBA	3.79	2	1.90
MBP	6.98	4	1.75
MP	10.16	6	1.69

We might expect a theoretical maxiumum speedup of 1x per core, but because of Intel’s Hyper-threading technology, each core is capable (at times) of running two threads, so our ideal speedup may be more like 2x per core. Clearly C++/TBB is currently able to better utilize multiple cores, perhaps because of better scheduling via work stealing.

One interesting possibility that remains to be explored is scaling up the number of computers participating in the computation. Because of Julia’s built-in distributed computing model, it stands to benefit most directly (and probably most easily) from scaling up to a computation distributed over many computers; however, new overhead costs for communication would be incurred as well. For now, we will leave that as a topic for a future post.

Source Code

`mandelbrot.jl`	Julia implementation
`mandelbrot.cpp`	Serial C++ implementation
`mandelbrot_tbb.cpp`	TBB implementation
`mandelbrot_cl.cpp`	OpenCL host/driver program
`mandelbrot.cl`	OpenCL kernel

OpenCL defines a host API for C and C++, but wrappers are available for many other languages↩︎
One could be forgiven for expecting that Complex(1.0, 2.0) — which is really just double2(1.0, 2.0) — would construct a vector literal, but it doesn’t. However, it will still compile, leading to surprising (and wrong) results. The correct way to initialize a vector type is (double2)(1.0, 2.0).↩︎
Run with -p 4 (4 worker processes)↩︎
Run with -p 8↩︎
Run with -p 12↩︎
Compiled with clang using -O3↩︎