Seven Ways from ADD

Mike Radin, 2013/04/13

I was asked recently about how I would use multiple threads to get a sum of a long list of numbers. The short answer is that this approach is not the best despite the appearance efficiency. Applications rarely run in complete isolation, the OS scheduler is constantly shuffling work to and from the CPU. This context switching can degrade performance. Even thread setup is not free, even with the use of a Threadpool which on its own likely does a good job of managing system resources. Then once the multiple threads have completed their own calculations the single result must be constructed from the returned parts. Contrived as this example may be, it offers a good segue to a discussion on list comprehension and native code.

Regular

The specific question was in the context of C#, which since introduction of .net 3.0 received a very powerful list comprehension tool — LINQ. Not only is there a library for it, but even special SQL-like syntax that can be mixed with more traditional C-style code.

Using LINQ there are two ways to sum a list:

array.Aggregate<float, double>(0, (current, i) => current + i);

array.Sum();

Since introduction C# also has a foreach statement that can iterate over IEnumerable collections. This yields another way to add the numbers up.

foreach(var i in array) { arraysum += i; }

Curiously, when you write this, code analysis tools suggest converting this loop into a LINQ statement. The resulting code, should you indulge Resharper, will look like the Aggregate call above. Then there is the usual indexed for loop and some prefer to unroll the loop for better performance. Though tricks like that should be performed by the JIT as the code is executing.

for(int i = 0; i < array.Length; i++) { arraysum += a[i]; }

for(int i = 0; i < array.Length; i++){
    arraysum += a[i];
    arraysum += a[++i];
    arraysum += a[++i];
    arraysum += a[++i];
}

These five snippets were tested with the following code. The idea is to create a dataset larger than the L3 cache of the CPU and fill it with values that in sum do not overflow an float.

var size = 9600000;
var array = new float[size];
var cap = int.MaxValue / size;
var random = new System.Random((int) (System.DateTime.Now.Ticks % int.MaxValue))
for(int i = 0; i < size; i++) { array[i] = random.Next(0, cap); }

The code in this article is listed in order of improving performance.

LINQ Aggregate: 100%
LINQ Sum: 77.04%
foreach: 30.48%
for x1: 30.70%
for x4: 30.72%

That's quite a penalty for writing somewhat more readable code. Perhaps more surprisingly the foreach loop is faster than an indexed for iteration, though only a decimal point. The unrolling presented above is not particularly efficient. It will generate four stores per loop where it only really needs one. The loop below is equivalent, but it executes much faster — 17.72%.

for(int i = 0; i < array.Length; i++){
    arraysum += a[i] + a[++i] + a[++i] + a[++i];
}

Even here though foreach was faster and the lower bound was also faster than any for loop.

Faster

With expertise, the best way to write fast code is in assembly. This is certainly not viable for large projects with large teams, but small optimizations in specific places may lead to significant performance gains. With SSE in 1999, x86 chips gained the ADDPS instruction. Add Packed Single treats the two 128-bit parameters as two arrays of four 32-bit float values and returns four sums of values at matching positions in the two arrays, a[0] + b[0], etc. Accessing this functionality from C# is a bit of work, but once the C wrapper is written, it's easy to use. Modern compilers provide libraries to access CPU instructions directly with intrinsics. _mm_add_ps will generate a ADDPS instruction. Later in SSE3, HADDPS became available which returns sums of consecutive pairs, a[0] + a[1]. For the purposes of this text it accomplishes the same task. The following code is meant to be compiled with MSC++ as part of Visual Studio 2010.

#include <x86intrin.h>

extern "C" __declspec(dllexport) __stdcall float sum(float* a, int len) {
    int i = 0;
    __m128 rr = _mm_setzero_ps();
    while (i < len) {
        rr = _mm_add_ps(rr, _mm_set_ps(a[i + 3], a[i + 2], a[i + 1], a[i]));
        i += 4;
    }

    return rr.m128_f32[0] + r.m128_f32[1] + r.m128_f32[2] + r.m128_f32[3];
}

Running this code through the same test produces 6.96% result, however an unrolled for loop in C written like the last example in C# above performs better, 6.07%. This is unexpected. There are some optimizations to be made here. First of all, _mm_set_ps is a rather inefficient way to load four contiguous values from an array, use _mm_load_ps(const float *v) instead. This way instead of multiple memory loads we should do just one MOVAPS. The problem here is the "A", the instruction expects 16-byte-aligned memory addresses and the CLR memory allocation mechanism makes no such guarantees. Instead we'll use MOVUPS which is the unaligned equivalent and is supposedly somewhat slower. The body of the new function will look as follows.

__m128 rr = _mm_loadu_ps((const float *)&a[0]);
for(int i = 4; i < len; i += 4) {
    r = _mm_add_ps(_mm_loadu_ps((__m128i *)&a[i]), rr)));
}

return rr.m128_f32[0] + rr.m128_f32[1] + rr.m128_f32[2] + rr.m128_f32[3];

The new loop runs in 5.96% time. Still slow. Applying loop unrolling technique we get a new function body below

__m128 rr = _mm_loadu_ps((const float *)&a[0]);
int ii = 8;
int iii = 12;
for(int i = 4; i < len; i += 12) {
    rr = _mm_add_ps(_mm_loadu_ps((const float*)&a[i]), _mm_add_ps(_mm_loadu_ps((const float*)&a[ii]), _mm_add_ps(_mm_loadu_ps((const float*)&a[iii]), rr)));
    i += 12;
    ii += 12;
    iii += 12;
}

This is where the expected gains appear, this code averages just over 4%. Curiously, replacing ADDPS with HADDPS in the last two examples yields better performance for the single iteration loop with 5.78% but not for the unrolled loop with 5.11%.

4.16% in this case translates into 4.98ms. That is well below the thread quantum threshold on most systems. That number does vary based on hardware and configuration and there may be cases where original LINQ Aggregate calculation will actually happen without a context switch, but that is far less likely than a thread being preempted in under 5ms.

Threaded

.net 4 introduced Task Parallel Library which aims to simplify and standardize implementation of parallel processing. Parallel.For is a feature of TPL that will execute a delegate in a loop using thread pool threads for the iteration.

public float aggregate(float[] array){
    float[] parts = new float[4];
    int partlength = array.Length / 4;

    Parallel.For(0, 4, i => {
        parts[i] = sum(array, partlength * i, partlength);
    });

    return parts[0] + parts[1] + parts[2] + parts[3];
}

public float sum(float[] array, int start, int loop){
    var arraysum = 0f;

    for(int i = start; i < start + loop; i++){
        arraysum += a[i] + a[++i] + a[++i] + a[++i];
    }

    return arraysum;
}

This code produces very good results. 4.88% which is faster than anything else presented here except for the optimized SIMD code. Since the target machine had 16 virtual (8 physical) cores, test were run for an 8-way split as well. While performance we better, it was not a significant improvement. About 8% faster than 4-way split which wouldn't get very close to the leading position. For those interested in other TPL features, there is also PLINQ, but that was not tested for this article.

TL;DR

There may be large performance benefits in avoiding LINQ. While the CLR puts up quite a show, C outperforms it by a wide margin. SIMD instructions may not provide serious improvement without detailed optimization.

Further optimizations are possible. Newer chips implement AVX which allows 8-way addition. By writing assembly it would be possible to save some MOV instructions for intermediate values allowing the aggregate to persist in a register rather than storing it to memory at every iteration that MSC++ compiler seems to want to do. The C compiler included in Visual Studio 2012 should auto-vectorize loops of this nature.

Appendix 1: Run Time Averages

LINQ Aggregate: 119.83ms; σ = 12.51
LINQ Sum: 92.32; σ = 7.91
foreach: 36.52; σ = 2.87
for x4 unrolled: 36.81; σ = 3.06
for: 36.79; σ = 3.00
for x4 inline: 21.23; σ = 1.84
ADDPS single, _mm_set_ps: 8.02; σ = 0.55
ADDPS single, _mm_loadu_ps: 7.14; σ = 0.74
for C, x4 unrolled: 7.28; σ = 0.65
HADDPS single: 6.93; σ = 0.82
HADDPS, x3 unrolled: 6.13; σ = 0.56
TPL x4: 5.85; σ = 0.41
ADDPS, x3 unrolled: 4.98; σ = 0.51

Though not mentioned specifically, the same code built in debug mode produces poorer results especially for the top six items of this list. Furthermore, foreach and for loop performance differ by a factor of two. The averages were computed based on 100 iterations of each operation. Raw data is available on Google Drive.

Appendix 2: Test Code

Note the the code assumes a list of size modulo 12 equals 0. This was done to simplify testing, furthermore the input data is guaranteed not to overflow a float value.

Appendix 3: Environment Description

These tests were conducted on Windows 7, SP1. All code was compiled in Visual Studio 2010, SP1 using release mode, optimized for speed, C# code was built targeting .net 4. CPU was an Intel Xeon E5520 at 2.27GHz. Resharper version 5.1.

Appendix 4: Sample Assembly

This assembly is generated for the single ADDPS loop.

mov     eax,dword ptr [ebp-40h]
cmp     eax,dword ptr [ebx+0Ch]
jge     sumx+2E8h
mov     eax,dword ptr [ebp-40h]
mov     ecx,dword ptr [ebx+8]
movaps  xmm0,xmmword ptr [ecx+eax*4]
movaps  xmmword ptr [ebp-280h],xmm0
movaps  xmm0,xmmword ptr [ebp-280h]
movaps  xmm1,xmmword ptr [ebp-60h]
addps   xmm1,xmm0
movaps  xmmword ptr [ebp-260h],xmm1
movaps  xmm0,xmmword ptr [ebp+260h]
movaps  xmmword ptr [ebp-60h],xmm0
mov     eax,dword ptr [ebp-40h]
add     eax,4
mov     dword ptr[ebp-40h],eax
jmp     sumx+2A4h

Appendix 5: A Warning on Optimization

Sometimes code that looks right, doesn't execute as expected. For example

for(int i = 0; i < len;){
    r += a[i++] + a[i++] + a[i++] + a[i++];
}

Where r is the accumulated result, a[] is the input array of length len. When written in C this will produce an incorrect result. While it may be obvious to some, post-increment operations won't actually be executed until the sum has completed. For an array of length four, this would be the equivalent of adding the first element to itself four times and then incrementing the index by four in four separate operations. Replacing the body of the loop with the following fixes the problem.

r += a[i] + a[i + 1] + a[i + 2] + a[i + 3];
i += 4;

References

MSDN: Precedence and Order of Evaluation

Google Drive document: detailed summary results