User:Bjacob/ArithmeticTimingDifferences

From MozillaWiki
Jump to: navigation, search

This page is about which scalar floating-point arithmetic instructions fail to execute in constant time (independently of their operands) on various CPUs. We also study some multiple-instructions constructs such as math functions.

Terminology

x86 architecture

Instruction sets

In the x86 architecture, there are two competing floating-point arithmetic instruction sets: the legacy x87 instruction set, and SSE.

While SSE offers SIMD instructions, it also provides a full replacement for old x87 scalar (MIMD) instructions. We are only concerned with scalar instructions in this document.

Here we use SSE to generically refer to all versions of SSE instruction sets. The original SSE instruction set (hereafter SSE1) was introduced in the Pentium III and only handles floats. The subsequent SSE2 instruction set, introduced in the Pentium 4, added support for doubles. We are not concerned with SSE3 and newer instruction sets in this document.

Handling of denormals

x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals by SSE instructions:

  • FTZ (Flush To Zero) causes any denormal result to be flushed to zero.
  • DAZ (Denormals Are Zero) causes any denormal operand to be handled as if it were zero.


Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ, see below the section about that.

x86-64 architecture

The only way in which x86-64 differs from x86-32 in these matters, is that SSE2 is unconditionally available on x86-64. Hereafter, we let x86 refer generically to x86-32 or x86-64.

ARM architecture

Instruction sets

ARM has one hardware, scalar, floating-point instruction set: VFP.

NEON is purely a SIMD instruction set, so it doesn't concern us here.

Many ARM CPUs do not have NEON, and some ARM CPUs do not even have VFP. On these, software emulation of floating-point arithmetic is used. For our purposes, this can be considered another separate instruction set.

Handling of denormals

ARM has a single flag, Flush To Zero (FZ) which plays the role of both of x86's flags, FTZ and DAZ.

Getting experimental data

Methodology

The code performing the measurements is on github.

For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over all sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is not a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation.

The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. We verified on a few results that the timings' density plot looked plausibly similar to a Gaussian curve.

The reason to compare these normal laws to the overall median, not the overall mean, is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, all actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.

Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend across one cache line boundary).

Caveats

Equality comparisons

One of the types of operation that we time, "equality comparison", requires some details to be explained. For a given scalar type, by "equality comparison" we mean this:

 scalar equality_comparison(scalar x, scalar y)
 {
   return scalar(x == y);
 }

With integer types, this gets implemented with conditional moves and thus consistently runs in constant time. With floating-point types, conditional moves are not enough, for lack of floating-point variants. The compiler could still decide to compute (x==y) as a boolean with conditional move and then cast it to the floating-point type, but at least GCC 4.6 doesn't do that on x86. Instead, GCC 4.6 decided to use conditional jumps on x86. This means that the results of the floating-point "equality comparison" operation are always reported as non-constant-time as there is at least a timing difference between the (x==y) and (x!=y) cases. These reports are ignored in the analysis below.

Below, when we report significant timing differences in equality comparisons, we implicitly mean besides the above-discussed issue. For example, if we find that there are specific timing differences when one of the operands is NaN, that can't be explained by the above issue, and is thus an inherent non-constant-time characteristic of the floating-point equality comparison instruction that was used.

Running the code

You can either build the source code or use prebuilt binaries. Note that only Linux is supported at the moment.

The binaries should run on any Linux-based system: GNU/Linux, Android, Firefox OS. They are statically linked, so they don't have any dependency, not even on the standard library.

On x86-32 machines, note that it is only worth trying out SSE2-capable CPUs. Most of the x86-32 binaries have 'sse2' in their filename, indicating that they require support for SSE2.

On x86-64 machines, it is very much worth running both the x86-64 and the x86-32 binaries.

On ARM machines, note that the binaries require ARMv7.

Recorded results

Test results so far are recorded there.

Analysis

Here are the conclusions that we can draw from our experimental results so far.

The boring parts

Let's start by listing a few things that "everybody knows" for which we got an experimental confirmation.

x87 instructions are highly non-constant-time

x86 code using x87 floating-point instructions exhibits highly non-constant time behavior, especially but not only with denormals and NaNs, and there doesn't appear to be anything to do about it.

Here is a sample run.

ARM software floating-point emulation is highly non-constant-time

ARM code relying on software floating-point emulation should not expect any kind of constant-time characteristics, even on regular normal finite values.

Here is a sample run.

Math functions are highly non-constant-time

Note that for math functions, we just use std:: math functions and rely on the compiler to do the right thing. It is often the case that there is no instruction for a given math function and the compiler or standard library has to implement these manually, using basic arithmetic operations. That is at least the case with SSE.

Our results show that on all architectures and instruction sets we tried, math functions are highly non-constant-time.

Divisions are highly non-constant-time, even with integer types

As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.

Integer addition, subtraction and multiplication are almost universally constant-time

But again, note that integer division isn't, and nor are any floating-point operations.

See below for a strange exception, where int64 subtraction is not constant time on some 32bit CPUs.

The hopefully interesting parts

Let's now get to the findings that we made, that contradict what we thought we knew.

Some Intel CPUs have slow multiplications with NaN values, even with SSE2 instructions

Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality.

In particular, this is not true for a wide range of Intel CPUs based on the Netburst microarchitecture, which was popular until the mid-2000s. At least Intel family 15 models 2, 3 and 4 are affected:


A significant mass of SSE2-capable CPUs do not have DAZ

Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.

This turns out not to be the case even for Intel CPUs: this Pentium M supports SSE2 but evidently doesn't support DAZ. As a result, timings are all over the place, even on just additions, as soon as denormals are involved.

It is also reported that some early Pentium 4 CPUs were in the same case. Conversely, not all Pentium M's are affected (here is an unaffected one).

Timing differences in equality comparisons

The SSE instructions being used there are ucomiss and ucomisd.

The first problem is denormals: equality comparisons are just as affected as other operations by denormals.

The next problem is that many CPUs have abnormally slow or fast equality comparisons when the operands are non-finite.

Specifically:

Some CPUs have surprising timing differences on 64-bit integer arithmetic using 32-bit integer arithmetic instructions, when DAZ is enabled

On x86-32, our int64 subtraction benchmark compiles to this assembly:

 movl	(%ebx), %eax
 movl	4(%ebx), %edx
 movl	8(%ebx), %esi
 movl	12(%ebx), %edi
 subl	%esi, %eax
 sbbl	%edi, %edx
 movl	%eax, 16(%ebx)
 movl	%edx, 20(%ebx)

On one Pentium 4 CPU that we used, this does not quite run in constant time, with timing differences of up to 25% and about 3 sigma from the overall median; and the weirdest part is that this only happens when DAZ is enabled! This is of course particularly surprising as DAZ is not supposed to have any relevance to integer arithmetic.

Here are the results with DAZ, showing slower-than-normal operation when the right hand side of the subtraction is INT64_MIN, with the slowest case being INT64_MIN - INT64_MIN:


On the other hand, here are the results without DAZ, showing no such timing difference with INT64_MIN operands:


It is worth noting that this only happens on that particular Pentium 4 CPU (family 15 model 3) and not on other Netburst-based CPUs that we tested.

Float to int64 conversion is non-constant-time on some CPUs

FILL ME

std::max<int64> is non-constant-time on some CPUs

FILL ME