- c performance x86 floating-point stencils
- Updated Sun, 07 Aug 2022 20:20:08 GMT

I am performing a stencil computation on a matrix I previously read from a file. I use two different kinds of matrices (NonZero type and Zero type). Both types share the value of the boundaries (1000 usually), whilst the rest of the elements are 0 for Zero type and 1 for NonZero type.

The code stores the matrix of the file in two allocated matrices of the same size. Then it performs an operation in every element of one matrix using its own value and values of neighbours (add x 4 and mul x 1), and stores the result in the second matrix. Once the computation is finished, the pointers for matrices are swapped and the same operation is perform for a finite amount of times. Here you have the core code:

```
#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]
for (cur_time=0; cur_time<timeSteps; cur_time++) {
for (i=1; i<rows-1; i++) {
for (j=1; j<cols-1; j++) {
PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
}
}
// Change pointers for next iteration
auxP = wMat;
wMat = rMat;
rMat = auxP;
}
```

The case I am exposing uses a fixed amount of 500 timeSteps (outer iterations) and a matrix size of 8192 rows and 8192 columns, but the problem persists while changing number of timeSteps or matrix size. Note that I only measure time of this concrete part of algorithm, so reading matrix from file nor anything else affects the time measure.

What it happens, is that I get different times depending on which type of matrix I use, obtaining a much worse performance when using Zero type (every other matrix performs same as NonZero type, as I have already tried to generate a matrix full of random values).

I am certain it is the multiplication operation, as if I remove it and leave only the adds, they perform the same. Note that with Zero matrix type, most of the type the result of the sum will be 0, so the operation will be "0.2*0".

This behaviour is certainly weird for me, as I thought that floating point operations were independent of values of operands, which does not look like the case here. I have also tried to capture and show SIGFPE exceptions in case that was the problem, but I obtained no results.

In case it helps, I am using an Intel Nehalem processor and gcc 4.4.3.

The problem has already mostly been diagnosed, but I will write up exactly what happens here.

Essentially, the questioner is modeling diffusion; an initial quantity on the boundary diffuses into the entirety of a large grid. At each time step t, the value at the leading edge of the diffusion will be 0.2^t (ignoring effects at the corners).

The smallest normalized single-precision value is 2^-126; when `cur_time = 55`

, the value at the frontier of the diffusion is 0.2^55, which is a bit smaller than 2^-127. From this time step forward, some of the cells in the grid will contain *denormal* values. On the questioner's Nehalem, operations on denormal data are about 100 times slower than the same operation on normalized floating point data, explaining the slowdown.

When the grid is initially filled with constant data of `1.0`

, the data never gets too small, and so the denormal stall is avoided.

Note that changing the data type to `double`

would delay, but not alleviate the issue. If double precision is used for the computation, denormal values (now smaller than 2^-1022) will first arise in the 441st iteration.

At the cost of precision at the leading edge of the diffusion, you could fix the slowdown by enabling "Flush to Zero", which causes the processor to produce zero instead of denormal results in arithmetic operations. This is done by toggling a bit in the FPSCR or MXSCR, preferably via the functions defined in the `<fenv.h>`

header in the C library.

Another (hackier, less good) "fix" would be to fill the matrix initially with very small non-zero values (`0x1.0p-126f`

, the smallest normal number). This would also prevent denormals from arising in the computation.

- +0 – Should work too:
`double FlushToZero(double x){return (x < 2e-126)?0:x;}`

and call it:`y = FlushToZero(0.2 * ...);`

. Should contain no overhead with proper inlining of the compiler. — Mar 03, 2011 at 21:37 - +0 – @Xeo: I suspect most compilers will (rightly or wrongly) compile that into a compare and branch, which will introduce overhead, but it's still a reasonable suggestion. Actually enabling the FTZ bit is the only way to do it with zero overhead. Also, the constant should be
`0x1.0p-126f`

, not`2e-126`

(a rather different value). — Mar 03, 2011 at 21:41 - +0 – Woops, you're right with the constant, but my 5min for edits have passed. :| Fun fact:
`0x1.0p`

is`3,1415926535897932384626433832795`

. Found that while pasting`0x1.0p-126`

into Windows Calc. — Mar 03, 2011 at 21:47 - +0 – But what is the scope of this setting (in <fenv.h>) ? Can it be turned off and on around specific calculations whose near-zero accuracy isn't important? and what is the processor behavior in other threads in my process, when they perform such calculations? — Dec 28, 2015 at 22:46
- +0 – @MottiShneor: The setting is per-thread. If the near-zero accuracy is unimportant, there's no reason to bother turning it off, because that's all that it effects. — Dec 28, 2015 at 22:49