- assembly optimization x86 average micro-optimization
- Updated Sun, 04 Sep 2022 12:20:25 GMT

Suppose we have two register-length^{2} signed^{1} integers, say `a`

and `b`

. We want to compute the value `(a + b) / 2`

, either rounded up, down, towards zero, or away from zero, whichever way is easier (i.e. we do not care about the rounding direction).

The result is another register-length signed integer (it is clear that the average must be within the range of a register-length signed integer).

What is the fastest way to perform this computation?

You may choose which registers the two integers will initially be in, and which register the average ends up being in.

Footnote 1: For unsigned integers, we can do it in two instructions. This is perhaps the fastest way, although rotate-through-carry is more than 1 uop on Intel CPUs. But only a couple when the count is only 1. An answer on a Q&A about unsigned mean discusses the efficiency.

```
add rdi, rsi
rcr rdi, 1
```

The two numbers start in `rdi`

and `rsi`

, and the average ends up in `rdi`

. But for signed numbers, `-1 + 3`

would set CF, and rotate a `1`

into the sign bit. Not giving the correct answer of `+1`

.

Footnote 2: I specified *register-length* signed integers so that we can't simply sign extend the integers with a `movsxd`

or `cdqe`

instruction.

The closest I've got towards a solution uses four instructions, one of them an `rcr`

that's 3 uops on Intel, 1 on AMD Zen (https://uops.info/):

```
add rdi, rsi
setge al
sub al, 1 # CF = !(ge) = !(SF==OF)
rcr rdi, 1 # shift CF into the top of (a+b)>>1
```

I think a shorter solution probably lies in combining the middle two instructions in some way, i.e. performing `CF SF OF`

.

I've seen this question, but that's not x86-specific and none of the answers seem to compile to something as good as my solution.

Depending on how we interpret your lax rounding requirements, the following may be acceptable:

```
sar rdi, 1
sar rsi, 1
adc rdi, rsi
```

This effectively divides both inputs by 2, adds the results, and adds 1 more if `rsi`

was odd. (Remember that `sar`

sets the carry flag according to the last bit shifted out.)

Since `sar`

rounds to minus infinity, the result of this algorithm is:

exactly correct if rdi, rsi are both even or both odd

rounded down (toward minus infinity) if rdi is odd and rsi is even

rounded up (toward plus infinity) if rdi is even and rsi is odd

As a bonus, for random inputs, the average rounding error is zero.

It should be 3 uops on a typical CPU, with a latency of 2 cycles since the two `sar`

are independent.

- +3 – "up" here is always towards +Inf, not symmetric around 0. But yeah, for
`avg(-3,-3)`

, we get`-2 + -2 + CF(1)`

, so we get the correct`-3`

. For`avg(3,3)`

, we get`1 + 1 + CF(1)`

which is also 3. So the always-upward of adding CF counteracts the towards -Inf rounding of arithmetic right shift. — Jul 25, 2022 at 20:06 - +0 – @PeterCordes Would something like this be useful? lea eax,[edi+esi] shr eax,1 — Jul 26, 2022 at 12:50
- +0 – @vengy: For unsigned inputs that are known to not overflow, yes. i.e. that are already zero-extended from 31 bits or narrower. (Except you'd never use 32-bit address-size with LEA, you'd use
`lea eax, [rdi+rsi]`

to avoid an address-size prefix). But this Q&A is about full-range signed inputs that may be negative. So even`sar eax, 1`

wouldn't be sufficient. — Jul 26, 2022 at 12:55 - +0 – Seems like this is the only faster solution so far, but with the drawback that the operation is not commutative. — Aug 02, 2022 at 17:04

External links referenced by this document: