Fix overflow in add_dpbusd_epi32x2
This patch fixes 16bit overflow in *_add_dpbusd_epi32x2 functions,
that can be triggered in rare cases depending on the NNUE weights.
While the code leads to some slowdown on affected architectures
(most notably avx2), the fix is simpler than some of the other
options discussed in
https://github.com/official-stockfish/Stockfish/pull/4394
Code suggested by Sopel97.
Result of "bench 4096 1 30 default depth nnue":
| Architecture | master | patch (gcc) | patch (clang) |
|---------------------|-----------|-------------|---------------|
| x86-64-vnni512 | 762122798 | 762122798 | 762122798 |
| x86-64-avx512 | 769723503 | 762122798 | 762122798 |
| x86-64-bmi2 | 769723503 | 762122798 | 762122798 |
| x86-64-ssse3 | 769723503 | 762122798 | 762122798 |
| x86-64 | 762122798 | 762122798 | 762122798 |
Following architectures will experience ~4% slowdown due to an
additional instruction in the middle of hot path:
* x86-64-avx512
* x86-64-bmi2
* x86-64-avx2
* x86-64-sse41-popcnt (x86-64-modern)
* x86-64-ssse3
* x86-32-sse41-popcnt
This patch clearly loses Elo against master with both STC and LTC.
Failed non-regression STC (256bit fix only):
LLR: -2.95 (-2.94,2.94) <-1.75,0.25>
Total: 33528 W: 8769 L: 9049 D: 15710
Ptnml(0-2): 96, 3616, 9600, 3376, 76
https://tests.stockfishchess.org/tests/view/63e6a5b44299542b1e26a485
60+0.6 @ 30000 games:
Elo: -1.67 +-1.7 (95%) LOS: 2.8%
Total: 30000 W: 7848 L: 7992 D: 14160
Ptnml(0-2): 12, 2847, 9436, 2683, 22
nElo: -3.84 +-3.9 (95%) PairsRatio: 0.95
https://tests.stockfishchess.org/tests/view/63e7ac716d0e1db55f35a660
However, a test against nn-a3dc078bafc7.nnue, which is the latest "safe"
network not causing the bug, passed with regular bounds.
Passed STC:
LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 160456 W: 42658 L: 42175 D: 75623
Ptnml(0-2): 487, 17638, 43469, 18173, 461
https://tests.stockfishchess.org/tests/view/63e89836d62a5d02b0fa82c8
closes https://github.com/official-stockfish/Stockfish/pull/4391
closes https://github.com/official-stockfish/Stockfish/pull/4394
No functional change