Shave some instructions off a hot loop in affine transform
On x86, GCC generates highly suboptimal code for this loop in its old form,
about 2x as many instructions as necessary. This decreases throughput
especially in an SMT setting. Clang does a better job but this change still has
some improvement. Note that the std::ptrdiff_t type is not optional; using an
unsigned type brings back the bad assembly. (Not sure why, but it seems
reliable on all the GCC versions I tested.)
passed STC:
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 44672 W: 11841 L: 11527 D: 21304
Ptnml(0-2): 165, 4625, 12415, 4993, 138
https://tests.stockfishchess.org/tests/view/68d8111efa806e2e8393b10e
closes https://github.com/official-stockfish/Stockfish/pull/6331
No functional change