Speedup movegen with VBMI2
Passed STC
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 166720 W: 43191 L: 42701 D: 80828
Ptnml(0-2): 348, 18567, 45069, 18999, 377
https://tests.stockfishchess.org/tests/view/686ae98dfe0f2fe354c0c867
Refactor movegen to emit to a vector with 16-bit elements, which enables a
speedup with AVX512-VBMI2 when writing moves to the move list.
Very crude timing measurements of perft via timing ./stockfish "go perft 7"
demonstrates approximately 17% perft speedup:
Summary
./Stockfish-dev/src/stockfish 'go perft 7' ran
1.17 ± 0.04 times faster than ./Stockfish-base/src/stockfish 'go perft 7'
Estimated overall nps increase of 0.4% via speedtest: 33605229 -> 33749825
(many thanks JonathanHallstrom).
The corresponding arch is avx512icl as it is a good baseline for consumer
avx-512 feature set support; Intel Ice Lake was the first consumer AVX-512 CPU
and it is a decent subset of what AMD Zen 4 supports.
closes https://github.com/official-stockfish/Stockfish/pull/6153
No functional change