AVX-512 for smaller affine and feature transforms.
For the feature transformer the code is analogical to AVX2 since there was room for easy adaptation of wider simd registers.
For the smaller affine transforms that have 32 byte stride we keep 2 columns in one zmm register. We also unroll more aggressively so that in the end we have to do 16 parallel horizontal additions on ymm slices each consisting of 4 32-bit integers. The slices are embedded in 8 zmm registers.
These changes provide about 1.5% speedup for AVX-512 builds.
Closes https://github.com/official-stockfish/Stockfish/pull/3218
No functional change.