Implement AffineTransformSparseInput for armv8
Implements AffineTransformSparseInput layer for the NNUE evaluation
for the armv8 and armv8-dotprod architectures. We measured some nice
speed improvements via 10 runs of our benchmark:
armv8, Cortex-X1 : 18.5% speed-up
armv8, Cortex-A76 : 13.2% speed-up
armv8-dotprod, Cortex-X1 : 27.1% speed-up
armv8-dotprod, Cortex-A76 : 12.1% speed-up
armv8, Cortex-A72, Raspberry Pi 4 : 8.2% speed-up (thanks Torom!)
closes https://github.com/official-stockfish/Stockfish/pull/4719
No functional change