Use non-locking operations for nodes count and tbHits
fetch_add and friends must produce a locking instruction because they guarantee
consistency in the presence of multiple writers. Because only one thread ever
writes to nodes and tbHits, we may use relaxed load and store to emit regular
loads and stores (on both x86-64 and arm64), while avoiding undefined behavior
and miscounting nodes.
Credit must go to Arseniy Surkov of Reckless (codedeliveryservice) for
uncovering this performance gotcha.
failed yellow as a gainer on fishtest:
LLR: -2.95 (-2.94,2.94) <0.00,2.00>
Total: 219616 W: 57165 L: 57105 D: 105346
Ptnml(0-2): 732, 24277, 59720, 24357, 722
https://tests.stockfishchess.org/tests/view/69086abfea4b268f1fac2809
potential small speedup on large core system:
```
==== master ====
1 Nodes/second : 294269736
2 Nodes/second : 295217629
Average (over 2): 294743682
==== patch ====
1 Nodes/second : 299003603
2 Nodes/second : 298111320
Average (over 2): 298557461
```
closes https://github.com/official-stockfish/Stockfish/pull/6392
No functional change