Add sse4 if bmi2 is enabled
The only change done to the Makefile to get a somewhat faster binary as
discussed in #2291 is to add -msse4 to the compile options of the bmi2 build.
Since all processors supporting bmi2 also support sse4 this can be done easily.
It is a useful step to avoid sending around custom and poorly tested builds.
The speedup isn't enough to pass [0,4] but it is roughly 1.15Elo and a LOS of 90%:
LLR: -2.95 (-2.94,2.94) [0.00,4.00]
Total: 93009 W: 20519 L: 20316 D: 52174
Also rewrite the documentation for the user when using `make --help`, so that
the order of architectures for x86-64 has the more performant build one on top.
Closes https://github.com/official-stockfish/Stockfish/pull/2300
No functional change