

On modern CPUs it doesn’t matter that much. And any optimization would have to be updated for each CPU type (Zen/4, Alder lake, etc.) Modern CPUs have insane out of order execution that makes compiler generated code nearly as fast as the most optimized handwritten ones. On older CPUs you’d see more of a performance bump.









99% of the time, memory is the bottleneck. It’s why DMA is so huge of a feature.