Speedup Rust's heapsort by 1.5x by making it branchless

by pykelloon 2/21/2023, 8:12 PMwith 1 comments

by nequoon 2/22/2023, 12:28 AM

This is the commit in the Rust repository that achieves the speedup by reducing branch misprediction:

https://github.com/rust-lang/rust/commit/b7089e0dd3e988270f3...

Do I understand correctly why this works?

In the `if` condition on line 201, `child + i < v.len()` is easy to predict but `is_less(...)` is hard. Because of this, `child += 1` on line 202 needs to be walked back with a non-negligible probability. This wastes time.

  200   // Choose the greater child.
  201   if child + 1 < v.len() && is_less(&v[child], &v[child + 1]) {
  202       child += 1;
  203   }
  204
  205   // Stop if the invariant holds at `node`.
  206   if !is_less(&v[node], &v[child]) {
After moving the `is_less(...)` call from the `if` condition to the consequent on line 205 below, we still need to fetch both `child += ...` and `is_less(...)` but we are much less likely to need to walk these instructions back. We waste less time and we reach the `if` condition on line 209 sooner.

  200   // Choose the greater child.
  201   if child + 1 < v.len() {
  202       // We need a branch to be sure not to out-of-bounds index,
  203       // but it's highly predictable.  The comparison, however,
  204       // is better done branchless, especially for primitives.
  205       child += is_less(&v[child], &v[child + 1]) as usize;
  206   }
  207
  208   // Stop if the invariant holds at `node`.
  209   if !is_less(&v[node], &v[child]) {