Comparing sqrt_ratio function implementations

Comparison of performance of sqrt_ratio implementations. The new implementation (new_method below) merges the exponentiations in the Tonelli–Shanks algorithm with those in the inversion (as described in Scott's paper). Thus 3 exponentiations (1 for each squareroot and 1 for the inversion) become 1 (computing the "projenitor" for num^3*den). When p - 1 is 2-adically small one expects this to perform much better. On the other hand when p - 1 = 2s t with t small, most of the time is spent on the loop and little is saved. The worst case of this is the Fermat primes, the best case is p = 3 (mod 4), but performance is pretty good when t is smallish.

Each instance consists of computing sqrt_ratio(&num, &den) for 1000 random pairs (num, den) (so that the average 6.75ms of new_method for 2255 - 19 equates to average 6.75μs per instance). This was on my machine, your milage may vary.

See individual benchmark pages below for more details.

The favourable cases

The unfavourable cases