Poly1305 revised
Poly1305 implementations are characterized by several parameters:
 radix or base of inputs representation, or how many digits represent the 130bit value the algorithm operates on;
 vectorization factor, or how many input blocks are processed per [loop] iteration and in parallel;
 floatingpoint vs. integer/scalar arithmetic;
We are most interested in performing calculations in minimum time, and for this reason find all these parameters interdependent. For example, the algorithm specification favours base 2³² by structuring the key in a manner that minimizes carries. But, that radix is impossible with IEEE 754 doubleprecision operations because the 53bit mantissa can’t accommodate the result of the multiplication of a pair of corresponding digits. Nor is it practical with vectorized implementation, because that uses not only the key itself, but also powers of the key. And the trouble is that the just mentioned carryfriendly structure is lost if the key is multiplied with itself, and one must either handle carries sequentially or make the radix smaller to preserve parallelism. Long story short; so far implementors have explored the following options:

base 2³² floatingpoint implementation utilizing extendedprecision operations supported by Intel x87 ISA;

mixed base 2¹⁶ᐟ³² floatingpoint implementation for doubleprecision arithmetics;

base 2²²ish vectorized floatingpoint implementation targeting x86_64 AVX unit;

base 2²⁶ vectorized implementation targeting integer SIMD units in x86, x86_64 and ARM processors;

base 2²⁶ scalar implementation as a reference or fallback implementation;
This list is obviously incomplete in the sense that there are no base 2³² or base 2⁶⁴ integer/scalar implementations. A floatingpoint approach is often assumed to be superior. That assumption is effectively disproved by base 2²⁶ vectorized integer implementations, but what about processors that lack a sufficiently capable SIMD unit? This post addresses that situation.
As just mentioned, vectorized base 2²⁶ implementations turn to be a better choice at least in comparison to floatingpoint implementations, which allows us to say the following: As all contemporary processors in the x86 family have vector units, we can dismiss base 2³² floating point as hardly relevant. But for the sake of completeness we can mention that its largeblock performance varies from ~2.6 cycles per processed byte on latest desktop/server Intel processors to ~4.6 on VIA Nano and ~7.7 on Atom Silvermont. Among preSSE2 processors it varies from ~9.0 on Pentium to ~4.6 on PIII and ~3.1 on AMD Athlon. These results were collected with poly1305util. Once again, the x87 code is fastest option on preSSE2 processors, but those are stoneage and are very uncommon now, hence the omission. Back to vectorized, here are selected results for OpenSSL implementations, also in cycles per processed byte out of large block:
SSE2, 32bit  AVX, 64bit  AVX2, 64bit  

Core 2  1.80  
Sandy Bridge  1.36  1.10  
Haswell  1.18  1.11  0.65 
Atom Silvermont  4.80  
Bulldozer  1.31  0.97  
NEON, 32bit  NEON, 64bit  
CortexA8  2.36  
CortexA15  1.25  
CortexA53  1.61  1.47  
CortexA57  1.29  1.14  
Apple A7  0.78  0.72 
The table mixes different optimizations, which might be a little bit confusing, but it does provide a useful overview and underlines following point: Even though 32bit results are inherently worse because of limited register bank capacity, they don’t differ significantly from 64bit ones on the same processor. This is because it’s same radix and therefore the same amount of computational operations.
Impressive as they are, the above results come with a “price tag” of precalculation of key powers, which “costs” several 16byte blocks. We should also keep in mind that to process a single block with a vectorized procedure, one must perform “throwaway” processing of several blocks and discard all of the results but one. The bottom line is that the vectorized procedure is far from optimal for short inputs, for which scalar implementations are preferable, even if largeblock performance is measured to be several times worse:
base 2³²  base 2⁶⁴  

Core 2  4.85  2.39 
Sandy Bridge  3.90  1.39 
Haswell  3.88  1.10 
Atom Silvermont  11.0  2.83 
Bulldozer  4.53  2.21 
CortexA8  6.25  
CortexA15  3.79  
CortexA53  4.07  2.63 
CortexA57  3.51  2.70 
Apple A7  3.62  1.86 
It was observed that a single, up to 16 bytes long block is fully hashed, i.e. accounting even for initialization and finalization, about 45 times slower than one block out of large buffer, which in turn is 16 times the result from the table above. For example, on Haswell single block is hashed in ~80 cycles which is faster than vectorized precomputation alone. Another factor that benefits smallblock performance is that scalar code is really compact, e.g. x86_64 base 2⁶⁴ code is less than 500 bytes, and for ARMv8 less than 400. Does all this mean that input length plays a role in which approach is applied? Yes. In OpenSSL the internal context is always initialized for scalar processing, but as soon as long input is passed, long enough to justify additional overhead, the current hash value is converted to a different radix and necessary precalculations are performed.
Note that there are base 2⁶⁴ results that are not worse than vectorized SSE2 on some x86_64 CPUs, and some are even better, most notably on Atom Silvermont. This is basically why 64bit SSE2 code is omitted from OpenSSL; it’s not a better option on contemporary preAVX processors. One can argue that this is unfair to legacy systems, such as Westmere and predecessors, but it’s not an unreasonable compromise, because we’re talking about processors that are approaching or have passed their end of life. Fortunately the performance gap is lower for more recent processors.
As for floating point: In general nonvectorized implementations are limited by a critical path length which is a complex function of instruction latencies, pipeline throughput, and other obscure factors. The thing to recognize about floating point is that the critical path is likely to be dominated by floatingpoint operation latency. This is because floatingpoint addition has the same latency as multiplication. Given algorithmic dependencies an ideal processor with sufficient resources should perform a block operation in 10 latencies. For example, for latency of 6, performance would be 10*6/16=3.75 cycles per byte. Once again, this is an asymptotic limit for an ideal processor, and in practice we are likely to measure worse performance. Either way the key point is that I failed to identify a processor capable of supporting base 2⁶⁴ that performs worse than the floatingpoint estimate for same processor. Or in other words, floating point makes sense only as an alternative to base 2³² scalar option. Here are comparison results for a few PowerPC and z/Architecture processors to demostrate the assertion:
base 2³²  floatingpoint  base 2⁶⁴  

Freescale e300  14.8  9.78  
PPC970  7.20  6.24  3.51 
POWER7  3.67  3.50  1.87 
POWER8    3.75  2.13 
z10  11.2  6.40  
z196  7.30  2.20 
The choice of IBM processors is rather circumstantial, the principle holds true for many other platforms.
There are oddball cases that are tricky to evaluate; one example is SPARCv9. The floatingpoint option is distinctly superior on the original UltraSPARC preTx family, several times faster than integer base 2³². But then the UltraSPARC T1 was a really poor floatingpoint “performer” and so are its successors. The Tx family is improving in that regard, but the VIS3 extension introduced in the T3 made base 2⁶⁴ implementation possible, in both 32 and 64bit builds, which turned to be fast enough to render floating point irrelevant. As UltraSPARC IIV is unlikely to be found in production, floatingpoint implementation is omitted for SPARCv9.
In conclusion let’s note some processors of historical interest. Floating point would surely be optimal for PARISC and Itanium, because integer multiplications are performed by the floatingpoint unit on these processors, and transfer between register banks is not exactly gratis. In addition Itanium is capable of performing extendedprecision operations which makes base 2³² floatingpoint implementation possible.
Summary and future work
Integer/scalar base 2³² and base 2⁶⁴ are implemented as a fallback and as the preferred choice for short inputs. The base 2⁶⁴ approach is shown to be the best choice for 64bit processors that lack a SIMD unit. Vectorized base 2²⁶ implementation is optimized with allround performance in mind, i.e. attempting to meet multiple optimization strategies for different microarchitectures.
As for future work, note that Power ISA 2.07 adds instructions that make vectorized base 2²⁶ implementation possible, and so does z13 from the z/Architecture family. The base 2⁶⁴ approach might turn to be suboptimal for MIPS64 processors because of a combination of two factors: carries must be handled programmatically, and it takes additional instructions to pull the high part of the multiplication result. Both mean an increased number of instructions and in combination might tip the balance toward a base 2³² implementation, but one expressed in 64bit instructions.
But the closest goal is AVX512 which allows to double vectorization factor. This requires an even larger precomputed table. The intention to not preserve second half of the table, but instead calculate it upon every call with long enough input. That may appear excessive, but the most common usage pattern is to make one “big” call per TLS record anyway.