Google highway simd library

i’m not affiliated with google or this code and i haven’t tried it out locally (yet) but the runtime dispatch portion seems interesting:

Efficient and safe runtime dispatch is important. Modules such as image or video codecs are typically embedded into larger applications such as browsers, so they cannot require separate binaries for each CPU. Libraries also cannot predict whether the application already uses AVX2 (and pays the frequency throttling cost), so this decision must be left to the application. Using only the lowest-common denominator instructions sacrifices too much performance. Therefore, we provide code paths for multiple instruction sets and choose the most suitable at runtime. To reduce overhead, dispatch should be hoisted to higher layers instead of checking inside every low-level function. Generating each code path from the same source reduces implementation- and debugging cost.

1 Like

So what techniques are used to ID the bottlenecks? Back in the day Vtune would take you there for IA platforms in which case you could optimize away and develop code paths accordingly. This was a ghastly investment of dev resources (I know, been there, have the t-shirt as I schlepped the SIMD/Vtune story for Intel for the first decade of my career). IA platforms long since became “fast enough” to not merit screwing around with the overhead. This is probably not as true for non-desktop targets these days. Hence the question- how are you determining the bottlenecks?