The dsp::SIMDRegister
class implementation asserts that load and store operations are aligned. Besides these constraints not being mentioned in the class documentation, there are cases where I explicitly want to perform unaligned load even if it has a worse performance, e.g. in case the load/store operation only makes up a small amount of the vectorised operation and most work is done by working on the loaded SIMD registers.
On Intel architecture there are instructions to do both, aligned and unaligned loads and stores. On ARM all load instruction take arbitrary aligned memory (still aligned memory will load faster) as far as my research revealed.
So would you consider to add some explicitly unaligned load and store functions to that class and the underlying platform specific implementations?