@eniko@youen@lina would be interesting if you could convert the new fixed code into floating and it'd turn out to be faster than both the old floating code and the new fixed code
@youen@eniko@lina if the cache is so important, it means that the algorithm is memory bound, not ops bound, and this should be more visible if the FPU is involved, not less, unless something very funky is going on in the floating point code
@oblomov@youen@lina dunno. the float->fixed code isn't a trivial conversion so maybe i changed something somewhere in a way that was a bottleneck before but now is not
@youen@eniko@lina as I mentioned in the other reply, it is of course possible, but I'd find it extremely surprising. I'm assuming the data size and access patterns are similar, and the FPU should have more registers and should be able to retire more instructions per cycle than the ALU + with fixed point there's generally a higher op density for equivalent abstract operation in the numbers (consider a multiplication in fixed point vs in floating point).
@eniko says "i won't infodump without permission, but here's the short version", writes two hundred words anyway. the struggle to contain oneself can sometimes really be immense, huh 🫠