fixed point was a mistake
-
@eniko I am swamped with work right now, otherwise I would ask for permission to infodump :D
TLDR is there's thousands (or at least multiple hundred) counters inside the CPU that tick up for all kinds of performance- and otherwise relevant events (cache misses for example) and each OS gives you a way to read these out, plus the kernel can make sure it's properly accounted per process or thread and whatnot. some of these counters are preposterously specific, like uhhhhhh "Number of cycles dispatch is stalled for integer scheduler queue 3 tokens". But there's usually a little selection of commonly useful ones available with some extra simple command.
It's absolutely fascinating what you can, in theory, do if you want to dig really really really far down, but to be honest, I usually get very little really actionable insights from anything more intricate than the most basic ones ;(
absolutely a skill issue from my end I'm convinced!
@eniko says "i won't infodump without permission, but here's the short version", writes two hundred words anyway. the struggle to contain oneself can sometimes really be immense, huh 🫠
-
so what's *probably* happening is that the worker threads have a way higher chance of hitting L1 cache for multiple triangles in a row since triangles fully outside their quadrant are instantly rejected
my L1 cache is 32k, and 160x100x4 bytes is 64k so the whole screen still wouldn't fit in L1 cache. but at 80x50x4 its 16k which does fit all in L1 cache
this being some cache effect also explains why only this one threaded metric swings around by +/- 15% when no other metric does that
@eniko You can also try changing the way the data is organized, instead of structs use large arrays and arrays of indexes.
-
@eniko @oblomov @lina I was mostly trying to find a potential explanation to the x5 gain instead of x4.
There might be multiple factors playing together to make things even more confusing...
About the difference floating vs fixed point, it could be floating point is FPU bound, while for fixed point the bottleneck was the memory cache? Though honestly I'm not experienced enough with such low level considerations.
@youen @eniko @lina as I mentioned in the other reply, it is of course possible, but I'd find it extremely surprising. I'm assuming the data size and access patterns are similar, and the FPU should have more registers and should be able to retire more instructions per cycle than the ALU + with fixed point there's generally a higher op density for equivalent abstract operation in the numbers (consider a multiplication in fixed point vs in floating point).
-
How well would it work to do it in two passes at 160x50x4?
@Professor_Stevens that's a lot harder to do
-
@eniko says "i won't infodump without permission, but here's the short version", writes two hundred words anyway. the struggle to contain oneself can sometimes really be immense, huh 🫠
@timotimo haha, its okay, i get it
tbh i'm just relieved it's not some weird bug
-
@youen @eniko @lina as I mentioned in the other reply, it is of course possible, but I'd find it extremely surprising. I'm assuming the data size and access patterns are similar, and the FPU should have more registers and should be able to retire more instructions per cycle than the ALU + with fixed point there's generally a higher op density for equivalent abstract operation in the numbers (consider a multiplication in fixed point vs in floating point).
-
@youen @eniko @lina as I mentioned in the other reply, it is of course possible, but I'd find it extremely surprising. I'm assuming the data size and access patterns are similar, and the FPU should have more registers and should be able to retire more instructions per cycle than the ALU + with fixed point there's generally a higher op density for equivalent abstract operation in the numbers (consider a multiplication in fixed point vs in floating point).
-