fixed point was a mistake
-
-
-
@eniko@mastodon.gamedev.place @lina@vt.social @oblomov@sociale.network ..probably cache performance getting better due to more thread-wise allocation of space??
-
-
-
-
To be clear, everything improved 0-35% from the previous implementation that used floating point after switching to fixed point
So the current threaded random triangles with flat color metric (fixed point) is 2x as fast as the previous threaded random triangles with flat color metric (floating point)
ah, it might actually be cache after all. if i drop resolution to 80x50 then scalar->threaded is only 4.4x
-
@oblomov @eniko @lina no idea if this applies here, but I remember reading something about multi-threading yielding more performance improvement than the number of threads in some cases, because some CPU architectures have memory cache per-core, so more threads means more cache memory which can overcome a bottleneck. As a rule of thumb, I believe multi-thread performance is hard to measure and hard to understand 😉
-
-
-
@oblomov @eniko @lina no idea if this applies here, but I remember reading something about multi-threading yielding more performance improvement than the number of threads in some cases, because some CPU architectures have memory cache per-core, so more threads means more cache memory which can overcome a bottleneck. As a rule of thumb, I believe multi-thread performance is hard to measure and hard to understand 😉
-
ah, it might actually be cache after all. if i drop resolution to 80x50 then scalar->threaded is only 4.4x
so what's *probably* happening is that the worker threads have a way higher chance of hitting L1 cache for multiple triangles in a row since triangles fully outside their quadrant are instantly rejected
my L1 cache is 32k, and 160x100x4 bytes is 64k so the whole screen still wouldn't fit in L1 cache. but at 80x50x4 its 16k which does fit all in L1 cache
this being some cache effect also explains why only this one threaded metric swings around by +/- 15% when no other metric does that
-
ah, it might actually be cache after all. if i drop resolution to 80x50 then scalar->threaded is only 4.4x
@eniko ooooooooh. Potential gotcha that you may be hitting: if you’re writing to the same pixel array from each thread, make sure each subtile is aligned to 64 bytes. Otherwise you will run into false sharing issues.
-
@oblomov @eniko @lina no idea if this applies here, but I remember reading something about multi-threading yielding more performance improvement than the number of threads in some cases, because some CPU architectures have memory cache per-core, so more threads means more cache memory which can overcome a bottleneck. As a rule of thumb, I believe multi-thread performance is hard to measure and hard to understand 😉
-
-
-
-
@eniko @oblomov @lina I was mostly trying to find a potential explanation to the x5 gain instead of x4.
There might be multiple factors playing together to make things even more confusing...
About the difference floating vs fixed point, it could be floating point is FPU bound, while for fixed point the bottleneck was the memory cache? Though honestly I'm not experienced enough with such low level considerations.
-
so what's *probably* happening is that the worker threads have a way higher chance of hitting L1 cache for multiple triangles in a row since triangles fully outside their quadrant are instantly rejected
my L1 cache is 32k, and 160x100x4 bytes is 64k so the whole screen still wouldn't fit in L1 cache. but at 80x50x4 its 16k which does fit all in L1 cache
this being some cache effect also explains why only this one threaded metric swings around by +/- 15% when no other metric does that
How well would it work to do it in two passes at 160x50x4?
-
@timotimo I'm incredibly new at running benchmarks at this level so I don't really know what that is
@eniko I am swamped with work right now, otherwise I would ask for permission to infodump :D
TLDR is there's thousands (or at least multiple hundred) counters inside the CPU that tick up for all kinds of performance- and otherwise relevant events (cache misses for example) and each OS gives you a way to read these out, plus the kernel can make sure it's properly accounted per process or thread and whatnot. some of these counters are preposterously specific, like uhhhhhh "Number of cycles dispatch is stalled for integer scheduler queue 3 tokens". But there's usually a little selection of commonly useful ones available with some extra simple command.
It's absolutely fascinating what you can, in theory, do if you want to dig really really really far down, but to be honest, I usually get very little really actionable insights from anything more intricate than the most basic ones ;(
absolutely a skill issue from my end I'm convinced!