fixed point was a mistake
-
@eniko @oblomov Did the conversion only affect the pure computation part, or did some buffer formats/sizes change too?
I'm thinking if you did something like f32 to u8 per channel for the framebuffer, the tiles now fit in cache and that can be a huge speedup.
Verify by testing just one of the four quadrant threads, or alternatively just non threaded version with lower resolution.
-
@slyecho as in threaded flat color random triangles with fixed point is 2x as fast as threaded flat color random triangles with floating point
@eniko Yeah that makes sense, ints are faster.
-
@ataylor
random flat color triangles x9.3
random gouraud triangles x4.7
fullscreen flat color triangles x2.2
fullscreen gouraud triangles x2.8@eniko that seems… weird. I would expect close to 4x for all of these, tbh.
-
@eniko @oblomov Did the conversion only affect the pure computation part, or did some buffer formats/sizes change too?
I'm thinking if you did something like f32 to u8 per channel for the framebuffer, the tiles now fit in cache and that can be a huge speedup.
Verify by testing just one of the four quadrant threads, or alternatively just non threaded version with lower resolution.
-
-
-
-
@eniko@mastodon.gamedev.place @lina@vt.social @oblomov@sociale.network ..probably cache performance getting better due to more thread-wise allocation of space??
-
-
-
-
To be clear, everything improved 0-35% from the previous implementation that used floating point after switching to fixed point
So the current threaded random triangles with flat color metric (fixed point) is 2x as fast as the previous threaded random triangles with flat color metric (floating point)
ah, it might actually be cache after all. if i drop resolution to 80x50 then scalar->threaded is only 4.4x
-
@oblomov @eniko @lina no idea if this applies here, but I remember reading something about multi-threading yielding more performance improvement than the number of threads in some cases, because some CPU architectures have memory cache per-core, so more threads means more cache memory which can overcome a bottleneck. As a rule of thumb, I believe multi-thread performance is hard to measure and hard to understand 😉
-
-
-
@oblomov @eniko @lina no idea if this applies here, but I remember reading something about multi-threading yielding more performance improvement than the number of threads in some cases, because some CPU architectures have memory cache per-core, so more threads means more cache memory which can overcome a bottleneck. As a rule of thumb, I believe multi-thread performance is hard to measure and hard to understand 😉
-
ah, it might actually be cache after all. if i drop resolution to 80x50 then scalar->threaded is only 4.4x
so what's *probably* happening is that the worker threads have a way higher chance of hitting L1 cache for multiple triangles in a row since triangles fully outside their quadrant are instantly rejected
my L1 cache is 32k, and 160x100x4 bytes is 64k so the whole screen still wouldn't fit in L1 cache. but at 80x50x4 its 16k which does fit all in L1 cache
this being some cache effect also explains why only this one threaded metric swings around by +/- 15% when no other metric does that
-
ah, it might actually be cache after all. if i drop resolution to 80x50 then scalar->threaded is only 4.4x
@eniko ooooooooh. Potential gotcha that you may be hitting: if you’re writing to the same pixel array from each thread, make sure each subtile is aligned to 64 bytes. Otherwise you will run into false sharing issues.
-
@oblomov @eniko @lina no idea if this applies here, but I remember reading something about multi-threading yielding more performance improvement than the number of threads in some cases, because some CPU architectures have memory cache per-core, so more threads means more cache memory which can overcome a bottleneck. As a rule of thumb, I believe multi-thread performance is hard to measure and hard to understand 😉
-