fixed point was a mistake

oblomov@sociale.network

@eniko @lina moving to 4 threads gives you an over 5x performance improvement in the old code which is … odd, and for the new code it's over 9x. These are both very strange single/multithreaded numbers

eniko@mastodon.gamedev.place

@lina @oblomov if i turn 3 of the 4 worker threads idle so they dont render triangles at all, then divide the fillrate for the remaining thread by 4, the perf boost from single thread to "threaded" is x4.47

kate@hai.z0ne.social

@eniko@mastodon.gamedev.place @lina@vt.social @oblomov@sociale.network ..probably cache performance getting better due to more thread-wise allocation of space??

oblomov@sociale.network

@eniko @lina with the old or new code?

(Still too much BTW)

eniko@mastodon.gamedev.place

@oblomov @lina the new code

eniko@mastodon.gamedev.place

@oblomov @lina ok if i drop resolution to 80x50 then the speed boost drops to 4.4x

eniko@mastodon.gamedev.place

ah, it might actually be cache after all. if i drop resolution to 80x50 then scalar->threaded is only 4.4x

youen@mastodon.gamedev.place

@oblomov @eniko @lina no idea if this applies here, but I remember reading something about multi-threading yielding more performance improvement than the number of threads in some cases, because some CPU architectures have memory cache per-core, so more threads means more cache memory which can overcome a bottleneck. As a rule of thumb, I believe multi-thread performance is hard to measure and hard to understand 😉

oblomov@sociale.network

@eniko @lina ironically that might be because there isn't enough work

eniko@mastodon.gamedev.place

@oblomov @lina nah i thought of that too, but the benchmark generates a proportion of small, medium, and large triangles so it'll still do that regardless of the tiny resolution

eniko@mastodon.gamedev.place

@youen @oblomov @lina yeah i think it might be this

eniko@mastodon.gamedev.place

so what's *probably* happening is that the worker threads have a way higher chance of hitting L1 cache for multiple triangles in a row since triangles fully outside their quadrant are instantly rejected

my L1 cache is 32k, and 160x100x4 bytes is 64k so the whole screen still wouldn't fit in L1 cache. but at 80x50x4 its 16k which does fit all in L1 cache

this being some cache effect also explains why only this one threaded metric swings around by +/- 15% when no other metric does that

ataylor@mastodon.gamedev.place

@eniko ooooooooh. Potential gotcha that you may be hitting: if you’re writing to the same pixel array from each thread, make sure each subtile is aligned to 64 bytes. Otherwise you will run into false sharing issues.

oblomov@sociale.network

@youen @eniko @lina that's a good point, but why would this effects only be visible so significantly in the fixed-point version though? Are the data structures or sizes so different?

eniko@mastodon.gamedev.place

@oblomov @youen @lina the floating point version probably had some FPU math bottleneck the ALU doesn't

oblomov@sociale.network

@eniko @youen @lina

«oh no, we need to roofline analyze this»

Joking aside, I would find that very surprising unless something very weird is going on in the floating case, but without looking at the machine code produced it's hard to say.

youen@mastodon.gamedev.place

@eniko @oblomov @lina also, maybe a long shot, but it could be interesting to time the rendering of the four tiles on a single thread (again, because it may improve caching to just split in local sections of the screen?)

youen@mastodon.gamedev.place

@eniko @oblomov @lina I was mostly trying to find a potential explanation to the x5 gain instead of x4.

There might be multiple factors playing together to make things even more confusing...

About the difference floating vs fixed point, it could be floating point is FPU bound, while for fixed point the bottleneck was the memory cache? Though honestly I'm not experienced enough with such low level considerations.

professor_stevens@mastodon.gamedev.place

@eniko

How well would it work to do it in two passes at 160x50x4?

timotimo@peoplemaking.games

@eniko I am swamped with work right now, otherwise I would ask for permission to infodump :D

TLDR is there's thousands (or at least multiple hundred) counters inside the CPU that tick up for all kinds of performance- and otherwise relevant events (cache misses for example) and each OS gives you a way to read these out, plus the kernel can make sure it's properly accounted per process or thread and whatnot. some of these counters are preposterously specific, like uhhhhhh "Number of cycles dispatch is stalled for integer scheduler queue 3 tokens". But there's usually a little selection of commonly useful ones available with some extra simple command.

It's absolutely fascinating what you can, in theory, do if you want to dig really really really far down, but to be honest, I usually get very little really actionable insights from anything more intricate than the most basic ones ;(

absolutely a skill issue from my end I'm convinced!

Piero Bosio Social Web Site Personale

fixed point was a mistake

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

L'attacco contro l'Iran ha destabilizzato l’economia internazionale.

Trump ha perso la guerra in Iran

IL CAPO DI GABINETTO DEL MINISTERO DI GIUSTIZIA LA PIÙ AUTOREVOLE TESTIMONIAL DEL VOTARE “NO”

Ieri ho seguito un corso di aggiornamento sul linguaggio giornalistico nei casi di femminicidio.