fixed point was a mistake
-
@eniko do you already have experience with the kind of profiler that lets you get performance counter values?
on linux my go-to first step is
perf stat -d ./myprogram(-d for details gives a couple more numbers. gotta have numbers!) then you'll see a few numbers that may point at a drastic difference.I'm thinking a higher instruction per cycle number probably means fewer instructions that take many cycles (though I hear integer division is much better nowadays?), or your cache hit rate for data or instruction cache may be a lot better, or maybe your code ends up with fewer total instructions for some reason?
@timotimo I'm incredibly new at running benchmarks at this level so I don't really know what that is
-
@eniko one would assume 4 times as fast with four threads, not 30% faster. But I don’t know exactly what the code is doing without seeing it
@slyecho the 30% improvement was over the same implementation with floating point
-
@slyecho the 30% improvement was over the same implementation with floating point
@slyecho as in threaded flat color random triangles with fixed point is 2x as fast as threaded flat color random triangles with floating point
-
I have completed the triangle rasterizer fixed point conversion
The benchmarks have all improved between 0 and 35%
Except for threaded random triangles with flat color. That metric has increased 106%. As in its twice as fast as before
I have no idea why but I'm fairly sure I've ruled out bugs in my benchmarking
I am very confused
To be clear, everything improved 0-35% from the previous implementation that used floating point after switching to fixed point
So the current threaded random triangles with flat color metric (fixed point) is 2x as fast as the previous threaded random triangles with flat color metric (floating point)
-
To be clear, everything improved 0-35% from the previous implementation that used floating point after switching to fixed point
So the current threaded random triangles with flat color metric (fixed point) is 2x as fast as the previous threaded random triangles with flat color metric (floating point)
@eniko from my experience the biggest upside of using fixed-point in rasterization is that you get exact sub-pixel precision with as many bits as you like (or need), and it doesn't depend on how far away from the origin you are. That alone would be worth it even without performance improvements
-
@eniko from my experience the biggest upside of using fixed-point in rasterization is that you get exact sub-pixel precision with as many bits as you like (or need), and it doesn't depend on how far away from the origin you are. That alone would be worth it even without performance improvements
@gabrielesvelto also helps if you wanna run it on really old CPUs >_>
-
@gabrielesvelto also helps if you wanna run it on really old CPUs >_>
@gabrielesvelto also to be clear getting a +100% performance boost is *good* I'm just having a hard time it's not a benchmarking bug. But if it is a bug I sure can't find it, and the threading code is only 300 lines so it's not like there's a lot of places it could be hiding
-
@gabrielesvelto also to be clear getting a +100% performance boost is *good* I'm just having a hard time it's not a benchmarking bug. But if it is a bug I sure can't find it, and the threading code is only 300 lines so it's not like there's a lot of places it could be hiding
@eniko BTW are you using only scalar math or are you leveraging SIMD extensions? IIUC one of the advantages of fixed-point math is that you could implement some stuff on x86 even with the oldest, crustiest SIMD stuff (hello MMX!) and get at least some benefits
-
@midnaw not really no, they're rarely used nowadays
-
-
-
@eniko BTW are you using only scalar math or are you leveraging SIMD extensions? IIUC one of the advantages of fixed-point math is that you could implement some stuff on x86 even with the oldest, crustiest SIMD stuff (hello MMX!) and get at least some benefits
@gabrielesvelto i'm not using any SIMD atm so its scalar only
-
To be clear, everything improved 0-35% from the previous implementation that used floating point after switching to fixed point
So the current threaded random triangles with flat color metric (fixed point) is 2x as fast as the previous threaded random triangles with flat color metric (floating point)
@eniko it would take some staring at assembly to know for sure, but one possibility is that the lack of needing to care about floating point specials (inf, nan) lets the optimizer do a better job. Float semantics are hard for compilers to work around without fastmath (do not use fastmath.)
-
@eniko it would take some staring at assembly to know for sure, but one possibility is that the lack of needing to care about floating point specials (inf, nan) lets the optimizer do a better job. Float semantics are hard for compilers to work around without fastmath (do not use fastmath.)
@ataylor but the single threaded random flat color triangles metric only improved by +15%
and all the threading does is take 4 worker threads, split the screen into 4 quadrants, and have each of them call the regular single-threaded renderer for every triangle on their quadrant
-
@ataylor but the single threaded random flat color triangles metric only improved by +15%
and all the threading does is take 4 worker threads, split the screen into 4 quadrants, and have each of them call the regular single-threaded renderer for every triangle on their quadrant
@eniko that is quite odd. What is the relative speedup between threaded and unthreaded for each? (Like, float single threaded versus multi threaded and so on.)
-
@eniko that is quite odd. What is the relative speedup between threaded and unthreaded for each? (Like, float single threaded versus multi threaded and so on.)
@ataylor
random flat color triangles x9.3
random gouraud triangles x4.7
fullscreen flat color triangles x2.2
fullscreen gouraud triangles x2.8 -
@eniko @oblomov Did the conversion only affect the pure computation part, or did some buffer formats/sizes change too?
I'm thinking if you did something like f32 to u8 per channel for the framebuffer, the tiles now fit in cache and that can be a huge speedup.
Verify by testing just one of the four quadrant threads, or alternatively just non threaded version with lower resolution.
-
@slyecho as in threaded flat color random triangles with fixed point is 2x as fast as threaded flat color random triangles with floating point
@eniko Yeah that makes sense, ints are faster.
-
@ataylor
random flat color triangles x9.3
random gouraud triangles x4.7
fullscreen flat color triangles x2.2
fullscreen gouraud triangles x2.8@eniko that seems… weird. I would expect close to 4x for all of these, tbh.
-
@eniko @oblomov Did the conversion only affect the pure computation part, or did some buffer formats/sizes change too?
I'm thinking if you did something like f32 to u8 per channel for the framebuffer, the tiles now fit in cache and that can be a huge speedup.
Verify by testing just one of the four quadrant threads, or alternatively just non threaded version with lower resolution.