@eniko @oblomov Did the conversion only affect the pure computation part, or did some buffer formats/sizes change too?
I'm thinking if you did something like f32 to u8 per channel for the framebuffer, the tiles now fit in cache and that can be a huge speedup.
Verify by testing just one of the four quadrant threads, or alternatively just non threaded version with lower resolution.