i'm going to write a software rendered occlusion culling mechanism for block game 😤
-
i wish i knew a better way to approximate the bounding rectangle of these AABB cubes. atm i'm using an extremely generous buffer to make sure nothing is considered offscreen when it's not actually offscreen in my first pass, but that means i wind up doing a lot of matrix multiplications for corners of subchunks that aren't on screen
lol. it is *much* more effective underground. at 12 chunk radius fps goes from 45 to 230 😊
(big occluders in your face helps cull a lot of chunks)
-
lol. it is *much* more effective underground. at 12 chunk radius fps goes from 45 to 230 😊
(big occluders in your face helps cull a lot of chunks)
@eniko thank you occluders
-
undefined aeva@mastodon.gamedev.place shared this topic on
-
i'm going to write a software rendered occlusion culling mechanism for block game 😤
@eniko is there a reason why you're not using the typical hierarchical z buffer approach?
-
lol. it is *much* more effective underground. at 12 chunk radius fps goes from 45 to 230 😊
(big occluders in your face helps cull a lot of chunks)
@eniko
So the obvious performance hack is to set your game in the 1870's and the player character is a horse wearing blinders. -
i wish i knew a better way to approximate the bounding rectangle of these AABB cubes. atm i'm using an extremely generous buffer to make sure nothing is considered offscreen when it's not actually offscreen in my first pass, but that means i wind up doing a lot of matrix multiplications for corners of subchunks that aren't on screen
@eniko so, disclaimer, I don't know if this is useful advice to your situation, but I found these to be handy on a project I worked on a while back:
https://lxjk.github.io/2018/03/25/Improve-Tile-based-Light-Culling-with-Spherical-sliced-Cone.html
https://wickedengine.net/2018/01/optimizing-tile-based-light-culling/
iirc we used a modified version of the spherical-sliced cone test combined with the tile depth bitmask strat for dealing with depth discontinuity. this was for culling reflection probes though, not objects.
-
@eniko so, disclaimer, I don't know if this is useful advice to your situation, but I found these to be handy on a project I worked on a while back:
https://lxjk.github.io/2018/03/25/Improve-Tile-based-Light-Culling-with-Spherical-sliced-Cone.html
https://wickedengine.net/2018/01/optimizing-tile-based-light-culling/
iirc we used a modified version of the spherical-sliced cone test combined with the tile depth bitmask strat for dealing with depth discontinuity. this was for culling reflection probes though, not objects.
@eniko I want to say we used AABB and sphere tests as coarse tests, and anything that survived the initial elimination got a more expensive test that miiiight have just been the spherical slice test, but I'm remembering something about calculating a culling primitive in object space instead of world space and getting a major win from that, but I can't remember what that was about and I don't think it was this.
-
@eniko is there a reason why you're not using the typical hierarchical z buffer approach?
@lritter as far as I know that requires reading data back from the gpu in a way that I don't have access to with FNA
-
@bnut yes, all cubes are 4x4x4 or 8x8x8 blocks in size
-
@bnut yes, all cubes are 4x4x4 or 8x8x8 blocks in size
@bnut (as in I set a constant whether I'm doing 2x2x2 subchunks per chunk or 4x4x4)
-
@eniko I want to say we used AABB and sphere tests as coarse tests, and anything that survived the initial elimination got a more expensive test that miiiight have just been the spherical slice test, but I'm remembering something about calculating a culling primitive in object space instead of world space and getting a major win from that, but I can't remember what that was about and I don't think it was this.
-
lol. it is *much* more effective underground. at 12 chunk radius fps goes from 45 to 230 😊
(big occluders in your face helps cull a lot of chunks)
ok so two ideas:
1. i should be culling in world or view space, which is more reliable and cheaper
2. for projecting corners i think i can cut out some operations when doing the world->screen transform since all i care about is screen x/y and some relative depth, not actually normalized depth like for a depth buffer so i can actually use linear view space distance
-
ok so two ideas:
1. i should be culling in world or view space, which is more reliable and cheaper
2. for projecting corners i think i can cut out some operations when doing the world->screen transform since all i care about is screen x/y and some relative depth, not actually normalized depth like for a depth buffer so i can actually use linear view space distance
simplified my world->screen transforms. instead of doing a 4x4 matrix multiply then dividing by w and multiplying by buffer size (21 multiplies) i switched to a view matrix, view transform the position, then project to the screen manually (13 multiplies) and use view distance instead of depth since i don't care
and it now runs in about 2/3rds the time it did before :D
-
simplified my world->screen transforms. instead of doing a 4x4 matrix multiply then dividing by w and multiplying by buffer size (21 multiplies) i switched to a view matrix, view transform the position, then project to the screen manually (13 multiplies) and use view distance instead of depth since i don't care
and it now runs in about 2/3rds the time it did before :D
changed the way i cull subchunks courtesy of @bnut. i extract the world space frustum planes from my view-projection matrix, normalize, then do `dot(plane.normal, subchunkCenter) + plane.dist` and check against the bounding sphere radius (0.5*sqrt3*side)
occlusion culling now runs in 0.43x the time but also the culling is up from 43% to 74%, which means the game runs much faster. at 12 chunk distance fps above ground is up from 60 to 110!
-
changed the way i cull subchunks courtesy of @bnut. i extract the world space frustum planes from my view-projection matrix, normalize, then do `dot(plane.normal, subchunkCenter) + plane.dist` and check against the bounding sphere radius (0.5*sqrt3*side)
occlusion culling now runs in 0.43x the time but also the culling is up from 43% to 74%, which means the game runs much faster. at 12 chunk distance fps above ground is up from 60 to 110!
@eniko These probably count as optimizing more than needed, but a few suggestions:
1. test without the early out and see if it's faster. It's more ALU work, but because of how deeply pipelined CPUs are you really want to keep the code flow in small loops very predictable.
2. assuming you're targeting 64-bit platforms, you should have access to SSE2, which would let you test 4 chunks against the view frustum at once. Depending on how the memory latency works out, that can net another 2-4x. -
changed the way i cull subchunks courtesy of @bnut. i extract the world space frustum planes from my view-projection matrix, normalize, then do `dot(plane.normal, subchunkCenter) + plane.dist` and check against the bounding sphere radius (0.5*sqrt3*side)
occlusion culling now runs in 0.43x the time but also the culling is up from 43% to 74%, which means the game runs much faster. at 12 chunk distance fps above ground is up from 60 to 110!
interestingly, if i switch from 4 subchunks per axis (64 per chunk) to only 2 (8 per chunk) the entire occlusion culling pass finishes in just 7ms even on my underpowered workstation at 12 visible chunk radius
unfortunately the culling% does go down from 71% to 57% and so fps goes down from 110 to 90, but that might be worth the trade off 🤔
-
interestingly, if i switch from 4 subchunks per axis (64 per chunk) to only 2 (8 per chunk) the entire occlusion culling pass finishes in just 7ms even on my underpowered workstation at 12 visible chunk radius
unfortunately the culling% does go down from 71% to 57% and so fps goes down from 110 to 90, but that might be worth the trade off 🤔
even if i set the visible chunk radius to 16 (which is quite far!) the occlusion culling runs in 11ms* and my fps above ground is 55
which i would say is solid for this early in the project
*with 8 subchunks per chunk
-
@eniko These probably count as optimizing more than needed, but a few suggestions:
1. test without the early out and see if it's faster. It's more ALU work, but because of how deeply pipelined CPUs are you really want to keep the code flow in small loops very predictable.
2. assuming you're targeting 64-bit platforms, you should have access to SSE2, which would let you test 4 chunks against the view frustum at once. Depending on how the memory latency works out, that can net another 2-4x.@ataylor @eniko does C# have something like loop unrolling directives? That static 4-times loop looks like a prime candidate for it, although I guess doing it by hand with only 4 elements might still be acceptable, assigning the comparison to 4 bools and skipping the last instruction if any is false, or something like that?
-
even if i set the visible chunk radius to 16 (which is quite far!) the occlusion culling runs in 11ms* and my fps above ground is 55
which i would say is solid for this early in the project
*with 8 subchunks per chunk
Oh. Most of the gains are from the frustum culling. I can cull a significant % of chunks if I use 64 subchunks per chunk, but that is also very very slow. But it does bring cull % up to like 75%
But just frustum culling chunks culls like 45-50% of chunks and is very very fast :|
8 subchunks per chunks is just a regular amount of slow but also not very effective
So hm. Well I guess at least I learned how to do efficient frustum culling? Womp womp 😔
-
Oh. Most of the gains are from the frustum culling. I can cull a significant % of chunks if I use 64 subchunks per chunk, but that is also very very slow. But it does bring cull % up to like 75%
But just frustum culling chunks culls like 45-50% of chunks and is very very fast :|
8 subchunks per chunks is just a regular amount of slow but also not very effective
So hm. Well I guess at least I learned how to do efficient frustum culling? Womp womp 😔
oh hey i got it down to 54ms from 93 at 16 visible chunk radius with 64 subchunks per chunk by only processing subchunks that have visible faces
i was using visible *blocks* before. but a theoretically visible block (not void/air) surrounded by blocks has no visible faces, so can't actually be seen
-
oh hey i got it down to 54ms from 93 at 16 visible chunk radius with 64 subchunks per chunk by only processing subchunks that have visible faces
i was using visible *blocks* before. but a theoretically visible block (not void/air) surrounded by blocks has no visible faces, so can't actually be seen
oops i had a bug that was causing a bunch of subchunks to not get occluded when they should be. so now the number of depth occluded subchunks is up by 50% and there's 1 depth occluded subchunk for every 2 frustum culled subchunks