@eniko You can use SIMD ops for that too but the biggest problem for SW rendering is sampling+edge cull. AVX2 has gather-load ops (to pull texture data from spread-out locations) and conditional store ops (to avoid storing outside of the tri edge), which make it much less problematic.
There are writeups out there for how to compute barycentric coordinates for each pixel, which will let you interpolate the coordinates+matrices of each vert. Beyond that it's the same as using shaders.