@eniko I am swamped with work right now, otherwise I would ask for permission to infodump :D
TLDR is there's thousands (or at least multiple hundred) counters inside the CPU that tick up for all kinds of performance- and otherwise relevant events (cache misses for example) and each OS gives you a way to read these out, plus the kernel can make sure it's properly accounted per process or thread and whatnot. some of these counters are preposterously specific, like uhhhhhh "Number of cycles dispatch is stalled for integer scheduler queue 3 tokens". But there's usually a little selection of commonly useful ones available with some extra simple command.
It's absolutely fascinating what you can, in theory, do if you want to dig really really really far down, but to be honest, I usually get very little really actionable insights from anything more intricate than the most basic ones ;(
absolutely a skill issue from my end I'm convinced!