In the early days of personal computing CPU bugs were so rare as to be newsworthy.
-
This is a circuit with two sets of 8 wires that go into it, plus one wire to select which inputs will go to the output, and a single set of 8 wires going out. Depending on the value of the select signal you'll get one or the other set of inputs. Guess what happens if the select signal arrives too late, for example right after the end of the clock cycle? You get the wrong set of bits in the output. 26/31
I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31
-
I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31
You might also remember that Raptor Lake CPU problems get worse over time. That's because circuits degrade, and applying the wrong voltage can cause them to degrade faster. Circuit degradation is a research field of its own, but its effects are broadly the same: resistance in wires go up, capacity of trench capacitors go down, etc… and the combined effect of these changes is that circuits get slower and need more voltage to operate at the same frequency. 28/31
-
You might also remember that Raptor Lake CPU problems get worse over time. That's because circuits degrade, and applying the wrong voltage can cause them to degrade faster. Circuit degradation is a research field of its own, but its effects are broadly the same: resistance in wires go up, capacity of trench capacitors go down, etc… and the combined effect of these changes is that circuits get slower and need more voltage to operate at the same frequency. 28/31
When CPUs ship their most performance critical circuits are supposed to come with a certain timing slack that will compensate for this effect. Over time this timing slack gets smaller. If a CPU is already operating near the edge, aging might cut this slack all the way down to zero, causing the core to fail consistently. 29/31
-
When CPUs ship their most performance critical circuits are supposed to come with a certain timing slack that will compensate for this effect. Over time this timing slack gets smaller. If a CPU is already operating near the edge, aging might cut this slack all the way down to zero, causing the core to fail consistently. 29/31
And remember there's a lot of variables involved: timing broadly depends on transistor sizing and wire resistance. Higher voltages improve transistor performance but increase power dissipation and thus temperature. Temperature increases resistance which decreases propagation speed in wires. It's a delicate dance to keep a dynamic equilibrium of optimal power consumption, adequate performance and reliability. 30/31
-
And remember there's a lot of variables involved: timing broadly depends on transistor sizing and wire resistance. Higher voltages improve transistor performance but increase power dissipation and thus temperature. Temperature increases resistance which decreases propagation speed in wires. It's a delicate dance to keep a dynamic equilibrium of optimal power consumption, adequate performance and reliability. 30/31
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
-
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto This is one of those cases where I wish I had a Mastodon client that let me like the whole thread.
-
Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.
@gabrielesvelto I went to a lecture in the early 1990's by Tim Leonard, the formal methods guy at DEC. His story was that DEC had as-built simulators for every CPU they designed, and they had correct-per-the-spec simulators for these CPUs.
At night, after the engineers went home, their workstations would fire up tools that generated random sequences of instructions, throw those sequences at both simulators, and compare the results. This took *lots *of machines, but, as Tim joked, Equipment was DEC's middle name.
And they'd find bugs - typically with longer sequences, and with weird corner cases of exceptions and interrupts - but real bugs in real products they'd already shipped.
But here was the banger: sure, they'd fix those bugs. But there were still more bugs to find, and it took longer and longer to find them.
Leonard's empirical conclusion is that there is no "last bug" to be found and fixed in real hardware. There's always one more bug out there, and it'll take you longer and longer (and cost more and more) to find it.
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto Thank you for this detailed and specific explanation. Chris Hobbs discusses the relative unreliability of popular modern CPUs in "Embedded Systems Development for Safety-Critical Systems" but not to this depth.
I don't do embedded work but I do safety-related software QA. Our process has three types of test - acceptance tests which determine fitness-for-use, installation tests to ensure the system is in proper working order, and in-service tests which are sort of a mystery. There's no real guidance on what an in-service test is or how it differs from an installation test. Those are typically run when the operating system is updated or there are similar changes to support software. Given the issue of CPU degradation, I wonder if it makes sense to periodically run in-service tests or somehow detect CPU degradation (that's probably something that should be owned by the infrastructure people vs the application people).
I've mainly thought of CPU failures as design or manufacturing defects, not in terms of "wear" so this has me questioning the assumptions our testing is based on.
-
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
@gabrielesvelto Super interesting; thanks for writing this up!
-
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
@gabrielesvelto great read ty!
-
undefined aeva@mastodon.gamedev.place shared this topic
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto Fascinating thread, especially the degradation over time inherit to modern processors. That came up recently in an interesting viral video on a world where we forget how to make new CPUs.
Bit of an aside, but I assume this affects other architectures? The thread mentioned Intel and AMD, but I assume Arm and Risc-V are similarly prone to these sorts of problems?
-
Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.
@gabrielesvelto that's the deep nerdy stuff I love about IT! Thanks a ton for sharing this!
-
The speed at which signals propagate in circuits is proportional to how much voltage is being applied. In older CPUs this voltage was fixed, but in modern ones it changes thousands of times per second to save power. Providing just as little voltage needed for a certain clock frequency can dramatically reduce power consumption, but providing too little voltage may cause a signal to arrive late, or the wrong signal to reach the pipeline register, causing in turn a cascade of failures. 24/31
@gabrielesvelto nitpick: the propagation velocity of a *signal* in a circuit is not affected by the voltage magnitude; that is a function of the (innate) dielectric constant of the material.
however, a higher core voltage does mean that a rising edge tends to reach the gate threshold voltage of a transistor more quickly, which reduces the time it takes for each asynchronous logic element's output to reach a well-defined state after a change in input, thus propagating logic *state* more quickly.
-
@gabrielesvelto nitpick: the propagation velocity of a *signal* in a circuit is not affected by the voltage magnitude; that is a function of the (innate) dielectric constant of the material.
however, a higher core voltage does mean that a rising edge tends to reach the gate threshold voltage of a transistor more quickly, which reduces the time it takes for each asynchronous logic element's output to reach a well-defined state after a change in input, thus propagating logic *state* more quickly.
@gabrielesvelto (what you said is absolutely correct regarding "signals" in the HDL sense of the word, it just gets a bit muddled when we're simultaneously talking about the analogue behaviours of the actual electrical signals, hence the clarification ^^)
-
Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.
@gabrielesvelto This was a phenomenal write-up, thank you!
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto fantastic thread thank you :D
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto Nice thread!
You seem to imply that bugs have become considerably more frequent, largely due to the increased complexity. Right?
To me it's not obvious that the larger number of known issues isn't to a large degree due to much better visibility (we didn't have anywhere close to today's automatic crash collection systems in the past) and due to the vastly increased number of CPUs... Do you have any gut feeling about that?
-
@gabrielesvelto (what you said is absolutely correct regarding "signals" in the HDL sense of the word, it just gets a bit muddled when we're simultaneously talking about the analogue behaviours of the actual electrical signals, hence the clarification ^^)
@gsuberland thanks, I was playing a bit fast and loose with the terminology. As I was writing these toots I reminded myself that entire books have been written just to model transistor behavior and propagation delay, and my very crude wording would probably give their authors a heart attack.
-
@gabrielesvelto Nice thread!
You seem to imply that bugs have become considerably more frequent, largely due to the increased complexity. Right?
To me it's not obvious that the larger number of known issues isn't to a large degree due to much better visibility (we didn't have anywhere close to today's automatic crash collection systems in the past) and due to the vastly increased number of CPUs... Do you have any gut feeling about that?
@AndresFreundTec I've been in charge of Firefox stability for ten years now and some of my early work to detect hardware issues dates back then. In pre-2020 years we could get a 2-3 bugs per year, usually across different CPUs. Now we get dozens, it's really on another level.