In the early days of personal computing CPU bugs were so rare as to be newsworthy.
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto thank you for this great, informative overview.
numerous times, i had asked myself if a reported crash could be caused by a hardware bug, and so far i would think i never saw a real case - possibly due to the software i work on running in more controlled environments.
but i would be curious how a crash from a real hardware bug could be classified automatically. do you have pointers to foss tools? -
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto let’s assume .1 major bug per 1kByte binary code. For a 6502 or Z80, you get 6.4 bugs any given time. Now with 16 GByte main memory … it’s the scale that ruins it.
-
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
@gabrielesvelto that was super fascinating. Thanks for the thread!
-
@gabrielesvelto thank you for this great, informative overview.
numerous times, i had asked myself if a reported crash could be caused by a hardware bug, and so far i would think i never saw a real case - possibly due to the software i work on running in more controlled environments.
but i would be curious how a crash from a real hardware bug could be classified automatically. do you have pointers to foss tools?@slink oh yes, we have tools for that. First however I'd point you to my thread about memory errors because those are even more common when analyzing crashes: https://fosstodon.org/@gabrielesvelto/112407741329145666
For crash analysis we have a rust crate to analyze minidumps, which we generate when Firefox crashes. The crate can be used both as a tool and as a library:
-
@slink oh yes, we have tools for that. First however I'd point you to my thread about memory errors because those are even more common when analyzing crashes: https://fosstodon.org/@gabrielesvelto/112407741329145666
For crash analysis we have a rust crate to analyze minidumps, which we generate when Firefox crashes. The crate can be used both as a tool and as a library:
@slink this crate can detect patterns that suggest a memory error was encountered or that the crash was inconsistent and thus most likely due to a hardware bug. If you check out the output schema of the tool you'll find two fields called "possible_bit_flips" and "crash_inconsistencies" that capture this information: https://github.com/rust-minidump/rust-minidump/blob/main/minidump-processor/json-schema.md
-
@slink oh yes, we have tools for that. First however I'd point you to my thread about memory errors because those are even more common when analyzing crashes: https://fosstodon.org/@gabrielesvelto/112407741329145666
For crash analysis we have a rust crate to analyze minidumps, which we generate when Firefox crashes. The crate can be used both as a tool and as a library:
@gabrielesvelto yes, i know the memory error thread, thank you. ECC absolutely is a must and in this regard i am glad that my code (usually) does not run on consumer devices. fwiw, relying on every single bit in a multi-tb ram system still feels scary at times, and it is amazing that these machines actually work.
thank you for the links! -
undefined stefano@mastodon.bsd.cafe shared this topic on
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto I always thought that the entire Intel CPU architecture was doomed with its over-complicated instruction set. I preferred Motorala designs. Do you have insights on how PowerPC and ARM (Mac) CPUs fare in in this regard? I suspect they use microcode as well, but it may be less complex? But then again, that may allow them to optimize in other areas more, which in turn raises the complexity just as well. (Oh, seeing you replied already, i.e. that your ARM base is much smaller)
-
@grumble209 @gabrielesvelto I rember that the UltraSPARC-II (Blackbird) CPU, over it's lifetime (and to date, to boot) only had a single errata, and that was an extremely unlikely timing issue, not a logic bug. Unfortunately, that overall goodness was offset by the widespread off-chip L2 cache issue circa y2k.
@grumble209 @gabrielesvelto Sun cheaped-out on the external cache pathway, using only parity protection rather than the ECC protection that direct competitors (HAL/Fujitsu) were using.
This made the US-II external cache vulnerable to environmental factors (alpha-particle emissions from common packaging materials).
-
@gabrielesvelto I always thought that the entire Intel CPU architecture was doomed with its over-complicated instruction set. I preferred Motorala designs. Do you have insights on how PowerPC and ARM (Mac) CPUs fare in in this regard? I suspect they use microcode as well, but it may be less complex? But then again, that may allow them to optimize in other areas more, which in turn raises the complexity just as well. (Oh, seeing you replied already, i.e. that your ARM base is much smaller)
@tempelorg every modern core is like what I described in this thread, regardless of the ISA. The ISA only contributes to the complexity of some parts of a specific design, but the bulk of it comes from these being very high performance cores. The machinery required to reach the current performance levels is what makes these designs very complex.
At lower performance level an ARM core can be simpler than an x86 one all else being equal, but not in the desktop/server space.
-
The root of all these issues is fundamentally the same: complexity. Modern cores have become so complex that it's impossible to demonstrate at design time that they will work reliably under all possible conditions, and thoroughly testing them is also infeasible. In addition to ever increasing logic complexity the the conditions in which they operate have also changed: fixed voltages and frequencies are a thing of the past, complicating physical design. 2/31
@gabrielesvelto x86 instruction complexity alone is unmanageable. A single instruction can be up to 15 bytes long. That's 2^120 possible bit combinations for instructions. So it's already physically impossible to test every instruction individually, let alone test every *sequence* of instructions to find problematic execution sequences.
-
@gabrielesvelto x86 instruction complexity alone is unmanageable. A single instruction can be up to 15 bytes long. That's 2^120 possible bit combinations for instructions. So it's already physically impossible to test every instruction individually, let alone test every *sequence* of instructions to find problematic execution sequences.
@gabrielesvelto Intel/AMD had an opportunity to create a clean, easy to decode instruction layout with the transition to 64bit but they failed. http://www.emulators.com/docs/nx05_vx64.htm
-
@gabrielesvelto Intel/AMD had an opportunity to create a clean, easy to decode instruction layout with the transition to 64bit but they failed. http://www.emulators.com/docs/nx05_vx64.htm
@hyc ISA complexity is just part of it. The issue stems from the combination of very large instruction sets and operation modes with very high performance implementations. If you look at something as old and simple as the Cortex A9, even that came with a pretty significant amount of issues: https://documentation-service.arm.com/static/608118315e70d934bc69f13d
-
@hyc ISA complexity is just part of it. The issue stems from the combination of very large instruction sets and operation modes with very high performance implementations. If you look at something as old and simple as the Cortex A9, even that came with a pretty significant amount of issues: https://documentation-service.arm.com/static/608118315e70d934bc69f13d
@gabrielesvelto yes, it's only a part, but it starts there. The irregular instruction sizes caused problems when instructions straddled cacheline boundaries, etc. Everything after that: superscalar execution, OOOE, all got harder because the simplest case, single instruction in-order, was already non-deterministic.
-
@gabrielesvelto yes, it's only a part, but it starts there. The irregular instruction sizes caused problems when instructions straddled cacheline boundaries, etc. Everything after that: superscalar execution, OOOE, all got harder because the simplest case, single instruction in-order, was already non-deterministic.
@hyc it's definitely an added source of complexity for x86 implementations. I remember reading this a few years ago: https://blog.trailofbits.com/2019/10/31/destroying-x86_64-instruction-decoders-with-differential-fuzzing/
-
Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.
@gabrielesvelto seriously, GET A FUCKING BLOG.
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto this was fascinating, thanks!
-
@gabrielesvelto Intel's officially stated reason is that (too) high voltage (and temperature) caused fast degradation of clock trees inside cores. This degradation resulted in a duty cycle shift (square wave no longer square?), which caused general instability. If they use both posedge and negedge as triggers, then change in duty cycle will definitely violate timing.
@krzysdz @gabrielesvelto Back in my day at least there was lots of latch-based design. The time borrowing through the transparencies was used to make up for timing miscorrelation on the datapath. I remember timing limiters that could be 5+ cycles long.
However, that presumes you have tighter constraints on the clock path. Even a faster-than-model clock path could slow you down.
-
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
@gabrielesvelto great thread! Thanks!
-
@grumble209 @gabrielesvelto Sun cheaped-out on the external cache pathway, using only parity protection rather than the ECC protection that direct competitors (HAL/Fujitsu) were using.
This made the US-II external cache vulnerable to environmental factors (alpha-particle emissions from common packaging materials).
@shelldozer @grumble209 @gabrielesvelto I've probably had more of those UltraSPARC-II's pass through my hands than any other CPU. (I had four maxed-out E4000's at home at one point.)
I had a friend in the 90s who had a job at DEC one summer writing a program that output random but legal C, to stress-test their compiler.