In the early days of personal computing CPU bugs were so rare as to be newsworthy.
-
I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31
@gabrielesvelto Intel's officially stated reason is that (too) high voltage (and temperature) caused fast degradation of clock trees inside cores. This degradation resulted in a duty cycle shift (square wave no longer square?), which caused general instability. If they use both posedge and negedge as triggers, then change in duty cycle will definitely violate timing.
-
@gabrielesvelto Thank you for this detailed and specific explanation. Chris Hobbs discusses the relative unreliability of popular modern CPUs in "Embedded Systems Development for Safety-Critical Systems" but not to this depth.
I don't do embedded work but I do safety-related software QA. Our process has three types of test - acceptance tests which determine fitness-for-use, installation tests to ensure the system is in proper working order, and in-service tests which are sort of a mystery. There's no real guidance on what an in-service test is or how it differs from an installation test. Those are typically run when the operating system is updated or there are similar changes to support software. Given the issue of CPU degradation, I wonder if it makes sense to periodically run in-service tests or somehow detect CPU degradation (that's probably something that should be owned by the infrastructure people vs the application people).
I've mainly thought of CPU failures as design or manufacturing defects, not in terms of "wear" so this has me questioning the assumptions our testing is based on.
@arclight timing degradation should not be visible outside of the highest-spec desktop CPUs which are really pushing the envelope even when they're new. Embedded systems and even mid-range desktop CPUs will never fail because of it. What might become visible is increased power consumption over time though.
-
@arclight timing degradation should not be visible outside of the highest-spec desktop CPUs which are really pushing the envelope even when they're new. Embedded systems and even mid-range desktop CPUs will never fail because of it. What might become visible is increased power consumption over time though.
@arclight on the other hand watch out for memory errors. Those can crop up much sooner than CPU problems due to circuit degradation: https://fosstodon.org/@gabrielesvelto/112407741329145666
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto there was also no meaningful computer security nor much need for it in the days of 6502. it's much different when most computers are now connected to the internet and can be infected with malware within seconds of connecting.
-
@gabrielesvelto but UEFI is already quite complex, it has to find block devices, read their partition tables, read FAT file systems, read directories and files, load data in memory and transfer execution. Wouldn't a patch after all that not be too late?
@mdione yes, it's very complex, but motherboard firmware has a mechanism to load the new microcode right as the CPU is bootstrapped. That is even before the CPU is capable of accessing DRAM. All the rest of the UEFI machinery runs after that. Note that this early bootstrap mechanisms usually involves a separate bootstrap CPU, usually an embedded microcontroller whose task is to get the main x86 core up and running.
-
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
@gabrielesvelto I wonder if they could use said statistical toys as part of a large-scale fuzzing process to detect such bugs?
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
Fascinating thread. Do you know if the same issues exist on low power, embedded CPUs like ESP32, or is this something that mostly affects high-end stuff?
-
@gabrielesvelto that's the deep nerdy stuff I love about IT! Thanks a ton for sharing this!
@perpetuum_mobile @gabrielesvelto I used to even code in assembler on 8 bit platforms, for years I could not quite get my head round how modern CPUs worked until this thread (and now I know a bit more)
-
Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.
I don’t cut any slack for Intel producing two whole generations of CPUs with manufacturing flaws then trying to cover it up and never really offering full restitution to any customers.
-
All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31
@gabrielesvelto It was a very rich, exciting, interesting, and useful post! Thank you very much!
-
@perpetuum_mobile @gabrielesvelto I used to even code in assembler on 8 bit platforms, for years I could not quite get my head round how modern CPUs worked until this thread (and now I know a bit more)
@vfrmedia @perpetuum_mobile if you have some free time this is a good deep dive: https://cseweb.ucsd.edu/classes/fa14/cse240A-a/pdf/04/Gonzalez_Processor_Microarchitecture_2010_Claypool.pdf
While it doesn't cover some of the most recent advancement it captures 90% of what you need to know.
If you have a lot of free time and want to dive deeper there's this: https://www.agner.org/optimize/microarchitecture.pdf
-
In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31
@gabrielesvelto The book 'Silicon' by the Italian who designed the 4004, 8080 and Z80 is a most splendid read. Fascinating that he had to add reverse engineering optical confusions to minimise cloning by rivals.
-
@perpetuum_mobile @gabrielesvelto I used to even code in assembler on 8 bit platforms, for years I could not quite get my head round how modern CPUs worked until this thread (and now I know a bit more)
@vfrmedia @gabrielesvelto I did code a little bit in x86 asm when I was a teen. It was the only way to turn on SVGA modes in Turbo Pascal and I wanted to make game back then ;-) I did a prpgram which simulated a flame in real time, doing per pixel average of surrounding pixels and adding random 255 sparks on the bottom to make the flame look real
-
undefined oblomov@sociale.network shared this topic