In the early days of personal computing CPU bugs were so rare as to be newsworthy.

jdm_@mastodon.social

@gabrielesvelto Super interesting; thanks for writing this up!

pinkforest@hachyderm.io

tehstu@hachyderm.io

@gabrielesvelto Fascinating thread, especially the degradation over time inherit to modern processors. That came up recently in an interesting viral video on a world where we forget how to make new CPUs.

Bit of an aside, but I assume this affects other architectures? The thread mentioned Intel and AMD, but I assume Arm and Risc-V are similarly prone to these sorts of problems?

perpetuum_mobile@mastodon.social

@gabrielesvelto that's the deep nerdy stuff I love about IT! Thanks a ton for sharing this!

gsuberland@chaos.social

@gabrielesvelto nitpick: the propagation velocity of a *signal* in a circuit is not affected by the voltage magnitude; that is a function of the (innate) dielectric constant of the material.

however, a higher core voltage does mean that a rising edge tends to reach the gate threshold voltage of a transistor more quickly, which reduces the time it takes for each asynchronous logic element's output to reach a well-defined state after a change in input, thus propagating logic *state* more quickly.

gsuberland@chaos.social

@gabrielesvelto (what you said is absolutely correct regarding "signals" in the HDL sense of the word, it just gets a bit muddled when we're simultaneously talking about the analogue behaviours of the actual electrical signals, hence the clarification ^^)

theorangetheme@en.osm.town

@gabrielesvelto This was a phenomenal write-up, thank you!

dubiousblur@social.treehouse.systems

@gabrielesvelto fantastic thread thank you :D

andresfreundtec@mastodon.social

@gabrielesvelto Nice thread!

You seem to imply that bugs have become considerably more frequent, largely due to the increased complexity. Right?

To me it's not obvious that the larger number of known issues isn't to a large degree due to much better visibility (we didn't have anywhere close to today's automatic crash collection systems in the past) and due to the vastly increased number of CPUs... Do you have any gut feeling about that?

gabrielesvelto@mas.to

@gsuberland thanks, I was playing a bit fast and loose with the terminology. As I was writing these toots I reminded myself that entire books have been written just to model transistor behavior and propagation delay, and my very crude wording would probably give their authors a heart attack.

gabrielesvelto@mas.to

@AndresFreundTec I've been in charge of Firefox stability for ten years now and some of my early work to detect hardware issues dates back then. In pre-2020 years we could get a 2-3 bugs per year, usually across different CPUs. Now we get dozens, it's really on another level.

gabrielesvelto@mas.to

@AndresFreundTec admittedly we get a lot more after a new microarchitecture launches, and then they go down as microcode updates get rolled out. If Microsoft hadn't started shipping microcode updates with their OS updates we'd be swamped.

kimsj@mastodon.social

@gabrielesvelto
There’s also meta-stability. If a value is snapshotted half way through it changing, it may occasionally result in the output not being one or zero, but some ‘half’ value. Depending on the circuits using that result, it may be interpreted as either 1 or 0 — and maybe different parts of the circuit will use different interpretations. Such intermediate states are only meta-stable, and will flip to a firm 1 or 0 at some indeterminate time later, possibly propagating the problem.

gabrielesvelto@mas.to

@KimSJ ah yes, very good point. It's been a while since my days in hardware land and I had forgotten about it.

gabrielesvelto@mas.to

@dubiousblur glad you liked it!

gabrielesvelto@mas.to

@tehstu yes, absolutely. I've encountered several bugs in AMD CPUs, not many on ARM just yet, but our ARM user-base is very small compared to x86, so it's just less likely for us to stumble upon them. Plus we have some machinery that can detect some hardware bugs automatically but it doesn't work on ARM just yet.

mdione@en.osm.town

@gabrielesvelto but UEFI is already quite complex, it has to find block devices, read their partition tables, read FAT file systems, read directories and files, load data in memory and transfer execution. Wouldn't a patch after all that not be too late?

krzysdz@mastodon.online

@gabrielesvelto Intel's officially stated reason is that (too) high voltage (and temperature) caused fast degradation of clock trees inside cores. This degradation resulted in a duty cycle shift (square wave no longer square?), which caused general instability. If they use both posedge and negedge as triggers, then change in duty cycle will definitely violate timing.

https://community.intel.com/t5/Blogs/Tech-Innovation/Client/Intel-Core-13th-and-14th-Gen-Desktop-Instability-Root-Cause/post/1633239

gabrielesvelto@mas.to

@arclight timing degradation should not be visible outside of the highest-spec desktop CPUs which are really pushing the envelope even when they're new. Embedded systems and even mid-range desktop CPUs will never fail because of it. What might become visible is increased power consumption over time though.

gabrielesvelto@mas.to

@arclight on the other hand watch out for memory errors. Those can crop up much sooner than CPU problems due to circuit degradation: https://fosstodon.org/@gabrielesvelto/112407741329145666

Piero Bosio Social Web Site Personale

In the early days of personal computing CPU bugs were so rare as to be newsworthy.

Feed RSS

Gli ultimi otto messaggi ricevuti dalla Federazione

Post suggeriti

Wordle 1729 2/6

Cells for NetBSD... Lightning Talk without the Talk ;-)

Where Have You Been for the Last 20 Years?

:: Montaruli, la paladina del “carcere per uno spinello”, condannata dalla Cassazione.