Intel Chip Design Flaw Mitigation May Slow Many Enterprise Systems; Related Security Flaws Will Linger

Gonna be a bonanza for Microsoft and Windows 10, since they refuse to let Win7 install on newer processors. There's no technical reason for this, and they're arguably violating the terms of their retail license, but updates will not run on Win7 if you have a recent Intel or Ryzen chip.

Apple has announced a fix for Spectre by the end of next week for "90% of it's processors", and already has a fix for Meltdown. I'm checking on other vendors.

Robear wrote:

Apple has announced a fix for Spectre by the end of next week for "90% of it's processors", and already has a fix for Meltdown. I'm checking on other vendors.

Is Spectre something that be fully fixed in software, though? I know there are mitigations like Firefox reducing the accuracy of JS timers to make detecting the cache hit/miss harder, but what else can they do?

It's not clear to me yet how thorough the software fixes for Meltdown will be. (Spectre I think is an easier fix because it's attacking software boundaries via software.) Obviously one goal of security is making attacks so expensive that they are essentially impractical. I suspect that will be the goal here, because it will be impossible to just replace every modern multi-core processor out there.

This Forbes column seems to be a good source for updates on what vendors are doing. In particular, Google Chrome users should enable Site Isolation immediately; it's easy to do and gives each site you visit its own separate memory space to prevent illicit reads from a compromised website which attacks you using Spectre.

Meltdown seems to require an app on the device or system, or physical access with admin privs, so it has a smaller (but still significant) attack surface. Spectre (of which there are two variants) can be done remotely, but is easier to fix.

Robear wrote:

Apple has announced a fix for Spectre by the end of next week for "90% of it's processors", and already has a fix for Meltdown. I'm checking on other vendors.

Did you mean Intel here and not Apple? I know Intel has claimed that they have a fix for 90% of its processors, though they were light on details (https://newsroom.intel.com/news-rele...). Apple just seems to be saying they will have mitigations out for Safari (https://support.apple.com/en-us/HT20...).

Google is now also claiming to have a mitigation in use for Spectre (https://security.googleblog.com/2018...) and that what they have done for Meltdown hasn't caused much in the way of performance issues for them.

No, Apple. Intel has not released any fixes itself, and is referring queries to the vendors involved in selling the branded hardware. Apple's fixes for Spectre are of course in the browser, since the attacks are via remote applications (websites); their "mitigations" for Meltdown are already in iOS 11.2, macOS 10.13.2, and tvOS 11.2.

Obviously websites are not the only possible vector for app communications (or the port tables would be waaaay smaller lol), so Apple and presumably all OS vendors will be figuring out what other vulns they have and working to patch them over time. But web traffic is the big vulnerability here, and probably the easiest to test and address, so everyone seems to be doing browser fixes first.

Reducing the accuracy of timers is of limited use, because you can just construct accurate timers by other means

https://gruss.cc/files/fantastictime...

And if you fall down the hole in that paper you'll find the reference to this one, which is amazing though maybe not yet a practical concern:

https://cmaurice.fr/pdf/ndss17_mauri...

There they describe how to run ssh between EC2 instances using the processor cache as the communications channel. Zowie.

TheGameguru wrote:

I would guess in the end the performance hit won’t be that noticeable. Certainly not for gaming.

Certainly looks that way. For consumers this seems to be largely just a big nasty security vulnerability to patch up. Server side is a different story.

(I'm with Ed. Thread title reads more hyperbolic than it needs to be. If this were an enterprise IT forum I'd feel otherwise.)

Robear wrote:
Gremlin wrote:

Um, the patch breaks some antivirus software (to the point that your computer won't boot) so you might need to update your AV before you can patch:

I assume Avira is one of them, or perhaps Malwarebytes, because I have not to my knowledge been offered the security update. Anyone have an actual list?

My Avira picked up an update yesterday, and grabbed the Windows Update this morning.

ZDNet has a running list of the AV vendors and their status in terms of patching

Gremlin wrote:

I've been looking but haven't found one yet. Microsoft describes a way in that article for you to check your registry to see if your AV vendor has flipped that setting on.

You can also run a MS provided PowerShell script to get a pretty comprehensive report about if you're patched and if you have hardware support for the fix as well.

(h/t to Ars who has provided some of the better coverage about the issue)

From Rasberry Pi, a very solid look at what Spectre and Meltdown are(and why Rasberry Pi isn’t vulnerable. Link. I liked how they explained what the attacks are in more layman’s terms (and skipping things like pipelining).

It's not clear to me yet how thorough the software fixes for Meltdown will be. (Spectre I think is an easier fix because it's attacking software boundaries via software.)

I'm still confused about the technical details, but Markdown seems to be a method of reading other processes' memory state via hardware exploit of reading kernel memory (which maps all processes into its absolute memory space somewhere, so it can read anything), and can be conclusively defeated by doing a TLB flush and switch on a kernel call. This has a performance penalty, but entirely mitigates the problem. Run a kernel of any kind with the TLB flush enabled, and Markdown is not a problem. Performance might be, but security is not at risk.

Spectre, on the other hand, which I really don't get, yet, is apparently going to be vastly harder to mitigate, and will stick around for ages. I've seen it described as both only within-process (to break local sandboxes) and cross-process (to break out of virtual machines), and I don't know which explanation is accurate. It apparently has to be mitigated directly by programs that want to protect some memory segments, by putting up a 'fence' instruction that prevents speculative execution past a certain point. But I'm entirely unclear on who does the fencing.... the program being targeted (like an SSH agent holding your private keys) or the program being exploited (the web browser running Javascript.)

Maybe it's both?

As I understand it today, Spectre (which probably affects most modern chips and has two variants) is related to the common practice of placing pieces of kernel memory adjacent to ("within") user memory spaces, so that relevant kernel memory is as close as possible. So for example, a virtual machine would have its user and kernel memory allocations intermixed, to reduce latency as much as possible. Of course, this requires various techniques to isolate the two, otherwise a linear dump of memory would reveal everything. Hence address randomization and other techniques. With all this, bear in mind, any arbitrary chunk of memory contains both user and kernel information.

Now that Spectre exists, it allows a program to look at *all* the memory within its allocated chunk boundaries, even those memory spaces that belong to other processes or the kernel. And since a browser is allocated a pretty big chunk of memory, AND is present on most systems, an attack on that will automatically yield lots of information. It could be done with other programs, of course, but the browser is the obvious target. That's the best understanding I have at the moment, without diving into the papers myself.

Spectre is the easiest to fix, and has little to no performance impact as currently understood.

Meltdown, which is more Intel specific and has one variant in the set, works on the principle that in many modern processor designs, the security fencing that protects different execution threads is not applied to speculative processing branches. This means that a program can trick the processor into giving it information from a look-ahead cache in the processor, and get it. This naturally applies to both kernel and user-land execution threads.

The mitigation for this is relatively simple - disable speculative processing. Intel recommends that processors be told to do operations in serial order, not parallel, by adding a function in the software (the OS) that says "don't do any more work on this thread until this section is actually finished. So instead of being able to do one task and at the same time process the likely next task, having the results ready when the first finishes, the processor will revert to the early 90's and do one thing at a time per core. This can knock out the performance advantage from multi-threading (Hyper-threading in Intel-speak).

This has tremendous performance implications for heavily threaded workloads (think databases, virtualization, and multi-stream i/o workloads), as well as for systems that run under heavy load most of the time. (This is where my original assessment was focused, although at the time it was believed that home systems might also face this issue.) Systems that are used today at 80% or better utilization would be (will be) in danger of just keeling over if the code is patched without redistributing workloads. But if they don't, then they face problems like malicious hackers setting up instances in clouds just to read what their neighbors are doing - and you can spin up and down an awful lot of instances very quickly... Likewise, organizations that have efficiently consolidated workloads now face an acceleration of their tech update cycle, having to buy more systems to continue to meet their SLAs. (And the groups that did this are usually the ones where performance *really* matters...).

Ars Technica has a good breakdown, but I've drawn from other sources as well. It helps to have a basic understanding of how modern processors and memory allocation systems work, but that's beyond the scope of this post.

Edited because I had Meltdown and Spectre reversed.

BTW, here's a table that is being updated as anti-virus manufacturers update their code.

Meltdown is the easiest to fix, and has little to no performance impact as currently understood.

I'm fairly certain that this is not correct. This adds a great deal of overhead to kernel context switches. For gaming and standard user applications, this isn't that big a deal, because the great majority of the time is spent in userspace, with only occasional kernel calls. This means that the amount of time added to run most tasks is negligible.

However, and this is a big however, this is not true for server-oriented software, or any program that makes very heavy use of I/O in general. Every time a program hits the disk or the network card (like, say, a web server), it makes a kernel I/O call, and pays the TLB shootdown penalty. In microbenchmarks, where the program is doing nothing *but* tiny kernel calls, they're seeing performance impacts of about 50%. In other words, pre-patch, they could do about five million kernel calls a second. Post-patch, they can only do about 2.5 million.

So the performance impact of Meltdown will depend on workload. For servers, it'll probably be at least a little ouchy. For consumer computers... meh, no biggie. When we load our CPUs really heavily, it's almost always in userspace code, which isn't affected by Meltdown. As gamers, we won't really care.

I know that Spectre is based around speculative execution, and that the mitigation technique is 'fence' instructions, which can have a performance impact, perhaps severe. But I don't think anyone really knows what the performance impact will be, and one of the bigger problems there is that they don't even understand, yet, where the fence instructions need to go. At least so far, there's lots of assertions that the fencing can mitigate the problem, but almost no detail on exactly what needs to be done, by whom, and where.

This could,. as you say, be a severe problem... but it's my understanding that this will be a severe problem too, above and beyond the overhead imposed by Meltdown. And this one probably will hurt at the consumer and gamer level, but there's nothing we can do to avoid it, because nearly all processors are affected. Intel, AMD, many ARM chips, Power, even mainframes are vulnerable.

Not everyone gets bitten, however: The Raspberry Pi series is unaffected. Go Team Tiny! (it's because their chips are slow and don't use speculative execution, so in this case, their lack of speed kept them safe.)

Game servers might be affected more than games themselves:

https://www.epicgames.com/fortnite/f...

Doubling in CPU usage after applying the meltdown patch on the server?

Yeah, I was just coming to post that.

IMAGE(https://lh6.googleusercontent.com/MwzsHRXQLVbmJ3pusNuGwn0ZQVjo9h8nRJHJhIo4d3XFqbvUYCj8EPq5jV7zeVEEcHAkraNBesbbNDW_UAlIjvw-hZBd80rKt7ZYl35nBIcfCCVyRvW5V7M7KVejv9tvVBHfgSKr)

So, yeah, the Meltdown patch can be a very big performance hit for some workloads. Epic may be able to work around the problem to some degree, perhaps by gathering I/O into bigger batches, or spreading onto more servers, but it's likely to take substantial engineering and be a real PITA.

Spectre I dunno about, performance-wise, but Meltdown definitely matters.

Most people aren't pushing their desktops to the degree where it matters. (A few games might, but only because they've historically been terrible about multithreading. I expect most games aren't CPU-bound, though.)

Servers, on the other hand, tend to get pushed to their limits, so Robear's focus makes sense.

Servers, on the other hand, tend to get pushed to their limits, so Robear's focus makes sense.

I seem to be doing a very poor job of explaining that Meltdown can have a big impact on servers, but probably very little on clients. When Robear tells you that Meltdown doesn't matter, to my best knowledge, he is just dead wrong about this. It does matter, quite a lot, at least with some workloads. (see picture above.)

Meanwhile, he's asserting big performance hits from Spectre, but everything I'm reading suggests that this is simply not known yet. Meltdown is a simple, thorough fix, with a known performance penalty. Fixing Spectre seems to be extremely ill-defined. Since they can't even precisely describe, yet, what software developers need to do to mitigate the problem, I don't think anyone can really characterize the performance impact. It's probably safe to say that programs will slow down, but will that be significant? It hasn't been so far with web browsers, but are they done mitigating yet?

From the sources I'm reading, I'm not sure anyone knows. This is a nasty problem. It may not even be reasonably fixable in software. This may be one of the 'there's always one more leak' things, like with address space randomization.

So I am the family IT person. I have already gotten calls on this. Right now I am just saying to update the antivirus, run the windows update, and if they use chrome I am having them turn the site isolation option. I have not heard from anyone that does not use chrome as a primary browser.

Does anyone have any other steps I should give or a simple not super scary explanation? I have been saying it is a big issue but it is just the current issue and that the companies are working on it.

Thanks.

Yeah, looks like I reversed the names in the article above. I've fixed it and appended a note to that effect.

Google and Amazon note that they have patched their Cloud and production systems for Spectre variant 2 without much effect on performance, as I predicted above while reversing the names.

I tried hard to get the info right, but it's confusing and complicated, so mea culpa. I didn't post it to score points, but rather hoping that as people learned more, they'd contribute what they know. I hope that can happen without things continuing to have a personal focus moving forward.

I think you're doing everything you can, Stealthpizza, at this time.

I still use win 7 and I got a monthly security rollup patch yesterday, which was unusual since it dropped on friday instead of tuesday. Was that the fix?

If it contained KB4056897, then you have the latest patch for it.

The one I have is KB 4056894, the monthly rollup for january. It doesn't specifically say that it contains 4056897 but that was the only update released this month and both patches have the same description and antivirus warning. Looks like I'm set, no performance issues that I have noticed so far, but my CPU very rarely goes above 20% anyhow.

https://support.microsoft.com/en-us/...
https://support.microsoft.com/en-us/...

I would not expect perf issues on desktop systems; maybe if you're heavy into Photoshop or something else highly multithreaded.

Meltdown's performance impact is around kernel traps which force a page table flush; thread context switches are rare enough to be blips. Photoshop doesn't trap to do anything crunchy, so I don't see why it'd be materially impacted.

Ed Ropple wrote:

Meltdown's performance impact is around kernel traps which force a page table flush; thread context switches are rare enough to be blips. Photoshop doesn't trap to do anything crunchy, so I don't see why it'd be materially impacted.

Pretty much.

Spoiler:

I didn't understand a g*D damn word of that. Except traps. Those are a thing in D&D.

I assumed Photoshop was multithreaded these days. Can't win for losing...

Robear wrote:

I assumed Photoshop was multithreaded these days. Can't win for losing...

Photoshop is unlikely to max out your CPU these days. It can use the GPU for some stuff, but I haven't observed it being CPU bound very often. Limited by memory, on the other hand...

There are some apps that I'm curious about, like 3D renders, which certainly can max out the CPU. But I'm not sure if even those would be much affected. I haven't got time to benchmark them at the moment, since the performance code for my day job is GPU-bound these days.

Robear wrote:
Cube wrote:

We supported SPARC and Solaris for years, and it was absolute hell. The GNU toolchain is a circle of hell unto itself, but Sun everything was somehow worse.

I worked with and for Sun Micro, and then Oracle hardware side, since the 80's, and you're literally the first person I've had say this. I've had customers upset with one thing or another, of course, but "absolute hell"? First for me. (Expressing surprise here, not disbelief, I'm sure you have good reasons for that.)

I can count the number of customers I had who are happier after leaving Sun behind on one hand with fingers left over. Most of them wish they didn't have to. But to each his own.

(I no longer work for Oracle.)

Well, I'll put up my hand in Cube's support. I've been paid to develop software in C++ using the Microsoft, Apple, GNU, and Sun toolchains, and that's my best-to-worst order. (Also SGI, but that was so long ago I don't remember the details.)

(...yeah, it's possible this may be ever so slightly off topic, sorry.)

clang's been better than MSVC for a while, IMO, unless you really need an IDE to drive for you. (And clang has great tools for that, it's just that Xcode feels like crap. I haven't found an excuse to try Rider.)