BSODs and reboots

In the last few days my system has developed some instability, I can't think what's causing it and I'm looking for suggestions.

It seems to be fairly random whether it'll BSOD or just reboot when doing something demanding. I've had it point at ntoskrnl.exe, dxgmm1.sys and ntfs.sys, which seems reasonably inconclusive.

I did think it was just 3D games, until I managed to get Intel burn test (very high, not normal) to crash it (pointing at ntfs.sys). Planetside2, Path of Exile and Guild Wars 2 have a managed to crash it, seemingly at random amounts of time, but you've got to be running it for a little while (no instant crashes), and GW2 did a crash the moment I headed through a gate to load another zone. As far as AMD drivers go, I've had it crash with 13.1 WHQL and 13.2 b7.

memtest didn't pick up an error after letting it run for 30m/2 cycles, although that's more a case where it's able to prove a fault rather than give something a clean bill of health (you could run it for 100 cycles and it'll fail on the 101st). Monitoring/logging with hwinfo hasn't shown anything suspicious, no high temperatures or wandering voltages before a crash.

I'm probably going to crack the case next and try reseating some things.

For my own systems, random reboots have been tied to RAM not behaving itself, or, in one case, a RAM slot frying. You might want to try shuffling stick position/temporarily eliminating one stick from the equation if possible to see if anything changes. I'd also run the system with the case cover/panel off, and make sure all the fans are running. I had one random reboot cycle caused by a stray wire blocking a fan on a video card, rebooting a Windows 7 install randomly until the culprit was tracked down.

ntfs.sys usually points to hard drive errors/failure. Do you have all your stuff backed up...?

RAM or memory in general (hard drive access) was what I was leaning towards, but nothing conclusive yet. I'm not seeing any file system errors, the cables are new. That said, I'm not 100% convinced it's a hardware failure yet.

Cross referencing with the windows reliability index, and what I installed on the day things start crashing, the only real changes were IE10 (plus platform update) and Java (only the x64 version and no x86 browser plugins). I can try rolling those back/uninstalling, and running a malware scan. Weirder things have happened.

I did think it was just 3D games, until I managed to get Intel burn test (very high, not normal) to crash it (pointing at ntfs.sys).

Okay, well, the Intel Burn Test, on that setting, allocates most of your memory, and then beats the crap out of it. This means that the machine will usually start swapping, at least a little. Because you crashed in ntfs.sys, rather than in IBT itself, I'm leaning in the direction that you have a faulty disk.

Disk errors can and do also show up when you have faulty memory, so memory is not ruled out. But, assuming that your temps otherwise seem normal, then I'd consider disk to be the primary potential culprit, with memory the second.

Testing a disk is really a pain; you usually have to reinstall your whole OS on a fresh drive. If this isn't the problem, this is super-frustrating to go through.

So, even though I think memory is less likely, it's so easy to test that I'd start there first. Keep trying with memtest86. If you have four RAM sticks, also try running on half, and then swapping sticks in and out, and see if you can make anything change. If you only have two sticks, swap them.

If it does turn out to be memory, testing it first saves you a ton of time. If it's not the culprit, it wasn't much wasted effort.

Once you're sure that temperatures are good, and you can't make memory fail, then check the SMART values on your drive... if you see anything wacky there, that's a near-definite diagnosis. But even if you don't, swapping it out would be the next logical step. It just really, really sucks.

Just did 10 runs each on IBT on normal/high/very high. I don't think at any point I've pushed it past what I had as free memory (showing 6.5GB free at idle now). Temperatures peaked at 75C. Memory/hard-drive is one angle to pursue, but I don't feel like I want to get tunnel vision-ed on it yet unless I can clearly see it failing when it's specifically doing that (more testing...), which is hard as you can't really tell it to stop doing things in the background. If it was a hard drive error I can't yet pin it on operations on any drive. I've seen nothing in SMART that looks bad.

I suppose it's a good idea to give you my spec:
i5-3570k
Sapphire 7850 OC 2GB
Gigabyte GA-Z77-DS3H
2x4GB Corsair Vengeance CMZ8GX3M2A1600C9 9-9-9-24 1600MHz
3 drives: Crucial M4 256GB SSD (OS, stuff to go fast), Western Digital black 640GB (main storage), Samsung 1.5GB 5400RPM something-or-another (bulk storage for things I don't care about access speed)

Everything at stock speeds, in fact I don't think I've overclocked anything so far.

edit: Just done a whole bunch of memory and hard drive benchmarks in Sisoft Sandra and nothing blew up. Moving onto other areas for the moment.
edit: Add furmark and 3dmark to ones that haven't crashed yet.

edit: After a day of doing a variety of stuff, I got a BSOD after a few minutes Blacklight retribution - page fault in non paged area with no file associated. That suggests memory to me. I'm going to try disabling the page file (waves at Malor) and reseating the RAM in the other slots.
I do seem to remember at some point while putting the PC together that there was a period where it was powering on and not doing anything (no action, no POST beeps), which turned out to be badly seated RAM. It could be that the sockets on the mobo are a bit weak.

edit: After a bit more Blacklight, another crash, except this time it was an application crash that didn't take the system down. In eventlog it's giving exception code 0xc0000005, which google tells me is a memory access violation. I think I might do more memtest if I can get a pattern going. Planetside2... for science!

Hmm, yeah, a page fault in a non-paged area almost has to be memory. Non-paged areas don't get swapped out to disk, that's why they're called non-paged. Drive failure is still possible, in that the machine might have loaded slightly corrupt code from disk, but I'd now consider that less likely, and RAM much more probable.

75C is fine for the CPU. What temps were you showing in Furmark?

I think 75C is the highest I've seen my GPU at as well.

Right now I'm just going to keep monitoring it, as I think I need more data (more crashes) to come to any conclusions, rather than the current suspicion that it's something to do with memory. "Something to do with memory" doesn't really help me a lot unless I'm eager to ship pretty much everything back as RMAs, as memory is quite a broad reaching thing.

Leaving it for hours on end with memtest is one option, and I assume the range of tests it does is meant to stress memory, but are there any other tests you'd like to suggest that can point at a particular component? Alternatively, checks to do to ensure it's not a bad driver.

Well, it's probably just one thing; two-item crashes are possible, but obviously less likely. (and damnably difficult to troubleshoot.) But, yes, more data always helps.

75C is fine for GPU temp, as well, so it's probably not temperature or airflow that's causing the problem. I think you can tentatively rule that out. The thing that will help you fix it fastest is figuring out something that will reliably make it crash. If you can trigger the crash at will, then it will be very easy to know when it's fixed. It sounds like it's mostly games, so far, with just the one IBT failure?

I'd probably try memtest, running all day, while you're off at work or something. That tests just memory. If that doesn't fail, the next day try IBT on max settings for the day. That's a good test of both the processor and the memory. If that comes back clean, then try Furmark for several hours. If THAT comes back clean, try both at once; this will stress your power supply and motherboard power delivery circuitry.

If all those things work, but games fail, then maybe it's time to start going through drivers.

Log update:
Had a crash last night after several hours of playing Tomb Raider, another symptom this time, it froze and garbled the video output. After a power cycle it crashed/restarted when I tried to open the hardware log. It was a large log at that point, but libreoffice shouldn't be able to take down a system. Opening up the second half of the file has the sensors showing nothing out of the ordinary, and no faults found on chkdsk. Probably because I've disabled swap, there's no BSOD memory dumps and nothing in the event log to play around with to get more clues out of.

I uninstalled IE10 and the KB2670838 platform update, as there are a few reports of it causing instability.

I'm still thinking about memory, this seems to be a thing that only messes up after the computer has a long time in operation. I'm wondering if there's something particular about how some tasks/games chew up the memory that leads to some being more likely to fail than others, or keeping an open mind, how windows or a driver does something long-term.

Progress?
I think I've found a repeatable way to get it to crash - Prime95.

It crashed earlier after about 10-15m of a RAR compression operation, it ran memtest for a good while. I've managed to get Prime95 to crash it after 1-3m of the 'blend' torture test, which is supposed to be "tests lots of everything, lots of RAM tested". Now to test it on the other settings, each with the bias towards either the CPU or RAM to see which one causes more trouble.

edit: Apparently 'blend' is the most stressful on memory. I managed to run the Small FFTs test (FPU stress, not much RAM) for a while without crashing, but the In-place large FFTs test (max heat, power stress, some RAM) crashes. That seems to be the surest sign so far that memory is bad.

I bought a set of 2x4GB sticks. What's the chances that both sticks will be faulty? I'm thinking my next stage of testing should be one stick at a time, if one produces errors/crashes that would be reasonably definitive, but if both produce crashes individually would that shift the attention elsewhere like the memory controller (on CPU)?

edit: Yep, looking like one bad stick of RAM. One flakes out on Prime95 blend, another is still going. I'll keep running on this one stick with an anaemic 4GB to see if it fails otherwise.

Any other tests that are worth running? Ideally I'd be looking for ones that purely stress a component at a time, and I can't think of too many that don't involve memory in some way.

We posted at the same time, see my second edit.

What's the chances that both sticks will be faulty?

Very low. I'd just stick both in there and see if the crashes go away. If they do, then swap sticks around until the crashes come back -- you'll then know which stick to send off for replacement. Once it's replaced, install all four sticks. (use them in pairs; you can run memory in single sticks, but it's a lot slower that way.)

but if both produce crashes individually would that shift the attention elsewhere like the memory controller (on CPU)?

The chance of something like that being wrong is super low. Intel chip quality is legendary. Assuming you're not overclocking, overvolting, or failing to cool the chip properly, the chance of the CPU being at fault is... well, honestly, I have never seen a bad Intel chip that wasn't fried by user abuse. I assume there must be some out there, but I've never seen anyone talking about a bum chip that they didn't cook themselves.

There could be something wrong with the motherboard, or something squirrely with the RAM sockets. As long as you're treating it correctly, though, the Intel CPU itself should be at the absolute bottom of your list for potential culprits. And even if you abuse it a little, it should STILL be at the bottom of your list. Intel CPU quality is just unreal.

Well, now that you've found a reliable method of making it crash, and you've replaced a component and made the crash go away, I think you're probably fixed.

If I knew of any other good ways to test memory, I would already have told you.

edit: well, okay, not fixed, but you've at least proven that your problem is either the RAM stick or the socket. Try swapping the bad RAM into the first socket and see if you can make it fail again. If it breaks, then you've got a definite diagnosis. Replace the chip, and you should be golden.

I guess we can add Prime95 to the list of troubleshooting tools.

edit: And amazon being awesome, they're going to do the replacement, including sending them out before I send the dodgy ones back. This is good because it means I don't have to deal with Corsair's RMA which would involve me paying for postage. It was past the 30 day returns period too.

Seems like I didn't completely solve this, or at least something isn't quite right.

I'm getting occasional data corruption of files on disc, memtest throws up errors, Prime95 will throw up errors in testing mode, but not all the time and not crash itself or the system. Any system crashes are irregular and infrequent, and I haven't found anything that triggers them. Over the last two weeks I've noticed the odd problem where an archive extraction may (rarely) give a CRC error, or clonezilla will error out when making an image. I could happily use the system for demanding stuff, apart from the knowledge that stuff reading/writing to the drives are at risk of corruption.

I'm trying to narrow down what's up, but I suspect the motherboard. It could be two sets of bad RAM, but I'd put that lower on my list of suspects, I've moved it around and the problems persist, and my SATA cables (which are all fairly young) seem fine.

Testing to do, lots of testing. Any suggestions before I pester Amazon, again, for an RMA?

Yeah, you may have a two-component failure, those always suck.

I'll go back to what I posted awhile back:

But, assuming that your temps otherwise seem normal, then I'd consider disk to be the primary potential culprit, with memory the second.

You found and fixed memory errors, and you're still having trouble, but the trouble is much reduced. If you can't find any more memory errors, then I suspect it's a bad disk (likely) or a bad motherboard chip (less likely, but quite possible.). What kind of disk and motherboard are they? If your motherboard has more than one kind of SATA controller, what brand are you using? (eg: my motherboard has both Intel and Marvell ports, and I'm using all Intel.)

It's definitely weird, infrequent and inconsistent. If it's a disc thing, all 3 discs are affected, but a minority of the time. One thing I'll have to check out is whether Prime95 writes to disc for intermediate results in the torture-test/self-test mode, or if it's wholly contained within memory, and whether running it within a RAMdisk changes anything. (edit: P95 in a ramdisk changes nothing)

I think memory is a hard one to pin down, because it underpins everything. Was it a bad read from/write to disc, or was it writing correctly what the RAM was holding (RAM was holding errors).

I'm on a Gigabyte GA-Z77-DS3H, which is all intel.

More memtest: Incomplete set of results so far, but it looks like I might have another dodgy stick of RAM (Corsair Vengeance for anyone interested). I was aiming to run each stick on each socket, and while I've yet to run a test to completion, one stick will last a long time with no error, and another will produce errors in the random number test (9). I'm now running on the 'good' stick to see if I can make that produce failures.

edit: So far so good.

I did a look up, and my memory isn't on the supported list for my motherboard, although a very similar one is (4x4GB instead of 2x4GB, same timings/brand).

Just as something to check, is there even the remotest chance of anything on my motherboard causing damage to RAM over time that would mean it? I'd like to get a more complete set of memtest results first though before diagnosing RAM

RMAing RAM now looks like the most obvious course of action, although I think I'd ask for a different set (not Corsair) of equivalent instead of a straight replacement this time. Prices have gone up too, which is annoying.

edit: This is using memtest 5 RC1 (needed for ivy bridge support), which going by the forums it might have a few bugs.

New RAM ordered with another manufacturer, Amazon have said they're going to refund the faulty stuff, and they won't do a replacement because of the risk of the same thing happening again. I guess "replacement" to them only means like-with-like.

Gesundheit.