2021/09/22 23:34:30
alexey-b
Hi guys, help me out with some advice please.
 
TL;DR
 
Today my PC stopped booting and I have a suspicion that the cause is my GPU (2060 Super SC Ultra) somehow being almost dead. A few things can boot: bios setup utility, Linux in text mode, similar things, but the PC seems to force-restart itself when the normal desktop OS is trying to change the video mode. Does this look like a GPU failure?
 
Longer version:
 
Everything went from functional to the current state within several minutes:
- First, a game that I was playing crashed and refused to restart.
- Then, I tried to reboot the PC, and the screen went blank (grey to be precise).
- Then, the PC restarted, but Windows got into boot loop and went into recovery utility.
- After a few attempts to do usual software fixes in the recovery mode (rollback to a restore point, etc), the recovery utility stopped booting at all.
 
Now, what would happen is that PC would start booting, show the bios logo normally, then show the blue Windows logo and the usual dotted circle. After a second, the PC would restart itself. Same thing happens with Windows installer USB and different Linux live USBs. However, there are still few things that work:
 
- Bios setup works. I tried clearing the settings, tried updating the bios. Updating seems to work correctly, but has no impact.
- Some simple utilities that boot from USB work. I launched memtest86+ and it found no errors.
- Grub menu (when I try booting Linux from a USB) works.
- Now the most interesting part. I can boot a Linux distro from a USB as long as it stays in the text mode with big letters.
 
I tried Slitaz Linux (I guess the actual distribution does not matter, just something that can optionally boot without fiddling with graphics) and it works as long as it is in the "default" text mode with big letters (not sure what's the actual resolution). The moment it tries to go to text mode with higher resolution (smaller letters) or start the graphic desktop, the PC reboots. I would sometimes see a badly drawn cursor and maybe a piece of a graphic window for a couple of seconds before reboot.
 
Now, I'm wondering, does that look like a GPU failure? Are there things to check (in bios, visually on the card, with other hardware etc) before going further (trying to get another GPU, requesting warranty service for this one, etc)? Can it be something else, perhaps motherboard?
 
Some random hardware things I tried with no effect:
- Disconnected the ssd and some optional cables (extra fans, usb ports, etc).
- Put ram in different slots.
- Took out and put back in cpu and cpu power, gpu and gpu power, mobo power.
 
Thanks
2021/09/23 02:13:23
alexey-b
I made a few more experiments and it now it seems that the graphics card is probably not to blame. I have a Gigabyte Aorus X570 Elite motherboard, it has two full size PCIe ports, the default one at the top and another at the bottom. I moved the GPU from the top port to the bottom one and the system does not reboot when changing video modes any more, Linux booted from a USB seems to work smoothly as long as I don't plug in the NVMe SSD. When I plug the SSD, strange things start to happen. Linux booted from a USB starts taking a while to load, and when I plug the USB drive to some specific ports, it would freeze completely. Keyboard would not work if plugged into specific ports. I still cannot boot from the SSD, but at least Windows would again would load the recovery utility for me, just that it cannot really do anything, even "reinstall Windows" errors out without much explanation.
 
Anyway, I guess, at this point, the GPU is not to blame. Perhaps, the motherboard is somehow borked? Or could it be the CPU as well?
 
Thanks for hearing me out.
2021/09/26 08:29:24
alexey-b
Just to give an update, it's probably motherboard's fault, and other people have been having similar issues. I moved the GPU to the bottom slot, limited PCIe to Gen 3, backed up my files from the SSD, and reinstalled the OS. After that, the system has been stable, including Blender and games, meaning the GPU is probably completely fine. Which is quite fortunate in the current year.
2021/10/02 18:31:28
ZoranC
alexey-b
Just to give an update ...



I'm glad you figured it out, thank you for sharing knowledge with the rest of us :)
2022/01/29 16:12:00
alexey-b
Just wanted to post a final update in case someone else has the same issue. The problem was that my Ryzen 3600 CPU became faulty (after working fine for almost two years). While running Linux with the GPU in the top slot and without kernel mode setting, I saw some relevant error messages on the screen and in the kernel log:
 
[ 0.637325] mce: [Hardware Error]: Machine check events logged
[ 0.637326] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
[ 0.637334] fbcon: Taking over console
[ 0.637336] mce: [Hardware Error]: TSC 0 ADDR 1ffffa81ccbde MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[ 0.637341] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1633397933 SOCKET 0 APIC 4 microcode 8701021
[ 0.637345] mce: [Hardware Error]: Machine check events logged
[ 0.637345] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
[ 0.637347] mce: [Hardware Error]: TSC 0 ADDR 1ffffa845f714 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[ 0.637350] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1633397933 SOCKET 0 APIC a microcode 8701021

 
Searching the internet for the symptom "Machine Check: 0 Bank 5: bea0000000000108" shows a few false leads, but also a couple of cases where the issue was a CPU fault. I sent a warranty request to AMD and they immediately asked me to send my CPU to them and later sent me a new CPU, which works fine so far.
 
Now, I don't have proof, but I believe it's a Windows update that borked my CPU, by doing a bad microcode update or something like that. When I was trying to reboot my Windows (for the last time as it would turn out) I'm pretty sure I saw the Windows update message. Quite a coincidence.
2022/01/29 16:37:17
ZoranC
alexey-b
Just wanted to post a final update in case someone else has the same issue. The problem was that my Ryzen 3600 CPU became faulty (after working fine for almost two years). While running Linux with the GPU in the top slot and without kernel mode setting, I saw some relevant error messages on the screen and in the kernel log:
 
[ 0.637325] mce: [Hardware Error]: Machine check events logged
[ 0.637326] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
[ 0.637334] fbcon: Taking over console
[ 0.637336] mce: [Hardware Error]: TSC 0 ADDR 1ffffa81ccbde MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[ 0.637341] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1633397933 SOCKET 0 APIC 4 microcode 8701021
[ 0.637345] mce: [Hardware Error]: Machine check events logged
[ 0.637345] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
[ 0.637347] mce: [Hardware Error]: TSC 0 ADDR 1ffffa845f714 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[ 0.637350] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1633397933 SOCKET 0 APIC a microcode 8701021

 
Searching the internet for the symptom "Machine Check: 0 Bank 5: bea0000000000108" shows a few false leads, but also a couple of cases where the issue was a CPU fault. I sent a warranty request to AMD and they immediately asked me to send my CPU to them and later sent me a new CPU, which works fine so far.



That was quite a troubleshooting, glad to hear things worked out for you.

Use My Existing Forum Account

Use My Social Media Account