System turned off then wouldnt power on


Recommended Posts

Ok this one was odd... relatively new system...

Soecs

ASUS ROG Strix Z790-A
G. Skill RiPjaws S5 DDR5 6000 16x4 (64GB Total)
ASUS GeForce 3060 RTX
Corsair HX750 PSU
Corsair H115i CPU Cooler
4x nvme SSDs all Samsung 980 Pros

 

this system had zero issues for a couple months now, but today I set it to do a GPU intensive task (CUDA only) it ran for hours, then it just blipped off

No power, pressed the power button nothing, turned the PSU power switch off let it set a minute turned back on, pressed power, nothing

unplugged replugged, nothing wouldnt turn on

so I pulled the ATX connector, did the jumper Power on to ground pins and the PSU came on checked all the voltates, +12, -12, 3.3, 5 all came back correct... plugged it back into the motherboard, came back on

any idea what would cause this? Strangely now though the LED on the power button is off all the time, it was on before this happened... haven't tested the LED independently yet, but it's odd that it's off now when the system is on... it's plugged into the correct header on the motherboard and has the correct polarity.

Link to comment
Share on other sites

Sounds like something shorted. Check your motherboard for any bloated/broken caps.

That also could be a memory error..

The board specs say: 7800+(OC)/7600(OC)/7400(OC)/7200(OC)/7000(OC)/6800(OC)/6600(OC)/6400(OC)/ 6200(OC)/6000(OC)/5800(OC)/5600/5400/5200/5000/4800

So if you said 6000, that's still an overclock. I suggest you run it at lower overclock and see how that works..

Link to comment
Share on other sites

Posted (edited)
On 13/05/2023 at 13:51, Mindovermaster said:

Sounds like something shorted. Check your motherboard for any bloated/broken caps.

That also could be a memory error..

The board specs say: 7800+(OC)/7600(OC)/7400(OC)/7200(OC)/7000(OC)/6800(OC)/6600(OC)/6400(OC)/ 6200(OC)/6000(OC)/5800(OC)/5600/5400/5200/5000/4800

So if you said 6000, that's still an overclock. I suggest you run it at lower overclock and see how that works..

the weird thing is XMP had it at 6000 for months with no issues. It's been working fine since it strangely came back on after testing the voltages... 

Does anyone know if this mobo has resettable fuses? (fuses that arent physically blown but will "blow" virtually based on conditions then reset chemically after time). that's kinda what it felt like, because it was probably 10 minutes after it was off that it just turned on with no issues 

  • Like 1
Link to comment
Share on other sites

Posted (edited)
On 13/05/2023 at 11:49, neufuse said:

this system had zero issues for a couple months now, but today I set it to do a GPU intensive task (CUDA only) it ran for hours, then it just blipped off

My system just did the exact same thing as yours, not but a couple days ago. I was encoding a full UHD Bluray, using the 3060Ti, HEVC x265 spec (NVEnc) thru Handbrake. It completed the task, but just as I would have normally seen HB finish at 100%, and then go to "Encode Finished", she green screened on me for a second, and shut off. Wouldn't fire back up until I unplugged the PSU for about five mins... and apparently shut down faster than it could write a kernel crash to disk.

Set up is similar to yours in a fashion... I use the XMP profiles to run her at 3200MHz (AM4 vs. your AM5)... and I'm kinda guessing that it may have run near peak capacity in regards to the video card. I had just recently updated the drivers as Nvidia seemed to release quite a few here recently. I didn't really do any troubleshooting, cuz, again, like yourself.. the system was purring like a kitten since the first time it was turned on.

I would offer a guess, that it could very well have been a heat issue. Voltages being what they are, it ran peak longer than it had done before perhaps? 

I think AIDA still has a burn in test... I almost never recommend it because it's there to test the ultimate boundaries of your silicon.. and replacing parts isn't cheap. If you can monitor voltages and heat levels, and you're willing to... set up the same scenario and see if it completes the task without shutting down.

 

Edit: According to your mobo manual, she doesn't come with any fuses.

Edited by xMorpheousx416
added info
  • Like 1
Link to comment
Share on other sites

Posted (edited)

been running prime 95 with max heat for a few hours now and no issues.... ran a very heavy computation on CUDA and after 15 minutes the system started jerking then off again.... starting to think it's the PSU or GPU?...

Im currentlign running nvidia drivers 531.61 studio version

Link to comment
Share on other sites

Temps seem to be ok under load for a while, this is with prime95 running at full heat mode

image.thumb.png.263c5a74823a468c096be699b932f1ef.png

No thermal throttling on any of the cores

image.thumb.png.77eecc6074a086e331df0603fd14ce9a.png

Link to comment
Share on other sites

GPU doesn't seem like anything wild when CUDA usage is at 100% for a while

image.thumb.png.28292c65e0146352a76870369be6b1d7.png

No thermal throttling on any of the cores

image.thumb.png.77eecc6074a086e331df0603fd14ce9a.png

Link to comment
Share on other sites

I am seeing this error a lot in the sys log now..

Log Name:      System
Source:        nvlddmkm
Date:          5/13/2023 5:18:28 PM
Event ID:      0
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      DT-23-CUDANODE1
Description:
The description for Event ID 0 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

\Device\Video3
Error occurred on GPUID: 100

The message resource is present but the message was not found in the message table

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="nvlddmkm" />
    <EventID Qualifiers="0">0</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2023-05-13T21:18:28.4952725Z" />
    <EventRecordID>7365</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="776" />
    <Channel>System</Channel>
    <Computer>DT-23-CUDANODE1</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\Video3</Data>
    <Data>Error occurred on GPUID: 100</Data>
    <Binary>00000000020030000000000000000000000000000000000000000000000000000000000000000000</Binary>
  </EventData>
</Event>

 

No thermal throttling on any of the cores

image.thumb.png.77eecc6074a086e331df0603fd14ce9a.png

 

image.thumb.png.4ade32903cff0b15a4e23fee2c236348.png

image.png

Link to comment
Share on other sites

Posted (edited)

Looks like about 40 of these warnings before this happened also

 

A corrected hardware error has occurred.

Component: PCI Express Root Port
Error Source: Advanced Error Reporting (PCI Express)

Primary Bus:Device:Function: 0x0:0x1:0x0
Secondary Bus:Device:Function: 0x0:0x0:0x0
Primary Device Name:PCI\VEN_8086&DEV_A70D&SUBSYS_88821043&REV_01
Secondary Device Name:

 

Obviously from an Intel device, not sure what Device A70D is, nothing came back when I looked it up

Only reference I found so far is Intel(R) PCIe RC 010 G5 - A70D

Gen 5 PCIe bus controller?

 

Edit: found it

PCI Express root port is A70D

Link to comment
Share on other sites

Not sure what is up with the forum here but every time I posted something it duplicated content instead of merging it (see the post above)

  • Like 1
Link to comment
Share on other sites

On 13/05/2023 at 21:07, neufuse said:

Not sure what is up with the forum here but every time I posted something it duplicated content instead of merging it (see the post above)

Might of been in the latest updates, idk..

Link to comment
Share on other sites

Posted (edited)

Gut is telling me the chipset driver doesn't care for the graphics driver, or vice versa.

If you have the latest chipset, give the game driver 531.79 a run for its money. Just throwing this out there, cuz you're one of Neowin's elite users and hardly need my advice, lol... but I'd recommend using DDU to clean up the graphics, cut off your net connection during the process until you get the newest driver installed.. that way MS doesn't try to install the 400 series drivers in the background.

Hopefully, it's one of the two. If not.. I'd be leaning towards a bad chipset or GPU.

 

Edit: You mentioned you think it could be the PSU? Have we tried an online PSU calculator to see what your system is drawing/compared to PSU output?

Edited by xMorpheousx416
Link to comment
Share on other sites

Posted (edited)

odd update, the power LED out of no where works today, it would not come on at all yesterday

On 13/05/2023 at 23:22, xMorpheousx416 said:

Gut is telling me the chipset driver doesn't care for the graphics driver, or vice versa.

If you have the latest chipset, give the game driver 531.79 a run for its money. Just throwing this out there, cuz you're one of Neowin's elite users and hardly need my advice, lol... but I'd recommend using DDU to clean up the graphics, cut off your net connection during the process until you get the newest driver installed.. that way MS doesn't try to install the 400 series drivers in the background.

Hopefully, it's one of the two. If not.. I'd be leaning towards a bad chipset or GPU.

 

Edit: You mentioned you think it could be the PSU? Have we tried an online PSU calculator to see what your system is drawing/compared to PSU output?

but would a chipset driver prevent a system from even powering on, I'd say no since that's well after POST. the motherboard was 100% off dead not even status or power LEDs on the board came on... it was just confusing there are 4 or 5 POST LEDs on this board and none of them came on, until I disconnected the PSU and did a manual voltage test and it was was fine then it just worked after that... just odd

Link to comment
Share on other sites

On 14/05/2023 at 14:08, neufuse said:

but would a chipset driver prevent a system from even powering on,

Only if it were a physical problem with the chip(set).

However, if we believe neither to be the case... the last finger still points to your PSU. Or, the fact that you unplugged it, plugged it back into the board.. may have been the tight connection it was looking for.

Got another PSU to test against, using the same scenario?

  • Thanks 1
Link to comment
Share on other sites

On 15/05/2023 at 01:26, xMorpheousx416 said:

Only if it were a physical problem with the chip(set).

However, if we believe neither to be the case... the last finger still points to your PSU. Or, the fact that you unplugged it, plugged it back into the board.. may have been the tight connection it was looking for.

Got another PSU to test against, using the same scenario?

got many PSU's problem is this happened once and hasn't again... the odd thing though which just adds to the confusion, the case power LED (on the motherboard there is one and the LED header LED also) didn't come on when the system came back for a day then randomly it started working again... I'm just stumped

Link to comment
Share on other sites

I've never believed in the "well, it's new, so it must work" idealism.

But... if it's not broken, why fix it? Easy... it drives us nuts when we seen a glitch in the matrix and can't figure out what the heck is happening. :D

It could just be a faulty sensor*, LED, BIOS glitch.. or any of the hundreds of capacitors/resistors in between the PSU and that LED. Could have been a loose connection at first, and you fixed it when you plugged things back in... but without a signal tracer, it's tough to track down what's not up to par on the circuit board.

Too many metaphors for the day, but let sleeping dogs lie... if you can't replicate the problem now, she's probably stable enough to keep right on running. Unless they're using those LEDs as part of the circuitry, and not just an indicator light.. it should be okay.

 

* I had a thermistor go bad in one power supply, and the fan wouldn't kick on when under load.. you could smell it getting hot and that was the only indicator to let things cool down. I ended up putting a different fan in it, and plugged that fan directly into the motherboard headers and let it run full speed.

  • Haha 1
Link to comment
Share on other sites

Just as a note, I've RMAd at least 10x as much GSKILL memory than any other brand I've purchased.  In the past I've had the ram over heat and cause massive corruption and system shutdowns.

Link to comment
Share on other sites

On 15/05/2023 at 15:13, Kelxin said:

Just as a note, I've RMAd at least 10x as much GSKILL memory than any other brand I've purchased.  In the past I've had the ram over heat and cause massive corruption and system shutdowns.

Funny, I only had to RMA one stick. Was DOA.. But since that, I had no problem. Since ~2005, I always used their RipJaw line..

Link to comment
Share on other sites

On 15/05/2023 at 16:13, Kelxin said:

Just as a note, I've RMAd at least 10x as much GSKILL memory than any other brand I've purchased.  In the past I've had the ram over heat and cause massive corruption and system shutdowns.

I've never had to RMA RAM, video cards though yes, yes and yes again lol

Link to comment
Share on other sites

did it again tonight randomly.... power just blipped off... would not turn on, zero volts at motherboard on +12,+5 and +3.3.... and once again unplugged the PSU and the ground to power on jumper the PSU started right up with all correct voltages.... and the system worked again when reconnected to the PSU...I don't get it, is it the PSU, the motherboard... it's an HX750 PSU, system is maxing out at 375 watts under load so it shouldn't be overloading... going to have to put another PSU on tomorrow and monitor

Link to comment
Share on other sites

Well, if your spare PSU does the same thing, you know it's your board..

  • Like 1
Link to comment
Share on other sites

Posted (edited)
On 15/05/2023 at 20:31, neufuse said:

it's an HX750 PSU, system is maxing out at 375 watts under load so it shouldn't be overloading...

If all it's internal circuitry is working properly. Remember my thermistor story?

 

On 15/05/2023 at 20:59, Mindovermaster said:

Well, if your spare PSU does the same thing, you know it's your board..

Yup. 

 

Link to comment
Share on other sites

Posted (edited)
On 16/05/2023 at 01:00, xMorpheousx416 said:

If all it's internal circuitry is working properly. Remember my thermistor story?

 

Yup. 

 

thermistor story? no I don't remember hearing this one

Link to comment
Share on other sites

Well it may be my graphics card.... 

moved it to another system that has a 1000 watt psu... ran the same CUDA computations and boom it went off too same issue, computer would not turn back on, tried something different, pulled the GPU when it wouldnt turn on and boom it came on... let the GPU sit out then put back in and it worked again

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.