Fixing EVGA's 7 Figure Problem with FTW3 30 Series cards.

EVGA GeForce RTX 30 Series
Fixing EVGA's 7 Figure Problem with FTW3 30 Series cards. (p.15)

2022/09/20 09:13:11

HeavyHemi

JimBeamBlack
Sorry to barge into this thread, But I have a quick question.

I'm probably worrying for nothing, but I'm also trying to be as vigilant as I can about my investment in this 3080 so I keep an eye on the forums, and to me this is interesting about these things happening in older games, games which should not even be any sort of issue.

I decided to run the Precision X1 HWM to see what my voltages are, keep in mind I don't overclock past the factory settings of my FTW3 Ultra, I really haven't had the need to at all as this card is BLAZING fast for me.

This is what I got running P3D v5 (dx12) for about 2 hours, NOTE that my voltage while running the game was low, but while the game was paused is when the voltage spiked up to 1.081 along with the memory clock mhz? Then when I un-paused the game the voltage and memory mhz went back down again.

Is THIS unusual behavior or just normal operating procedure? You would think it would be just the opposite but I am no expert? OR am I not even looking at this correctly?

Anyways, should I worry or just go about my business? So far while gaming for hours in Cyberpunk 2077 I haven't broke 76C, and no black screens or high running fans, but I'm almost afraid to pause or alt/tab out of any of my games.

thanks

If you're not using vsync or some form of frame limiting, when paused, your FPS can reach into the hundreds or thousands as you're no longer CPU limited and your GPU will pump out as many frames as it can.

2022/09/20 11:37:33

Intoxicus

Rewire92
However, I was getting black screen high fan spinning usage when playing less demanding titles such as League or Halo MCC.

I've brought the issue up specifically with MCC and all I got was dismissive hubris as replies.

Thank you for the vindication.

2022/09/20 11:59:37

Intoxicus

Rewire92
Hey all,

I have had the unfortunate displeasure of dealing with 2 broken 3090 FTW3 Ultras, and upon the issue happening yet again with my 3rd card, I decided to do some deep trouble shooting. Let me lay out the experience for you.

I enjoy playing older titles, as well as the newer AAA titles from time to time as well. My first 3090 did great in demanding games, but was netting me a subpar overclock (+75 Core/+500 Mem), I played 100+ hours in Cyberpunk 2077 and I really had a great time. However, I was getting black screen high fan spinning usage when playing less demanding titles such as League or Halo MCC. I had to hard reset my computer to get the card to turn back on, and I did so over the next few weeks whenever I played those games, until one day, it wouldn't turn back on at all. I had no lights on the card, and a red light over one of the PCIE power pin slots.

I RMAed. The new 3090 came, and I was happy. 50 more hours of Cyberpunk, and no issues. But in League....more crashing. 6 more black screens later, the 2nd card was dead. Red light over a PCIE pin power slot, but the card still lit up. Not sure why about that.

My 3rd card came yesterday. And I knew right where to look. The reason that doctors press until it hurts is to find out where the problem is, because patients sometimes lie.

I launched League and without 3 minutes going by in game, black screen, high fans, no output, and the card was running at stock speeds. I called up EVGA, and I had a nice long chat with a rep while I tried to reproduce the issue. They told me the symptoms I was experiencing were "Over-Current Protections" kicking in, and also that my first RMA card had failed because of a power related issue. They suggested I switch out my power supply (an EVGA 1200W P2) with the gold power supply I had before (EVGA 1300W G2). I played a full 16 minutes of league while on the phone with them, and experienced no crashing, but a minute after we hung up, it black screened, and crashed again.

Now, I dabble in overclocking quite a bit. I'm aware of how voltages can cause instability in cards, and how too much current breaks transistors and traces inside CPUs. I'd never had this issue with GPUs before, because I would always just do mild overclocks.

So I started using GPU-Z to watch my voltages while gaming on League. What did I see while playing League? Well, the card would *usually* be at 1800 Mhz, using 0.8680 V while I was in game, but occasionally, the voltage would spike along with the clock speed, all the way up to 2025Mhz and 1.0810 V. Now, I like I said, I don't do much "hardcore" overclocking for my GPUs, but I have used MSI Afterburner for literally 10 years. I've never seen a video card go over 1.050 Vs. I looked up the max safe voltage for the 3090, and wasn't able to find it using google, but I had another solution that I knew would work.

So, I booted up Cyberpunk, since I knew I could game on that for hours on end without crashing. Max Voltage I saw in that game? 1.050V. Played fine for an hour. Then I thought to myself, let's try overclocking? So I set my power limit to 107%, with a mild OC of +75 Core and +750 Mem. 1.075 V when the game started, and 1.068 V while in game. Ok....Let's crank the overclock. +150 Core and +1500 Mem. Played fine for another hour, still max voltage in game? 1.068 V.

I then had a thought. What if the voltage curve in lower power states is messed up somehow? I set MSI Afterburner to "Force Constant Voltage" and booted up a League custom game. I was able to play the game for 35 minutes before the game crashed, so I knew I was on the right track. There were less voltage spikes, and less core Mhz spikes as well. But it still crashed? Why? Well, when it finally did crash, it had gone up to 1.081 V again.

https://prnt.sc/yla8ib

The fix? Voltage curves.

https://prnt.sc/ylaaor

As you can see here, the normal voltage curve stops ramping only when the card gets up to a whopping 1.118 V on the core. Well, I'm crashing at well below that at using only 150 watts and the core at 1.081V, and I know the card is stable using 450 Watts at 1.068V so what can I do to fix this?

https://prnt.sc/ylahyd

I set the core Mhz to plummet after 1.068 V, and since I'm not getting anywhere near those higher voltage numbers without a higher power limit BIOS, I don't need to worry about them.

The result? I just streamed and watched a movie on my 5 monitors while playing a 2 hour custom game of League by myself. I'm going to need more testing, but I believe I've fixed the problem.

EVGA needs to adjust their voltage curves for the standard BIOS, because I believe it's breaking Voltage limits in lower power states while still attempting to go to higher core clocks. Also, while my experience here is only anecdotal, it *has* fixed my League crashing problem, so I can only assume that the voltage curve *is* the issue. The card attempts to go up to a voltage it shouldn't be at when the temperature is not low enough on the card to do so, breaking copper traces in the card with too much voltage at too high of a temperature.

While my experience here is a solitary thing, I would like to have some other people experiencing this issue chime in, and let me know if this fixes the issue. Maybe I've fixed EVGA's RMA problem with these cards, an experience that I can only assume has reached a large dollar cost figure, with how many people I've seen having the same issues.

Good luck!

EDIT: While I fix my images.

EDIT 2: For those that don't feel like reading through the entire thread, the problem was fixed by limiting my voltage for OC to 1.062V and below, and having the card run at stock speeds at any voltage above 1.062 V. Comfortably running at +120/+1250 for about 5 days with no crashes. While you obviously shouldn't have to do this sort of workaround to prevent your card from dying, I can say with confidence this solves whatever issue is causing game crashes and constant RMAs. Happy Gaming!

Edit: I had neglected to look at the images you posted at first. You have what we can maybe call a "death curve." Where it clusters up to 5 voltage points for the same core clock? That's a problem. Because it always chooses the lowest voltage for a cluster like that you'll always get frakked over by when it clusters more than 2. I only got instability and crashes from it on my 3080 FTW3 Ultra Hybrid on the XOC bios. But if there are underlying issues on the 3090 design the curve clustering could be a trigger.

This is what my reliable Voltage curve I use for gaming looks like: https://imgur.com/a/MugXyT4

For the core clocks you know will be in use you want only one voltage point per core clock. Once you get below where it's likely to throttle to it can have doubles, but not triples(sorry Bob Odenkirk, triples are not best or safe in this context.)

It takes some finagling to get those results. And quite often the curve does random stuff when I hit "apply." I have three presets that I use to get a curve like that. Each slightly different. The goal is to get it so that only one or two points move when you hit apply, that way you can correct them, and then they usually, but not always, stay put the second time you hit "apply." I should probably make a video using OBS to demonstrate this in action as my explanation is probably not very clear...

-----------------------

That is some interesting results and data.

There are multiple known issues with the voltage curves. They'll do all sorts of weird things and can be difficult to "tame." I had a long support interaction with EVGA about this because it was causing instability due to multiple voltage points for the same core clock. Instead of each voltage increment only having a single core clock, you can get two to five different voltages for the same core clock. And it always defaults to the lowest point, which can turn an otherwise stable OC into constant crashes. And it also changes your curve in random ways when you hit apply.

I wouldn't be surprised if the Voltage Curve inconsistencies are a part of the problem.

EVGA did release a bios update that made a huge difference for me, but didn't completely fix it. And they said they can't do much more because they're limited by what Nvidia allows.

The max core voltage allowed is 1.1v btw.

Although I found that with MCC the fan ramping bug doesn't get stopped by this approach. It still happened to me, but only in specific menus. As long as I avoided those menus I would usually be fine. Whenever fan ramping started I always immediately shut the game down and rebooted. And if you watch the rpm numbers it ramps into the millions quickly. The pwm pulses alternate between correct RPM and a progressive doubling. Everytime it ramps up again the rpm doubles. And if it's actually trying to provide the power to achieve that insane rpm(into the millions very quickly!) that could be a large part of the issue. I can only say that it certainly sounds like it's trying to exceed it's limits. The MCC fan ramping sounds insane, it's significantly louder and more intense than normally setting the fans to 100%.

I don't understand how they though the PSU being changed would make a difference. The PSU provides what is "requested." If the GPU is the problem, as had already been established, then changing the PSU won't help anything. Changing the PSU won't stop the GPU from demanding too much current.

Also trying using the MSI Afterburner to monitor stuff like fan rpm, core voltage, core clock, and power draw(both total and per connection.) You can see if there's an imbalance between the power connectors and PCIe slot. And also see what's actually happening when the fan ramping strikes. MSI Afterburner's OSD is amazing when combined with HWinfo. You can see almost everything important happening in real time and get a very precise picture of what's going on.

You'll probably be surprised to find out that you're not getting the core clocks and voltage that you set the curve to. The curve does it's own thing and you rarely actually get what you set it to.

2022/09/20 12:44:02

Intoxicus

ty_ger07
It was the Rev 0.1's cheaper components, not a soldering issue. If you have a Rev 1.0, you have a lot less to worry about.
The reason EVGA can't tell you which cards were affected by the supposed soldering problem is because it was a myth from the start. 'Any of the early 3090 FTW3s (Rev 0.1) cards could have a soldering issue.' Lol. Get real!
Rev 0.1 3090 FTW3 had 50 amp power stages and a not-so-great analog voltage controller. The Rev 1.0 was upgraded to 70 amp power stages and a digital voltage controller.

edited for language….(moderator)

EVGA said it themselves.

"EVGA admits that the failure of some of its products was due to a "rare soldering issue," limited to the aforementioned early batch of RTX 3090s. More specifically, an X-ray PCB analysis of failed products revealed "poor workmanship" on soldering around the card’s MOSFET circuits."

2022/09/20 13:34:54

ty_ger07

Intoxicus
ty_ger07
It was the Rev 0.1's cheaper components, not a soldering issue. If you have a Rev 1.0, you have a lot less to worry about.
The reason EVGA can't tell you which cards were affected by the supposed soldering problem is because it was a myth from the start. 'Any of the early 3090 FTW3s (Rev 0.1) cards could have a soldering issue.' Lol. Get real!
Rev 0.1 3090 FTW3 had 50 amp power stages and a not-so-great analog voltage controller. The Rev 1.0 was upgraded to 70 amp power stages and a digital voltage controller.

TOS are you even on about.

EVGA said it themselves.

"EVGA admits that the failure of some of its products was due to a "rare soldering issue," limited to the aforementioned early batch of RTX 3090s. More specifically, an X-ray PCB analysis of failed products revealed "poor workmanship" on soldering around the card’s MOSFET circuits."

And all the other ones which failed with no soldering faults?

What I am on about is a bunch failed, and they were the inferior 0.1 ones. Not every one had soldering issues. Surely. Do you think EVGA designed a board revision and upgraded to more expensive components out of kindness? If fixing the soldering process was the solution, EVGA would still be making Rev 0.1 cards.

Think

2022/09/20 13:58:28

Intoxicus

edited by moderator

They said "EVGA admits that the failure of ***some*** of its products was due to a "rare soldering issue,""

You do understand that "some" means "not all"?

It's not the cause for *all failures*, only *some* of them.

Meaning it's not a myth.

You could paraphrase it as "Some, but not all failures are due to a rare soldering issue."

It's almost like you decided on something without paying attention to the actual facts...

2022/09/20 14:32:39

GTXJackBauer

HeavyHemi
JimBeamBlack
Sorry to barge into this thread, But I have a quick question.

I'm probably worrying for nothing, but I'm also trying to be as vigilant as I can about my investment in this 3080 so I keep an eye on the forums, and to me this is interesting about these things happening in older games, games which should not even be any sort of issue.

I decided to run the Precision X1 HWM to see what my voltages are, keep in mind I don't overclock past the factory settings of my FTW3 Ultra, I really haven't had the need to at all as this card is BLAZING fast for me.

This is what I got running P3D v5 (dx12) for about 2 hours, NOTE that my voltage while running the game was low, but while the game was paused is when the voltage spiked up to 1.081 along with the memory clock mhz? Then when I un-paused the game the voltage and memory mhz went back down again.

Is THIS unusual behavior or just normal operating procedure? You would think it would be just the opposite but I am no expert? OR am I not even looking at this correctly?

Anyways, should I worry or just go about my business? So far while gaming for hours in Cyberpunk 2077 I haven't broke 76C, and no black screens or high running fans, but I'm almost afraid to pause or alt/tab out of any of my games.

thanks

If you're not using vsync or some form of frame limiting, when paused, your FPS can reach into the hundreds or thousands as you're no longer CPU limited and your GPU will pump out as many frames as it can.

+1 Not to mention you're wasting power and dumping unnecessary heat load.

Always sync your GPU to your screen's Hz since going above it makes no difference since your screen can only display what its capable of.

2022/09/20 18:45:51

ty_ger07

Intoxicus
Omg, how dense are you.

They said "EVGA admits that the failure of ***some*** of its products was due to a "rare soldering issue,""

You do understand that "some" means "not all"?

It's not the cause for *all failures*, only *some* of them.

Meaning it's not a myth.

You could paraphrase it as "Some, but not all failures are due to a rare soldering issue."

It's almost like you decided on something without paying attention to the actual facts...

Cool. And the others weren't soldering faults. So we agree.
When you ask EVGA why every card failed, they say it was a rare soldering fault. Therefore, they are telling you a myth. Either it is a rare soldering fault, or it is a common something else fault. If every one is a "rare" soldering fault, then it is obviously a myth.

2022/09/21 12:25:07

redteamgo

Intoxicus
Omg, how dense are you.

They said "EVGA admits that the failure of ***some*** of its products was due to a "rare soldering issue,""

You do understand that "some" means "not all"?

It's not the cause for *all failures*, only *some* of them.

Meaning it's not a myth.

You could paraphrase it as "Some, but not all failures are due to a rare soldering issue."

It's almost like you decided on something without paying attention to the actual facts...

do you understand that a company’s marketing statement on a product’s failure is not necessarily a fact?

2022/09/21 14:51:55

HeavyHemi

redteamgo
Intoxicus
Omg, how dense are you.

They said "EVGA admits that the failure of ***some*** of its products was due to a "rare soldering issue,""

You do understand that "some" means "not all"?

It's not the cause for *all failures*, only *some* of them.

Meaning it's not a myth.

You could paraphrase it as "Some, but not all failures are due to a rare soldering issue."

It's almost like you decided on something without paying attention to the actual facts...
do you understand that a company’s marketing statement on a product’s failure is not necessarily a fact?

The same follows for statements claimed by forum members as well... generally used in conjunction with the word 'every' meaning all was one thing or the other.

<< ..15 16

Fixing EVGA's 7 Figure Problem with FTW3 30 Series cards.

Use My Existing Forum Account

Use My Social Media Account