Some further experience. I finally found an experience that could quickly and consistently create a crash for me, so I could do some testing. The Star Wars Battlefield beta (running on all high settings) was causing my comp to crash a few minutes in every time. I set up GPU-Z to track the sensors on my card every second to a log and see what was happening. It wasn't obvious what was happening exactly before the crash, but one pattern emerged. As others have said, the card was getting around 80c and the TDP % was hovering around 100, or even jumping up to 103% at times. After doing some research, I realized this is "Thermal Design Power" and generally refers to the manufacturer saying how much power can safely go through your card. The fact that it was going to 103% percent was interesting to me. I was also shocked to see that even at 80c, the fans were only running at 50%.
So, I did a few things. First of all, I realized I had SLI disabled, so I enabled it. I also did one other I think key thing that others here have mentioned. I downloaded EVGA Precision, and enabled the built in fan curve, which is much more aggressive, going to 50% fan at around 50c and upwards from there. I also set the max power at 100% for both cards. I then went back into the game and set everything to Ultra.
I played three full 10 minute matches of Battlefield with no crashes. GPU-Z showed the temperature never got above 69c, and the TDP never hit 100%.
So here's my hypothesis. For these OC cards, they set the power and heat allowances up higher, and slap a better heatsink on it. They do their own testing, but they can't test against our specific computer setups (how much heat we have in our systems, running in SLI etc). So even if it can last at full load in their setups, it might get to an unstable level of heat/power draw in ours. I'm a bit shocked by how conservative the default fan curve is to be running at 50% at 80c, which seems a bit hot to me.
For me, so far, this combination of upping the fan curve and putting a hard cap on the power draw through Precision if not fixing the issue, seems to greatly help. When I complained most recently to EVGA and asked for a non-OC card, they said they couldn't do that, but they could RMA me another and make sure to note to the testers to be very thorough in their testing. I'm thinking this isn't so much a "manufacturing" flaw with these OC cards so much as, they are OC'ed and are inherently more unstable, and thus it's up to the testing team to ensure that the level of OC and cooling etc is stable, and for the ones we've been getting, combined with our comp setups (airflow etc), may push the OC cards over the edge.
I personally bought a factory OC card because I didn't want to deal with the trouble of making sure my manual OC was stable, but I guess that's what I'm doing. Since I'm not going for bleeding edge performance, but just a little boost, increasing the fans and putting a cap on the power draw is well worth the trade off of stability.
I'll keep you all updated with if this seems to fix it across other games, but since Battlefield was crashing so aggressively, and now is stable, I'm hoping it will be a trend.