EVGA

Helpful ReplyLow Memory Benchmark Woes

Page: < 1234 Showing page 4 of 4
Author
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/05/03 21:07:34 (permalink)
TiN_EE
What I can add is that, most of this memory access/routing/operation is not relevant to X299 Dark specific, but apply on total for a given platform. If you look on performance of one board with chipset X and CPU A, then you will see majority of all other boards on chipset X and CPU A doing same thing, given everything runs on same clocks/settings.

 
Logic implies that is the pattern one should see but I didn’t have data that would tell me is that what I should be seeing.
 
However just because we have, I believe, established there can be some drop in some benchmarks when going from two to four memory channels that doesn’t, I think, yet mean everything is confirmed as kosher and performing as it should be to its full potential, period (in my setup at least).
 
Reason why I say this is one of results I have seen but haven’t reached point of posting about.
 
To be specific: Spreadsheet part of PCMark10 test. It has shown 20% performance drop when going from 2 to 4 channels while all other parts of PCMark10’s tests showed practically no change (well, withing few %).
 
Observation of its execution indicates spreadsheet part is yet another single threaded “benchmark” whose result is linearly proportional to speed of “CPU” (when I increased frequency X% result would increase practically same amount).
 
Because clock speed was same in 2ch and 4ch test that, I believe, indicates only reason for getting 20% drop is something that happens when going from 2 to 4 channels.
 
As I experimented I tried hyperthreading off and figures for spreadsheet part went UP to the point where ones for 4 channel with HT off were higher than ones for 2 channel (both with HT off or on)!!! All other PCMark10 results didn’t change, neither up nor down, by turning hyperthreading on/off.
 
I repeated few other tests and naturally for some of them results dropped when turning off HT. Cinebench R20 MT was ~ 20% slower (which is actually not that much for cutting number of CPUs in half).
 
Results of y-cruncher were slightly mixed bag. Turning off HT actually improved results of 2 channel run. Not significantly but considering it had two time less “CPUs” to work with that is, I think, huge. 4 channel run was slightly slower with HT off than on but again I believe, considering it had twice less CPUs, that is still great.
 
What are your thoughts after all of this? Is it possible that hyperthreading is one of contributors to me having big drops in some tests and if yes why? To me, as an uneducated layman, first thing that comes to mind when I think “hyperthreading … (memory) performance improvement with ht off / drop with on” is “security mitigations of side channel attacks” but I am now in way over my head and would love thoughts from person that knows way better.
#91
xuqi99
New Member
  • Total Posts : 41
  • Reward points : 0
  • Joined: 2015/04/30 07:06:29
  • Location: Australia
  • Status: offline
  • Ribbons : 3
Re: Low Memory Benchmark Woes 2020/05/04 02:16:55 (permalink)
"I repeated few other tests and naturally for some of them results dropped when turning off HT. Cinebench R20 MT was ~ 20% slower (which is actually not that much for cutting number of CPUs in half).
 
Results of y-cruncher were slightly mixed bag. Turning off HT actually improved results of 2 channel run. Not significantly but considering it had two time less “CPUs” to work with that is, I think, huge. 4 channel run was slightly slower with HT off than on but again I believe, considering it had twice less CPUs, that is still great."
 
Your understanding of HT and it's operation is flawed. Whilst you see 2x logical cores with HT enabled, it's still the same # of physical cores.
 
https://www.youtube.com/watch?v=wnS50lJicXc
 
post edited by xuqi99 - 2020/05/04 02:19:14
#92
TiN_EE
Yes, that TiN
  • Total Posts : 377
  • Reward points : 0
  • Joined: 2010/01/22 21:30:49
  • Location: xDevs.com
  • Status: offline
  • Ribbons : 14
Re: Low Memory Benchmark Woes 2020/05/04 08:58:46 (permalink)
I am not a software or driver engineer to know lot about how threading and HT in particular works on software level, but so far your findings are in line with what known in community, HT is not always improving performance, sometimes it works the opposite. There are many applications, especially older ones, that scale only to limited amount of cores, and after that don't scale at all. And if that limit is below your actual real CPU cores, than you get into interesting situation with HT.
 
For example from 3D benchmarking - there is app with scaling to 32 cores only for example. Meaning if you have 64 cores, half of the cores would not be utilized at all. 
Now if you have 18 core CPU, and HT OFF, all cores would be used, and giving best they can for 18 threads operation. Nice predictable scores, no problems. Now you take same 18 core cpu with HT ON. Well, now each core is "interleaved" with HT logical core. You would think benchmark will use 18 real cores + extra 14 HT cores, but instead windows scheduler will be loading only 16 real cores and 16 HT cores, and 2 real cores + 2 HT cores will be unutilized. And because all of this dynamic , you get big variations in benchmark and performance could be worse than actual 16 core CPU + HT ON ;)
 
And not all loads scale same way. Rendering scales great, because it's easy to split work into independent pieces and have each CPU work it's bit. But it's memory intensive. Computation often scale good too, and may be not as memory hungry. But other apps that must use results from previous computation, more like a serial chain compute - cannot scale to many cores well, so you'd see different picture and performance footprint.

If you have question, please post in public forum. I do not reply PMs, so all in community can benefit the answer. 
#93
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/05/04 18:47:58 (permalink)
xuqi99
Your understanding of HT and it's operation is flawed. Whilst you see 2x logical cores with HT enabled, it's still the same # of physical cores.

 
It seems you’ve read my post too quickly and jumped to assumption my understanding is flawed. I know that HT is not actual cores but logical CPUs 😊 That is why I put a quote marks around words “CPU”.
 
So I know that turning on ht doesn’t add real “CPUs” of the same power. One point I was trying to make is that I was both disappointed and happy to find their contribution to some of results wasn’t on higher level.
 
Disappointed for obvious reasons. However that might’ve been bit unrealistic. I forgot to take into account which scenarios are best fit for ht, which ones will give highest gains and which ones won’t.
 
Biggest gains will be in what I call “random non-uniform workloads” that do not require syncing / are not competing for same resource in the pipeline at the same time. Like lots of different tasks all doing different things and/or a same thing in non-uniform manner (server doing one type of operation for one user, different for other, and so on).
 
However, if same shared resource is reached for over and over again in same manner then ht’s contribution can end up low(er), especially if bandwidth of part of pipeline is already (nearly) saturated and one takes into account “scheduler’s” overhead.
 
That is where I felt “happy” part might kick in if getting CPU with higher amount of cores and turning ht off (or tuning CPU affinity for apps) might mean getting same or better performance, with better predictability of performance, at lower overall temperatures (which of course will require analysis of what works best for apps -I- utilize, but that is a different story).
 
More important point I was trying to make is that I was not expecting to see noticeable drops with ht on / gains with it off (and I am not talking just spreadsheet part of PCMark10).
 
Prime95 bandwidth benchmark of 8064K FFT dropped 27% and FFT timings benchmark  dropped 25% when using ht.
 
Can in theory performance of app go down when ht is used everything is kosher as it should be? Yes, but should it go down in this amount? That I am not sure.
 
What I do know is that whenever I would (in environments I worked in) see noticeable performance drop that would go away after turning ht off it was almost always after OS/hardware security patches.
 
More on this in my next post to TiN_EE …
#94
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/05/04 18:54:41 (permalink)
TiN_EE
There are many applications, especially older ones, that scale only to limited amount of cores, and after that don't scale at all. And if that limit is below your actual real CPU cores, than you get into interesting situation with HT.

And not all loads scale same way.

 
That is true so PCMark10 spreadsheet part on its own wasn’t something I paid high value to.
 
y-cruncher I tried is (so I hear) supposed to scale well with number of threads (Anandtech articles show high increase in performance when it multithreads, even with number of threads much higher than mine), but it didn’t scale. By then I started thinking “Why it didn’t scale when I hear it should? What might be holding it back? Well, at least it didn’t drop, but …” It is possible I have tested incorrectly so I will have to RTFM better and repeat that test and if it is still not scaling up with ht on reach out to author for comment.
 
It is Prime95 benchmark drop though (please see my previous post, reply to xuqi99) that got me completely thinking. Should it be scaling up instead of dropping? I don’t know. Is it “normal” it drops? I don’t know, maybe not, maybe yes. But even if yes is 27% drop induced by ht normal amount? I couldn’t find answer on that so I will have to reach out to P95 community.
 
In the meantime Anandtech articles indicate DigiCortex benchmark is both intensive on memory and scales real well with hyperthreading on so I will have to look into setting it up. And if DigiCortex too shows no scaling or drop then we have a definite “what here is off” …
#95
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/05/05 18:01:13 (permalink)
… let me continue from point we were at before we digressed.
 
After IMLC I have started looking at “real world” benchmarks. Blender, Cinebench R20, Corona 1.3, POV-Ray (both single and multi CPU) … all had no practical difference when going from 2ch to 4ch, results didn’t scale up nor they dropped.
 
Next was PCMark10 and that one has been already discussed. Only one of its tests (spreadsheet part) showed drop in memory performance when going from 2 to 4 channels and that one seems to be influenced by hyperthreading.
 
After that was y-cruncher, which was already discussed. It scaled up when going from 2 to 4 channels but figures with ht on were not higher than with off even though info I have says they should’ve been. So that one will need revisit for ht double-check.
 
Prime95 benchmarks will need revisit too and check with P95 community as they had 25+% drop with ht on.
 
That brings me to last, but not least, SPECworkstation benchmarks suite …
 
{… to be continued …}
 
#96
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/05/11 19:45:40 (permalink)
I’ve run complete SPECworkstation suite (except for, of course, storage benchmark). Results are:
 
Majority of tests showed no practical difference in performance between 2 channels and 4 channels with hyperthreading on. When they did drop going from 2 to 4 channels drop was usually in 1%, small number was in 2% range, and only one (srmp) was 4%. We can say that the average of the drops bigger than measurement error of 2% was 4% with single data point.
 
On the other hand number of benchmarks that showed performance increase when going from 2 to 4 channels was much bigger. 19 of them had gain bigger than 2% and their average was 29.4%.
 
I’ve also re-run SPEC suite with hyper-threading off (4 channels). Good number of results showed neither scaling up nor down and I didn’t have time to check are those the ones that are single-threaded or not. Several of them dropped more than 2% with worst one (Poisson) dropping 31% and second worst (fsi Binomial) dropping 12% even though both are multi-threaded. Remaining “drops” had an average drop of 4.5%.
 
On the other hand number of benchmarks that showed performance increase with hyperthreading was much higher. Some scaled up real good, some not so good, and average gain of those that had gain bigger than 2% was 25.9%.
 
{… to be continued …}
#97
GenjoKoan
New Member
  • Total Posts : 85
  • Reward points : 0
  • Joined: 2017/10/13 15:24:11
  • Location: Suburban Atlanta
  • Status: offline
  • Ribbons : 0
Re: Low Memory Benchmark Woes 2020/05/26 22:01:47 (permalink)
My reading of the EVGA website, they clearly state that the X299 Dark supports 32GB DDRM memory modules *AND* up to 128GB.  I looked at the site seems like countless times now before I bought one myself and a 10920X.  I wish I could help you but right now I only have a set of 8GB DIMMS and my power supply of choice may or may not be on a very slow boat from China.
#98
GenjoKoan
New Member
  • Total Posts : 85
  • Reward points : 0
  • Joined: 2017/10/13 15:24:11
  • Location: Suburban Atlanta
  • Status: offline
  • Ribbons : 0
Re: Low Memory Benchmark Woes 2020/05/26 23:11:50 (permalink)
Had I read past the first page before I posted the above, I may not have posted it.
Okay, now I have read it all and apologies if I missed it but there is one test that you may have not run - real world tests on real, large datasets.  Sorry, I forget exactly what you said you do, but in doing what you do, I imagine you could come up with multiple  examples of repeatable data runs.  Yes?  Big tests with enough data to remove the cache from the equation.  If you did that with as much rigor as you have applied to canned benchmarks, I suspect this phenomenon will fade from your concern.  I hope so.
#99
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/05/27 21:59:00 (permalink)
GenjoKoan
I wish I could help you but right now …

 
Thank you for wanting to help, it is appreciated :)
 
GenjoKoan
… there is one test that you may have not run - real world tests on real, large datasets.  Sorry, I forget exactly what you said you do, but in doing what you do, I imagine you could come up with multiple  examples of repeatable data runs.

 
You are partially correct. I have tried to identify real world tests that would work with large datasets. That was exactly the intention behind looking at SPECworkstation, y-cruncher, etc. y-cruncher, for example, works with pretty large data set (set it to work on 25 billion digits and it will happily consume almost all of 128 GB).
 
However, when picking benchmark to use one should IMHO strive to pick one that will “illuminate” area benchmarked well and figures are not skewed or limited by something else (that is why I ended up walking away from y-cruncher).
 
That is where your question could I come up with my own comes into play. I did consider it and could do it but question is how quickly. To make long story short coming up with correct “just memory and nothing else” test of modern database engine wouldn’t be very quick (one would have, for start, to assure no self-tuning/caching/statistics/etc. is inflating the numbers), it would take longer time than these pre-canned benchmarks did, so I passed on it for now.
 
GenjoKoan
… I suspect this phenomenon will fade from your concern.


As of now I am inclined to feel big part of what I’ve seen so far is, for the lack of better description, “benchmark’s” “coding” not being one that would perform best on architecture it is running on. That doesn’t eliminate my curiosity to (within reason) understand more and assure performance level is where it should be.
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/06/16 21:34:07 (permalink)
I’ve done further looking at DigiCortex and y-cruncher but haven’t had time until now to write it up. For those that are interested conclusion was:
 
a) I don’t feel DigiCortex should be used as a benchmark for anything. At least I won’t.
 
b) y-Cruncher does show memory bandwidth going up from 2 to 4 channels -but- I don’t feel it can be used as a benchmark tool for memory or anything else. Developer himself says “Long memory latencies are hidden away fairly well by hyperthreading” and “On Skylake X processors, L3 cache bandwidth is also a bottleneck. So overclock the cache as much as possible.” Only other piece information I was able to get by spending effort on y-Cruncher was that my i9-10900X had same performance in it as i9-7900X running at same speed.
 
That leaves checking why Prime95 bandwidth benchmark returned 30-ish% better figures with hyperthreading turned off.
 
{… to be continued …}
 
ZoranC
FTW Member
  • Total Posts : 1099
  • Reward points : 0
  • Joined: 2011/05/24 17:22:15
  • Status: offline
  • Ribbons : 16
Re: Low Memory Benchmark Woes 2020/06/30 19:36:45 (permalink)
I discussed my P95 results with Mersenne community and found P95 in general loads single core and its memory cache with single worker to the point that hyperthreading / additional workers often won’t result in any higher memory bandwidth benchmark and often can/will result in what I’ve seen (lower results when ht was used that kept getting lower as workers were added).
 
Once I looked at solely single worker non-ht numbers it was clear memory bandwidth was scaling up as expected when going from 2 to 4 channels without any question marks.
 
In other words: You can use P95 to gauge is memory bandwidth of system X in general better than one of Y but:
 
a) It can’t be used to gauge anything more granular, especially if ht is in equation, and
 
b) If you don’t have results from systems you want to compare against then absence of central repository with plenty of data points for comparison (yes, there is a central repository at Mersenne forum but I could find only one record for i9 CPU and only a handful for i7s) makes your single data point unusable for anything.
 
So, yeah, it benchmarks memory but its value as a general memory benchmark might be very low. In all fairness that’s not its intended use to start with, its intended purpose is to tell how good memory setup is for running Prime95. Still, considering how many enthusiasts regularly run P95 on every new system they build I feel there would be a value out of having many data points.
 
At least it provided another confirmation memory bandwidth scaled up as number of channels went up.
 
{… conclusion coming very soon …}
 
Page: < 1234 Showing page 4 of 4
Jump to:
  • Back to Mobile