PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profiler - EVGA Forums

Join us now!

Username
Password
Verification
	Stay logged in

Forgot Your Password? Forgot your Username? Haven't received registration validation E-mail?

Welcome, ! User Control Panel Log out

Forums
Posts

Latest Posts

Active Posts

Recently Visited

Search Results

View More
Blog

Recent Blog Posts

View More
Photos

Recent Photos

My Favorites

View More Photo Galleries
PMs

Unread PMs

Inbox

Send New PM View More
Page Extras
Menu
- Forum Themes
Back to Mobile

Mark Thread UnreadFlat Reading Mode ❐

PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profiler

Author Post Essentials Only Full Version
jaafaman FTW Member Total Posts : 1133 Reward points : 0 Joined: 2/23/2008 Status: offline Ribbons : 13 Wednesday, July 07, 2010 3:36 PM (permalink) ...Realistically, Nvidia could use packed, single precision SSE for PhysX, if they wanted to take advantage of the CPU. Each instruction would execute up to 4 SIMD operations per cycle, rather than just one scalar operation. In theory, this could quadruple the performance of PhysX on a CPU, but the reality is that the gains are probably in the neighborhood of 2X on the current Nehalem and Westmere generation of CPUs. That is still a hefty boost and could easily move some games from the unplayable <24 FPS zone to >30 FPS territory when using CPU based PhysX... ...While as a buyer it may be frustrating to see PhysX hobbled on the CPU, it should not be surprising. Nvidia has no obligation to optimize for their competitor’s products. PhysX does not run on top of AMD GPUs, and nobody reasonably expects that it will. Not only because of the extra development and support costs, but also AMD would never want to give Nvidia early developer versions of their products. Nvidia wants PhysX to be an exclusive, and it will likely stay that way. In the case of PhysX on the CPU, there are no significant extra costs (and frankly supporting SSE is easier than x87 anyway). For Nvidia, decreasing the baseline CPU performance by using x87 instructions and a single thread makes GPUs look better. This tactic calls into question the CPU vs. GPU comparisons made using PhysX; but the name of the game at Nvidia is making the GPU look good, and PhysX certainly fits the bill in the current incarnation... From the final Analysis page, PhysX87: Software Deficiency, which examines CPU PhysX as per the thread's title... ASUS X79 Deluxe, Intel E5-1680v2, GTX 1080, Windows 7 Ultimate SP1 (Main WS) \|\| ASUS Rampage IV Extreme, Intel E5-1650, GTX 970, Windows Server 2008 R2 (VS 2010 SP1 Server) \|\| Huanan X79 Turbo, Intel E5-1650v2, RTX 2070, Windows 10 Professional 1903 (Gaming) \|\| Super Micro X9DR3-LN4F+, 2x Intel E5-2687W, Quadro K5200, 2x Tesla K20, Windows Server 2012 R2 (VS 2015 WS) \|\| 2x Dell Optiplex 7010, Intel I5-3470, iGPU, Windows Server 2012 R2 (Edge Servers) \|\| Dell Optiplex 7010 SFF, Intel I7-3770, iGPU, Windows Server 2012 R2 (AD-DS-DC, VPN-RRAS, RDS License VMs) \|\| HP p6320y, AMD Phenom II X4 820, iGPU, Windows Server 2012 R2 (Media Server) Working on an RDS Server + a subnet for Win98/XP workstations (through the WS 2008 R2 system) #1 17 Replies Related Threads
xfinrodx iCX Member Total Posts : 415 Reward points : 0 Joined: 8/7/2006 Location: Washington Status: offline Ribbons : 1 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Wednesday, July 07, 2010 3:51 PM (permalink) I suspected similarly for some time, but never cared to provide proof. Very interesting read, jaafaman. The chiefest of human design flaws is sleep. #2
merc.man87 FTW Member Total Posts : 1289 Reward points : 0 Joined: 3/28/2009 Status: offline Ribbons : 6 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Wednesday, July 07, 2010 5:11 PM (permalink) Just read this. Basically confirms my suspicions. But oh well, i look for an official answer from Nvidia that will refute such nonsense as this, lol. #3
chumbucket843 iCX Member Total Posts : 469 Reward points : 0 Joined: 4/16/2009 Status: offline Ribbons : 0 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Wednesday, July 07, 2010 6:11 PM (permalink) fail Core i7 D0 EVGA X58 LE EVGA GTX260\\folding 3x2 GB DDR3 *10 real cores folding #4
chizow CLASSIFIED Member Total Posts : 3768 Reward points : 0 Joined: 1/28/2007 Status: offline Ribbons : 30 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Thursday, July 08, 2010 2:28 AM (permalink) I'm sure Nvidia will respond to this as they have been doing more of recently, but my take on it is it just comes down to the fundamental difference in Nvidia's approach to parallel computing. We know from their GPU architecture and design that Nvidia prefers Thread Level Parallism while their competition's designs prefer Instruction Level Parallism. For a quick primer on the differences between TLP vs. ILP, here's a good read on it at AnandTech from their RV770 review: http://www.anandtech.com/show/2556/6 NVIDIA relies entirely on TLP (thread level parallelism) while AMD exploits both TLP and ILP. Extracting TLP is much much easier than ILP, as the only time you need to worry about any inter-thread conflicts is when sharing data (which happens much less frequently than does dependent code within a single thread). In a graphics architecture, with the necessity of running millions of threads per frame, there are plenty of threads with which to fill the execution units of the hardware, and thus exploiting TLP to fill the width of the hardware is all NVIDIA needs to do to get good utilization. It sounds to me as if x87 leverages TLP with many simple instructions spawning multiple threads to achieve maximum parallel efficiency from their architectures, which really sounds perfect for physics calculations anyways. This is in contrast to ILP, SSE instructions in this case, which would require the instructions themselves to extract parallel efficiency and may not benefit at all from ILP depending on the code involved. Given the nature of physics calculation (simple math calculations) and the fact the original PhysX library is so old (Novodex in early 2000s), its probably a case of there not being much benefit of either 1) recompiling the source code to support SSE or 2) rewriting the source code to actually benefit from SSE. I highly doubt the problem is as simple as the author describes where Nvidia is negligently or purposefully refusing to compile PhysX for SSE in an effort to leave performance on the table, I think its just a matter of there not being much benefit in doing so, or not enough benefit to make the expenditure worthwhile. post edited by chizow - Thursday, July 08, 2010 2:42 AM Intel Core i7 5930K @4.5GHz \| Gigabyte X99 Gaming 5 \| Win8.1 Pro x64 \| Corsair H105 2x Nvidia GeForce Titan X SLI \| Asus ROG Swift 144Hz 3D Vision G-Sync LCD \| 2xDell U2410 \| 32GB Acer XPG DDR4 2800 Samsung 850 Pro 256GB \| Samsung 840EVO 4x1TB RAID 0 \| Seagate 2TB SSHD Yamaha VSX-677 A/V Receiver \| Polk Audio RM6880 7.1 \| LG Super Multi Blu-Ray Auzen X-Fi HT HD \| Logitech G710/G502/G27/G930 \| Corsair Air 540 \| EVGA SuperNOVA P2 1200W #5
chumbucket843 iCX Member Total Posts : 469 Reward points : 0 Joined: 4/16/2009 Status: offline Ribbons : 0 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Thursday, July 08, 2010 3:14 AM (permalink) chizow I'm sure Nvidia will respond to this as they have been doing more of recently, but my take on it is it just comes down to the fundamental difference in Nvidia's approach to parallel computing. We know from their GPU architecture and design that Nvidia prefers Thread Level Parallism while their competition's designs prefer Instruction Level Parallism. For a quick primer on the differences between TLP vs. ILP, here's a good read on it at AnandTech from their RV770 review: http://www.anandtech.com/show/2556/6 NVIDIA relies entirely on TLP (thread level parallelism) while AMD exploits both TLP and ILP. Extracting TLP is much much easier than ILP, as the only time you need to worry about any inter-thread conflicts is when sharing data (which happens much less frequently than does dependent code within a single thread). In a graphics architecture, with the necessity of running millions of threads per frame, there are plenty of threads with which to fill the execution units of the hardware, and thus exploiting TLP to fill the width of the hardware is all NVIDIA needs to do to get good utilization. that article has some misinformation in it. i think they need to hire someone who really knows what they are talking about. It sounds to me as if x87 leverages TLP with many simple instructions spawning multiple threads to achieve maximum parallel efficiency from their architectures, which really sounds perfect for physics calculations anyways. This is in contrast to ILP which would require the instructions themselves to extract parallel efficiency and may not benefit at all from ILP depending on the code involved. i know you dont have any programming experience but x87 is total legacy garbage by today's standards. it is extremely slow due to stack based registers which save opcode space. it turns out this data structure prevents any form of parallelism from being exploited. Given the nature of physics calculation (simple math calculations) and the fact the original PhysX library is so old (Novodex in early 2000s), its probably a case of there not being much benefit of either 1) recompiling the source code to support SSE or 2) rewriting the source code to actually benefit from SSE. computers think in algorithms, not math. computational physics uses matrices to represent objects. performing arithmetic on them is well suited to SSE and is easily vectorized. I highly doubt the problem is as simple as the author describes where Nvidia is negligently or purposefully refusing to compile PhysX for SSE in an effort to leave performance on the table, I think its just a matter of there not being much benefit in doing so, or not enough benefit to make the expenditure worthwhile. david kanter knows what he is talking about. check out his other articles, i insist. it's pretty much a given that using the /sse option will bring performance improvements over x86 or x87. typing 4 letters for a pretty nice gain in performance is a free lunch. simple optimizations will probably gain 3-4x improvement and with some expertise, speed ups of up to 20x are possible. Core i7 D0 EVGA X58 LE EVGA GTX260\\folding 3x2 GB DDR3 *10 real cores folding #6
chizow CLASSIFIED Member Total Posts : 3768 Reward points : 0 Joined: 1/28/2007 Status: offline Ribbons : 30 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Thursday, July 08, 2010 4:11 AM (permalink) chumbucket843 that article has some misinformation in it. i think they need to hire someone who really knows what they are talking about. Instead of just claiming such, why not point out what misinformation you're referring to? AnandTech has a relatively long history of being a credible and accurate tech site, you, not so much. i know you dont have any programming experience but x87 is total legacy garbage by today's standards. it is extremely slow due to stack based registers which save opcode space. it turns out this data structure prevents any form of parallelism from being exploited. And again, you assume simple floating point math would benefit significantly from today's standard instruction sets. Its funny because Intel has been struggling with this very problem for years. Instead of trying to futilely increase parallelism by increasing number of execution units and IPC with more advanced instruction sets and gimmicky features like HyperThreading, what have they done? They've most successfully increased parallelism by simply increasing number of cores and relying on tried and true TLP. Intel has written tons of whitepapers about this very problem, you should read them, I insist. http://software.intel.com/en-us/articles/data-parallelism-whitepapers-and-tutorials/ computers think in algorithms, not math. computational physics uses matrices to represent objects. performing arithmetic on them is well suited to SSE and is easily vectorized. Uh, and what is an algorithm? A formula or instruction set used to solve a problem, often a simple algebraic equation. Computers think in the simplest form of math, binary, 1s and 0s added, subtracted and multiplied and represented in a longer string of numbers. Floating point math is the basis of all modern computing languages and what all graphics and physics calculations are reduced to. The closer those formulas (or algorithms as you will) are to floating point math, the less need and benefit there is from instruction level parallelism. david kanter knows what he is talking about. check out his other articles, i insist. it's pretty much a given that using the /sse option will bring performance improvements over x86 or x87. typing 4 letters for a pretty nice gain in performance is a free lunch. simple optimizations will probably gain 3-4x improvement and with some expertise, speed ups of up to 20x are possible. Again, I didn't say he didn't know what he is talking about, its obvious his research and analysis is thorough, however, he hasn't demonstrated 1) a benefit of compiling the source code for SSE and 2) hasn't addressed whether the types of instructions involved, simple floating point math, would even benefit from SSE without significant optimization for ILP. Also I'd say your estimates are overly optimistic, if you had read his article as you insisted, you would've seen even his own estimates claimed ~4x speedup max but really closer to ~2x assuming the source code benefitted from SSE at all without significant code optimizations. Intel Core i7 5930K @4.5GHz \| Gigabyte X99 Gaming 5 \| Win8.1 Pro x64 \| Corsair H105 2x Nvidia GeForce Titan X SLI \| Asus ROG Swift 144Hz 3D Vision G-Sync LCD \| 2xDell U2410 \| 32GB Acer XPG DDR4 2800 Samsung 850 Pro 256GB \| Samsung 840EVO 4x1TB RAID 0 \| Seagate 2TB SSHD Yamaha VSX-677 A/V Receiver \| Polk Audio RM6880 7.1 \| LG Super Multi Blu-Ray Auzen X-Fi HT HD \| Logitech G710/G502/G27/G930 \| Corsair Air 540 \| EVGA SuperNOVA P2 1200W #7
jaafaman FTW Member Total Posts : 1133 Reward points : 0 Joined: 2/23/2008 Status: offline Ribbons : 13 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Thursday, July 08, 2010 10:41 AM (permalink) I'm thinking this would be an excellent question to pose in the CUDA Forums... ASUS X79 Deluxe, Intel E5-1680v2, GTX 1080, Windows 7 Ultimate SP1 (Main WS) \|\| ASUS Rampage IV Extreme, Intel E5-1650, GTX 970, Windows Server 2008 R2 (VS 2010 SP1 Server) \|\| Huanan X79 Turbo, Intel E5-1650v2, RTX 2070, Windows 10 Professional 1903 (Gaming) \|\| Super Micro X9DR3-LN4F+, 2x Intel E5-2687W, Quadro K5200, 2x Tesla K20, Windows Server 2012 R2 (VS 2015 WS) \|\| 2x Dell Optiplex 7010, Intel I5-3470, iGPU, Windows Server 2012 R2 (Edge Servers) \|\| Dell Optiplex 7010 SFF, Intel I7-3770, iGPU, Windows Server 2012 R2 (AD-DS-DC, VPN-RRAS, RDS License VMs) \|\| HP p6320y, AMD Phenom II X4 820, iGPU, Windows Server 2012 R2 (Media Server) Working on an RDS Server + a subnet for Win98/XP workstations (through the WS 2008 R2 system) #8
chumbucket843 iCX Member Total Posts : 469 Reward points : 0 Joined: 4/16/2009 Status: offline Ribbons : 0 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Thursday, July 08, 2010 11:15 PM (permalink) chizow chumbucket843 that article has some misinformation in it. i think they need to hire someone who really knows what they are talking about. Instead of just claiming such, why not point out what misinformation you're referring to? AnandTech has a relatively long history of being a credible and accurate tech site, you, not so much. their claims that ATi does not extract ILP well to thoroughly utilize the hardware is wrong. exploiting ILP properly such as VLIW can be very efficient. it also simplifies physical design because most of the hardware is datapaths and sram. here are some shader benches. http://forum.beyond3d.com/showpost.php?p=1220350&postcount=27 i know you dont have any programming experience but x87 is total legacy garbage by today's standards. it is extremely slow due to stack based registers which save opcode space. it turns out this data structure prevents any form of parallelism from being exploited. And again, you assume simple floating point math would benefit significantly from today's standard instruction sets. Its funny because Intel has been struggling with this very problem for years. Instead of trying to futilely increase parallelism by increasing number of execution units and IPC with more advanced instruction sets and gimmicky features like HyperThreading, what have they done? They've most successfully increased parallelism by simply increasing number of cores and relying on tried and true TLP. Intel has written tons of whitepapers about this very problem, you should read them, I insist. http://software.intel.com/en-us/articles/data-parallelism-whitepapers-and-tutorials/ a warp in cuda has a SIMD width of 32. this means to get maximum performance you have to vectorize your code. a thread in SSE has a SIMD width of 8. obviously if physics is speed up from being vectorized on a gpu then it will on a cpu too. computers think in algorithms, not math. computational physics uses matrices to represent objects. performing arithmetic on them is well suited to SSE and is easily vectorized. Uh, and what is an algorithm? A formula or instruction set used to solve a problem, often a simple algebraic equation. Computers think in the simplest form of math, binary, 1s and 0s added, subtracted and multiplied and represented in a longer string of numbers. Floating point math is the basis of all modern computing languages and what all graphics and physics calculations are reduced to. The closer those formulas (or algorithms as you will) are to floating point math, the less need and benefit there is from instruction level parallelism. an algorithm is hard to define exactly but it is basically a finite series of steps or decisions that yields a result. void matrix_multiply_1(matrix_t A, matrix_t B, matrix_t C) { for (int i = 0; i < A.rows; ++i) { for (int j = 0; j < B.cols; ++j) { for (int k = 0; k < A.cols; ++k) C[j] += A[k] * B[k][j]; } } } here is a matrix matrix multiplication algorithm. it's easily parallelized in any way. it's data parallel, task parallel, and instruction parallel. all control flow and memory access patterns are 100% predictable. matrix multiplication is used heavily in physics. more precisely sparse matrices which are used to solve partial differential equations. david kanter knows what he is talking about. check out his other articles, i insist. it's pretty much a given that using the /sse option will bring performance improvements over x86 or x87. typing 4 letters for a pretty nice gain in performance is a free lunch. simple optimizations will probably gain 3-4x improvement and with some expertise, speed ups of up to 20x are possible. Again, I didn't say he didn't know what he is talking about, its obvious his research and analysis is thorough, however, he hasn't demonstrated 1) a benefit of compiling the source code for SSE and 2) hasn't addressed whether the types of instructions involved, simple floating point math, would even benefit from SSE without significant optimization for ILP. Also I'd say your estimates are overly optimistic, if you had read his article as you insisted, you would've seen even his own estimates claimed ~4x speedup max but really closer to ~2x assuming the source code benefitted from SSE at all without significant code optimizations. here is a ~4x gain from optimizing SpMV(used for physics). http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study_10.aspx Core i7 D0 EVGA X58 LE EVGA GTX260\\folding 3x2 GB DDR3 *10 real cores folding #9
chizow CLASSIFIED Member Total Posts : 3768 Reward points : 0 Joined: 1/28/2007 Status: offline Ribbons : 30 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Friday, July 09, 2010 4:03 AM (permalink) chumbucket843 their claims that ATi does not extract ILP well to thoroughly utilize the hardware is wrong. exploiting ILP properly such as VLIW can be very efficient. it also simplifies physical design because most of the hardware is datapaths and sram. here are some shader benches. http://forum.beyond3d.com/showpost.php?p=1220350&postcount=27 How is it wrong when no real-world applications validate ATI's architectural approach as the better one? How long has ATI been touting ILP and their best-case VLIW/superscalar performance in TeraFLOPS only to see their GPUs thoroughly defeated in all things GPGPU and GPU? We have GT200 ~700GigaFLOPS and RV770 at 1.2TeraFLOPs, which is faster in everything? GT200. Now we have GF100 ~1.35 TeraFLOPs and RV870 with ~2.72TeraFLOPs. So Cypress must be 2x faster than Fermi right? Of course not, its slower in EVERYTHING. Except for some simple shader program posted on an AMD shill site that demonstrates performance gains near peak throughput. Unfortunately ATI's 1x5 superscalar architecture performs more like 1 shader in the worst-case rather than 5 in its best-case, so those 1600 SPs are really more like 320 shaders, which makes a lot more sense when stacked up against GF100's 480 real SPs. a warp in cuda has a SIMD width of 32. this means to get maximum performance you have to vectorize your code. a thread in SSE has a SIMD width of 8. obviously if physics is speed up from being vectorized on a gpu then it will on a cpu too. No, because each of those 32 SIMD in a warp are executed as individual threads, so you could run 32 single-issue instructions just as well as rewriting and vectorizing your code for SSE. Of course you wouldn't have to do anything if your code was already well-suited to multiple single-issue instructions, which brings us full circle to the fundamental differences in Nvidia's approach to parallelism favoring TLP over ILP. an algorithm is hard to define exactly but it is basically a finite series of steps or decisions that yields a result. void matrix_multiply_1(matrix_t A, matrix_t B, matrix_t C) { for (int i = 0; i < A.rows; ++i) { for (int j = 0; j < B.cols; ++j) { for (int k = 0; k < A.cols; ++k) C[j] += A[k] * B[k][j]; } } } here is a matrix matrix multiplication algorithm. it's easily parallelized in any way. it's data parallel, task parallel, and instruction parallel. all control flow and memory access patterns are 100% predictable. matrix multiplication is used heavily in physics. more precisely sparse matrices which are used to solve partial differential equations. Our personal definitions of algorithm aside, I'd say its safe to say computers do "think in math." As for the example code, I suppose it would've been too much to ask to cite your source; seems you did a bit more than read those Intel whitepapers. http://software.intel.com/en-us/articles/a-tale-of-two-algorithms-multithreading-matrix-multiplication/ Funny how you omitted this portion of the author's analysis however: Okay, let's run it! I used these two algorithms to multiply a 687x837 matrix by a 837x1107 matrix on a four-core desktop PC using an alpha version of Cilk++. The second version is highly parallel, so we should get near-linear speedup. However, when I tried it, not only did I not get linear speedup, but the parallel version actually ran twice as slow on four cores as the serial version did on one core! So what's the problem? Has Cilk++ failed us? Not at all, in fact! It turns out that our matrix_multiply_2 code is rather problematic, and there are two big ways we could improve it. Interesting, looks like oversimplifying the situation yields some interesting and unexpected results with code that doesn't lend itself to parallelization without significant optimization and code re-writes. here is a ~4x gain from optimizing SpMV(used for physics). http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study_10.aspx Awesome, and what did they do in order to get it to run better on the CPU for code that was originally written to run optimally on the GPU? They re-wrote it. You seem to be forgetting most of PhysX's source code was written almost a decade ago and certainly not with only SSE in mind. While I'm sure PhysX source code could be rewritten and optimized for SSE, the question is whether the speed-up would be worthwhile or significant. PhysX can be compiled for all the consoles, all x86 CPUs, ARM, PowerPC and GPUs so it needs to be flexible and portable and is actually so old it actually pre-dates SSE2 implementation on any of AMD's processors. I'd be interested to see the author of that article put his money where his mouth is however, as his analysis stopped short of comparing Bullet compiled for SSE vs. Bullet compiled for x87 even though he did mention it as a method of demonstrating any speed-up. Wonder why he didn't bother? Hmmm. Intel Core i7 5930K @4.5GHz \| Gigabyte X99 Gaming 5 \| Win8.1 Pro x64 \| Corsair H105 2x Nvidia GeForce Titan X SLI \| Asus ROG Swift 144Hz 3D Vision G-Sync LCD \| 2xDell U2410 \| 32GB Acer XPG DDR4 2800 Samsung 850 Pro 256GB \| Samsung 840EVO 4x1TB RAID 0 \| Seagate 2TB SSHD Yamaha VSX-677 A/V Receiver \| Polk Audio RM6880 7.1 \| LG Super Multi Blu-Ray Auzen X-Fi HT HD \| Logitech G710/G502/G27/G930 \| Corsair Air 540 \| EVGA SuperNOVA P2 1200W #10
jaafaman FTW Member Total Posts : 1133 Reward points : 0 Joined: 2/23/2008 Status: offline Ribbons : 13 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Friday, July 09, 2010 10:22 AM (permalink) ArsTechnica is carrying the story now - along with nVidia's response: ...It's rare that you talk to people at a company and they spend as much time slagging their codebase as the NVIDIA guys did on the PC version of PhysX. It seemed pretty clear that PhysX 2.x has a ton of legacy issues, and that the big ground-up rewrite that's coming next year with 3.0 will make a big difference. The 3.0 release will use SSE scalar at the very least, and they may do some vectorization if they can devote the engineering resources to it. As for how big of a difference 3.0 would bring for PhysX on the PC, we and NVIDIA had divergent takes... http://arstechnica.com/gaming/news/2010/07/did-nvidia-cripple-its-cpu-gaming-physics-library-to-spite-intel.ars ASUS X79 Deluxe, Intel E5-1680v2, GTX 1080, Windows 7 Ultimate SP1 (Main WS) \|\| ASUS Rampage IV Extreme, Intel E5-1650, GTX 970, Windows Server 2008 R2 (VS 2010 SP1 Server) \|\| Huanan X79 Turbo, Intel E5-1650v2, RTX 2070, Windows 10 Professional 1903 (Gaming) \|\| Super Micro X9DR3-LN4F+, 2x Intel E5-2687W, Quadro K5200, 2x Tesla K20, Windows Server 2012 R2 (VS 2015 WS) \|\| 2x Dell Optiplex 7010, Intel I5-3470, iGPU, Windows Server 2012 R2 (Edge Servers) \|\| Dell Optiplex 7010 SFF, Intel I7-3770, iGPU, Windows Server 2012 R2 (AD-DS-DC, VPN-RRAS, RDS License VMs) \|\| HP p6320y, AMD Phenom II X4 820, iGPU, Windows Server 2012 R2 (Media Server) Working on an RDS Server + a subnet for Win98/XP workstations (through the WS 2008 R2 system) #11
chizow CLASSIFIED Member Total Posts : 3768 Reward points : 0 Joined: 1/28/2007 Status: offline Ribbons : 30 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Friday, July 09, 2010 8:05 PM (permalink) Yeah Nvidia replied to RWT's accusations and dismissed them as "factually incorrect" at various tech/blog sites: http://www.tgdaily.com/hardware-features/50554-does-physx-diminish-cpu-performance http://www.thinq.co.uk/2010/7/8/nvidia-were-not-hobbling-cpu-physx/ They basically say the same things I was questioning, whether there was: 1) A demonstrated speed-up from simply recompiling their source from x87 to SSE. Answer is NO. In fact, Nvidia claims in some cases their code compiled for SSE runs SLOWER than x87. "[And although] our SDK does [include] some SSE code, we found [that] non-SSE code can result in higher performance than SSE in many situations. [Nevertheless], we will continue to use SSE and plan to enable it by default in future releases. That being said, not all developers want SSE enabled by default, because they still want support for older CPUs for their SW versions." 2) Whether optimizing their code to benefit from SSE would take significant code re-writes. Answer is YES, however, they are still skeptical about how much performance will increase as a result, probably nowhere near the 2x or even 4x performance speed-up claims from RWT. However, they do plan a new SDK version that will use SSE by default with version 3.x which will not only have SSE optimizations but also automatically enable multi-threading by default for developers who are too lazy to do so themselves. Intel Core i7 5930K @4.5GHz \| Gigabyte X99 Gaming 5 \| Win8.1 Pro x64 \| Corsair H105 2x Nvidia GeForce Titan X SLI \| Asus ROG Swift 144Hz 3D Vision G-Sync LCD \| 2xDell U2410 \| 32GB Acer XPG DDR4 2800 Samsung 850 Pro 256GB \| Samsung 840EVO 4x1TB RAID 0 \| Seagate 2TB SSHD Yamaha VSX-677 A/V Receiver \| Polk Audio RM6880 7.1 \| LG Super Multi Blu-Ray Auzen X-Fi HT HD \| Logitech G710/G502/G27/G930 \| Corsair Air 540 \| EVGA SuperNOVA P2 1200W #12
jaafaman FTW Member Total Posts : 1133 Reward points : 0 Joined: 2/23/2008 Status: offline Ribbons : 13 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Friday, July 09, 2010 10:03 PM (permalink) Have you read anything of late on the 64-bit version?... ASUS X79 Deluxe, Intel E5-1680v2, GTX 1080, Windows 7 Ultimate SP1 (Main WS) \|\| ASUS Rampage IV Extreme, Intel E5-1650, GTX 970, Windows Server 2008 R2 (VS 2010 SP1 Server) \|\| Huanan X79 Turbo, Intel E5-1650v2, RTX 2070, Windows 10 Professional 1903 (Gaming) \|\| Super Micro X9DR3-LN4F+, 2x Intel E5-2687W, Quadro K5200, 2x Tesla K20, Windows Server 2012 R2 (VS 2015 WS) \|\| 2x Dell Optiplex 7010, Intel I5-3470, iGPU, Windows Server 2012 R2 (Edge Servers) \|\| Dell Optiplex 7010 SFF, Intel I7-3770, iGPU, Windows Server 2012 R2 (AD-DS-DC, VPN-RRAS, RDS License VMs) \|\| HP p6320y, AMD Phenom II X4 820, iGPU, Windows Server 2012 R2 (Media Server) Working on an RDS Server + a subnet for Win98/XP workstations (through the WS 2008 R2 system) #13
chumbucket843 iCX Member Total Posts : 469 Reward points : 0 Joined: 4/16/2009 Status: offline Ribbons : 0 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Sunday, July 11, 2010 4:55 PM (permalink) chizow chumbucket843 their claims that ATi does not extract ILP well to thoroughly utilize the hardware is wrong. exploiting ILP properly such as VLIW can be very efficient. it also simplifies physical design because most of the hardware is datapaths and sram. here are some shader benches. http://forum.beyond3d.com/showpost.php?p=1220350&postcount=27 How is it wrong when no real-world applications validate ATI's architectural approach as the better one? How long has ATI been touting ILP and their best-case VLIW/superscalar performance in TeraFLOPS only to see their GPUs thoroughly defeated in all things GPGPU and GPU? We have GT200 ~700GigaFLOPS and RV770 at 1.2TeraFLOPs, which is faster in everything? GT200. Now we have GF100 ~1.35 TeraFLOPs and RV870 with ~2.72TeraFLOPs. So Cypress must be 2x faster than Fermi right? Of course not, its slower in EVERYTHING. Except for some simple shader program posted on an AMD shill site that demonstrates performance gains near peak throughput. Unfortunately ATI's 1x5 superscalar architecture performs more like 1 shader in the worst-case rather than 5 in its best-case, so those 1600 SPs are really more like 320 shaders, which makes a lot more sense when stacked up against GF100's 480 real SPs. when we are talking about games as a whole shaders and gflops are not faithful measurements of performance. you have a lot of other things going on like hierarcheral z buffer, triangle setup, rasterization, texture fitering and sampling, real time compression, etc. those require different things like rops, shaders, memory bandwidth, texture samplers but not much shader power. that's why you see rv770 faster at shading there but not at the game as a whole. I am going to paraphrase david bowman here. he said ATi went with a lot more shaders because it gave them best perf/mm2 rather than focusing on other parts of the architecture. as for gpgpu, ATi is not bad. in some area's they beat fermi by a good bit but for more general code fermi is probably better. here is an example of a HUGE advantage of VLIW. cypress can operate on 192bit integers natively, saving a lot of the computation needed for arbitrary precision. http://www.brightsideofnews.com/Data/2010_3_26/nVidia-GeForce-GTX-480-and-GTX-480-SLI-Review/NVDA_GTX480_Elcom2_675.jpg No, because each of those 32 SIMD in a warp are executed as individual threads, so you could run 32 single-issue instructions just as well as rewriting and vectorizing your code for SSE. Of course you wouldn't have to do anything if your code was already well-suited to multiple single-issue instructions, which brings us full circle to the fundamental differences in Nvidia's approach to parallelism favoring TLP over ILP. all 32 threads must be executing the same kernel so they arent really individual like they are on a cpu. furthermore there are very little synchronization primitives for gpu's. fermi has improved that but a little more programmability wont hurt. Our personal definitions of algorithm aside, I'd say its safe to say computers do "think in math." As for the example code, I suppose it would've been too much to ask to cite your source; seems you did a bit more than read those Intel whitepapers. http://software.intel.com/en-us/articles/a-tale-of-two-algorithms-multithreading-matrix-multiplication/ Funny how you omitted this portion of the author's analysis however: Okay, let's run it! I used these two algorithms to multiply a 687x837 matrix by a 837x1107 matrix on a four-core desktop PC using an alpha version of Cilk++. The second version is highly parallel, so we should get near-linear speedup. However, when I tried it, not only did I not get linear speedup, but the parallel version actually ran twice as slow on four cores as the serial version did on one core! So what's the problem? Has Cilk++ failed us? Not at all, in fact! It turns out that our matrix_multiply_2 code is rather problematic, and there are two big ways we could improve it. Interesting, looks like oversimplifying the situation yields some interesting and unexpected results with code that doesn't lend itself to parallelization without significant optimization and code re-writes. umm that is a very naive matrix matrix multiplication algorithm which is why treating as just math wont work. it is bound by memory bandwidth, adding more cores will not help. the only purpose of that is to show how to write MMM algorithm in C/C++. there are a shatload of high performance libraries you can use so a rewrite would not be needed. Awesome, and what did they do in order to get it to run better on the CPU for code that was originally written to run optimally on the GPU? They re-wrote it. You seem to be forgetting most of PhysX's source code was written almost a decade ago and certainly not with only SSE in mind. While I'm sure PhysX source code could be rewritten and optimized for SSE, the question is whether the speed-up would be worthwhile or significant. PhysX can be compiled for all the consoles, all x86 CPUs, ARM, PowerPC and GPUs so it needs to be flexible and portable and is actually so old it actually pre-dates SSE2 implementation on any of AMD's processors. if most of physx's coding was done 10 years ago then that's just sad. they really need a rewrite! this really puts them behind other physics middleware, especially since they use SSE. I'd be interested to see the author of that article put his money where his mouth is however, as his analysis stopped short of comparing Bullet compiled for SSE vs. Bullet compiled for x87 even though he did mention it as a method of demonstrating any speed-up. Wonder why he didn't bother? Hmmm. he doesnt have much time. that's why he doesnt post as much articles as he used to. Core i7 D0 EVGA X58 LE EVGA GTX260\\folding 3x2 GB DDR3 *10 real cores folding #14
chizow CLASSIFIED Member Total Posts : 3768 Reward points : 0 Joined: 1/28/2007 Status: offline Ribbons : 30 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Monday, July 12, 2010 0:39 PM (permalink) chumbucket843 when we are talking about games as a whole shaders and gflops are not faithful measurements of performance. you have a lot of other things going on like hierarcheral z buffer, triangle setup, rasterization, texture fitering and sampling, real time compression, etc. those require different things like rops, shaders, memory bandwidth, texture samplers but not much shader power. that's why you see rv770 faster at shading there but not at the game as a whole. I am going to paraphrase david bowman here. he said ATi went with a lot more shaders because it gave them best perf/mm2 rather than focusing on other parts of the architecture. as for gpgpu, ATi is not bad. in some area's they beat fermi by a good bit but for more general code fermi is probably better. here is an example of a HUGE advantage of VLIW. cypress can operate on 192bit integers natively, saving a lot of the computation needed for arbitrary precision. http://www.brightsideofnews.com/Data/2010_3_26/nVidia-GeForce-GTX-480-and-GTX-480-SLI-Review/NVDA_GTX480_Elcom2_675.jpg Again, I quoted the numbers that both companies state as their single-precision floating point throughput, which directly corrolates to how fast a GPU should be able to perform the floating point math ops required for GPGPU applications. Clearly ATI's architecture is less efficient if it can't come anywhere close to achieving that level of throughput in real world apps that rely on floating point operations and are routinely outperformed by Nvidia parts with 1/2 the theoretical throughput. If there's parts of their architecture that are holding back performance it sounds like they should probably address them and try to make their shaders more efficient, but instead, they take a brute-force approach to the problem by simply doubling their number of shaders with each generation. The end result is that those shaders are even LESS efficient with each generation than the previous. all 32 threads must be executing the same kernel so they arent really individual like they are on a cpu. furthermore there are very little synchronization primitives for gpu's. fermi has improved that but a little more programmability wont hurt. Yep and the scheduler handles all of that to coordinate which warps are in flight executing the same instruction on all threads. Fortunately, floating point math can be reduced to 3 instructions, add, subtract and multiply so the chances of concurrent execution of the same instruction is high. umm that is a very naive matrix matrix multiplication algorithm which is why treating as just math wont work. it is bound by memory bandwidth, adding more cores will not help. the only purpose of that is to show how to write MMM algorithm in C/C++. there are a shatload of high performance libraries you can use so a rewrite would not be needed. It shows attempts to improve efficiency can easily lead to unexpected, undesired results. In this case re-writing and optimizing code for parallel efficiency resulted in worst performance. if most of physx's coding was done 10 years ago then that's just sad. they really need a rewrite! this really puts them behind other physics middleware, especially since they use SSE. They need a rewrite according to whom? One segment of PhysX's target audience that runs adequately with x87? Its funny that you bring up other physics middleware because if the conclusions Kanter came to were true, that SSE could result in 2-4x speed-up, why are CPU PhysX effects perfectly comparable to what's available with CPU Havok or CPU Bullet? If SSE were so beneficial as you claim, I'd expect to see GPU PhysX-like effects on the CPU with Quad Core SSE optimized Havok, but instead, we just get the same CPU accelerated physics we've seen for years. Maybe Intel is purposely hobbling their Havok code? he doesnt have much time. that's why he doesnt post as much articles as he used to. lol he had time to write a 6 page speculation piece on PhysX but didn't have the time to download and recompile Bullet and compare the results? Sounds like a bad oversight or he's got his own agenda. I guess his research wasn't thorough enough, he probably knew the results might conflict with and contradict his conclusions so he chose to leave them out. Intel Core i7 5930K @4.5GHz \| Gigabyte X99 Gaming 5 \| Win8.1 Pro x64 \| Corsair H105 2x Nvidia GeForce Titan X SLI \| Asus ROG Swift 144Hz 3D Vision G-Sync LCD \| 2xDell U2410 \| 32GB Acer XPG DDR4 2800 Samsung 850 Pro 256GB \| Samsung 840EVO 4x1TB RAID 0 \| Seagate 2TB SSHD Yamaha VSX-677 A/V Receiver \| Polk Audio RM6880 7.1 \| LG Super Multi Blu-Ray Auzen X-Fi HT HD \| Logitech G710/G502/G27/G930 \| Corsair Air 540 \| EVGA SuperNOVA P2 1200W #15
chizow CLASSIFIED Member Total Posts : 3768 Reward points : 0 Joined: 1/28/2007 Status: offline Ribbons : 30 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Monday, July 12, 2010 0:42 PM (permalink) jaafaman Have you read anything of late on the 64-bit version?... Sorry I haven't really kept up or heard anything about 64-bit PhysX SDK, although I'm not sure how much it would help since most PhysX calculations are truncated and limited to single precision anyways. I'm more interested in the game engines themselves going 64-bit so they can address more system memory to pre-load all textures and get away from the streaming and swapping required with the current /largeaddressaware 32-bit limitations. Intel Core i7 5930K @4.5GHz \| Gigabyte X99 Gaming 5 \| Win8.1 Pro x64 \| Corsair H105 2x Nvidia GeForce Titan X SLI \| Asus ROG Swift 144Hz 3D Vision G-Sync LCD \| 2xDell U2410 \| 32GB Acer XPG DDR4 2800 Samsung 850 Pro 256GB \| Samsung 840EVO 4x1TB RAID 0 \| Seagate 2TB SSHD Yamaha VSX-677 A/V Receiver \| Polk Audio RM6880 7.1 \| LG Super Multi Blu-Ray Auzen X-Fi HT HD \| Logitech G710/G502/G27/G930 \| Corsair Air 540 \| EVGA SuperNOVA P2 1200W #16
chumbucket843 iCX Member Total Posts : 469 Reward points : 0 Joined: 4/16/2009 Status: offline Ribbons : 0 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Tuesday, July 13, 2010 7:43 PM (permalink) chizow chumbucket843 when we are talking about games as a whole shaders and gflops are not faithful measurements of performance. you have a lot of other things going on like hierarcheral z buffer, triangle setup, rasterization, texture fitering and sampling, real time compression, etc. those require different things like rops, shaders, memory bandwidth, texture samplers but not much shader power. that's why you see rv770 faster at shading there but not at the game as a whole. I am going to paraphrase david bowman here. he said ATi went with a lot more shaders because it gave them best perf/mm2 rather than focusing on other parts of the architecture. as for gpgpu, ATi is not bad. in some area's they beat fermi by a good bit but for more general code fermi is probably better. here is an example of a HUGE advantage of VLIW. cypress can operate on 192bit integers natively, saving a lot of the computation needed for arbitrary precision. http://www.brightsideofnews.com/Data/2010_3_26/nVidia-GeForce-GTX-480-and-GTX-480-SLI-Review/NVDA_GTX480_Elcom2_675.jpg Again, I quoted the numbers that both companies state as their single-precision floating point throughput, which directly corrolates to how fast a GPU should be able to perform the floating point math ops required for GPGPU applications. Clearly ATI's architecture is less efficient if it can't come anywhere close to achieving that level of throughput in real world apps that rely on floating point operations and are routinely outperformed by Nvidia parts with 1/2 the theoretical throughput. If there's parts of their architecture that are holding back performance it sounds like they should probably address them and try to make their shaders more efficient, but instead, they take a brute-force approach to the problem by simply doubling their number of shaders with each generation. The end result is that those shaders are even LESS efficient with each generation than the previous. here is optimized SGEMM on gpu's. in the later part of the thread, after 58xx launched, they acheived 2.2TFLOPs on 5870. utilization is even more efficient than 4000 series. http://forum.beyond3d.com/showthread.php?t=54842 not bad for single precision eh? I would not consider doubling shaders every gen bruteforce. it's more conservative. all 32 threads must be executing the same kernel so they arent really individual like they are on a cpu. furthermore there are very little synchronization primitives for gpu's. fermi has improved that but a little more programmability wont hurt. Yep and the scheduler handles all of that to coordinate which warps are in flight executing the same instruction on all threads. Fortunately, floating point math can be reduced to 3 instructions, add, subtract and multiply so the chances of concurrent execution of the same instruction is high. just because the arithmetic is +,-,,/ does not mean there are other instructions. nvidia's arch is heavily focused on risc. it's a load store architecture to reduce complexity. there are plenty of other instructions for comparing values, bitwise operations, string arithmetic, conditional jumps and other control flow, load/ store from mem to regfile, and NOPs. btw kernels are functions, not instructions. they can execute the instructions out of order with scoreboarding. umm that is a very naive matrix matrix multiplication algorithm which is why treating as just math wont work. it is bound by memory bandwidth, adding more cores will not help. the only purpose of that is to show how to write MMM algorithm in C/C++. there are a shatload of high performance libraries you can use so a rewrite would not be needed. It shows attempts to improve efficiency can easily lead to unexpected, undesired results. In this case re-writing and optimizing code for parallel efficiency resulted in worst performance. it doesnt really show anything but how to parallelize code. matrix multiplication is the "hello world" of parallel programming. you might want to read the conclusion of the article: The first lesson is that parallelism can't always make up for a poor algorithm.* The second lesson is that running code in parallel always has overhead (no matter what framework you're using), and even though Cilk++ gets that overhead very low, it can still be hugely significant if the overhead is bigger than the actual work we're doing (like it was when we tried to use a hyperobject in a tight inner loop). The third lesson is that shared-memory-multiprocessor hardware can have problems with memory bandwidth and false sharing, and we need to be aware of that. if most of physx's coding was done 10 years ago then that's just sad. they really need a rewrite! this really puts them behind other physics middleware, especially since they use SSE. They need a rewrite according to whom? One segment of PhysX's target audience that runs adequately with x87? Its funny that you bring up other physics middleware because if the conclusions Kanter came to were true, that SSE could result in 2-4x speed-up, why are CPU PhysX effects perfectly comparable to what's available with CPU Havok or CPU Bullet? If SSE were so beneficial as you claim, I'd expect to see GPU PhysX-like effects on the CPU with Quad Core SSE optimized Havok, but instead, we just get the same CPU accelerated physics we've seen for years. Maybe Intel is purposely hobbling their Havok code? i think they need a rewrite. games are constantly getting better looking and becoming more like the real world. game physics 10 years ago is totally different than it is now. it's kind of brute force to use the same functions that are 10 years old on modern hardware. it's really your opinion on what looks better or not. the technicals behind a physics engine might be better but the game might not make effective use of the physics. SSE is very effective at speeding up code over x86. it's like a small gpu on the cpu. you just wont get the speed up you will with a gpu. he doesnt have much time. that's why he doesnt post as much articles as he used to. lol he had time to write a 6 page speculation piece on PhysX but didn't have the time to download and recompile Bullet and compare the results? Sounds like a bad oversight or he's got his own agenda. I guess his research wasn't thorough enough, he probably knew the results might conflict with and contradict his conclusions so he chose to leave them out. i remember him posting something about analyzing physx.dll back in 2008 or so. the article could have been 10 pages if he spent the time to analyze bullet like he did with physx. Core i7 D0 EVGA X58 LE EVGA GTX260\\folding 3x2 GB DDR3 *10 real cores folding #17
chizow CLASSIFIED Member Total Posts : 3768 Reward points : 0 Joined: 1/28/2007 Status: offline Ribbons : 30 Re:PhysX87: PhysX CPU Software Deficiency Slow-down Identified Using Intel's VTune Profile Tuesday, July 13, 2010 10:27 PM (permalink) chumbucket843 here is optimized SGEMM on gpu's. in the later part of the thread, after 58xx launched, they acheived 2.2TFLOPs on 5870. utilization is even more efficient than 4000 series. http://forum.beyond3d.com/showthread.php?t=54842 not bad for single precision eh? I would not consider doubling shaders every gen bruteforce. it's more conservative. Uh, ya that's pretty bad considering he was only able to achieve ~80% efficiency in a purely synthetic benchmark that ONLY sits there and performs matrix calculations that are perfectly suited to ATI's VLIW superscalar architecture. Unfortunately there are no real-world apps that achieve anything close to that, which is why you've once again resorted to linking some obscure post on an ATI shill site. What's next? Synthetic tests that show Phenom is faster than i7 from AMDZone? As for brute force, its clearly a brute force approach when they didn't bother to refine their pipeline at all leading to the inefficiency and poor scaling we see from RV7x0 to RV8x0. We've seen it in numerous cases where Evergreen-based GPUs with the same or more number of SPs perform significantly WORST than their predecessor parts despite the benefit of having more SP and/or higher clocks (See: 5870 vs 4870X2, 5830/5770 vs. 4890/ 4870 and 5450 vs 4550). I guess its no surprise in that sense that Southern Islands is expected to address these inefficiencies and nothing else with regard to theoretical compute performance, which goes away from their previous brute-force approach of simply doubling inefficient SPs hoping to extract additional performance: http://www.pcper.com/article.php?aid=945 "PCPer The second area which will probably get some attention is that of stream computing efficiency. While the actual stream units will be unchanged from the previous generation, how the data is delivered to said stream units can be improved upon. The raw floating point performance of the HD 5870 is impressive, and achieving single precision rates of 2.7 TFLOPS is amazing. Unfortunately, actual throughput is typically less than even a previous generation NVIDIA GTX 285 in most GPGPU workloads. AMD will not be able to work miracles by changing around the non-stream portions of the design, but they should be able to improve throughput and compatibility." just because the arithmetic is +,-,,/ does not mean there are other instructions. nvidia's arch is heavily focused on risc. it's a load store architecture to reduce complexity. there are plenty of other instructions for comparing values, bitwise operations, string arithmetic, conditional jumps and other control flow, load/ store from mem to regfile, and NOPs. btw kernels are functions, not instructions. they can execute the instructions out of order with scoreboarding. Again, all the GPU sees is simple machine code, the rest is handled by the GPU driver's compiler to reduce high level instructions to the machine code the GPU can execute. You do understand this right? That's why the compiler is almost as important as the hardware itself and further illustrates the state of Nvidia's compiler for GPGPU applications compared to ATI's. The Nvidia compiler does more work to reduce high level APIs to machine code which is probably why their driver has more CPU overhead, but it also does a better job of increasing efficiency and throughput to make the best use of their hardware suited for TLP. This is again, in contrast to ATI's dependence on ILP which requires either compiler tricks (they're unwilling or incapable of producing) or optimized code that lends itself well to their architecture to fully extract efficiency from their design decisions. it doesnt really show anything but how to parallelize code. matrix multiplication is the "hello world" of parallel programming. you might want to read the conclusion of the article: The first lesson is that parallelism can't always make up for a poor algorithm.* The second lesson is that running code in parallel always has overhead (no matter what framework you're using), and even though Cilk++ gets that overhead very low, it can still be hugely significant if the overhead is bigger than the actual work we're doing (like it was when we tried to use a hyperobject in a tight inner loop). The third lesson is that shared-memory-multiprocessor hardware can have problems with memory bandwidth and false sharing, and we need to be aware of that. No it does an excellent job of showing his attempts to parallelize code to speed up a simple problem resulted in WORST results, the quoted portion simply states possible limitations like overhead and buffer over-runs that can occur when you try to make very simple problems overly complicated. Hmmm coulda swore Nvidia said similar with regard to re-writing or re-compiling PhysX and x87 vs. SSE..... i think they need a rewrite. games are constantly getting better looking and becoming more like the real world. game physics 10 years ago is totally different than it is now. it's kind of brute force to use the same functions that are 10 years old on modern hardware. it's really your opinion on what looks better or not. the technicals behind a physics engine might be better but the game might not make effective use of the physics. SSE is very effective at speeding up code over x86. it's like a small gpu on the cpu. you just wont get the speed up you will with a gpu. No, the algorithms, formulae and math behind physics have not changed in the last 10 years, the impetus is on someone else to show there is a demonstrated speed-up or no change is needed. Even on dated code PhysX has demonstrated improvements and speed-ups consistent with the pace of innovation for CPUs. Unfortunately CPU advances have slowed to a crawl and come nowhere close to maintaining Moore's Law while GPU's are actually outpacing Moore's law in terms of speed and transistor size. Again, you can claim its my opinion all you like but the fact of the matter is, SSE optimized Havok doesn't demonstrate any additional benefits or performance over x87-based PhysX even when Intel has all the incentive in the world to make Havok look better; if you look at the capabilities of each CPU SDK you will see they mirror each other almost identically in a list of bullet points and is further reflected in actual production titles and their respective physics effects. The best use of either SDK I've played to-date is Prince of Persia Forgotten Sands for Havok and Transformers War for Cybertron for PhysX (CPU). Look at the physics effects in either, they're both decent and on par with one another as neither stands out, but both are far inferior to GPU PhysX. i remember him posting something about analyzing physx.dll back in 2008 or so. the article could have been 10 pages if he spent the time to analyze bullet like he did with physx. Nah it looks like it would've reduced his 6 page presumptuous article to ~1 page of discredited FUD. Apparently someone did make the effort to perform the tests Kanter decided to omit: http://forums.anandtech.com/showpost.php?p=30102764&postcount=46 Scali I took the liberty of doing the Bullet-test myself. I've downloaded the latest Bullet SDK (version 2.76). I then compiled it with Visual Studio 2008, with the default Bullet project settings, which use SSE. Then I added a new configuration, where I disabled SSE, but left all other options untouched, so I'd get a 'vanilla' x87 version. I then ran the included benchmarks on my Core2 Duo 3 GHz machine: SSE results. x87 results. As you can see, the difference is marginal at best. Sometimes x87 comes out on top. I see no indication of 1.5-2x speedup with the SSE code anywhere. If anyone wants to try it on their PC, you can download my precompiled binaries here: http://bohemiq.scali.eu.org/bullet/bullet-2.76-x87-sse.zip And if you don't trust me... well, it's open source, you can build it yourself. You'll find his simple re-compile from SSE to x87 brought nowhere near the 1.5-2x speed-up, virtually no difference at all, and actually showed x87 was indeed faster in some cases. Intel Core i7 5930K @4.5GHz \| Gigabyte X99 Gaming 5 \| Win8.1 Pro x64 \| Corsair H105 2x Nvidia GeForce Titan X SLI \| Asus ROG Swift 144Hz 3D Vision G-Sync LCD \| 2xDell U2410 \| 32GB Acer XPG DDR4 2800 Samsung 850 Pro 256GB \| Samsung 840EVO 4x1TB RAID 0 \| Seagate 2TB SSHD Yamaha VSX-677 A/V Receiver \| Polk Audio RM6880 7.1 \| LG Super Multi Blu-Ray Auzen X-Fi HT HD \| Logitech G710/G502/G27/G930 \| Corsair Air 540 \| EVGA SuperNOVA P2 1200W #18

Jump to: