EVGA

Troubleshooting missing memory on SR-2

Author
sfield
Superclocked Member
  • Total Posts : 143
  • Reward points : 0
  • Joined: 2009/07/19 17:57:10
  • Status: offline
  • Ribbons : 3
2010/10/24 20:42:14 (permalink)
I've been trying to figure out why my system doesn't consistently boot with NUMA turned off.  In the cases where the system does boot successfully, I only see 20 out of 24GB of memory -- one 4GB module is "taken out". 

I haven't yet figured out why exactly this is happening with NUMA off, but the SR-2 is known for removing memory when timings are too tight.  Having said that, a number of people have complained of memory not being visible in Windows, yet the DIMM and SPD information shown by various tools such as CPU-Z clearly show the presence of all DIMMs.
 
I've come up with a (manual) way to determine which DIMMs get "taken out" by the early BIOS startup code in the SR-2.  This should be better than trying to guess which module(s) are causing problems beyond the trial-and-error approach of swapping/removing modules.

So, before getting into how you can (manually) tell which modules have been "taken out", a bit of education is needed.

The SR-2 supports two physical processors.
  • Each physical processor has various functions that can be accessed through a standard PCI addressing mechanism.  The processor closest to the motherboard back-panel can be addressed on bus 255.  The other processor is addressed on bus 254.
  • Each physical processor has 3 memory channels.
  • Each memory channel can support 3 DIMMs in the memory controller, although the SR-2 only physically supports 2 DIMMs per memory channel.  (2 processors) * ((3 memory channels) * (2 modules per channel)) = 12 modules total.
In cases where I observed the SR-2 "taking out" memory modules, the DIMMPRESENT bit (bit 9) was not set in the DIMM Organization Descriptor Register (eg: MC_DOD_CH0_0) for a slot that actually contained a DIMM.  That DIMM had also been partially configured, so some forward progress had been made when the memory controller was being initialized.  So the DIMMPRESENT bit being set to zero for a populated slot is an indication of which DIMMs the startup code didn't like for some reason.

There are 3 PCI devices (one per memory channel) for each processor that we are interested in querying, to determine what the DIMM status is.  We want to look at registers 0x48 and 0x4c in the below:
Location bus 254 (0xFE), device 4 (0x04), function 1 (0x01)
Location bus 254 (0xFE), device 5 (0x05), function 1 (0x01)
Location bus 254 (0xFE), device 6 (0x06), function 1 (0x01)
Location bus 255 (0xFF), device 4 (0x04), function 1 (0x01)
Location bus 255 (0xFF), device 5 (0x05), function 1 (0x01)
Location bus 255 (0xFF), device 6 (0x06), function 1 (0x01)

To start, fire up CPU-Z, and go to the About tab.  Save off a report to a .TXT file, and open that file:


Search for one of the devices above "bus 254 (0xFE), device 4 (0x04), function 1 (0x01)":


 
You can convert the contents of register values 0x48 and 0x4c to binary.  Note the register dump is in byte order, and you'll want to convert the bytes to a swapped DWORD.  Then check if the DIMMPRESENT bit is (1 -- present) or (0 -- not present/"taken out").


 
You'd want to repeat the search above in the saved .TXT file to cover the 6 device and register combinations.  You can pick which device and register combinations based on looking at the modules you have installed relative to the overall module overlay in the photo below, or you can just check all of the values.  With that, you can tell which module(s) associated with a particular CPU get "taken out".
Location bus 254 (0xFE), device 4 (0x04), function 1 (0x01)
Location bus 254 (0xFE), device 5 (0x05), function 1 (0x01)
Location bus 254 (0xFE), device 6 (0x06), function 1 (0x01)
Location bus 255 (0xFF), device 4 (0x04), function 1 (0x01)
Location bus 255 (0xFF), device 5 (0x05), function 1 (0x01)
Location bus 255 (0xFF), device 6 (0x06), function 1 (0x01)
 
Here's a photo with the bus, device, function, register mappings overlayed ontop of each module slot:

 
Ideally, I'd like to see EVGA do a few things here:
  • Provide more control over what happens in the early memory controller initialization phase.  For instance, it should be possible to ignore certain errors to allow problems to be diagnosed later in the boot sequence.
  • Include more logging or output to the motherboard status device as to what problems are detected that cause memory to be "taken out".
  • Eleet could be modified to include more information on the status of installed memory modules, so we can tell if certain modules were "taken out".  Eleet already has access to sufficient low-level information to query the above and provide more context.
 
post edited by safield - 2010/10/24 23:58:29

Attached Image(s)

#1

7 Replies Related Threads

    ty_ger07
    Insert Custom Title Here
    • Total Posts : 19793
    • Reward points : 0
    • Joined: 2008/04/10 23:48:15
    • Location: traveler
    • Status: offline
    • Ribbons : 242
    Re:Troubleshooting missing memory on SR-2 2010/10/24 20:49:02 (permalink)

     
    BR!
     

    Ideally, I'd like to see EVGA do a few things here:
    • Provide more control over what happens in the early memory controller initialization phase.  For instance, it should be possible to ignore certain errors to allow problems to be diagnosed later in the boot sequence.
    • Include more logging or output to the motherboard status device as to what problems are detected that cause memory to be "taken out".
    • Eleet could be modified to include more information on the status of installed memory modules, so we can tell if certain modules were "taken out".  Eleet already has access to sufficient low-level information to query the above and provide more context.

     

    #2
    may i be worthy
    iCX Member
    • Total Posts : 263
    • Reward points : 0
    • Joined: 2010/07/27 00:20:41
    • Status: offline
    • Ribbons : 2
    Re:Troubleshooting missing memory on SR-2 2010/10/24 21:37:04 (permalink)
    that is awesome. Wish I had this a month ago.
     
    But seriously awesome research.


    SR-2 #3 -Folding/render: Dual Hexacore X5680: @4.301GHz, 12GB @ 2:10 DDR 1850.
    SR-2 #2 -Folding/render: Dual Hexacore X5660 @4.301GHz, 12GB @ 2:10 DDR 1850. | P2686 : 162,850 ppd  
    SR-2 #1 My main work rig. Dual Hexacore X5650 @4.202GHz, 24GB @ 2:8, All aircooled: Noctua DH-14 

    #3
    farthestkris
    CLASSIFIED Member
    • Total Posts : 4109
    • Reward points : 0
    • Joined: 2009/03/30 15:54:27
    • Location: Crunching Location .......... Location Not Found
    • Status: offline
    • Ribbons : 12
    Re:Troubleshooting missing memory on SR-2 2010/10/25 06:36:41 (permalink)
    safield

    I've been trying to figure out why my system doesn't consistently boot with NUMA turned off.  In the cases where the system does boot successfully, I only see 20 out of 24GB of memory -- one 4GB module is "taken out". 

    I haven't yet figured out why exactly this is happening with NUMA off, but the SR-2 is known for removing memory when timings are too tight.  Having said that, a number of people have complained of memory not being visible in Windows, yet the DIMM and SPD information shown by various tools such as CPU-Z clearly show the presence of all DIMMs.

    I've come up with a (manual) way to determine which DIMMs get "taken out" by the early BIOS startup code in the SR-2.  This should be better than trying to guess which module(s) are causing problems beyond the trial-and-error approach of swapping/removing modules.

    So, before getting into how you can (manually) tell which modules have been "taken out", a bit of education is needed.

    The SR-2 supports two physical processors.
    • Each physical processor has various functions that can be accessed through a standard PCI addressing mechanism.  The processor closest to the motherboard back-panel can be addressed on bus 255.  The other processor is addressed on bus 254.
    • Each physical processor has 3 memory channels.
    • Each memory channel can support 3 DIMMs in the memory controller, although the SR-2 only physically supports 2 DIMMs per memory channel.  (2 processors) * ((3 memory channels) * (2 modules per channel)) = 12 modules total.
    In cases where I observed the SR-2 "taking out" memory modules, the DIMMPRESENT bit (bit 9) was not set in the DIMM Organization Descriptor Register (eg: MC_DOD_CH0_0) for a slot that actually contained a DIMM.  That DIMM had also been partially configured, so some forward progress had been made when the memory controller was being initialized.  So the DIMMPRESENT bit being set to zero for a populated slot is an indication of which DIMMs the startup code didn't like for some reason.

    There are 3 PCI devices (one per memory channel) for each processor that we are interested in querying, to determine what the DIMM status is.  We want to look at registers 0x48 and 0x4c in the below:
    Location bus 254 (0xFE), device 4 (0x04), function 1 (0x01)
    Location bus 254 (0xFE), device 5 (0x05), function 1 (0x01)
    Location bus 254 (0xFE), device 6 (0x06), function 1 (0x01)
    Location bus 255 (0xFF), device 4 (0x04), function 1 (0x01)
    Location bus 255 (0xFF), device 5 (0x05), function 1 (0x01)
    Location bus 255 (0xFF), device 6 (0x06), function 1 (0x01)

    To start, fire up CPU-Z, and go to the About tab.  Save off a report to a .TXT file, and open that file:


    Search for one of the devices above "bus 254 (0xFE), device 4 (0x04), function 1 (0x01)":



    You can convert the contents of register values 0x48 and 0x4c to binary.  Note the register dump is in byte order, and you'll want to convert the bytes to a swapped DWORD.  Then check if the DIMMPRESENT bit is (1 -- present) or (0 -- not present/"taken out").



    You'd want to repeat the search above in the saved .TXT file to cover the 6 device and register combinations.  You can pick which device and register combinations based on looking at the modules you have installed relative to the overall module overlay in the photo below, or you can just check all of the values.  With that, you can tell which module(s) associated with a particular CPU get "taken out".
    Location bus 254 (0xFE), device 4 (0x04), function 1 (0x01)
    Location bus 254 (0xFE), device 5 (0x05), function 1 (0x01)
    Location bus 254 (0xFE), device 6 (0x06), function 1 (0x01)
    Location bus 255 (0xFF), device 4 (0x04), function 1 (0x01)
    Location bus 255 (0xFF), device 5 (0x05), function 1 (0x01)
    Location bus 255 (0xFF), device 6 (0x06), function 1 (0x01)

    Here's a photo with the bus, device, function, register mappings overlayed ontop of each module slot:


    Ideally, I'd like to see EVGA do a few things here:
    • Provide more control over what happens in the early memory controller initialization phase.  For instance, it should be possible to ignore certain errors to allow problems to be diagnosed later in the boot sequence.
    • Include more logging or output to the motherboard status device as to what problems are detected that cause memory to be "taken out".
    • Eleet could be modified to include more information on the status of installed memory modules, so we can tell if certain modules were "taken out".  Eleet already has access to sufficient low-level information to query the above and provide more context.
     

    set your back to cas latency to 3 or 5 

     
     
     
    #4
    Dariusz1989
    Superclocked Member
    • Total Posts : 210
    • Reward points : 0
    • Joined: 2011/08/08 04:08:43
    • Location: UK
    • Status: offline
    • Ribbons : 0
    Re:Troubleshooting missing memory on SR-2 2011/08/26 08:09:38 (permalink)
    Heya
     
    Ok that tells me how to find which one ram is not there... but how do I fix it after ? I got exact that same problem I cant see my 24 gb of ram...

    www.dariuszmakowski.com
    EVGA SR 2 
    2x Xeon 5670 @ 4.0
    48 gb 8x4 Tripple channel 1600  
    1x Ocz  Vertex 3 120 gb SSD OS
    4x WD RE 3 500 gb 2x2 Raid 0
    EVGA GTX 680 4GB Classified  
    Enermax Revolution 1050w 
    Xigmatec Elysium 3x 120 on top / 2x 120 on bottom / 140 back / 2x 120 front / 1x 240 side / 4x CPU 120mm - 2x noctua 50cf + 2x scythe 100 cf
     
    #5
    nikkocortez
    CLASSIFIED Member
    • Total Posts : 2886
    • Reward points : 0
    • Joined: 2010/02/01 10:04:03
    • Status: offline
    • Ribbons : 14
    Re:Troubleshooting missing memory on SR-2 2011/08/26 16:05:02 (permalink)
    I think I learned something... now if I only knew what it was... LoL  I'm not to familiar with hex or even binary coding but this is neat what you dug up.  I'm really wanting to learn this stuff now so I can understand what all of my money is doing in my computers.  Hats off to you cuz its kind of inspiring me to put my GI-Bill to work! 
    #6
    farthestkris
    CLASSIFIED Member
    • Total Posts : 4109
    • Reward points : 0
    • Joined: 2009/03/30 15:54:27
    • Location: Crunching Location .......... Location Not Found
    • Status: offline
    • Ribbons : 12
    Re:Troubleshooting missing memory on SR-2 2011/08/26 16:20:47 (permalink)
    Dariusz1989

    Heya

    Ok that tells me how to find which one ram is not there... but how do I fix it after ? I got exact that same problem I cant see my 24 gb of ram...

    its a settings issue. Simply set your DIMM to 1.65 and your VTT to 1.3 Your Voltage May vary as always. 

     
     
     
    #7
    jabloomf1230
    Superclocked Member
    • Total Posts : 132
    • Reward points : 0
    • Joined: 2009/03/04 15:40:47
    • Status: offline
    • Ribbons : 0
    Re:Troubleshooting missing memory on SR-2 2011/11/01 15:10:17 (permalink)
    I know that this is an older thread, but sfield's methodical approach using the CPU-Z register dump helped me isolate a DIMM that was acting up. Try this approach first, if you are not seeing your expected complement of memory.
    #8
    Jump to:
  • Back to Mobile