I've been trying to figure out why my system doesn't consistently boot with NUMA turned off. In the cases where the system does boot successfully, I only see 20 out of 24GB of memory -- one 4GB module is "taken out".
I haven't yet figured out why exactly this is happening with NUMA off, but the SR-2 is known for removing memory when timings are too tight. Having said that, a number of people have complained of memory not being visible in Windows, yet the DIMM and SPD information shown by various tools such as CPU-Z clearly show the presence of all DIMMs.
I've come up with a (manual) way to determine which DIMMs get "taken out" by the early BIOS startup code in the SR-2. This should be better than trying to guess which module(s) are causing problems beyond the trial-and-error approach of swapping/removing modules.
So, before getting into how you can (manually) tell which modules have been "taken out", a bit of education is needed.
The SR-2 supports two physical processors.
- Each physical processor has various functions that can be accessed through a standard PCI addressing mechanism. The processor closest to the motherboard back-panel can be addressed on bus 255. The other processor is addressed on bus 254.
- Each physical processor has 3 memory channels.
- Each memory channel can support 3 DIMMs in the memory controller, although the SR-2 only physically supports 2 DIMMs per memory channel. (2 processors) * ((3 memory channels) * (2 modules per channel)) = 12 modules total.
In cases where I observed the SR-2 "taking out" memory modules, the DIMMPRESENT bit (bit 9) was not set in the DIMM Organization Descriptor Register (eg: MC_DOD_CH0_0) for a slot that actually contained a DIMM. That DIMM had also been partially configured, so some forward progress had been made when the memory controller was being initialized. So the DIMMPRESENT bit being set to zero for a populated slot is an indication of which DIMMs the startup code didn't like for some reason.
There are 3 PCI devices (one per memory channel) for each processor that we are interested in querying, to determine what the DIMM status is. We want to look at registers 0x48 and 0x4c in the below:
Location bus 254 (0xFE), device 4 (0x04), function 1 (0x01)
Location bus 254 (0xFE), device 5 (0x05), function 1 (0x01)
Location bus 254 (0xFE), device 6 (0x06), function 1 (0x01)
Location bus 255 (0xFF), device 4 (0x04), function 1 (0x01)
Location bus 255 (0xFF), device 5 (0x05), function 1 (0x01)
Location bus 255 (0xFF), device 6 (0x06), function 1 (0x01)
To start, fire up CPU-Z, and go to the About tab. Save off a report to a .TXT file, and open that file:
Search for one of the devices above "bus 254 (0xFE), device 4 (0x04), function 1 (0x01)":
You can convert the contents of register values 0x48 and 0x4c to binary. Note the register dump is in byte order, and you'll want to convert the bytes to a swapped DWORD. Then check if the DIMMPRESENT bit is (1 -- present) or (0 -- not present/"taken out").
You'd want to repeat the search above in the saved .TXT file to cover the 6 device and register combinations. You can pick which device and register combinations based on looking at the modules you have installed relative to the overall module overlay in the photo below, or you can just check all of the values. With that, you can tell which module(s) associated with a particular CPU get "taken out".
Location bus 254 (0xFE), device 4 (0x04), function 1 (0x01)
Location bus 254 (0xFE), device 5 (0x05), function 1 (0x01)
Location bus 254 (0xFE), device 6 (0x06), function 1 (0x01)
Location bus 255 (0xFF), device 4 (0x04), function 1 (0x01)
Location bus 255 (0xFF), device 5 (0x05), function 1 (0x01)
Location bus 255 (0xFF), device 6 (0x06), function 1 (0x01)
Here's a photo with the bus, device, function, register mappings overlayed ontop of each module slot:
Ideally, I'd like to see EVGA do a few things here:
- Provide more control over what happens in the early memory controller initialization phase. For instance, it should be possible to ignore certain errors to allow problems to be diagnosed later in the boot sequence.
- Include more logging or output to the motherboard status device as to what problems are detected that cause memory to be "taken out".
- Eleet could be modified to include more information on the status of installed memory modules, so we can tell if certain modules were "taken out". Eleet already has access to sufficient low-level information to query the above and provide more context.
post edited by safield - 2010/10/24 23:58:29