Supermicro X9SCM-F Issues

Post date: Oct 10, 2012 11:04:43 AM

My experience with the Supermicro X9SCM-F motherboard

This post is to detail the issues with this motherboard model. I have used other Supermicro motherboards (X7SPA-F) and (X6DTH-F) without any problems. This board though has a lot of issues and I have sent out several e-mails over the past 2-6 months about them, I am going to summarize them all here so they can all be found in one place and ask Supermicro to take a look.

Key issues:

1. the board does not always address 32GB of ram if you are using an Ivy Bridge chip on this motherboard

summary: use BIOS 2.0a and not BIOS 2.0b (released last Friday 10/05/2012) as it will show 32GB but the machine is unstable and cannot address > 16GB

before 2.0b (2.0a, which is also the first public BIOS to support Ivy Bridge CPUs) it only sees 16GB of the 32GB of ram:

top - 17:35:59 up 3 days, 3:13, 33 users, load average: 0.14, 0.12, 0.07

Tasks: 269 total, 1 running, 268 sleeping, 0 stopped, 0 zombie

%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 16418012 total, 11244420 used, 5173592 free, 3446988 buffers

KiB Swap: 31248972 total, 0 used, 31248972 free, 6152428 cached

with 2.0b, it sees 32GB all of the time, but is unstable:

this is the first time I've even seen memtest86+ crash-- "Unexpected interrupt - Halting"

here is a video showing me run memtest86+ and having the box reboot itself due to this bug in the 2.0b bios:

2. e1000e network problems

summary: on-board NIC (the first port is not reliable)

a) During heavy network I/O (file copy) on eth0 the network latency jumps to 300-1000ms+ every 4-5 seconds (it does not do this on a separate card)

http://permalink.gmane.org/gmane.linux.kernel/1372917

This bug has been bothering me for months (random lag) during high network I/O on my X9SCM-F motherboard. There is a lot of discussion about this problem here: http://sourceforge.net/p/e1000/bugs/27/?page=4 I tried the EEPROM fix but it did not work: http://sourceforge.net/projects/e1000/files/e1000e%20stable/eeprom_fix_82574_or_82583/

The readme states:

"The value at offset 0x001e (58) has bit 1 unset. This enables the problematic power saving feature. In this case, the EEPROM needs to read "5a" at offset 0x001e." # ethtool -e eth4 |grep 0x0010 0x0010: ff ff ff ff 6b 02 00 00 d9 15 d3 10 ff ff 5a a5 Yes I did reboot. Here is what the problem looks like, during a SAMBA copy from A->B where B=X9SCM running Linux: (ONBOARD Intel eth0 / 82574L ) $ ping windowspc PING windowspc (192.168.0.1) 56(84) bytes of data. 64 bytes from windowspc (192.168.0.1): icmp_req=1 ttl=128 time=0.544 ms 64 bytes from windowspc (192.168.0.1): icmp_req=2 ttl=128 time=0.193 ms 64 bytes from windowspc (192.168.0.1): icmp_req=3 ttl=128 time=0.619 ms 64 bytes from windowspc (192.168.0.1): icmp_req=4 ttl=128 time=0.642 ms 64 bytes from windowspc (192.168.0.1): icmp_req=5 ttl=128 time=0.426 ms 64 bytes from windowspc (192.168.0.1): icmp_req=6 ttl=128 time=0.464 ms 64 bytes from windowspc (192.168.0.1): icmp_req=7 ttl=128 time=0.696 ms 64 bytes from windowspc (192.168.0.1): icmp_req=8 ttl=128 time=1353 ms64 bytes from windowspc (192.168.0.1): icmp_req=9 ttl=128 time=353 ms 64 bytes from windowspc (192.168.0.1): icmp_req=10 ttl=128 time=0.492 ms 64 bytes from windowspc (192.168.0.1): icmp_req=11 ttl=128 time=0.618 ms 64 bytes from windowspc (192.168.0.1): icmp_req=12 ttl=128 time=0.474 ms 64 bytes from windowspc (192.168.0.1): icmp_req=13 ttl=128 time=0.542 ms 64 bytes from windowspc (192.168.0.1): icmp_req=14 ttl=128 time=0.471 ms 64 bytes from windowspc (192.168.0.1): icmp_req=15 ttl=128 time=0.645 ms 64 bytes from windowspc (192.168.0.1): icmp_req=16 ttl=128 time=0.394 ms 64 bytes from windowspc (192.168.0.1): icmp_req=17 ttl=128 time=0.537 ms 64 bytes from windowspc (192.168.0.1): icmp_req=18 ttl=128 time=0.706 ms 64 bytes from windowspc (192.168.0.1): icmp_req=19 ttl=128 time=0.465 ms 64 bytes from windowspc (192.168.0.1): icmp_req=20 ttl=128 time=0.707 ms 64 bytes from windowspc (192.168.0.1): icmp_req=21 ttl=128 time=348 ms 64 bytes from windowspc (192.168.0.1): icmp_req=22 ttl=128 time=0.703 ms 64 bytes from windowspc (192.168.0.1): icmp_req=23 ttl=128 time=0.560 ms 64 bytes from windowspc (192.168.0.1): icmp_req=24 ttl=128 time=0.554 ms 64 bytes from windowspc (192.168.0.1): icmp_req=25 ttl=128 time=0.585 ms 64 bytes from windowspc (192.168.0.1): icmp_req=26 ttl=128 time=0.508 ms 64 bytes from windowspc (192.168.0.1): icmp_req=27 ttl=128 time=345 ms 64 bytes from windowspc (192.168.0.1): icmp_req=28 ttl=128 time=0.374 ms 64 bytes from windowspc (192.168.0.1): icmp_req=29 ttl=128 time=0.728 ms 64 bytes from windowspc (192.168.0.1): icmp_req=30 ttl=128 time=0.537 ms 64 bytes from windowspc (192.168.0.1): icmp_req=31 ttl=128 time=0.190 ms 64 bytes from windowspc (192.168.0.1): icmp_req=32 ttl=128 time=0.204 ms 64 bytes from windowspc (192.168.0.1): icmp_req=33 ttl=128 time=0.239 ms Same test (copy test) with samba as above but now with an Intel 4-port NIC: $ ping windowspc 64 bytes from windowspc (192.168.0.1): icmp_req=1 ttl=128 time=0.175 ms 64 bytes from windowspc (192.168.0.1): icmp_req=2 ttl=128 time=0.332 ms 64 bytes from windowspc (192.168.0.1): icmp_req=3 ttl=128 time=0.276 ms 64 bytes from windowspc (192.168.0.1): icmp_req=4 ttl=128 time=0.221 ms 64 bytes from windowspc (192.168.0.1): icmp_req=5 ttl=128 time=0.518 ms 64 bytes from windowspc (192.168.0.1): icmp_req=6 ttl=128 time=0.157 ms 64 bytes from windowspc (192.168.0.1): icmp_req=7 ttl=128 time=0.222 ms 64 bytes from windowspc (192.168.0.1): icmp_req=8 ttl=128 time=0.605 ms 64 bytes from windowspc (192.168.0.1): icmp_req=9 ttl=128 time=0.335 ms 64 bytes from windowspc (192.168.0.1): icmp_req=10 ttl=128 time=0.679 ms 64 bytes from windowspc (192.168.0.1): icmp_req=11 ttl=128 time=0.223 ms 64 bytes from windowspc (192.168.0.1): icmp_req=12 ttl=128 time=0.189 ms 64 bytes from windowspc (192.168.0.1): icmp_req=13 ttl=128 time=0.432 ms 64 bytes from windowspc (192.168.0.1): icmp_req=14 ttl=128 time=0.235 ms 64 bytes from windowspc (192.168.0.1): icmp_req=15 ttl=128 time=0.386 ms 64 bytes from windowspc (192.168.0.1): icmp_req=16 ttl=128 time=0.658 ms 64 bytes from windowspc (192.168.0.1): icmp_req=17 ttl=128 time=0.430 ms 64 bytes from windowspc (192.168.0.1): icmp_req=18 ttl=128 time=0.494 ms 64 bytes from windowspc (192.168.0.1): icmp_req=19 ttl=128 time=0.411 ms 64 bytes from windowspc (192.168.0.1): icmp_req=20 ttl=128 time=0.737 ms 64 bytes from windowspc (192.168.0.1): icmp_req=21 ttl=128 time=0.543 ms 64 bytes from windowspc (192.168.0.1): icmp_req=22 ttl=128 time=0.564 ms 64 bytes from windowspc (192.168.0.1): icmp_req=23 ttl=128 time=0.571 ms 64 bytes from windowspc (192.168.0.1): icmp_req=24 ttl=128 time=0.407 ms 64 bytes from windowspc (192.168.0.1): icmp_req=25 ttl=128 time=0.518 ms 64 bytes from windowspc (192.168.0.1): icmp_req=26 ttl=128 time=0.482 ms 64 bytes from windowspc (192.168.0.1): icmp_req=27 ttl=128 time=0.904 ms 64 bytes from windowspc (192.168.0.1): icmp_req=28 ttl=128 time=0.478 ms 64 bytes from windowspc (192.168.0.1): icmp_req=29 ttl=128 time=1.16 ms 64 bytes from windowspc (192.168.0.1): icmp_req=30 ttl=128 time=0.656 ms 64 bytes from windowspc (192.168.0.1): icmp_req=31 ttl=128 time=0.613 ms 64 bytes from windowspc (192.168.0.1): icmp_req=32 ttl=128 time=0.475 ms 64 bytes from windowspc (192.168.0.1): icmp_req=33 ttl=128 time=0.562 ms So it appears, at the moment, if you have problems with eth0, try using eth1 or buy another network card.

b) During heavy network I/O (file copy) of over 600GB of files, the kernel disabled the network IRQ on eth0 and took the server offline from a network perspective (it has not done this on a separate card..yet)

https://lkml.org/lkml/2012/10/8/374

Kernel: 3.6.0 (x86_64)

Distribution: Debian Testing

I was copying 600GB of files from Samba/Linux to a Windows host, it

copied around 500GB, then this happened, it disabled the network

interface on my Supermicro X9CM-F board (on-board) -- the first

interface, any idea why this happened?

[93593.565667] irq 44: nobody cared (try booting with the "irqpoll" option)

[93593.565673] Pid: 0, comm: swapper/0 Not tainted 3.6.0 #4

[93593.565675] Call Trace:

[ 0.971861] ACPI: Invalid Power Resource to register!

[93593.565677] <IRQ> [<ffffffff810a6fe1>] __report_bad_irq+0x31/0xd0

[93593.565690] [<ffffffff810a72c3>] note_interrupt+0x1a3/0x1f0

[93593.565694] [<ffffffff810a4e59>] handle_irq_event_percpu+0x89/0x160

[93593.565697] [<ffffffff810a4f6c>] handle_irq_event+0x3c/0x60

[93593.565700] [<ffffffff810a7b7f>] handle_edge_irq+0x6f/0x110

[93593.565705] [<ffffffff81003c6d>] handle_irq+0x1d/0x30

[93593.565709] [<ffffffff81003b65>] do_IRQ+0x55/0xd0

[93593.565714] [<ffffffff815fdb27>] common_interrupt+0x67/0x67

[93593.565715] <EOI> [<ffffffff8107d1ed>] ?

__hrtimer_start_range_ns+0x1bd/0x3b0

[93593.565728] [<ffffffff81383894>] ? acpi_idle_enter_c1+0xaa/0xcf

[93593.565731] [<ffffffff81383873>] ? acpi_idle_enter_c1+0x89/0xcf

[93593.565735] [<ffffffff814a1719>] cpuidle_enter+0x19/0x20

[93593.565738] [<ffffffff814a1a98>] cpuidle_idle_call+0x88/0x100

[93593.565750] [<ffffffff8100a39f>] cpu_idle+0x5f/0xd0

[93593.565752] [<ffffffff815e9a24>] rest_init+0x68/0x74

[93593.565755] [<ffffffff81a8ca86>] start_kernel+0x2a8/0x2b5

[93593.565756] [<ffffffff81a8c5dd>] ? repair_env_string+0x5e/0x5e

[93593.565758] [<ffffffff81a8c2fd>] x86_64_start_reservations+0x101/0x105

[93593.565759] [<ffffffff81a8c3d9>] x86_64_start_kernel+0xd8/0xdc

[93593.565760] handlers:

[93593.565762] [<ffffffff8140c330>] e1000_msix_other

[93593.565763] Disabling IRQ #44

3. Problems with PCI-e cards

summary: Don't expect to use all four PCI-e slots if they use a lot of power

I could take pictures or a video of the entire process, but it would be very long, in short, if you use too many boards that use a lot of power, they will not show up in Linux/BIOS. Example, I had my 4 port NIC and a PCI-e x1 video card installed, when I installed a USB 3.0 card (x4) highpoint, then the network card stopped working, none of the interfaces were shown. Even worse, when I had a eSATA PCI-e card, Linux would see it but when I plugged in some docks, after rebooting, the board is no longer shown in lspci. My final workaround for this problem is as follows, in short only use two boards in these slots:

There is also a few other issues with this board: 1) If you plug-in too many PCI-e devices, it will drop some of the boards, e.g. they will not show up if they use too much power. (USB-3 boards, etc)

BAD CONFIG:

SLOT Bottom => Video Slot next one up => empty Slot next one up => NIC Slot Next one up => empty The NIC doesn't show up in Linux, I put it in the first slot (near the CPU) and it worked.

GOOD CONFIG:

SLOT Bottom => Video Slot next one up => empty Slot next one up => empty Slot Next one up => NIC

4. clock drift issues:

summary: expect some strangeness if you use gpsd/a gps to help sync your time, due to what SM noted below

http://lists.ntp.org/pipermail/pool/2012-July/006019.html

X9SCM-F w/GPS:

$ ntpq -pn

remote refid st t when poll reach delay offset jitter

==============================================================================

*127.127.28.0 .GPS. 0 l 13 16 377 0.000 -76.927 2.471

-64.73.32.134 198.30.92.2 2 u 17 64 377 50.115 -16.339 2.352

+216.129.110.30 69.36.224.15 2 u 17 64 377 48.916 -31.149 3.928

+4.53.160.74 209.81.9.7 2 u 21 64 377 77.023 -20.115 4.624

-169.229.70.183 169.229.128.214 3 u 32 64 377 93.048 -15.348 3.198

X8DTH-6F (no GPS in this example but when I used to run the GPS on X8DTH-F) I do not recall seeing such an offset as shown above, maybe a USB/Linux chipset issue as well..?

$ ntpq -pn

remote refid st t when poll reach delay offset jitter

==============================================================================

+38.109.218.175 192.43.244.18 2 u 25 64 1 57.070 -2.652 2.037

-96.44.157.90 173.244.211.10 3 u 24 64 1 82.532 -9.315 0.635

*205.196.146.72 128.59.39.48 2 u 23 64 1 39.337 -5.478 0.736

+169.229.70.95 128.32.206.55 2 u 22 64 1 84.710 -7.677 0.703

SM Explanation:

"Please note the X7 have power saving stage at C3 whereas for X9 it has c7. All these add-on feature we suspect may accounted for the drift/offset as well."

All Xeons are built with internal Power Saving feature. How and when this power saving mode in the CPU activate or kick in we have no control. Especially the more advanced CPU they have more features and powering mode added. Hence I would prefer that you keep to the advice provided by : https://lists.ntp.org/pipermail/pool/2012-July/006025.html in your previous email.