SolidRun HoneyComb

2022, Jan 30

While ARM-based single board computers (SBCs) have been around for many years and increased in popularity with the Raspberry Pi, ARM-based servers have had mixed reception that has left their growth in the enterprise space significantly stunted. While Amazon and Google now have their own ARM-based cloud offerings (to increase profit margins by means of CPU cost and power-savings no doubt), on-premises uptake is still slow. While trying to reduce my own power usage (and as I like tinkering with technology) I decided to purchase an ARM server, well, technically two.


The Product

There are a few manufacturers that make ARM-based server equipment, and SolidRun is one of them (who I chose to buy the equipment from). They have multiple servers available (with different specifications and at different price-points), however the one that interested me was the HoneyComb (aka LX2160). Available as either a mini-ITX board or as a 1u rack-mount that can hold two systems, it definitely had appeal. Each board comes with 16 real CPU cores, up to 64GB DDR4-ECC RAM, up to 100Gb Ethernet (depending on how you split it), an m.2 NVMe slot (x4), a PCIe x16 slot (that supports x8 bifurcation), multiple SATA-3 ports, and two USB-3 ports. Combine this with being able to run Linux or the VMware ESXi fling and it looks good from many viewpoints.


Ordering

Getting hold of the equipment was more difficult than it needed to be, but not too bad overall. The rack-mount version had been advertised for some time however it wasn't listed on their web store. A quick email later indicated that they do indeed sell the rack-mount version however you have to order via email. A few emails later (as I wanted to know more about the power supplies they would ship with) and I had my order placed and was waiting in anticipation. What was somewhat disappointing was that the two server boards arrived but the chassis was nowhere to be seen. Another email later and it appeared that they had forgotten to send the case (and replacement heatsinks due to the 1u size). It took a few more days but finally I had all of the required parts and was ready to start putting them all together.

Chassis
The illusive chassis (that wasn't sold in their online store)

Building

While waiting for the chassis to arrive I decided to test the systems as they use a standard ATX PSU connector. One workbench later and the two boards were connected and ready to go, almost... Unlike the conventional x86 PC which has the BIOS/UEFI ready to go, these systems are shipped as a blank slate. For most, these devices perform their initial bootstrap from an SD card, allowing for rapid changing of the version of u-boot or UEFI as you desire. Due to another quirk with ARM hardware you also have to use a version that is compatible with the speed of the RAM you have fitted, what speed you want the CPU to run at (its maximum frequency), if you want PCIe bifurcation, and how you want the potential 100Gb network links (4xSFP+) configured.

Worktop 1
A lack of case doesn't prevent them being tested

Worktop 2
PCIe clearance isn't an issue (yet)

As you can see, this isn't like booting up a PC (or x86 server for that matter) and simply pressing the delete key. That said, SolidRun have precompiled versions ready for you to use and you simply write them to an SD card and you are ready to go. I did get tripped up by using the u-boot version initially (I prefer the UEFI purely as its easier to configure), however it was an easy mistake to make. What I did spot (which shows a rough edge unfortunately) is that the UEFI showed only 32GB RAM, which did leave me questioning if I had a dead stick until I could get the system booted.
Sled 1
One of the chassis sleds, with unfortunately long power cables

I added NVMe storage to mine as I prefer this over SATA and wanted something that wouldn't be a bottleneck for the CPU/RAM. After reading some reviews I went for the WD Black SN750 2TB drives, complete with the additional heatsink to help keep them cool. There are plenty of different choices available and in theory all should be compatible (providing they are the correct length), but for me these worked out as a good choice. The RAM was 64GB of Samsung ECC DDR4-SODIMM, something that turned out to be problematic to source. Another email was sent to SolidRun as all of the ECC RAM they had on their compatibility list was either out of stock everywhere or discontinued. Thankfully I managed to find a supplier that had stock of the compatible sets, so each server would be fully populated.

One thing I did switch out (and was glad to given the location of the chassis) was that of the provided fans. I switched the two 40cm fans for two Noctua equivalents, which while providing slightly less airflow are significantly less noisy. I also (after testing) wired them directly to the 12v rail instead of the fan header, primarily because the fan curve on the system isn't great and with the slightest system load results in the fan spinning to 100% before slowing back down after around 10 seconds. Given how little noise they make it was easier to fix them at full speed which should ensure the cooling is better as well.
Sled 2
Drives fitted and cables tidied (somewhat)


ESXi

As I am still a fan of virtualisation I decided to give the ESXi-on-ARM fling a try, with the hope that it would perform well and allow me to migrate my systems away from x86. To be clear I have no problem with using x86 systems (and have written at least one article about this), however having 16 real cores with similar power usage to that of 6 (and with the same amount of RAM) is a nice thought. Documentation exists for getting ESXi installed on the server, and for the most part it went really well.

The stumbling block (as most have found) is that the onboard adapters (both the 100Gb and the 1Gb) aren't supported by ESXi and are in fact unusable. While some have gone the route of USB3 Ethernet adapters I personally dislike those as even those with Intel chips inside (rare it seems) aren't always reliable, sometimes due to USB being USB, and sometimes due to overheating. Through some magic with cables and some internal PCIe extension that I am not proud of, I fitted an Intel network adapter with cable routed to the front. It did take two attempts here as I found that while an older quad-port Intel NIC would work, a single port Intel 350 model would not. If I had to guess its something to do with it not being correctly initialised by the UEFI at power-on, but in any case I purchased an older single-port Intel NIC and it worked without issue.

Sled 3
Not pretty, but does provide an ESXi-compatible NIC

With ESXi installed and the network working I was finally able to upload a few ISO's for various Linux distributions and start building VM's. To my surprise this part worked without issue. After about 30 minutes I had different versions of Ubuntu booting without issue and proving stable under load. I also tested Debian which worked just as well, though no real surprise given the similarity between the two. After the system had been load-tested for a solid 24 hours I started building my core virtual machines and migrating my configurations across. Even at time of writing these VM's haven't missed a beat, and the hypervisor hasn't ran into any noticeable issues.


Linux

As ESXi was to be installed on one of the systems, I needed Linux for the other as it would be running a heavy workload that would require all of the available CPU cores (no space for virtualisation overhead). This is where the challenges of ARM hardware really started to surface, and the ongoing struggle with manufacturers to get their hardware properly supported by Linux distributions shows all too painfully. My first few attempts at installing Linux failed due to a mix of the Ubuntu installer failing due to hardware incompatibility and general bugginess with cloud-init. After 20.04 had waisted more of my time than I care to admit, I switched to the latest 21.04 to see if that would behave any better. The installer progressed further, however the onboard 1Gb port wouldn't function and PCIe wasn't working, resulting in a deployment I couldn't access. A USB3 network adapter came to the rescue and allowed system updates to take place, however that still didn't solve the issue.

After connecting with a member of the SolidRun team via Discord (jnettlet, if you ever read this, thank you for the support you give everyone, including myself!), I managed to understand the issues more and what was needed to get things working. With a much newer kernel version and some kernel command-line tweaks my 1Gb port was now alive (as was my larger PCIe slot). I learned after that you can use the pre-built kernel package from SolidRun which contains all of the tweaks and makes things significantly easier, the more you know :-)

With Linux now installed and network connectivity (at 1Gb/s speeds) now working the system was load-tested to ensure it wouldn't break under stress. Even with the reduced airflow due to the Noctua fan-swap the thermals were great (double-checked with a thermal camera) and the system was stable. I was happy that I could start experimenting with the system to see if it could cover some additional use-cases.


Timekeeping

Before I could get started with more complex usage I encountered an odd issue with timekeeping under ESXi. Each time I rebooted the platform the system clock would reset which would throw systems in a bad way. ESXi doesn't like to synchronise against an NTP server when the reference clock is more than 45 seconds out, so being multiple years out doesn't help matters.
As it transpired, I wasn't the only person to encounter this and it wasn't just specific to ESXi. A quick check with the voltmeter showed that one of the CR2032 batteries was running low and needed to be replaced. After changing the battery the issue went away thankfully, which was needed for ESXi to auto-start systems without them breaking due to major clock drift.


SATA

Attempting to add SATA drives to the Linux system proved to be a painful experience, and sadly at time of writing is one that still isn't fully resolved. Upon connecting my Samsung 840 EVO SSD's I was greeted by many link errors that made the drives unreliable. I had seen comments on Discord regarding both cable length and cable quality, and was somewhat dismayed given I had tried three different lengths by three different vendors and none of them worked robustly. I also tried reseating the compute module on top of the mini-ITX motherboard but that didn't help matters either.

After purchasing some high-quality German SATA cables I still had the same issue and decided to try other (non-SSD) drives. Even using older Samsung Spinpoint drives the issue still persisted and they weren't reliable (too many link resets when under minimal load). In the end it was a follow-on conversation over Discord that revealed NXP hadn't fixed their SATA calibration within their hardware SDK, resulting in issues when using different combinations of cables and drives.

The solution (at least for me) was to force SATA-2 speeds on the link (using the 'libata.force=3.0' kernel boot option, which while resulting in slower throughput does provide a stable connection. This hardware issue is sadly a similar story to why PCIe Gen-4 support isn't present, as the v1 silicon couldn't reach the speeds reliably hence only Gen-3 is available.


Custom Firmware / UEFI

With SATA issues out of the way it was time to take a look at the firmware build process to see how it all pieces together. Another reason for doing this admittedly was that of overclocking. While I am typically against the latter (I have done it many times in my youth and destroyed CPU's as a result), the CPU used in the LX2160 was originally clocked at 2GHz rather than the 1.8GHz covered by the SolidRun-provided UEFI.

The build process for UEFI is handled by a single script, that after installing a few dependencies built a new version of the firmware complete with the 2GHz CPU speed. I've been running it (at time of writing) for over a month now and haven't had issues with system stability. It is noteworthy that you need an x86 system to build the custom firmware (at least at the moment), so keep this in mind if you try to build it on the device itself.


10Gb Ethernet

Getting 10Gb Ethernet to work did prove somewhat of a challenge, and even now is still somewhat quirky for me. First, you need the additional tooling to be installed so you can 'create' a network interface (think of it as virtual ports on a switch). Thankfully this is now packaged in deb format so you can install it easily (see the Discord channel). With this in place, and a SystemD service file that will create the interfaces for you on boot, you can get consistency with which port to use when you reboot.

Unfortunately for me that is where the consistency ends and a large quirk begins. While the SFP I use is compatible and does work without issue, the virtual port on the other hand does become stuck after a reboot. Specifically, the virtual interface is created on the correct physical port, however no traffic will make its way through the SFP to the networking stack. This becomes frustrating as you can see using ethtool that the interface is definitely up and is registered at the 10Gb/s speed (and the switch confirms the same), however tcpdump will show you nothing. In the end, a disconnect/reconnect of the SFP and a reboot of the OS is required for it to burst into life. Hopefully this gets fixed in a later OS update as it is frustrating when you frequently patch/reboot your systems.


Power Usage

One of the goals of purchasing and implementing these servers was to reduce my overall power usage while providing more stability/performance than that of a collection of Raspberry Pi boards. With this goal in mind I am impressed, as despite the power supplies not being the most efficient the systems don't use much power (considering the workload I run on both). Comparing them to my older Ivy-lake systems seems unfair given the improvements in technology, however comparing them to my NUCs shows that they do use less power while providing more performance (with the same amount of RAM). Obviously this is workload-specific (and my NUCs do get worked hard), but my power monitoring has shown a nice drop overall as I have migrated workloads over.


Performance

No real surprise here in that 16 real cores provides great performance when running a mix of virtual machines or a workload that runs 16 cores at 100% around 80% of the time. Performance of the NVMe storage is about as fast as you would expect, and the RAM is reasonably fast given the SODIMM profile and ECC support. Obviously the hit of running SATA devices at SATA-2 speeds is a disappointment, however for my use-case its enough speed to not cause any impact. I've not pushed the networking to 100Gb (as I don't have the infrastructure to do this), but I've seen a few comments on Discord regarding it being worked hard (so I don't doubt its capable of this). With the replacement fans the devices hum away in the background, being barely noticeable to most.


Future

While I am glad that I purchased these and they have met my goal (and are now running my permanent workloads), I will say that they aren't for those who expect them to work like x86 systems, or for those who don't want to invest time debugging/troubleshooting. My experiences working with these are likely one of the reasons why the uptake of these systems on-premises is so low, as the technical challenges that IT admins would face with them outweigh any potential benefit. I can also see why Amazon and Google have created their own variants that are under their own control, and why they support them with their own OS variants (Amazon Linux for example). As someone who runs cloud instances on Graviton, I can see why companies would choose to use these given the headaches I faced are removed.

I have no doubt desktop/workstation/server computing in the ARM space will grow over time, and I suspect my HoneyComb devices will also mature from a compatibility/support perspective over time as well. In truth I look forward to seeing what SolidRun release next, especially if its something with ARMv9 ISA support :-)