LINPACK is Back
It’s been awhile since I’ve done anything with Raiden. This is partly because I’ve been swamped with other projects and partly because it was summer and running the cluster made things uncomfortably (dangerously?) warm in the lab. Now that Makerfaire is past and winter is coming, I took some time to resume my pursuit of establishing the baseline performance of Raiden Mark I, which means round two of fighting with LINPACK. For the uninitiated, LINPACK (specifically the High-Performance Linpack Benchmark) is a standard method of measuring supercomputer performance. It’s what’s used by top500.org to rank the world’s fastest computers and while it’s not an absolute measure of performance for all applications, it’s the go-to compare the performance of supercomputers. Since the goal of the Raiden project is to determine if traditional Intel server-based supercomputer clusters can be replaced by less resource-intensive ARM systems, the first step is to establish a baseline of performance to use for comparisons. Since HPL is the standard for measuring supercomputers, it makes sense to use it here. The problem is that HPL isn’t the easiest thing to get running. This is partially due to the fact that most supercomputers are specialized, custom machines. It’s also a pretty specialized piece of software with a small userbase which means there’s just not a lot of people out there sharing their experiences with it. When I first built-out Raiden Mark I I kind of assumed that HPL would be part of the Rocks cluster distribution since benchmarking is a pretty common task when building-out a supercomputer. If it was included, I wasn’t able to find it, and after spending a few hours trying to get HPL and its various dependencies to compile, all I had to show for it was a highly-parallel segfault. I’m not sure what’s changed since then, but with fresh eyes I put on the white belt and tried building HPL from scratch. After reading the included documentation and looking over my own (working) MPI programs I was able to ask the right questions and found a tutorial that lead me to successfully compiling the software. Not only that, but I was able to do a test run on a single node without errors! [code lang=“text”] ================================================================================ T/V N NB P Q Time Gflops -——————————————————————————- WR11C2R4 29184 192 2 2 692.41 2.393e+01 HPL_pdgesv() start time Fri Oct 6 12:58:50 2017 HPL_pdgesv() end time Fri Oct 6 13:10:22 2017 /code I quickly spun-up the compute nodes of the cluster and modified the machines
file to run the benchmark across four nodes. However, for some reason only two nodes joined the cluster so I decided to run with only two and troubleshoot the missing nodes another time. [code lang=“text”] ================================================================================ T/V N NB P Q Time Gflops -——————————————————————————- WR11C2R4 29184 192 2 2 589.34 2.812e+01 HPL_pdgesv() start time Wed Jan 23 18:08:43 2008 HPL_pdgesv() end time Wed Jan 23 18:18:32 2008 [/code] The results were a bit disappointing (only about .5 Gflops faster). I would have expected something closer to twice the performance by adding two more nodes to the cluster (as well as off-loading the benchmark from the head node). Based on these results I decided to take a look at tuning the HPL.dat file and see if I could optimize the parameters for a two-node cluster vs. of a single computer. [code lang=“text”] ================================================================================ T/V N NB P Q Time Gflops -——————————————————————————- WR11C2R4 41088 192 2 4 1333.90 3.467e+01 HPL_pdgesv() start time Wed Jan 23 18:25:19 2008 HPL_pdgesv() end time Wed Jan 23 18:47:33 2008 [/code] This made a significant difference. Not surprisingly, the benchmark responds strongly to being tuned for the hardware configuration it’s running on. I knew this mattered, but I didn’t realize how dramatic the difference would be. I’m very excited to have reached this point in the project. There’s a number of reasons I’m anxious to move on to the Mark II version of the hardware and establishing a performance baseline for the Intel-based Mark I is a requirement for moving on to the next stage. There is still work to do, I need to get the other two nodes on-line and I need to spend more time learning how to optimize the settings in HPL.dat, but these are much less mysterious problems than getting the benchmark to compile & run on the cluster.