Networking¶
InfiniBand¶
InfiniBand is a high-performance, low-latency interconnect used in HPC systems. It is often beneficial to understand the InfiniBand topology before submitting openmpi jobs, as well as jobs that are going to do a lot of IO on Lustre. This is our topology.
Some nodes have QDR (40Gb/s) InfiniBand, most have FDR (56Gb/s). They are labeled with ib_qdr and ib_fdr Slurm constraints respectively, to allow you to request one or the other.
Libibverbs provides a Remote Direct Memory Access (RDMA) API which allows programs to communicate over InfiniBand without directly involving the kernel or CPU on either end. Context switching between user-space and kernel-space creates a lot of overhead and unpredictable latency, which RDMA networks like InfiniBand and RDMA over Converged Ethernet (RoCE) avoid.
Our openmpi module is linked against UCX, which dlopen's a library linked against the libibverbs library or falls back to other communication methods like TCP over Ethernet. This means openmpi programs will automatically make use of the InfiniBand fabric when it is available.
Additionally, we mount our Lustre filesystem over InfiniBand on nodes with InfiniBand. This gives nodes low-latency access to Lustre, along with (up to) 4-5GB/s of throughput per node. Our Lustre filesystem can achieve about ~14GB/s aggregate throughput shared across all nodes. On nodes that only have Ethernet, we mount Lustre over TCP.
Ethernet¶
Some nodes have 10GbE, some only 1GbE. They are labeled with the eth_10g and eth_1g Slurm constraints to allow you to request one or the other.
InfiniBand topology¶
flowchart TD
subgraph Rack4
SX6036R41 ---> MTR4C3[R4C3 M1000E switch QDR]
SX6036R41 ---> MTR4C1[R4C1 M1000E switch QDR]
SX6036R41 --> MTR4C1
SX6036R42 ---> MTR4C3[R4C3 M1000E switch QDR]
SX6036R42 ---> MTR4C1[R4C1 M1000E switch QDR]
SX6036R42 ---> MTR4C1
MTR4C3 --> opensm[OpenSM node 1]
MTR4C1 --> r4c1nodes[16 nodes with QDR]
end
subgraph Rack3
MTR3C2 --> opensm2[OpenSM node 2]
end
subgraph Rack25
SX6036R25 --> g031
end
SX6036R25 ----> SX6036R29
SX6036R25 ----> SX6036R29
subgraph Rack29
SX6036R29 --> r29nodes[28 nodes with FDR]
end
SX6036R41 ----> SX6036R1
SX6036R42 ----> SX6036R1
SX6036R41 --> V4036R2
SX6036R42 --> V4036R2
SX6036R41 --> SX6036R25
SX6036R41 --> SX6036R25
SX6036R42 --> SX6036R25
SX6036R42 --> SX6036R25
subgraph Rack1
SX6036R1 --> r1nodes[32 nodes with FDR]
end
subgraph Rack2
V4036R2[R2 Voltaire 4036 QDR] --> lustre1[lustre1 mlx4_0 and mlx4_1]
V4036R2 --> lustre2[lustre2 mlx4_0 and mlx4_1]
end
SX6036R1 <--> V4036R2
SX6036R1 <--> V4036R2
V4036R2 ---> MTR3C2[R3C2 M1000E switch DDR]
This is the topology of our InfiniBand fabric. Every line you see is a link. Some are fiber, some are copper. Each link between FDR switches/nodes is 56Gb/s, and each link to a QDR switch/node is negotiated down to 40Gb/s.