Skip to content

Networking

InfiniBand

InfiniBand is a high-performance, low-latency interconnect used in HPC systems. It is often beneficial to understand the InfiniBand topology before submitting openmpi jobs, as well as jobs that are going to do a lot of IO on Lustre. This is our topology.

Some nodes have QDR (40Gb/s) InfiniBand, most have FDR (56Gb/s). They are labeled with ib_qdr and ib_fdr Slurm constraints respectively, to allow you to request one or the other.

Libibverbs provides a Remote Direct Memory Access (RDMA) API which allows programs to communicate over InfiniBand without directly involving the kernel or CPU on either end. Context switching between user-space and kernel-space creates a lot of overhead and unpredictable latency, which RDMA networks like InfiniBand and RDMA over Converged Ethernet (RoCE) avoid.

Our openmpi module is linked against UCX, which dlopen's a library linked against the libibverbs library or falls back to other communication methods like TCP over Ethernet. This means openmpi programs will automatically make use of the InfiniBand fabric when it is available.

Additionally, we mount our Lustre filesystem over InfiniBand on nodes with InfiniBand. This gives nodes low-latency access to Lustre, along with (up to) 4-5GB/s of throughput per node. Our Lustre filesystem can achieve about ~14GB/s aggregate throughput shared across all nodes. On nodes that only have Ethernet, we mount Lustre over TCP.

Ethernet

Some nodes have 10GbE, some only 1GbE. They are labeled with the eth_10g and eth_1g Slurm constraints to allow you to request one or the other.

InfiniBand topology

flowchart TD

    subgraph Rack4
        SX6036R41 ---> MTR4C3[R4C3 M1000E switch QDR]
        SX6036R41 ---> MTR4C1[R4C1 M1000E switch QDR]
        SX6036R41 --> MTR4C1

        SX6036R42 ---> MTR4C3[R4C3 M1000E switch QDR]
        SX6036R42 ---> MTR4C1[R4C1 M1000E switch QDR]
        SX6036R42 ---> MTR4C1
        MTR4C3 --> opensm[OpenSM node 1]
        MTR4C1 --> r4c1nodes[16 nodes with QDR]

    end

    subgraph Rack3 
        MTR3C2 --> opensm2[OpenSM node 2]
    end

    subgraph Rack25  
        SX6036R25 --> g031
    end

    SX6036R25 ----> SX6036R29
    SX6036R25 ----> SX6036R29

    subgraph Rack29 
        SX6036R29 --> r29nodes[28 nodes with FDR]
    end

    SX6036R41 ----> SX6036R1
    SX6036R42 ----> SX6036R1

    SX6036R41 --> V4036R2
    SX6036R42 --> V4036R2
    SX6036R41 --> SX6036R25
    SX6036R41 --> SX6036R25
    SX6036R42 --> SX6036R25
    SX6036R42 --> SX6036R25


    subgraph Rack1
            SX6036R1 --> r1nodes[32 nodes with FDR]

    end

    subgraph Rack2
            V4036R2[R2 Voltaire 4036 QDR] --> lustre1[lustre1 mlx4_0 and mlx4_1]
            V4036R2 --> lustre2[lustre2 mlx4_0 and mlx4_1]

    end
    SX6036R1 <--> V4036R2
    SX6036R1 <--> V4036R2

    V4036R2 ---> MTR3C2[R3C2 M1000E switch DDR]

This is the topology of our InfiniBand fabric. Every line you see is a link. Some are fiber, some are copper. Each link between FDR switches/nodes is 56Gb/s, and each link to a QDR switch/node is negotiated down to 40Gb/s.