Useful Tips

Networks for supercomputers


Distributed computing is one way to contribute to some interesting projects. When your computer is idle, share its power with the SETI project, which is looking for extraterrestrial civilizations. In this case, your computer will analyze satellite data and information received from telescopes.

This article will help you join projects (such as SETI) involving distributed computing. The article also introduces you to BOINC, a distributed computing software.

You need a computer. If you already have one, go to the Sources and Links section and install the BOINC software. If you are not interested in the SETI project, below you will find a list of other projects.

If a lot of money

Separately, we note the extremely expensive, but productive line of processors on the Intel Xeon LGA1567 socket.
The top processor in this series is the E7-8870 with ten 2.4 GHz cores. Its price is $ 4616. For such CPUs, HP and Supermicro are releasing! eight-processor! server chassis. Eight 10-core Xeon E7-8870 2.4 GHz processors with HyperThreading support 8 * 10 * 2 = 160 threads, which is displayed in Windows Task Manager as one hundred and sixty graphs of processor loading, matrix 10x16.

In order for eight processors to fit in the case, they are not immediately placed on the motherboard, but on separate boards that stick into the motherboard. The photo shows four motherboards with processors installed in the motherboard (two on each). This is a Supermicro solution. In the HP solution, each processor has its own board. The cost of an HP solution is two to three million, depending on the number of processors, memory, and more. The Supermicro chassis costs $ 10,000, which is more attractive. In addition, Supermicro can put four coprocessor expansion cards in the PCI-Express x16 ports (by the way, there will still be room for an Infiniband adapter to assemble a cluster of these), but only two in HP. Thus, to create a supercomputer, an eight-processor platform from Supermicro is more attractive. The following photo from the exhibition shows the complete supercomputer with four GPU boards.

However, it is very expensive.

Communication networks

The effectiveness of a supercomputer in many applications is largely determined by the profile of working with memory and network. The profile of working with memory is usually described by the spatio-temporal localization of calls - by the size of calls and the scatter of their addresses, and the profile of working with the network is described by the distribution of nodes with which messages are exchanged, the exchange rate and message sizes.

The performance of a supercomputer on tasks with intensive data exchange between nodes (modeling problems, problems on graphs and irregular grids, calculations using sparse matrices) is mainly determined by the network performance, so the use of conventional commercial solutions (for example, Gigabit Ethernet) is extremely inefficient. However, a real network is always a compromise solution, in the development of which priorities are set between price, performance, energy consumption and other requirements that are largely conflicting: attempts to improve one characteristic can lead to a deterioration of the other.

A communication network consists of nodes, each of which has a network adapter connected to one or more routers, which in turn are interconnected by high-speed communication channels (links).

Fig. 1. Topology 4D-torus (3x3x3x3)

The network structure, which determines how exactly the nodes of the system are interconnected, is determined by the network topology (usually a lattice, a torus or a thick tree) and a set of structural parameters: the number of measurements, the number of tree levels, the dimensions of the torus sides, the number of switches at the tree levels, the number of network nodes ports on routers, etc. Figure 1 shows an example of the topology of a four-dimensional torus 3x3x3x3.

The architecture of the router determines the structure and functionality of the blocks responsible for the transfer of data between network nodes, as well as the necessary properties of the protocols of the channel, network, and transport layers, including routing, arbitration, and data flow control algorithms. The architecture of the network adapter determines the structure and functionality of the blocks responsible for the interaction between the processor, memory and the network, in particular, MPI operations are supported at this level, RDMA (Remote Direct Memory Access - direct access to the memory of another node without the participation of its processor), confirmations of receipt by another node of the packet, handling of exceptional situations, aggregation of packets.

To assess the performance of a communication network, three characteristics are most often used: bandwidth (amount of data transferred per unit of time), communication delay (data transfer time over the network), message pace (usually, they separately consider the rate of delivery when sending, receiving and transmitting packets between the internal units of the router).

For completeness, these characteristics are measured on different types of traffic, for example, when one node sends data to all the others, or, conversely, all nodes send data to one, or when all nodes send data to random destinations. Functionality requirements are imposed on modern networks:

  • effective implementation of the Shmem library, as an option to support the one-way communication model, and GASNet, on which the implementation of many PGAS languages ​​is based,
  • efficient implementation of MPI (usually this requires effective support of the mechanism of ring buffers and acknowledgments for received packets),
  • effective support for collective operations: broadcasting (sending the same data simultaneously to many nodes), reduction (applying a binary operation, for example addition, to the set of values ​​received from different nodes), distributing array elements over the set of nodes (scatter), assembling an array of elements, located at different nodes (gather),
  • effective support for inter-node synchronization operations (at least barrier synchronization), effective interaction with a network of a large number of processes on a node, and ensuring reliable packet delivery.

Effective support of the adapter’s work with the host’s memory directly without processor involvement is also important.

Foreign high-speed networks

All communication networks can be divided into two classes: commercial and custom, developed as part of computer systems and available only with them. Among commercial networks, the market is divided between InfiniBand and Ethernet - in the Top500 list (June 2011), 42% of systems use InfiniBand and 45% use Gigabit Ethernet. At the same time, if InfiniBand is focused on the segment of high-performance systems designed for complex computing tasks with a large number of communications, then Ethernet traditionally occupies a niche where data exchange between nodes is uncritical. In supercomputers, the Ethernet network, due to its low cost and availability, is often used as an auxiliary service network in order to reduce the interference of control traffic and task traffic.

The Inifiniband network was initially focused on configurations with the Fat tree topology, but the latest versions of switches and routers (primarily manufactured by QLogic) support the multidimensional torus topology (using the Torus-2QoS Routing Engine), as well as a hybrid topology from the 3D torus and Fat tree. The Sandia RedSky supercomputer, assembled at the beginning of 2010 and now in 16th place in the Top500, is one of the first large-scale projects with the InfiniBand network and topology 3D torus (6x6x8). Also, much attention is now paid to the effective support of RDMA operations and the Shmem library (in particular, Qlogic Shmem).

The popularity of InfiniBand is due to its relatively low cost, developed ecosystem of software and effective support for MPI. However, InfiniBand has its drawbacks: a low rate of message delivery (40 million messages per second in the latest solutions from Mellanox), low transmission efficiency of short packets, a relatively large delay (more than 1.5 μs for transmissions node-to-node and an additional 0.1- 0.5 μs per transit node), weak support for the toroidal topology. In general, it can be argued that InfiniBand is a product for the mass user, and during its development a compromise was made between efficiency and versatility.

We can also note the network Extoll, which is being prepared for launching on the market - the development of the University of Heidelberg under the leadership of Professor Ulrich Bruening. The main emphasis in the development of this network is to minimize delays and increase the rate of delivery in one-way communications. It is planned that Extoll will have a 3D torus topology and use optical links with a bandwidth of 10 Gb / s per lane (serial data transmission channel within the link) and a width of 12 lanes per link. Now there are prototypes of the Extoll network on FPGA: R1 - based on Virtex4, R2 Ventoux - a two-node layout based on Virtex6. One-way bandwidth per link is 600 MB / s (for R1). Two interfaces (HyperTransport 3.0 and PCI Express gen3) with a processor will also be supported, which will allow integrating this network into Intel and AMD platforms. Extoll supports several ways of organizing one-way records, its own MMU (Memory Management Unit, a block of translation of virtual addresses into physical addresses) and atomic operations.

Unlike commercial networks, custom networks occupy a much smaller market share, however they are used in the most powerful supercomputers from Cray, IBM, SGI, Fujitsu, NEC and Bull. When designing custom networks, developers have more freedom and try to use more progressive approaches due to the lower importance of the market attractiveness of the final product, solving primarily the problem of obtaining maximum performance on a specific class of tasks.

The K Computer supercomputer uses a proprietary Tofu (TOrus FUsion) communication network, which is a scalable 3D torus whose nodes contain groups of 12 nodes (groups of nodes are connected by 12 networks with a 3D torus, and each node from this group has its own output 3D torus network). The nodes within each group are interconnected by a 3D torus with sides 2x3x4 without duplicate links, which is equivalent to a 2D torus with sides 3x4 (so we get a 5D torus with fixed two dimensions). Thus, the Tofu network node has 10 links with one-way throughput of 40 Gb / s each. Barrier synchronization of nodes and reduction (integer and floating point) are supported in hardware.

The main goals in the development of the Tianhe-1A supercomputer were to achieve high energy efficiency, to develop their own processor and network superior to InfiniBand QDR. The supercomputer consists of 7168 computing nodes connected by the Arch network of its own design with the thick tree topology. The network is built from 16-port routers, one-way link bandwidth - 8 GB / s, delay - 1.57 μs. RDMA operations supported and collective operations optimized.

Classical representatives of systems using a toroidal topology to combine computational nodes are systems for the IBM Blue Gene series, in the first two generations of which - Blue Gene / L (2004) and Blue Gene / P (2007) - used the 3D torus topology. The network in Blue Gene / P has relatively weak links with a single-sided bandwidth of 0.425 GB / s, which is an order of magnitude lower than the bandwidth of its contemporary InfiniBand QDR link, however, hardware-based support for barrier synchronization and collective operations (on separate tree-like networks) allows for good scalability on real applications. In addition, all interfaces and routing units are integrated into the BPC microprocessor (Blue Gene / P Chip), which significantly reduces message transmission delays. The next generation communication network Blue Gene / Q has a 5D-tor topology, and unlike its predecessors, it does not have separate networks for barrier synchronization and collective operations. The Blue Gene / Q chip for the first time became multi-core-multi-thread - four hardware threads per core with 16 cores, which allows weakening network requirements and ensuring delay tolerance. Link throughput has been increased to 2 GB / s, but still remains small compared to Cray Gemini or Extoll. The low throughput in these systems is leveled by the large dimension of the torus (a large number of links) and, as a result, by the small diameter of the network (significantly smaller than that of networks with a 3D torus topology with the same number of nodes). Available sources report the creation of two Blue Gene / Q transpetaflops supercomputers: Sequoia with a performance of 20 PFLOPS and Mira - 10 PFLOPS. We can conclude that Blue Gene / Q is focused on tasks that will use tens and hundreds of thousands of computing nodes with network traffic of the “all to all” type.

Another adherent of the approach to building communication networks with a toroidal topology is Cray, which continues to use the 3D tor topology, while increasing throughput and the number of links connecting neighboring nodes. The current generation of the Cray toroidal network is the Cray Gemini network. One Gemini router corresponds to two routers of the previous SeaStar2 + generation, that is, actually to two network nodes, therefore in Gemini instead of 6 links 10 are used to connect to neighboring nodes (2 serve to connect two adapters to each other).

The components (network adapters, switches, routers) of a network for a supercomputer, unlike processors, are often more expensive, and access to them is more limited. For example, now the switches for the InfiniBand network, which is the main commercial network for supercomputers, are produced by only two companies, both of which are controlled by the United States. This means that in the absence of their own developments in the field of high-speed networks, the creation of modern supercomputers in any country except the USA, China or Japan can be easily controlled.

Domestic networks

The development of communication networks for use in supercomputers is carried out by a number of domestic organizations: the RFNC VNIIEF (there is very little information about these developments in open sources), the Institute for Software Systems of the Russian Academy of Sciences and RSK SKIF, the IPM RAS and the Research Institute Kvant (MVS-Express network ").

The 3D tor communication network for the Russian-Italian supercomputer SKIF-Aurora is completely built using the Altera Stratix IV FPGA, which explains the rather small bandwidth per link - 1.25 GB / s (FPGA resources are very limited).

In the MVS-Express network, PCI Express 2.0 is used to integrate the computing nodes, and the nodes are connected through 24-port switches. The network has a topology close to Fat tree. The network adapter in the computing node has one port with a width of 4 lanes, as a result of which the one-way peak throughput per link is 20 Gbit / s without taking into account the encoding overhead. The advantage of using PCI Express in MVS-Express is the efficient support of shared memory with the possibility of one-way communications. As a result, the network is convenient for implementing the Shmem library and PGAS languages ​​(UPC, CAF).

With the support of the Ministry of Industry and Trade of the Russian Federation, NICEVT OJSC is working on the development of the Angara communication network with a 4D-tor topology, which can become the basis for creating domestic technologies for the development of supercomputers.

Network "Angara"

The main objectives of the development of the Angara network:

  • effective support for one-way communications (put / get) and PGAS languages ​​(as the main means of parallel programming),
  • Effective MPI support
  • release of own crystal (to achieve high data transfer rates and low delays),
  • adaptive fail-safe packet transmission,
  • effective work with modern processors and chipsets.

At the first stage of development of this network (2006), a simulation of various network options was carried out and the main decisions were made on the topology, router architecture, routing algorithms and arbitration. In addition to the toroidal topology, Cayley networks and the “thick tree” were considered. The four-dimensional torus was chosen due to its simpler routing, good scalability, and high connectivity compared to smaller tori. Network modeling made it possible to study in detail the effect of various parameters of the network architecture on the main performance characteristics, to understand the patterns for the traffic of tasks with intensive irregular access to memory. As a result, optimal buffer sizes, the number of virtual channels were selected, and potential bottlenecks were analyzed.

In 2008, the first prototype of a FPGA router appeared - a network layout of six nodes on Virtex4 connected to a 2x3 torus, on which the basic functionality of the router was debugged, fault-tolerant data transmission was worked out, the driver and low-level library were written and debugged, the Shmem libraries were ported and MPI Now launched a third-generation layout, consisting of nine nodes connected in a two-dimensional torus 3x3. A stand with two nodes was assembled for testing new connectors and data transmission channels, intended for use with future crystals of the VKS router. При разработке принципов работы сети ряд деталей был позаимствован из работ и , а также в том или ином виде из архитектур IBM Blue Gene и Cray SeaStar.

Сеть «Ангара» имеет топологию 4D-тор. Поддерживается детерминированная маршрутизация, сохраняющая порядок передачи пакетов и предотвращающая появление дедлоков (взаимных блокировок), а также адаптивная маршрутизация, позволяющая одновременно использовать множество путей между узлами и обходить перегруженные и вышедшие из строя участки сети. Particular attention was paid to supporting collective operations (broadcasting and reduction) implemented using a virtual subnet having the topology of a tree superimposed on a multidimensional torus. The network at the hardware level supports two types of remote writes, reads, and atomic operations (addition and exclusive OR). The remote reading execution scheme (sending a request and receiving a response) is shown in Fig. 2 (remote recording and atomic operations are performed similarly). In a separate block, the logic is implemented to aggregate messages received from the network in order to increase the share of useful data per transaction when transmitting through an interface with a host (a host is a processor-memory-bridge bridge).

Fig. 2. Scheme of remote reading in the Angara network

At the data link layer, fail-safe packet transmission is supported. There is also a mechanism for bypassing failed communication channels and nodes by rebuilding routing tables. To perform various service operations (in particular, configure / rebuild routing tables) and perform some calculations, a service processor is used. The host interface uses PCI Express.

Fig. 3. The structure of the computing node with a network adapter / router "Angara"

The main blocks of the router:

  • interface with the host system, responsible for receiving and sending packets on the host interface,
  • an injection and ejection unit that forms packets to be sent to the network and parses the headers of packets coming from the network,
  • a request processing unit that processes packets that require information from the memory of the host system (for example, reads or atomic operations),
  • a collective operations network unit that processes packets associated with collective operations, in particular, performing reduction operations, generating broadcast request packets,
  • a service operations unit that processes packets going to and from the service coprocessor,
  • a switch connecting inputs from various virtual channels and inputs from injectors with outputs to various directions and ejectors,
  • communication channels for transmitting and receiving data in a certain direction,
  • a data transmission unit for sending packets in a given direction; and a receiving and routing unit for receiving packets and deciding on their future fate.

The host interaction (the code executed on the central processor) with the router is carried out by writing to the memory addresses mapped to the addresses of the resource regions of the router (memory-mapped input / output). This allows the application to interact with the router without the participation of the kernel, which reduces the overhead of sending packets, since switching to the kernel context and back takes more than a hundred clock cycles. To send packets, one of the memory regions is used, which is considered as a ring buffer. There is also a separate region for performing operations without copying memory-memory (data is read from memory and written by the adapter of the communication network through DMA operations) and a region with control registers. Access to certain resources of the router is controlled by the nuclear module.

To achieve greater efficiency, it was decided that only one computational task should be performed on one node, this eliminated the overhead associated with the use of virtual memory, avoided task interference, simplified the architecture of the router due to the lack of a full MMU and avoided all his work of communication delays, as well as simplify the network security model, eliminating from it the security of the processes of various tasks on one node. This solution did not affect the functionality of the network as intended primarily for large-sized tasks (as opposed to InfiniBand, a universal network for tasks of various sizes). A similar decision was made at IBM Blue Gene, where a restriction on the uniqueness of the task is introduced for the section.

At the hardware level, simultaneous work with the router of many threads / processes of one task is supported - it is implemented in the form of several injection channels available for use by processes through several ring buffers for recording packets. The number and size of these buffers can change dynamically.

The main programming mode for the Angara network is the joint use of MPI, OpenMP and Shmem, as well as GASNet and UPC.

After the verification and prototyping of the network is completed, it is planned to release a VLSI chip. A prototype VLSI batch will be designed for debugging basic technological solutions, a technological process, and experimental verification of simulation results. The prototype will contain all the basic functionality, work with the PCI Express gen2 x16 interface and links with a throughput of 75 Gb / s.

It is planned to promote the Angara network to the market in two versions: as a separate commercial network in the form of PCI Express cards for cluster systems with standard processors and chipsets, and as part of a four-socket blade system based on AMD processors being developed at the NICEVT.