'Zhores' -- Petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology

The Petaflops supercomputer"Zhores"recently launched in the"Center for Computational and Data-Intensive Science and Engineering"(CDISE) of Skolkovo Institute of Science and Technology (Skoltech) opens up new exciting opportunities for scientific discoveries in the institute especially in the areas of data-driven modeling, machine learning and artificial intelligence. This supercomputer utilizes the latest generation of Intel and NVidia processors to provide resources for the most compute intensive tasks of the Skoltech scientists working in digital pharma, predictive analytics, photonics, material science, image processing, plasma physics and many more. Currently it places 6th in the Russian and CIS TOP-50 (2018) supercomputer list. In this article we summarize the cluster properties and discuss the measured performance and usage modes of this scientific instrument in Skoltech.


Introduction
Skoltech CDISE Petaflops supercomputer "Zhores" named after the Nobel Laureate Zhores Alferov, is intended for cutting-edge multidisciplinary research in data-driven simulations and modeling, machine learning, Big Data and artificial intelligence (AI).It enables research in such important fields as Bio-medicine [46,44], Computer Vision [20,21,11,42,45,8], Remote Sensing and Data Processing [32,34,13], Oil/Gas [33,19], Internet of Things [37,38], High Performance Computing (HPC) [29,14], Quantum Computing [10,9], Agro-informatics [32], Chemical-informatics [39,35,16,15,17] and many more.Its architecture reflects the modern trend of convergence of "traditional" HPC, Big Data and AI.Moreover, heterogeneous demands of Skoltech projects on computing possibilities ranging from throughput computing to capability computing and the need to apply modern concepts of workflow acceleration and in-situ data analysis impose corresponding solutions on the architecture.The design of the cluster is based on the latest generation of CPUs, GPUs, network and storage technologies, current as of 2017-2019.This paper describes the implementation of this machine and gives details of the initial benchmarks that validate its architectural concepts.
The article is organized as follows.In section 2 the details of installation are discussed with subsections dedicated to the basic technologies.Section 3 describes several applications ran on the "Zhores" cluster and their scaling.The usage of the machine in the "Neurohackaton" held in November 2018 in Skoltech is described in section 4. Finally, section 5 provides conclusions.

Installation
"Zhores" is constructed from the DELL PowerEdge C6400 and C4140 servers with Intel R Xeon R CPUs and Nvidia Volta GPUs connected by Mellanox EDR Infiniband (IB) SB7800/7890 switches.We decided to allocate 20 TB of the fastest storage system (based on NVMe over IB technology) for small users' files and software (home directories), and 0.6 PB GPFS file system for bulk data storage.The principal scheme with the majority of components is illustrated in fig. 1.The exact composition with the characteristics of the components is found in table 1.The names of the nodes are given according to their intended role: • cn -compute nodes to handle CPU workload • gn -compute nodes to handle GPU workload • hd -hadoop nodes with set of disks for the classical Hadoop workload • an -access nodes for cluster login, submit jobs and transfer users' data • anlab -special nodes for user experiments • vn -visualization nodes • mn -main nodes for cluster management and monitoring All users land on one of the access nodes (an) after login and can use them for interactive work, data transfer and for job submission (dispatching tasks to compute nodes).Security requirements place the access nodes in the demilitarized zone.The queue structure is implemented using the SLURM workload manager and discussed in section 2.5.Both, shell scripts and Docker [4] images are accepted as valid work item by the queuing system.We have made a principal decision to use the latest CentOS version 7.5 which was officially available at the time of installation.The user environment is provided with the Environment Modules software system [5].Several compilers (Intel and GNU) are available as well as different versions of pre-compiled utilities and applications.
The cluster is managed with the fault tolerant installation of the Luna management tool [12].The two management nodes are mirrors of each other and provide the means of provisioning and administration of the cluster, provide the NFS export of user /home directories and all cluster configuration data.This is described in section 2.4.

Servers' Processor Characteristics
The servers have the latest generation of the Intel Xeon processors and Nvidia Volta GPUs.The basic characteristics of each type of the servers are captured in table 1.We have measured the salient features of these devices.Intel Xeon 6136 and 6140 "Gold" CPUs of Skylake generation differ by the total number of cores in the package and the working clock frequency (F ).Each core features two floating point AVX512 units.This has been tested with a special benchmark to verify that the performance varies with the frequency as expected.
The CPU performance and memory bandwidth of a single core is shown in fig. 2. The benchmark program to test the floating point calculation performance is published elsewhere [3].It is an unrolled vector loop with vector width 8, precisely tuned for the AVX512 instruction set.In this loop exactly 8 double precision numbers will be computed in parallel in two execution units of each core.With two execution units and the fused multiply-add instruction (FMA) the theoretical Double Precision (DP) performance of a single physical core is 8×2×2×F [GHz] and for the maximum of F = 3.5 GHz may reach 112 GFlop/s/core.The performance scales with the frequency to the maximum determined by processor thermal and electrical limits.The total FMA performance on a node when running AVX512 code on all processors in parallel is about 2.0 TFlop/s for C6140 machines (cn nodes, 24 cores) and 2.4 TFlop/s for the C4140 (gn nodes, 36 cores).Summing up all the cn and gn nodes gives the measured maximum CPU performance on the "Zhores" cluster of 150 TFlop/s.The latencies of the processor memory subsystem have been measured with the LMBench program [31] and summarized in table 2. The main memory performance is measured with the STREAMS program [30] and shown for the single core as a function of clock frequency in fig. 2. The theoretical performance of the memory bandwidth may be estimated with the Little Law [23] to 14 GB/s per each channel taking into account the memory latency of 27.4 ns given in table 2. The total memory bandwidth (STREAM Triad) for all cores reached 178.6 GB/s in our measurement using all 6 channels of 2666 MHz DIMMs.
The strong dependency of the FMA performance on the processor clock frequency and the weak dependency of the memory bandwidth on the clock frequency is noted to propose a scheme for the optimization of the power usage for applications with mixed instruction profiles.

Nvidia V100 GPU
Significant nodes (26) in the "Zhores" cluster are equipped with four Nvidia V100 GPUs each.The GPUs are connected pairwise with NVLink and individually with PCIe gen3 x16 to the CPU host.The principal scheme of the connections is shown in fig. 3. The basic measurements to label the links in the plot have been obtained with Nvidia p2p bandwidth program from the "Samples" directory loaded with GPU drivers.This setup is optimized for parallel computation scaling within the node, while the connections to the cluster network pass from the single PCIe link.The maximum estimated performance of a single V100 GPU is shown in fig. 4. The graphics clock rate was set with the command "nvidia-smi"; same command with different parameters lists the power draw of the device.The computational efficiency measured in Performance per Watt is not evenly distributed as function of frequency, the peak is 67.4 GFlop/s/W (single precision) at 1 GHz and drops to 47.7 GFlop/s/W at 1.5 GHz.

Mellanox IB EDR network
The high performance cluster network has the Fat Tree topology and is build from six Mellanox SB7890 (unmanaged) and two SB7800 (managed) switches that provide 100 Gbit/s (IB EDR) connections between the nodes.The performance of the interconnect has been measured with the "mpilink" program that times the ping-pong exchange between each node [1].To make the measurements we have installed Mellanox HPC package drivers and used openMPI version 3.1.2.The results are shown in fig. 5 for serial mode runs and in fig.6 for parallel mode runs.The serial mode sends packages to each node when previous communication has finished, while in parallel mode all sends and receives are issued at the same time.The parallel mode probes the package contention, while serial mode allows to establish the absolute speed and discover any failing links.The communication in serial mode is centered around the speed of 10.2 ± 0.5GB/s.The parallel mode reveals certain over-subscription of the Fat Tree network -while the computational nodes are balanced the additional traffic from the file services causes delays in the transmission.This problem will be addressed in future upgrades.

Operating System and cluster management
The "Zhores" cluster is managed by "Luna" [12] provisioning tool which can be installed in a fault tolerant active-passive cluster setup with TrinityX platform.The Luna management system was developed by ClusterVision BV.The system automates the creation of all the services and cluster configuration that make a bunch of servers a unified computational machine.
The cluster management software supports the following essential features: • All cluster configuration is kept in the Luna database and all cluster nodes boot from this information which is held in one place.This database is mirrored between the management nodes with the DRBD filesystem and the active management node provides access to data for every node in the cluster with the NFS share, see fig. 7.
• Node provisioning from OS images is based on the BitTorrent protocol [2] for efficient simultaneous (disk-less or standard) boot; the image management allows to grab an OS image from a running node to a file, clone images for testing or backup purposes; a group of nodes can use the same image for provisioning that fosters unification of cluster configuration.Nodes use PXE protocol to load a service image that implements the booting procedure.
• All the nodes (or groups of nodes) in the cluster can be switched on/off and reset with the IPMI protocol from the management nodes with a single command.
• The cluster services setup on the management node in a fault tolerant way include the following: DHCP, DNS, OpenLDAP, Slurm, Zabbix, Docker-repository, etc.
The management nodes are based on CentOS 7.5 and force same OS on the compute nodes; additional packages, specific drivers and different kernel versions can be included in the images for the cluster nodes.The installation requires each node to have at least two ethernet network interfaces, one dedicated to the management traffic and the other used for administrative access.A single cluster node can be booted within 2.5 minutes (over 1 GbE), and the whole "Zhores" cluster cold start takes 5 minutes to fully operational state.

The queueing system
Work queues have been organized with the Slurm workload manager to reflect the different application profiles of users of the cluster.Several nodes have been given to dedicated projects (gn26, anlab) and one CPU-only node is setup for debugging work (cn44).The remaining nodes have been combined in queues for the GPU-nodes (gn01-gn25) and for the CPU-nodes (cn01-cn43).

Linpack run
The Linpack benchmark was performed as a part of the cluster evaluation procedure and to rate the supercomputer for the performance comparison.The results of the run are shown in table 3  Table 3: Linpack performance of the "Zhores" cluster run separately on the GPU nodes and with all CPU resources.The power draw for CPU Linpack run is estimated (*).
"Zhores" supercomputer is significant for the Russian computational science community and has reached position 6 in the Russian and CIS TOP-50 list [6].

Algorithms for aggregation and fragmentation equations
In our benchmarks we used parallel implementation of efficient numerical methods for the aggregation and fragmentation equations [26,22] and also parallel implementation of the solver for advection-driven coagulation process [24].Its sequential version has already been utilized in a number of applications [25,27] and can be considered as one of the most efficient algorithms for a class of Smoluchowski-type aggregation kinetic equations.It is worth to stress that parallel algorithm for pure aggregation-fragmentation equations is based mostly on the performance of ClusterFFT operation which is a dominating operation in terms of algorithmic complexity, thus its scalability is extremely limited.Nevertheless for 128 cores we obtain speedup of calculations by more than 85 times, see table 4. In the case of the parallel solver, for advection-driven coagulation [29] we obtain almost ideal acceleration with utilization of the algorithm for almost full CPU-based segment.In this case, the algorithm is based on the onedimensional domain decomposition along the spatial coordinate and has a very good scalability, see table 5    Alongside with the consideration of the well-known two-particle problem of aggregation, we have measured the performance for a parallel implementation of a more general three-particle (ternary) Smoluchowski-type kinetic aggregation equations [40].In this case the algorithm is somewhat similar to the one for standard binary aggregation.However the number of the floating point calculations and the size of the allocated memory increases as compared to the binary case, because the dimension of the low rank Tensor Train (TT) decomposition [36] is naturally bigger in ternary case.The most computationally expensive operation in the parallel implementation of the algorithm is also the ClusterFFT.The speedup of the parallel ternary aggregation algorithm applied to the empirically derived ballistic-like kinetic coefficients [28] is shown in table 6.In full accordance with the structure of ClusterFFT and the problem complexity one needs to increase the parameter N of the used differential equations in order to obtain scalability.Speedups for both implementations of binary and ternary aggregation are shown on fig. 9

Gromacs
Classical molecular dynamics is an effective method with high predictive ability in a wide range of scientific fields [18,41].Using Gromacs 2018.3 software [7,43] we have performed molecular dynamics simulations in order to test the "Zhores" cluster performance.As a model system we chose 125 million Lennard-Jones spheres with the Van der Waals cut-off radius of 1.2 nm and with the Berendsen thermostat.All tests were conducted with a single precision version of Gromacs.The results are presented in fig.10.We measured the performance as a function of the number of nodes; we have used up to 40 CPU nodes and up to 24 GPU nodes.We have used 4 OpenMP threads per MPI process.Each task was performed 5 times with following averaging in order to obtain final performance.Grey and red solid lines show linear acceleration of the program on CPU and GPU nodes, respectively.In case of the CPU-nodes, one can see almost ideal speedup.With a large number of CPU-nodes, the speedup deviates from linear and grows slower.
To test performance on the GPU-nodes, we have performed simulations with 1, 2 and 4 graphics cards per node.The use of all 4 graphical cards demonstrates good scalability, while 2 GPU per node shows slightly lower speedup.Runs with 1 GPU per node demonstrates worse performance, especially with high number of nodes.To compare the efficiency for different number GPU per node, we show the performance for the four configurations (0, 1, 2 and 4 GPU) using 24 GPU-nodes in fig.11 as a bar chart.The 4 GPU per node configuration gives about 2.5 times higher performance than running the program only on the CPU cores.And even 1 GPU per node gives significant performance increase compared to the CPU only run.To address this problem, a two-stage authentication system was chosen using several levels of virtualization.Access to the cluster was made through the VPN tunnel using Cisco ASA and Cisco AnyConnect; then the SSH (RFC 4251) protocol was used to access the consoles of the operating system (OS) of the participants.
The virtualization was provided at the level of a data network through the IEEE 802.1Q (VLAN) protocol and OS level Docker [4] containerization with the ability to connect to GPU accelerators.The container worked in its address space and in a separate VLAN, so we achieved an additional isolation level from the host machine.Also at the Linux kernel level, the namespace feature was turned on and the user and group IDs were remapped to obfuscate the superuser rights on the host machine.
As a result, each participant of the Neurohackathon had a docker container with access via the SSH protocol to the console and used the https protocol to Jupiter application on his VM.The four Nvidia Tesla V100 accelerators on the GPU nodes were used for the computing.
The number of teams participating in the competition had rapidly increased from 6 to 11 one hour before the start of the event.The usage of virtualization technology and the flexible architecture of the cluster allowed us to provide all teams with the necessary resources and start the hackathon on time.

Conclusions
In conclusion, we have presented the Petaflops supercomputer "Zhores" installed in Skoltech CDISE that will be actively used for multidisciplinary research in data-driven simulations, machine learning, Big Data and artificial intelligence.Linpack benchmark placed this cluster at position 6 of the Russian and CIS TOP-50 Supercomputer list.Initial tests show a good scalability of the modeling applications and prove that the new computing instrument can be used to support advanced research at Skoltech and for all its research and industrial partners.

Acknowledgements
We are thankful to the Nobel Prize laureate Prof. Zhores Alferov for his agreement to lend his name to this project.Authors acknowledge valuable contribution from Dmitry Sivkov (Intel) and Sergey Kovylov (NVidia) for their help in running Linpack tests during the deployment of "Zhores" cluster and Dmitry Nikitenko (MSU) for his help in filling the forms for the Russia and CIS Top-50 submission.We are indebted to Dr. Sergey Matveev for valuable consultations and Denis Shageev for indispensable help with the cluster installation.We would also like to thank Prof. Dmitry Dylov, Prof. Andrey Somov, Prof. Dmitry Lakontsev, Seraphim Novichkov and Eugene Bykov for their active role in organization of the Neurohackathon.We are also thankful to Prof. Ivan Oseledets and Prof. Eugene Burnaev and their team members for testing of the cluster.

Figure 1 :
Figure 1: Principle connection scheme.The an and mn nodes are marked explicitly; the cn, gn and other nodes are lumped together.

Figure 2 :
Figure 2: Floating point performance (FMA instructions) on 6136 CPU core and memory bandwidth (STREAM Triad) as a function of clock frequency.Left ordinate shows the FMA performance, the right ordinate represents the memory bandwidth.

Figure 3 :
Figure 3: Principal connections between host and graphics subsystem on graphics nodes.

Figure 4 :
Figure 4: Nvidia V100 GPU floating point performance as a function of graphics clock rate.Electrical power draw corresponding to the set frequency is indicated on the upper axis.

Figure 5 :
Figure 5: Histogram of the ping-pong times/speeds between all nodes using 1 MB packets in serial mode

Figure 7 :
Figure 7: Organization of the "Zhores" cluster management with Luna system.

Figure 9 :
Figure9: binary and ternary aggregation solvers on CPU, Ballistic-like kernels, 16 and 10 time-integration steps for N = 222 and N = 219 nonlinear ODEs, respectively.Parameter R denotes the rank of used matrix and tensor decompositions.

Figure 10 :
Figure 10: Performance of the molecular dynamic simulations of 125 million Lennard-Jones spheres using Gromacs 2018.3 as a function of nodes number.Note, that there are only 26 GPU nodes on the cluster.

Table 2 :
Memory properties from Xeon 6136/6140 processor visible from single core separately for the GPU and for all nodes using only CPU computation.

Table 4 :
and fig.8.The experiments have been performed using Intel R compilers and the Intel R MKL library.Computational times for 16 time-integration steps for the parallel implementation of algorithm for the aggregation and fragmentation equations with N = 2 22 stronglycoupled nonlinear ODEs.In this benchmark we utilized the nodes from the CPU segment of the cluster.

Table 5 :
Parallel advection-coagulation solver on CPUs, Ballistic kernel, domain size N × M = 12288, 16 time-integration steps.This benchmark utilized up to 32 nodes from the CPU segment of the cluster.

Table 6 :
. The experiments have been performed using Intel compilers and Intel MKL library.Computational times for 10 time-integration steps for parallel implementation of the algorithm for ternary aggregation equations with N = 219nonlinear ODEs.