Challenges in the Supercomputer Operations
JAXA is tackling technical challenges in operation – Multiple approaches toward the solution via technological improvements or open information sharing and so on –
High level of awareness of the technical challenges with supercomputers
In order to solve the technical issues in the operation of supercomputers, JAXA has cooperated with the manufactures to apply technical improvements, and provided the information necessary to users to run the programs efficiently.
Pursuit for Efficiency
Power consumption per FLOPS has been drastically decreased with the progress of semiconductor technology, but it is not enough to satisfy the ever-increasing demand for processing power by the growing scale of the calculation.
We have been making every effort to reduce the power consumption of the whole supercomputer system.
SORA-MA, the main system of JSS2, has 180-200 times performance per rack basis, than M-System of JSS1, the main system of previous generation per rack basis.
Performance improvement of JSS2 was contributed by of its processors, sophisticated parts and design. JSS2 has adopted the processors aiming to achieve both low power consumption and high throughput. The performance ratio per processor is 25 times better than that of JSS1.
Circuit board of SORA-MA is carefully designed to improve power efficiency with high-performance and efficient parts.
Cooling system and Environmental improvements
The circuit board in the supercomputer generates more heat as the density increases, but the conventional cooling system cannot handle the huge heat. We have adopted water-cooling system with cold plates which removes more heat efficiently in JSS2.
Layout of the computer room is changed according to systems installed. The room was halved due to the reduction of the installation footprint of JSS2. With smaller space requirement, we achieved a reduction of the area needed for air conditioning to half.
The “hot-aisle/cold-aisle” layout, adopted in JSS1, is inherited in JSS2.
Moreover, we conduct a regular review of the temperature setting in the computer room. It is set at 20 degree Celsius, higher than when JSS1 was running.
Amount of energy needed for climate control for JSS2 (as of phase 2) was reduced to 1/6 of that for JSS1 with the adoption of the water-cooling system and the improved environment of the computer room.
Job Scheduling
When multiple users use a parallel supercomputer at the same time, many independent jobs of various amount of resources needed are submitted simultaneously. Job scheduler handles the jobs in efficient and fair fashion.
We had developed and used JAXA’s original scheduler “JARMan” until JSS1. It maintained utilization rate of as high as 98%.
Unfortunately “JARMan” could not run on JSS2 due to the different inter-connects from JSS1. This dropped the rate to 80% when JSS2 started operation.
Since then, the operation and settings improvements have raised the rate to 85%.
We are researching new technologies for better scheduling of jobs.