CC* Integration-Small: Unifying and Accelerating Campus Computational Science with Ray

Jack Brassil

Summary

A principal challenge in advancing scientific computing today is the efficient computation of a mix of conventional High Performance Computing (HPC) workloads with fast-growing Machine Learning (ML) tasks. Both workloads benefit from acceleration through hardware technology support including co-processors such as Graphics Processing Units (GPUs), as well as high-performance communications interconnects. There is also a specific need to serve the rapid emergence of Reinforcement Learning (RL) workloads with demanding latency requirements, where policies are adapted continuously in response to newly arriving data. RL is well suited to address many emerging Internet of Things (IoT) and sensing-based applications, ranging from self-driving car navigation to ensuring safe human-robot interactions.

Ray is an open-source distributed software framework designed to improve the efficiency of both scientific computing and ML workloads by distributing and parallelizing them. To accelerate computational science on campus, this project proposes to deploy Ray software across a range of different computing environments, ranging from 1) a large campus HPC compute cluster, 2) the NSF-supported FABRIC computing and networking research infrastructure, 3) hybrid campus and commercial compute clouds, and 4) an edge network of embedded computers. This single, unifying framework promises to allow investigators to code once and easily deploy to each of the four computing environments from their desktops. Ray handles the work of managing a cluster, scheduling tasks, transferring network data, checking system health, tolerating faults, etc, allowing computational scientists and engineers to focus on the underlying data science problem and not the complexities of distributed systems. This project will show that Ray can achieve this goal across an even wider variety of compute settings than envisioned.

Learn more

This project is supported in part by the National Science Foundation under grant OAC-2429485