Summary
Advances in large-scale data center computing and networking has accelerated innovation across computer science during the past decade. Large-scale, data-intensive systems experimentation is increasingly important to accelerating science, as is developing a workforce capable of solving big data problems on the large cloud-based application platforms that are increasingly central to the US economy. To stay relevant in an age of at-scale computer systems research, the systems research community has pressed operators of Computing Research Infrastructures (CRI) such as CloudLab and Chameleon Cloud with growing demands to support larger size experiments. But while commercial compute cloud platforms have grown exponentially, research infrastructures have grown modestly. Unfortunately, this `scale' gap seems likely to grow over time, threatening to inhibit academic researchers from directly innovating in large-scale computing systems research.
The Cloudjoin project seeks to develop a long-term, sustainable approach to shrink the scaling gap by creating hybrid CRIs with elastic computing. The project will develop methods, tools, and best practices for integration of cloud CRI and commercial compute clouds. Cloudjoin promises to permit experiments to scale at low cost while maintaining experiment reproducibility and preserving an intellectual `home' for computing systems research experimenters. The project explores three cloud integration research thrusts. In the first research thrust we ask "How should a CRI operator architect and implement her infrastructure's connectivity to a public cloud?" The second thrust asks: "How should an experimenter partition compute and storage resources between CRI and the public cloud?" The final thrust addresses the complexities of scaling experiments by exploring and embracing professional quality, multi-cloud management platforms equipped with monitoring, logging, alerting, error reporting and profiling capabilities.