Catch M(oor)e If You Can: Agile Hardware/Software Co-Design for Hyperscale Cloud Systems
Global reliance on cloud services, powered by transformative technologies like generative AI, machine learning, and big-data analytics, is driving exponential growth in demand for hyperscale cloud compute infrastructure. Meanwhile, the breakdown of classical hardware scaling (e.g., Moore's Law) is hampering growth in compute supply. Building domain-specific hardware can address this supply-demand gap, but catching up with exponential demand requires developing new hardware rapidly and with confidence that performance/efficiency gains will compound in the context of a complete system. These are challenging tasks given the status quo in hardware design, even before accounting for the immense scale of cloud systems.
This talk will focus on two themes of my work: (1) Developing radical new agile, end-to-end hardware/software co-design tools that challenge the status quo in hardware design for systems of all scales and unlock the ability to innovate on new hardware at datacenter scale. (2) Leveraging these tools and insights from hyperscale datacenter fleet profiling to architect and implement state-of-the-art domain-specific hardware that addresses key efficiency challenges in hyperscale cloud systems.I will first cover my work creating the award-winning and widely used FireSim FPGA-accelerated hardware simulation platform, which provides unprecedented hardware/software co-design capabilities. FireSim automatically constructs high-performance, cycle-exact, scale-out simulations of novel hardware designs derived from the tapeout-friendly RTL code that describes them, empowering hardware designers and domain experts alike to directly iterate on new hardware designs in hours rather than years. FireSim also unlocks innovation in datacenter hardware with the unparalleled ability to scale to massive, distributed simulations of thousand-node networked datacenter clusters with specialized server designs and complete control over the datacenter architecture. I will then briefly cover my work co-creating the also widely used Chipyard platform for agile construction, simulation (including FireSim), and tape-out of specialized RISC-V System-on-Chip (SoC) designs using a novel, RTL-generator-driven approach.
Next, I will discuss my work in collaboration with Google on Hyperscale SoC, a cloud-optimized server chip built, evaluated, and taped-out with FireSim and Chipyard. Hyperscale SoC includes my work on several novel domain-specific accelerators (DSAs) for expensive but foundational operations in hyperscale servers, including (de)serialization, (de)compression, and more. Hyperscale SoC demonstrates a new paradigm of data-driven, end-to-end hardware/software co-design, combining key insights from profiling Google's world-wide datacenter fleet with the ability to rapidly build and evaluate novel hardware designs in FireSim/Chipyard. This instance of Hyperscale SoC is just the beginning; I will conclude by covering the wide-ranging opportunities that can now be explored for radically redesigning next generation hyperscale cloud datacenters.
Bio: Sagar Karandikar is a Ph.D. Candidate at UC Berkeley and a Student Researcher at Google. His work broadly focuses on co-designing hardware and software to build next generation hyperscale cloud systems. He is also interested in agile, open-source hardware development methodologies.
His first-author publications have received several honors, including being selected for the ISCA@50 25-year Retrospective, as an IEEE Micro Top Pick, as an IEEE Micro Top Pick Honorable Mention, and as the MICRO '21 Distinguished Artifact Award winner.
He created and leads the FireSim project, which has been used as a foundational research platform in over 50 peer-reviewed publications from first authors at over 20 institutions. FireSim has also been used in the development of commercially available chips and as a standard host platform for DARPA and IARPA programs. He is a co-creator and co-lead of the also widely used Chipyard RISC-V System-on-Chip (SoC) development platform. His work on Hyperscale SoC has been influential at Google and more broadly across other silicon vendors. He was selected as a 2022 DARPA Riser and received the UC Berkeley Outstanding Graduate Student Instructor (TA) Award. He received his M.S. and B.S. from UC Berkeley.
To request accommodations for a disability please contact Emily Lawrence, emilyl@cs.princeton.edu, at least one week prior to the event.