Systems and Tools for Reliable Software: Replication, Reproducibility, and Security
Date and Time
Tuesday, March 28, 2017 - 12:30pm to 1:30pm
Location
Computer Science Small Auditorium (Room 105)
Type
CS Department Colloquium Series
Speaker
Host
Prof. Michael Freedman
The past decade has seen a rapid acceleration in the development of new and transformative applications in many areas including transportation, medicine, finance, and communication. Most of these applications are made possible by the increasing diversity and scale of hardware and software systems.
While this brings unprecedented opportunity, it also increases the probability of failures and the difficulty of diagnosing them. Increased scale and transience has also made management increasingly challenging. Devices can come and go for a variety of reasons including mobility, failure and recovery, and scaling capacity to meet demand.
In this talk, I will be presenting several systems that I built to address the resulting challenges to reliability, management, and security.
Ori is a reliable distributed file system for devices at the network edge. Ori automates many of the tasks of storage reliability and recovery through replication, taking advantage of fast LANs and low cost local storage in edge networks.
Castor is record/replay system for multi-core applications with predictable and consistently low overheads. This makes it practical to leave record/replay on in production systems, to reproduce difficult bugs when they occur, and to support recovering from hardware failures through fault tolerance.
Cryptographic CFI (CCFI) is a dynamic approach to control flow integrity. Unlike previous CFI systems that rely purely on static analysis, CCFI can classify pointers based on dynamic and runtime characteristics. This limits the attacks to only actively used code paths, resulting in a substantially smaller attack surface.
While this brings unprecedented opportunity, it also increases the probability of failures and the difficulty of diagnosing them. Increased scale and transience has also made management increasingly challenging. Devices can come and go for a variety of reasons including mobility, failure and recovery, and scaling capacity to meet demand.
In this talk, I will be presenting several systems that I built to address the resulting challenges to reliability, management, and security.
Ori is a reliable distributed file system for devices at the network edge. Ori automates many of the tasks of storage reliability and recovery through replication, taking advantage of fast LANs and low cost local storage in edge networks.
Castor is record/replay system for multi-core applications with predictable and consistently low overheads. This makes it practical to leave record/replay on in production systems, to reproduce difficult bugs when they occur, and to support recovering from hardware failures through fault tolerance.
Cryptographic CFI (CCFI) is a dynamic approach to control flow integrity. Unlike previous CFI systems that rely purely on static analysis, CCFI can classify pointers based on dynamic and runtime characteristics. This limits the attacks to only actively used code paths, resulting in a substantially smaller attack surface.
Ali is currently completing his PhD at Stanford University where he is advised by Prof. David Mazières. His work focuses on improving reliability, ease of management and security in operating systems and distributed systems. Previously, he was a Staff Engineer at VMware, Inc. where he was the technical lead for the live migration products. Ali received an M.Eng. in electrical engineering and computer science and a B.S. in electrical engineering from the Massachusetts Institute of Technology.