Configuration errors are among the dominant causes of service-wide, catastrophic failures in today's cloud and datacenter systems. Despite the wide adoption of fault-tolerance and recovery techniques, these large-scale software systems still fail to effectively deal with configuration errors. In fact, even tolerance/recovery mechanisms are often misconfigured and thus crippled in reality.
In this talk, I will present our research efforts towards hardening cloud and datacenter systems against configuration errors. I will start with work that seeks for understanding the fundamental causes of misconfigurations. I will then focus on two of my approaches, PCheck and Spex, that enable software systems to anticipate and defend against configuration errors. PCheck generates checking code to help systems detect configuration errors early, and Spex exposes bad system reactions to configuration errors based on constraints inferred from source code.
Tianyin Xu is a Ph.D. candidate in Computer Science and Engineering at University of California, San Diego. His research interests intersect systems, software engineering, and HCI towards the overarching goal of building reliable and secure systems. His dissertation work has impacted the configuration design and implementation of real-world commercial and open-source systems, and has received a Best Paper Award at OSDI 2016.