Probabilistic Failure Detection for Efficient Distributed Storage Maintenance

Jing Tian
Zhi Yang
Wei Chen
Ben Y. Zhao
Yafei Dai

IEEE Symposium on Reliable Distributed Systems (SRDS 2008)

[Full Text in PDF Format, 762KB]

Paper Abstract

Distributed storage systems often use data replica-tion to mask failures and guarantee high data avail-ability. Node failures can be transient or permanent. While the system must generate new replicas to replace replica lost to permanent failures, it can save signifi-cant replication costs by not replicating following transient faults. Given the unpredictability of network dynamics, however, distinguishing permanent and transient failures is extremely difficult. Traditional timeout approaches are difficult to tune and can intro-duce unnecessary replication.

In this paper, we propose Protector, an algorithm that addresses this problem using network-wide statis-tical prediction. Our algorithm drastically improves prediction accuracy by making predictions across ag-gregate replica groups instead of single nodes. These estimates of the number of "live replicas" can guide efficient data replication policies. We prove that given data on node down times and the probability of per-manent failures, the estimate given by our algorithm is more accurate than all alternatives. We describe two ways to obtain the failure probability function driven by models or traces. We conduct extensive simulations based both on synthetic and real traces, and show that Protector closely approximates the performance of a perfect oracle failure detector, while significantly outperforming timeout-based detectors using a wide range of parameters.