Probabilistic Failure Detection for Efficient Distributed Storage Maintenance

Jing Tian

Zhi Yang

Wei Chen

Ben Y. Zhao

Yafei Dai

*IEEE Symposium on Reliable Distributed
Systems (SRDS 2008)*

[Full Text in PDF Format, 762KB]

Paper Abstract

Distributed storage systems often use data replica-tion to mask failures and guarantee high data avail-ability. Node failures can be transient or permanent. While the system must generate new replicas to replace replica lost to permanent failures, it can save signifi-cant replication costs by not replicating following transient faults. Given the unpredictability of network dynamics, however, distinguishing permanent and transient failures is extremely difficult. Traditional timeout approaches are difficult to tune and can intro-duce unnecessary replication.

In this paper, we propose Protector, an algorithm that addresses this
problem using network-wide statis-tical prediction. Our algorithm
drastically improves prediction accuracy by making predictions across
ag-gregate replica groups instead of single nodes. These estimates of
the number of "live replicas" can guide efficient data replication
policies. We prove that given data on node down times and the
probability of per-manent failures, the estimate given by our algorithm
is more accurate than all alternatives. We describe two ways to obtain
the failure probability function driven by models or traces. We conduct
extensive simulations based both on synthetic and real traces, and show
that Protector closely approximates the performance of a perfect oracle
failure detector, while significantly outperforming
timeout-based detectors using a wide range of parameters.