TY - GEN
T1 - Quantifying and improving the availability of high-performance cluster-based Internet services
AU - Nagaraja, Kiran
AU - Krishnan, Neeraj
AU - Bianchini, Ricardo
AU - Martin, Richard P.
AU - Nguyen, Thu D.
PY - 2003
Y1 - 2003
N2 - Cluster-based servers can substantially increase performance when nodes cooperate to globally manage resources. However, in this paper we show that cooperation results in a substantial availability loss, in the absence of high-availability mechanisms. Specifically, we show that a sophisticated cluster-based Web server, which gains a factor of 3 in performance through cooperation, increases service unavailability by a factor of 10 over a non-cooperative version. We then show how to augment this Web server with software components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final expected availability of close to 99.99%.
AB - Cluster-based servers can substantially increase performance when nodes cooperate to globally manage resources. However, in this paper we show that cooperation results in a substantial availability loss, in the absence of high-availability mechanisms. Specifically, we show that a sophisticated cluster-based Web server, which gains a factor of 3 in performance through cooperation, increases service unavailability by a factor of 10 over a non-cooperative version. We then show how to augment this Web server with software components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final expected availability of close to 99.99%.
UR - http://www.scopus.com/inward/record.url?scp=84877070043&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84877070043&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/1048935.1050178
DO - https://doi.org/10.1145/1048935.1050178
M3 - Conference contribution
SN - 1581136951
SN - 9781581136951
T3 - Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC 2003
BT - Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC 2003
T2 - 2003 ACM/IEEE Conference on Supercomputing, SC 2003
Y2 - 15 November 2003 through 21 November 2003
ER -