Improving cluster availability using workstation validation

Research output: Contribution to journalConference articlepeer-review

49 Scopus citations

Abstract

We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high availability based on an understanding of the interconnections and isolated fault behavior of each component, as opposed to ad-hoc methods. In this work, we focus on using the entire commodity workstation as a component because it possesses natural, fault-isolated interfaces. We define a failure event as a reboot because not only is a workstation unavailable during a reboot, but also because reboots are symptomatic of a larger class of failures, such as configuration and operator errors. Our observations of 3 distinct clusters show that the time between reboots is best modeled by a Weibull distribution with shape parameters of less than 1, implying that a workstation becomes more reliable the longer it has been operating. Leveraging this observed property, we design an allocation strategy which withholds recently rebooted workstations from active service, validating their stability before allowing them to return to service. We show via simulation that this policy leads to a 70-30 rule-of-thumb: For a constant utilization, approximately 70% of the workstation failures can be masked from end clients with 30% extra capacity added to the cluster, provided reboots are not strongly correlated. We also found our technique is most sensitive to the burstiness of reboots as opposed to absolute lengths of workstation uptimes.

Original languageAmerican English
Pages (from-to)217-227
Number of pages11
JournalPerformance Evaluation Review
Volume30
Issue number1
DOIs
StatePublished - 2002
EventACM SIGMETRICS 2002 International Conference on Measurement and Modeling of Computer Systems - Marina Del Rey, CA, United States
Duration: Jun 15 2002Jun 19 2002

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Improving cluster availability using workstation validation'. Together they form a unique fingerprint.

Cite this