TY - JOUR
T1 - Experience with multi-tier grid MySQL database service resiliency at BNL
AU - Wlodek, Tomasz
AU - Ernst, Michael
AU - Hover, John
AU - Katramatos, Dimitrios
AU - Packard, Jay
AU - Smirnov, Yuri
AU - Yu, Dantong
PY - 2011
Y1 - 2011
N2 - We describe the use of F5's BIG-IP smart switch technology (3600 Series and Local Traffic Manager v9.0) to provide load balancing and automatic fail-over to multiple Grid services (GUMS, VOMS) and their associated back-end MySQL databases. This resiliency is introduced in front of the external application servers and also for the back-end database systems, which is what makes it "multi-tier". The combination of solutions chosen to ensure high availability of the services, in particular the database replication and fail-over mechanism, are discussed in detail. The paper explains the design and configuration of the overall system, including virtual servers, machine pools, and health monitors (which govern routing), as well as the master-slave database scheme and fail-over policies and procedures. Pre-deployment planning and stress testing will be outlined. Integration of the systems with our Nagios-based facility monitoring and alerting is also described. And application characteristics of GUMS and VOMS which enable effective clustering will be explained. We then summarize our practical experiences and real-world scenarios resulting from operating a major US Grid center, and assess the applicability of our approach to other Grid services in the future.
AB - We describe the use of F5's BIG-IP smart switch technology (3600 Series and Local Traffic Manager v9.0) to provide load balancing and automatic fail-over to multiple Grid services (GUMS, VOMS) and their associated back-end MySQL databases. This resiliency is introduced in front of the external application servers and also for the back-end database systems, which is what makes it "multi-tier". The combination of solutions chosen to ensure high availability of the services, in particular the database replication and fail-over mechanism, are discussed in detail. The paper explains the design and configuration of the overall system, including virtual servers, machine pools, and health monitors (which govern routing), as well as the master-slave database scheme and fail-over policies and procedures. Pre-deployment planning and stress testing will be outlined. Integration of the systems with our Nagios-based facility monitoring and alerting is also described. And application characteristics of GUMS and VOMS which enable effective clustering will be explained. We then summarize our practical experiences and real-world scenarios resulting from operating a major US Grid center, and assess the applicability of our approach to other Grid services in the future.
UR - http://www.scopus.com/inward/record.url?scp=84858136196&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84858136196&partnerID=8YFLogxK
U2 - https://doi.org/10.1088/1742-6596/331/4/042044
DO - https://doi.org/10.1088/1742-6596/331/4/042044
M3 - Conference article
VL - 331
JO - Journal of Physics: Conference Series
JF - Journal of Physics: Conference Series
SN - 1742-6588
IS - PART 4
M1 - 42044
T2 - International Conference on Computing in High Energy and Nuclear Physics, CHEP 2010
Y2 - 18 October 2010 through 22 October 2010
ER -