Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection

Zhihan Li, Youjian Zhao, Rong Liu, Dan Pei

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

For large Internet companies, it is very important to monitor a large number of KPIs (Key Performance Indicators) and detect anomalies to ensure the service quality and reliability. However, large-scale anomaly detection on millions of KPIs is very challenging due to the large overhead of model selection, parameter tuning, model training, or labeling. In this paper we argue that KPI clustering can help: we can cluster millions of KPIs into a small number of clusters and then select and train model on a per-cluster basis. However, KPI clustering faces new challenges that are not present in classic time series clustering: KPIs are typically much longer than other time series, and noises, anomalies, phase shifts and amplitude differences often change the shape of KPIs and mislead the clustering algorithm. To tackle the above challenges, in this paper we propose a robust and rapid KPI clustering algorithm, ROCKA. It consists of four steps: preprocessing, baseline extraction, clustering and assignment. These techniques help group KPIs according to their underlying shapes with high accuracy and efficiency. Our evaluation using real-world KPIs shows that ROCKA gets F-score higher than 0.85, and reduces model training time of a state-of-the-art anomaly detection algorithm by 90%, with only 15% performance loss.

Original languageEnglish (US)
Title of host publication2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538625422
DOIs
StatePublished - Jan 22 2019
Event26th IEEE/ACM International Symposium on Quality of Service, IWQoS 2018 - Banff, Canada
Duration: Jun 4 2018Jun 6 2018

Publication series

Name2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018

Conference

Conference26th IEEE/ACM International Symposium on Quality of Service, IWQoS 2018
CountryCanada
CityBanff
Period6/4/186/6/18

Fingerprint

Clustering algorithms
Time series
Phase shift
Labeling
Tuning
Internet
Clustering
Anomaly detection
Key performance indicators
Industry
Anomaly
Clustering algorithm

All Science Journal Classification (ASJC) codes

  • Safety, Risk, Reliability and Quality
  • Management of Technology and Innovation
  • Computer Networks and Communications
  • Media Technology

Cite this

Li, Z., Zhao, Y., Liu, R., & Pei, D. (2019). Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. In 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018 [8624168] (2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IWQoS.2018.8624168
Li, Zhihan ; Zhao, Youjian ; Liu, Rong ; Pei, Dan. / Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018. Institute of Electrical and Electronics Engineers Inc., 2019. (2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018).
@inproceedings{6a471b1501c3407686d4b07ba32dfa73,
title = "Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection",
abstract = "For large Internet companies, it is very important to monitor a large number of KPIs (Key Performance Indicators) and detect anomalies to ensure the service quality and reliability. However, large-scale anomaly detection on millions of KPIs is very challenging due to the large overhead of model selection, parameter tuning, model training, or labeling. In this paper we argue that KPI clustering can help: we can cluster millions of KPIs into a small number of clusters and then select and train model on a per-cluster basis. However, KPI clustering faces new challenges that are not present in classic time series clustering: KPIs are typically much longer than other time series, and noises, anomalies, phase shifts and amplitude differences often change the shape of KPIs and mislead the clustering algorithm. To tackle the above challenges, in this paper we propose a robust and rapid KPI clustering algorithm, ROCKA. It consists of four steps: preprocessing, baseline extraction, clustering and assignment. These techniques help group KPIs according to their underlying shapes with high accuracy and efficiency. Our evaluation using real-world KPIs shows that ROCKA gets F-score higher than 0.85, and reduces model training time of a state-of-the-art anomaly detection algorithm by 90{\%}, with only 15{\%} performance loss.",
author = "Zhihan Li and Youjian Zhao and Rong Liu and Dan Pei",
year = "2019",
month = "1",
day = "22",
doi = "https://doi.org/10.1109/IWQoS.2018.8624168",
language = "English (US)",
series = "2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018",
address = "United States",

}

Li, Z, Zhao, Y, Liu, R & Pei, D 2019, Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. in 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018., 8624168, 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018, Institute of Electrical and Electronics Engineers Inc., 26th IEEE/ACM International Symposium on Quality of Service, IWQoS 2018, Banff, Canada, 6/4/18. https://doi.org/10.1109/IWQoS.2018.8624168

Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. / Li, Zhihan; Zhao, Youjian; Liu, Rong; Pei, Dan.

2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018. Institute of Electrical and Electronics Engineers Inc., 2019. 8624168 (2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection

AU - Li, Zhihan

AU - Zhao, Youjian

AU - Liu, Rong

AU - Pei, Dan

PY - 2019/1/22

Y1 - 2019/1/22

N2 - For large Internet companies, it is very important to monitor a large number of KPIs (Key Performance Indicators) and detect anomalies to ensure the service quality and reliability. However, large-scale anomaly detection on millions of KPIs is very challenging due to the large overhead of model selection, parameter tuning, model training, or labeling. In this paper we argue that KPI clustering can help: we can cluster millions of KPIs into a small number of clusters and then select and train model on a per-cluster basis. However, KPI clustering faces new challenges that are not present in classic time series clustering: KPIs are typically much longer than other time series, and noises, anomalies, phase shifts and amplitude differences often change the shape of KPIs and mislead the clustering algorithm. To tackle the above challenges, in this paper we propose a robust and rapid KPI clustering algorithm, ROCKA. It consists of four steps: preprocessing, baseline extraction, clustering and assignment. These techniques help group KPIs according to their underlying shapes with high accuracy and efficiency. Our evaluation using real-world KPIs shows that ROCKA gets F-score higher than 0.85, and reduces model training time of a state-of-the-art anomaly detection algorithm by 90%, with only 15% performance loss.

AB - For large Internet companies, it is very important to monitor a large number of KPIs (Key Performance Indicators) and detect anomalies to ensure the service quality and reliability. However, large-scale anomaly detection on millions of KPIs is very challenging due to the large overhead of model selection, parameter tuning, model training, or labeling. In this paper we argue that KPI clustering can help: we can cluster millions of KPIs into a small number of clusters and then select and train model on a per-cluster basis. However, KPI clustering faces new challenges that are not present in classic time series clustering: KPIs are typically much longer than other time series, and noises, anomalies, phase shifts and amplitude differences often change the shape of KPIs and mislead the clustering algorithm. To tackle the above challenges, in this paper we propose a robust and rapid KPI clustering algorithm, ROCKA. It consists of four steps: preprocessing, baseline extraction, clustering and assignment. These techniques help group KPIs according to their underlying shapes with high accuracy and efficiency. Our evaluation using real-world KPIs shows that ROCKA gets F-score higher than 0.85, and reduces model training time of a state-of-the-art anomaly detection algorithm by 90%, with only 15% performance loss.

UR - http://www.scopus.com/inward/record.url?scp=85062631336&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062631336&partnerID=8YFLogxK

U2 - https://doi.org/10.1109/IWQoS.2018.8624168

DO - https://doi.org/10.1109/IWQoS.2018.8624168

M3 - Conference contribution

T3 - 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018

BT - 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Li Z, Zhao Y, Liu R, Pei D. Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. In 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018. Institute of Electrical and Electronics Engineers Inc. 2019. 8624168. (2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS 2018). https://doi.org/10.1109/IWQoS.2018.8624168