Estimating statistical aggregates on probabilistic data streams

T. S. Jayram, Andrew McGregor, Erik Vee

Research output: Contribution to journalArticle

30 Scopus citations

Abstract

The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical deterministic streams, where each item is deterministically one of the domain values. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment.We present the first known streaming algorithms for estimating F0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median, and a two-pass algorithm for estimating the range.

Original languageEnglish (US)
Article number26
JournalACM Transactions on Database Systems
Volume33
Issue number4
DOIs
StatePublished - Nov 1 2008

All Science Journal Classification (ASJC) codes

  • Information Systems

Keywords

  • Frequency moments
  • Mean
  • Median
  • OLAP
  • Probabilistic streams

Fingerprint Dive into the research topics of 'Estimating statistical aggregates on probabilistic data streams'. Together they form a unique fingerprint.

  • Cite this