A tale of two data-intensive paradigms: Applications, abstractions, and architectures

Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha, Geoffrey C. Fox

Research output: Chapter in Book/Report/Conference proceedingConference contribution

34 Citations (Scopus)

Abstract

Scientific problems that depend on processing largeamounts of data require overcoming challenges in multiple areas:managing large-scale data distribution, co-placement andscheduling of data with compute resources, and storing and transferringlarge volumes of data. We analyze the ecosystems of thetwo prominent paradigms for data-intensive applications, hereafterreferred to as the high-performance computing and theApache-Hadoop paradigm. We propose a basis, common terminologyand functional factors upon which to analyze the two approachesof both paradigms. We discuss the concept of 'Big DataOgres' and their facets as means of understanding and characterizingthe most common application workloads found acrossthe two paradigms. We then discuss the salient features of thetwo paradigms, and compare and contrast the two approaches.Specifically, we examine common implementation/approaches ofthese paradigms, shed light upon the reasons for their current'architecture' and discuss some typical workloads that utilizethem. In spite of the significant software distinctions, we believethere is architectural similarity. We discuss the potential integrationof different implementations, across the different levelsand components. Our comparison progresses from a fully qualitativeexamination of the two paradigms, to a semi-quantitativemethodology. We use a simple and broadly used Ogre (K-meansclustering), characterize its performance on a range of representativeplatforms, covering several implementations from bothparadigms. Our experiments provide an insight into the relativestrengths of the two paradigms. We propose that the set of Ogreswill serve as a benchmark to evaluate the two paradigms alongdifferent dimensions.

Original languageEnglish (US)
Title of host publicationProceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014
EditorsPeter Chen, Hemant Jain
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages645-652
Number of pages8
ISBN (Electronic)9781479950577
DOIs
StatePublished - Sep 22 2014
Event3rd IEEE International Congress on Big Data, BigData Congress 2014 - Anchorage, United States
Duration: Jun 27 2014Jul 2 2014

Publication series

NameProceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014

Other

Other3rd IEEE International Congress on Big Data, BigData Congress 2014
CountryUnited States
CityAnchorage
Period6/27/147/2/14

Fingerprint

Ecosystems
Processing
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Science Applications

Cite this

Jha, S., Qiu, J., Luckow, A., Mantha, P., & Fox, G. C. (2014). A tale of two data-intensive paradigms: Applications, abstractions, and architectures. In P. Chen, & H. Jain (Eds.), Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014 (pp. 645-652). [6906840] (Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.Congress.2014.137
Jha, Shantenu ; Qiu, Judy ; Luckow, Andre ; Mantha, Pradeep ; Fox, Geoffrey C. / A tale of two data-intensive paradigms : Applications, abstractions, and architectures. Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014. editor / Peter Chen ; Hemant Jain. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 645-652 (Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014).
@inproceedings{1e5ff411cf2547cfad471ad3a15b274b,
title = "A tale of two data-intensive paradigms: Applications, abstractions, and architectures",
abstract = "Scientific problems that depend on processing largeamounts of data require overcoming challenges in multiple areas:managing large-scale data distribution, co-placement andscheduling of data with compute resources, and storing and transferringlarge volumes of data. We analyze the ecosystems of thetwo prominent paradigms for data-intensive applications, hereafterreferred to as the high-performance computing and theApache-Hadoop paradigm. We propose a basis, common terminologyand functional factors upon which to analyze the two approachesof both paradigms. We discuss the concept of 'Big DataOgres' and their facets as means of understanding and characterizingthe most common application workloads found acrossthe two paradigms. We then discuss the salient features of thetwo paradigms, and compare and contrast the two approaches.Specifically, we examine common implementation/approaches ofthese paradigms, shed light upon the reasons for their current'architecture' and discuss some typical workloads that utilizethem. In spite of the significant software distinctions, we believethere is architectural similarity. We discuss the potential integrationof different implementations, across the different levelsand components. Our comparison progresses from a fully qualitativeexamination of the two paradigms, to a semi-quantitativemethodology. We use a simple and broadly used Ogre (K-meansclustering), characterize its performance on a range of representativeplatforms, covering several implementations from bothparadigms. Our experiments provide an insight into the relativestrengths of the two paradigms. We propose that the set of Ogreswill serve as a benchmark to evaluate the two paradigms alongdifferent dimensions.",
author = "Shantenu Jha and Judy Qiu and Andre Luckow and Pradeep Mantha and Fox, {Geoffrey C.}",
year = "2014",
month = "9",
day = "22",
doi = "https://doi.org/10.1109/BigData.Congress.2014.137",
language = "English (US)",
series = "Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "645--652",
editor = "Peter Chen and Hemant Jain",
booktitle = "Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014",
address = "United States",

}

Jha, S, Qiu, J, Luckow, A, Mantha, P & Fox, GC 2014, A tale of two data-intensive paradigms: Applications, abstractions, and architectures. in P Chen & H Jain (eds), Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014., 6906840, Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014, Institute of Electrical and Electronics Engineers Inc., pp. 645-652, 3rd IEEE International Congress on Big Data, BigData Congress 2014, Anchorage, United States, 6/27/14. https://doi.org/10.1109/BigData.Congress.2014.137

A tale of two data-intensive paradigms : Applications, abstractions, and architectures. / Jha, Shantenu; Qiu, Judy; Luckow, Andre; Mantha, Pradeep; Fox, Geoffrey C.

Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014. ed. / Peter Chen; Hemant Jain. Institute of Electrical and Electronics Engineers Inc., 2014. p. 645-652 6906840 (Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - A tale of two data-intensive paradigms

T2 - Applications, abstractions, and architectures

AU - Jha, Shantenu

AU - Qiu, Judy

AU - Luckow, Andre

AU - Mantha, Pradeep

AU - Fox, Geoffrey C.

PY - 2014/9/22

Y1 - 2014/9/22

N2 - Scientific problems that depend on processing largeamounts of data require overcoming challenges in multiple areas:managing large-scale data distribution, co-placement andscheduling of data with compute resources, and storing and transferringlarge volumes of data. We analyze the ecosystems of thetwo prominent paradigms for data-intensive applications, hereafterreferred to as the high-performance computing and theApache-Hadoop paradigm. We propose a basis, common terminologyand functional factors upon which to analyze the two approachesof both paradigms. We discuss the concept of 'Big DataOgres' and their facets as means of understanding and characterizingthe most common application workloads found acrossthe two paradigms. We then discuss the salient features of thetwo paradigms, and compare and contrast the two approaches.Specifically, we examine common implementation/approaches ofthese paradigms, shed light upon the reasons for their current'architecture' and discuss some typical workloads that utilizethem. In spite of the significant software distinctions, we believethere is architectural similarity. We discuss the potential integrationof different implementations, across the different levelsand components. Our comparison progresses from a fully qualitativeexamination of the two paradigms, to a semi-quantitativemethodology. We use a simple and broadly used Ogre (K-meansclustering), characterize its performance on a range of representativeplatforms, covering several implementations from bothparadigms. Our experiments provide an insight into the relativestrengths of the two paradigms. We propose that the set of Ogreswill serve as a benchmark to evaluate the two paradigms alongdifferent dimensions.

AB - Scientific problems that depend on processing largeamounts of data require overcoming challenges in multiple areas:managing large-scale data distribution, co-placement andscheduling of data with compute resources, and storing and transferringlarge volumes of data. We analyze the ecosystems of thetwo prominent paradigms for data-intensive applications, hereafterreferred to as the high-performance computing and theApache-Hadoop paradigm. We propose a basis, common terminologyand functional factors upon which to analyze the two approachesof both paradigms. We discuss the concept of 'Big DataOgres' and their facets as means of understanding and characterizingthe most common application workloads found acrossthe two paradigms. We then discuss the salient features of thetwo paradigms, and compare and contrast the two approaches.Specifically, we examine common implementation/approaches ofthese paradigms, shed light upon the reasons for their current'architecture' and discuss some typical workloads that utilizethem. In spite of the significant software distinctions, we believethere is architectural similarity. We discuss the potential integrationof different implementations, across the different levelsand components. Our comparison progresses from a fully qualitativeexamination of the two paradigms, to a semi-quantitativemethodology. We use a simple and broadly used Ogre (K-meansclustering), characterize its performance on a range of representativeplatforms, covering several implementations from bothparadigms. Our experiments provide an insight into the relativestrengths of the two paradigms. We propose that the set of Ogreswill serve as a benchmark to evaluate the two paradigms alongdifferent dimensions.

UR - http://www.scopus.com/inward/record.url?scp=84923884968&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84923884968&partnerID=8YFLogxK

U2 - https://doi.org/10.1109/BigData.Congress.2014.137

DO - https://doi.org/10.1109/BigData.Congress.2014.137

M3 - Conference contribution

T3 - Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014

SP - 645

EP - 652

BT - Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014

A2 - Chen, Peter

A2 - Jain, Hemant

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Jha S, Qiu J, Luckow A, Mantha P, Fox GC. A tale of two data-intensive paradigms: Applications, abstractions, and architectures. In Chen P, Jain H, editors, Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014. Institute of Electrical and Electronics Engineers Inc. 2014. p. 645-652. 6906840. (Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014). https://doi.org/10.1109/BigData.Congress.2014.137