Abstract
Tabular data on the Web has become a rich source of struc-tured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies with the goals of understanding the se-mantics of those Web tables and providing e ective search and exploration mechanisms over them. An important part of table understanding and search is column concept deter-mination, i.e., identifying the most appropriate concepts as-sociated with the columns of the tables. The problem be-comes especially challenging with the availability of increas-ingly rich knowledge bases that contain hundreds of millions of entities. In this paper, we focus on an important instantiation of the column concept determination problem, namely, the concepts of a column are determined by fuzzy matching its cell values to the entities within a large knowledge base. We provide an efficient and scalable MapReduce-based solution that is scalable to both the number of tables and the size of the knowledge base and propose two novel techniques: knowledge concept aggregation and knowledge entity par-tition. We prove that both the problem of finding the op-timal aggregation strategy and that of finding the optimal partition strategy are NP-hard, and propose efficient heuris-tic techniques by leveraging the hierarchy of the knowledge base. Experimental results on real-world datasets show that our method achieves high annotation quality and perfor-mance, and scales well.
Original language | American English |
---|---|
Pages (from-to) | 1606-1617 |
Number of pages | 12 |
Journal | Proceedings of the VLDB Endowment |
Volume | 6 |
Issue number | 13 |
DOIs | |
State | Published - Aug 2013 |
Externally published | Yes |
Event | 39th International Conference on Very Large Data Bases, VLDB 2012 - Trento, Italy Duration: Aug 26 2013 → Aug 30 2013 |
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- General Computer Science