TY - JOUR

T1 - Decision trees for entity identification

T2 - Approximation algorithms and hardness results

AU - Chakaravarthy, Venkatesan T.

AU - Pandit, Vinayaka

AU - Roy, Sambuddha

AU - Awasthi, Pranjal

AU - Mohania, Mukesh K.

PY - 2011/3

Y1 - 2011/3

N2 - We consider the problem of constructing decision trees for entity identification from a given relational table. The input is a table containing information about a set of entities over a fixed set of attributes and a probability distribution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such diverse applications as efficient fault detection, species identification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability distribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy algorithm and prove an approximation guarantee of O(rK . log N), where N is the number of entities and K is the maximum number of distinct values of an attribute. The value rK is a suitably defined Ramsey number, which is at most log K. We show that it is NP-hard to approximate the problem within a factor of Ω(log N), even for binary tables (i.e., K = 2). Thus, for the case of binary tables, our approximation algorithm is optimal up to constant factors (since r2 = 2). In addition, our analysis indicates a possible way of resolving a Ramsey-theoretic conjecture by Erdös.

AB - We consider the problem of constructing decision trees for entity identification from a given relational table. The input is a table containing information about a set of entities over a fixed set of attributes and a probability distribution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such diverse applications as efficient fault detection, species identification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability distribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy algorithm and prove an approximation guarantee of O(rK . log N), where N is the number of entities and K is the maximum number of distinct values of an attribute. The value rK is a suitably defined Ramsey number, which is at most log K. We show that it is NP-hard to approximate the problem within a factor of Ω(log N), even for binary tables (i.e., K = 2). Thus, for the case of binary tables, our approximation algorithm is optimal up to constant factors (since r2 = 2). In addition, our analysis indicates a possible way of resolving a Ramsey-theoretic conjecture by Erdös.

KW - Approximation algorithms

KW - Decision tree

KW - Ramsey numbers

UR - http://www.scopus.com/inward/record.url?scp=79953246501&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79953246501&partnerID=8YFLogxK

U2 - https://doi.org/10.1145/1921659.1921661

DO - https://doi.org/10.1145/1921659.1921661

M3 - Article

VL - 7

JO - ACM Transactions on Algorithms

JF - ACM Transactions on Algorithms

SN - 1549-6325

IS - 2

M1 - 15

ER -