Repairing dirty data is an important process since unclean data leads to erroneous decision making and economic loss. Most current repair techniques only focus on the repair of data inconsistencies (wrt a set of constraints). However, in many modern scenarios, the constraints need repair too (e.g., evolution of business rules, etc). In this paper, the authors present a novel cost model which accounts for both data and FD constraint repair. In addition, two algorithms were examined (individually and together)- one for data repair and one for constraint repair.
Data repair is based on the concept of tuple patterns. A core tuple pattern is a frequently occurring tuple in the relation, and a sequence of these form the model. A deviant tuple pattern is one which falls below the threshold but is very similar to at least one core pattern. The intuition is that deviants are candidates for data repair.
Constraint repair is based on the concept of VI (variance of information). Tuples are first partitioned into classes, where each class in a specific partition contains all tuples sharing the same X value wrt a FD, F: X -> Y. Classes can be further refined so that each class shares both the X and Y value. Constraint repair consists of enumerating candidate attribute partitions, and calculating VI scores (a technique used for evaluating different clusters) with the ground truth partition (the original partition based on F, since F is a validated application constraint). The attribute partition with the lowest VI with the ground truth is selected as the constraint repair.
Combining the data and constraint repair techniques results in the unified cost model. The model is guided by 3 properties (accuracy, redundancy and conciseness), which are encapsulated by a DL (distance length) score. The implementation of the cost model is the MDL repair algorithm. The repair algorithm first orders all constraints, and then loops over them. Within the loop, the cost for data repair is calculated first, and then the cost for constraint repair. The repair type which confers the lower cost is recommended. Extensive experiments indicate that the unified model scales well (given the parameters), and also produces high quality repairs (based on the Cora dataset).
Data repair is based on the concept of tuple patterns. A core tuple pattern is a frequently occurring tuple in the relation, and a sequence of these form the model. A deviant tuple pattern is one which falls below the threshold but is very similar to at least one core pattern. The intuition is that deviants are candidates for data repair.
Constraint repair is based on the concept of VI (variance of information). Tuples are first partitioned into classes, where each class in a specific partition contains all tuples sharing the same X value wrt a FD, F: X -> Y. Classes can be further refined so that each class shares both the X and Y value. Constraint repair consists of enumerating candidate attribute partitions, and calculating VI scores (a technique used for evaluating different clusters) with the ground truth partition (the original partition based on F, since F is a validated application constraint). The attribute partition with the lowest VI with the ground truth is selected as the constraint repair.
Combining the data and constraint repair techniques results in the unified cost model. The model is guided by 3 properties (accuracy, redundancy and conciseness), which are encapsulated by a DL (distance length) score. The implementation of the cost model is the MDL repair algorithm. The repair algorithm first orders all constraints, and then loops over them. Within the loop, the cost for data repair is calculated first, and then the cost for constraint repair. The repair type which confers the lower cost is recommended. Extensive experiments indicate that the unified model scales well (given the parameters), and also produces high quality repairs (based on the Cora dataset).
No comments:
Post a Comment