Repair of dirty data is a crucial process, since dirty data usually leads to various forms of economic loss. Most data repair efforts focus on tuple insertions and deletions, but these can result in loss of vital information e.g., w.r.t. INDs. The authors instead incorporate a mechanism of tuple modification and insertion (w.r.t. FDs and INDs) using a novel cost model. Since the cost model is NP-complete (in the size of the database), two greedy heuristics (which operate on equivalence classes) are proposed and optimized.

The proposed algorithms work on equivalence (EQ) classes. EQ classes consist of tuple, attribute pairs. These pairs are grouped together if every tuple in every pair has the same antecedent (w.r.t. a constraint) with every other pair in its class. Once we have these classes, the eventual target repair for the tuples in each class depends on the cost model. Specifically, the repair cost for an EQ class is the summation of repair cost of its element tuples (which is the accuracy value of the tuple times the distance between the original and the candidate repair value). EQ classes have to eventually be merged (since they are w.r.t. the same constraint).

The main repair driver is a procedure called GEN-REPAIR. Within GEN-REPAIR, an “unresolved” repair queue is initialized, maintained and looped over until it is empty, terminating GEN-REPAIR. Within this loop, a PICKNEXT function returns a tuple(s) to be resolved w.r.t. a constraint. PICKGREEDY is an implementation of PICKNEXT which either returns (i) a new or existing tuple w.r.t. an IND or (ii) a tuple and its EQ class w.r.t. an FD. The returned choice is the one with the least resolution cost, i.e., the “greedy choice” property. GEN-REPAIR is now a greedy algorithm. Another implementation of PICKNEXT gives unconditional preference to tuples w.r.t (the more rigid) FDs, resulting in the fd-first greedy algorithm. Optimizations to these two flavors are carried out, resulting in heuristics (e.g., avoid the greedy choice at times to escape local optima). Experiments suggest that the fd-first heuristic performs the better of the two.

Paper : A Cost-Based model and effective heuristic for repairing constraints by value modification; Bohannon et. Al., SIGMOD, 2005

The proposed algorithms work on equivalence (EQ) classes. EQ classes consist of tuple, attribute pairs. These pairs are grouped together if every tuple in every pair has the same antecedent (w.r.t. a constraint) with every other pair in its class. Once we have these classes, the eventual target repair for the tuples in each class depends on the cost model. Specifically, the repair cost for an EQ class is the summation of repair cost of its element tuples (which is the accuracy value of the tuple times the distance between the original and the candidate repair value). EQ classes have to eventually be merged (since they are w.r.t. the same constraint).

The main repair driver is a procedure called GEN-REPAIR. Within GEN-REPAIR, an “unresolved” repair queue is initialized, maintained and looped over until it is empty, terminating GEN-REPAIR. Within this loop, a PICKNEXT function returns a tuple(s) to be resolved w.r.t. a constraint. PICKGREEDY is an implementation of PICKNEXT which either returns (i) a new or existing tuple w.r.t. an IND or (ii) a tuple and its EQ class w.r.t. an FD. The returned choice is the one with the least resolution cost, i.e., the “greedy choice” property. GEN-REPAIR is now a greedy algorithm. Another implementation of PICKNEXT gives unconditional preference to tuples w.r.t (the more rigid) FDs, resulting in the fd-first greedy algorithm. Optimizations to these two flavors are carried out, resulting in heuristics (e.g., avoid the greedy choice at times to escape local optima). Experiments suggest that the fd-first heuristic performs the better of the two.

Paper : A Cost-Based model and effective heuristic for repairing constraints by value modification; Bohannon et. Al., SIGMOD, 2005

## No comments:

## Post a Comment