In the data cleaning process, it is often unclear whether the data is dirty or if data constraints are outdated (or both). [Chiang, ICDE ‘11] proposed a novel approach involving both data and FD repairs in a unified repair model. In this paper, the authors explored a similar idea by assigning relative trust values to data and FD repairs in a combined model. The proposed algorithms generate multiple data and FD repairs corresponding to specific relative trust values.

The key Repair_Data_FDs algorithm was developed to produce (Σ’, I’) repairs i.e., τ-constrained repairs (where τ is the relative trust value). These τ-constrained repairs are essentially minimal repairs. Two important steps of Repair_Data_FDs are (i) modifying Σ to get Σ’ (as close to Σ as possible), while guaranteeing that there exists an instance I’ s.t., I’ |= Σ’, with dist(I’, I) <= τ and (ii) modifying I wrt Σ’ to get a minimal I’.

The key Repair_Data_FDs algorithm was developed to produce (Σ’, I’) repairs i.e., τ-constrained repairs (where τ is the relative trust value). These τ-constrained repairs are essentially minimal repairs. Two important steps of Repair_Data_FDs are (i) modifying Σ to get Σ’ (as close to Σ as possible), while guaranteeing that there exists an instance I’ s.t., I’ |= Σ’, with dist(I’, I) <= τ and (ii) modifying I wrt Σ’ to get a minimal I’.

In step (i), a key requirement is to compute the minimum number of cell changes in I to satisfy Σ’, called δopt(Σ’, I). This is NP-hard. An approximation, called P-approximate τ-constrained repair, can be used to represent an upper bound on δopt(Σ’, I). This upper bound can also be represented as the 2-approximate vertex cover of the conflict graph (given I and Σ) below a threshold. Step (i) can now be formalized. One crucial procedure is to find goal states in a state search space of FD modifications. This can be done via an optimized BFS or A* search of the search space. The output of step (i), Σ’, can now be processed by step (ii), to find data repairs wrt Σ’. The output of step (ii) is a V-instance (which is used to represent multiple I’s), obtained by finding vertex covers of conflict graphs (superficially similar to finding the upper bound of δopt(Σ’, I) in step (i)).

Another algorithm, Find_Repairs_FDs (a modification of Repair_Data_FDs) performs efficiently on multiple τ values. However, unlike Repair_Data_FDs, Find_Repairs_FDs only generates FD repairs. Hence, another algorithm, Repair_Data, is used to find data modifications for each FD repair. Experiments indicate that the A* algorithm scales better than BFS for most conditions.

Paper : On the Relative Trust bet. Inconsistent Data and Inaccurate Constraints; Beskales et. Al., ICDE '13

Another algorithm, Find_Repairs_FDs (a modification of Repair_Data_FDs) performs efficiently on multiple τ values. However, unlike Repair_Data_FDs, Find_Repairs_FDs only generates FD repairs. Hence, another algorithm, Repair_Data, is used to find data modifications for each FD repair. Experiments indicate that the A* algorithm scales better than BFS for most conditions.

Paper : On the Relative Trust bet. Inconsistent Data and Inaccurate Constraints; Beskales et. Al., ICDE '13

Excellent post!!! Java is most popular and efficient programming language available in the market today. It helps developers to create stunning desktop/web applications loaded with stunning functionalities. J2EE Training in Chennai | JAVA Training in Chennai

ReplyDelete