Monday, 28 October 2013

Paper summary : On the Relative Trust bet. Inconsistent Data and Inaccurate Constraints

In the data cleaning process, it is often unclear whether the data is dirty or if data constraints are outdated (or both). [Chiang, ICDE ‘11] proposed a novel approach involving both data and FD repairs in a unified repair model. In this paper, the authors explored a similar idea by assigning relative trust values to data and FD repairs in a combined model. The proposed algorithms generate multiple data and FD repairs corresponding to specific relative trust values.

The key Repair_Data_FDs algorithm was developed to produce (Σ’, I’) repairs i.e., τ-constrained repairs (where τ is the relative trust value). These τ-constrained repairs are essentially minimal repairs. Two important steps of Repair_Data_FDs are (i) modifying Σ to get Σ’ (as close to Σ as possible), while guaranteeing that there exists an instance I’ s.t., I’ |= Σ’, with dist(I’, I) <= τ and (ii) modifying I wrt Σ’ to get a minimal I’.

In step (i), a key requirement is to compute the minimum number of cell changes in I to satisfy Σ’, called δopt(Σ’, I). This is NP-hard. An approximation, called P-approximate τ-constrained repair, can be used to represent an upper bound on δopt(Σ’, I). This upper bound can also be represented as the 2-approximate vertex cover of the conflict graph (given I and Σ) below a threshold. Step (i) can now be formalized. One crucial procedure is to find goal states in a state search space of FD modifications. This can be done via an optimized BFS or A* search of the search space. The output of step (i), Σ’, can now be processed by step (ii), to find data repairs wrt Σ’. The output of step (ii) is a V-instance (which is used to represent multiple I’s), obtained by finding vertex covers of conflict graphs (superficially similar to finding the upper bound of δopt(Σ’, I) in step (i)).

Another algorithm, Find_Repairs_FDs (a modification of Repair_Data_FDs) performs efficiently on multiple τ values. However, unlike Repair_Data_FDs, Find_Repairs_FDs only generates FD repairs. Hence, another algorithm, Repair_Data, is used to find data modifications for each FD repair. Experiments indicate that the A* algorithm scales better than BFS for most conditions.

Paper : On the Relative Trust bet. Inconsistent Data and Inaccurate Constraints; Beskales et. Al., ICDE '13

Paper summary : A Unified Model for Data and Constraint Repair

Repairing dirty data is an important process since unclean data leads to erroneous decision making and economic loss. Most current repair techniques only focus on the repair of data inconsistencies (wrt a set of constraints). However, in many modern scenarios, the constraints need repair too (e.g., evolution of business rules, etc). In this paper, the authors present a novel cost model which accounts for both data and FD constraint repair. In addition, two algorithms were examined (individually and together)- one for data repair and one for constraint repair.

Data repair is based on the concept of tuple patterns. A core tuple pattern is a frequently occurring tuple in the relation, and a sequence of these form the model. A deviant tuple pattern is one which falls below the threshold but is very similar to at least one core pattern. The intuition is that deviants are candidates for data repair.

Constraint repair is based on the concept of VI (variance of information). Tuples are first partitioned into classes, where each class in a specific partition contains all tuples sharing the same X value wrt a FD, F: X -> Y. Classes can be further refined so that each class shares both the X and Y value. Constraint repair consists of enumerating candidate attribute partitions, and calculating VI scores (a technique used for evaluating different clusters) with the ground truth partition (the original partition based on F, since F is a validated application constraint). The attribute partition with the lowest VI with the ground truth is selected as the constraint repair.

Combining the data and constraint repair techniques results in the unified cost model. The model is guided by 3 properties (accuracy, redundancy and conciseness), which are encapsulated by a DL (distance length) score. The implementation of the cost model is the MDL repair algorithm. The repair algorithm first orders all constraints, and then loops over them. Within the loop, the cost for data repair is calculated first, and then the cost for constraint repair. The repair type which confers the lower cost is recommended. Extensive experiments indicate that the unified model scales well (given the parameters), and also produces high quality repairs (based on the Cora dataset).

Tuesday, 22 October 2013

Presentation : PRIMES in P (AKS Algorithm for primality testing)

Here is a short presentation I did for one of my courses, Computational Complexity.

Useful Links :

Paper summary : Guided Data Repair

It is important to solicit user feedback in the data cleaning process since automatic updates could be risky for critical data. However, automating the data repair process is also necessary to reduce the cost of cleaning data. The authors present GDR, a framework which provides the best of both worlds by automatically suggesting updates while also involving users in the cleaning process. The machine learning algorithm used by GDR can also take on the task of deciding the correctness of updates once users are confident of the returned suggestions.

The GDR repair process is roughly divided into a sequence of steps. In step 1, all dirty tuples (w.r.t CFDs) are identified and a repair algorithm is used to generate candidate updates. In step 2, candidates are then grouped. Grouping provides users with contextual information when deciding on updates and also prevents the ranking algorithm from overfitting. In step 3, ranking of groups is performed, based on the concept of VOI and an active learning approach. VOI (assuming updates within a group are independent) relies on a loss function to approximate DB quality if a particular candidate is picked. Active learning relies on a classification model (using random forests) to make predictions (confirm, reject or retain) about the correctness of a candidate update. Updated values selected by the user result in labeled data which used to retrain the classifier. Users can choose to allow the classifier to make the candidate decisions if necessary. In step 4, user (or classifier) selected updates are applied to the DB. In step 5, the candidate updates are then regenerated for the next set of dirty tuples by the consistency manager. Regeneration can also be carried out if new tuples were added (e.g., using DB triggers). Steps 3-5 are repeated till the DB is clean w.r.t the constraints.

Experiments suggest that the ranking mechanism is very quick, given the dataset. When tested in isolation, the various VOI ranking algorithms’ performance is dependent on the input. The same is true if only the active learning model is used in isolation. However, combining VOI with the active learning model (i.e., GDR ranking) is very effective.

Paper summary : A Cost-Based model and effective heuristic for repairing constraints by value modification

Repair of dirty data is a crucial process, since dirty data usually leads to various forms of economic loss. Most data repair efforts focus on tuple insertions and deletions, but these can result in loss of vital information e.g., w.r.t. INDs. The authors instead incorporate a mechanism of tuple modification and insertion (w.r.t. FDs and INDs) using a novel cost model. Since the cost model is NP-complete (in the size of the database), two greedy heuristics (which operate on equivalence classes) are proposed and optimized.

The proposed algorithms work on equivalence (EQ) classes. EQ classes consist of tuple, attribute pairs. These pairs are grouped together if every tuple in every pair has the same antecedent (w.r.t. a constraint) with every other pair in its class. Once we have these classes, the eventual target repair for the tuples in each class depends on the cost model. Specifically, the repair cost for an EQ class is the summation of repair cost of its element tuples (which is the accuracy value of the tuple times the distance between the original and the candidate repair value). EQ classes have to eventually be merged (since they are w.r.t. the same constraint).

The main repair driver is a procedure called GEN-REPAIR. Within GEN-REPAIR, an “unresolved” repair queue is initialized, maintained and looped over until it is empty, terminating GEN-REPAIR. Within this loop, a PICKNEXT function returns a tuple(s) to be resolved w.r.t. a constraint. PICKGREEDY is an implementation of PICKNEXT which either returns (i) a new or existing tuple w.r.t. an IND or (ii) a tuple and its EQ class w.r.t. an FD. The returned choice is the one with the least resolution cost, i.e., the “greedy choice” property. GEN-REPAIR is now a greedy algorithm. Another implementation of PICKNEXT gives unconditional preference to tuples w.r.t (the more rigid) FDs, resulting in the fd-first greedy algorithm. Optimizations to these two flavors are carried out, resulting in heuristics (e.g., avoid the greedy choice at times to escape local optima). Experiments suggest that the fd-first heuristic performs the better of the two.

Paper : A Cost-Based model and effective heuristic for repairing constraints by value modification; Bohannon et. Al., SIGMOD, 2005

Monday, 7 October 2013

Paper summary : LLUNATIC

Data cleaning consists of removing inconsistencies from a database, given a set of constraints. It is a crucial process, since unclean databases result in economic loss and poor decision making. Currently, no general repair algorithms exist which handle a wide variety of constraints and different strategies to select preferred values. The authors present an implementation (called LLUNATIC) to solve these problems. In addition, they propose a general framework for expressing constraints and repairs, and also develop a minimum cost model for computing scalable, and high quality repairs.

The presented framework is essentially centered around the concept of a cleaning scenario (CS). Each CS encapsulates- a source schema (a clean database), a target schema (a dirty database), a set of equality generating dependencies (EGDs; used to specify types of constraints), and a set of partial orders (indicates preferred upgrades to the target database). “lluns” are introduced, which are sets of symbols used to express inconsistencies in the data that require human intervention to resolve.They are used along with cell groups to describe the semantics of repairs and lineage.

The crux of the data repair problem is to find solutions for CS’, which is done via a variant of the chase algorithm (which always terminates). This algorithm scales well because it works on equivalence classes instead of tuples, uses scalable datastructures (like delta databases) and relies on pruning (with the help of a cost manager). Experiments indicate a high potential for scalability, coupled with the fact that LLUNATIC runs atop a DBMS. Moreover, certain cost managers are shown to produce high quality repairs due to forward and backward chasing.