Tuesday, 22 October 2013

Paper summary : Guided Data Repair

It is important to solicit user feedback in the data cleaning process since automatic updates could be risky for critical data. However, automating the data repair process is also necessary to reduce the cost of cleaning data. The authors present GDR, a framework which provides the best of both worlds by automatically suggesting updates while also involving users in the cleaning process. The machine learning algorithm used by GDR can also take on the task of deciding the correctness of updates once users are confident of the returned suggestions.

The GDR repair process is roughly divided into a sequence of steps. In step 1, all dirty tuples (w.r.t CFDs) are identified and a repair algorithm is used to generate candidate updates. In step 2, candidates are then grouped. Grouping provides users with contextual information when deciding on updates and also prevents the ranking algorithm from overfitting. In step 3, ranking of groups is performed, based on the concept of VOI and an active learning approach. VOI (assuming updates within a group are independent) relies on a loss function to approximate DB quality if a particular candidate is picked. Active learning relies on a classification model (using random forests) to make predictions (confirm, reject or retain) about the correctness of a candidate update. Updated values selected by the user result in labeled data which used to retrain the classifier. Users can choose to allow the classifier to make the candidate decisions if necessary. In step 4, user (or classifier) selected updates are applied to the DB. In step 5, the candidate updates are then regenerated for the next set of dirty tuples by the consistency manager. Regeneration can also be carried out if new tuples were added (e.g., using DB triggers). Steps 3-5 are repeated till the DB is clean w.r.t the constraints.

Experiments suggest that the ranking mechanism is very quick, given the dataset. When tested in isolation, the various VOI ranking algorithms’ performance is dependent on the input. The same is true if only the active learning model is used in isolation. However, combining VOI with the active learning model (i.e., GDR ranking) is very effective.


No comments:

Post a Comment