Tuesday 22 October 2013

Paper summary : Guided Data Repair

It is important to solicit user feedback in the data cleaning process since automatic updates could be risky for critical data. However, automating the data repair process is also necessary to reduce the cost of cleaning data. The authors present GDR, a framework which provides the best of both worlds by automatically suggesting updates while also involving users in the cleaning process. The machine learning algorithm used by GDR can also take on the task of deciding the correctness of updates once users are confident of the returned suggestions.

The GDR repair process is roughly divided into a sequence of steps. In step 1, all dirty tuples (w.r.t CFDs) are identified and a repair algorithm is used to generate candidate updates. In step 2, candidates are then grouped. Grouping provides users with contextual information when deciding on updates and also prevents the ranking algorithm from overfitting. In step 3, ranking of groups is performed, based on the concept of VOI and an active learning approach. VOI (assuming updates within a group are independent) relies on a loss function to approximate DB quality if a particular candidate is picked. Active learning relies on a classification model (using random forests) to make predictions (confirm, reject or retain) about the correctness of a candidate update. Updated values selected by the user result in labeled data which used to retrain the classifier. Users can choose to allow the classifier to make the candidate decisions if necessary. In step 4, user (or classifier) selected updates are applied to the DB. In step 5, the candidate updates are then regenerated for the next set of dirty tuples by the consistency manager. Regeneration can also be carried out if new tuples were added (e.g., using DB triggers). Steps 3-5 are repeated till the DB is clean w.r.t the constraints.

Experiments suggest that the ranking mechanism is very quick, given the dataset. When tested in isolation, the various VOI ranking algorithms’ performance is dependent on the input. The same is true if only the active learning model is used in isolation. However, combining VOI with the active learning model (i.e., GDR ranking) is very effective.


  1. These are some great tools that i definitely use for SEO work. This is a great list to use in the future.. Laptop cu

  2. I agree with all the ideas which you presented in your post. They are really convincing and will certainly work. But the positions are too short for novices. Could you please expand a little next time? Thanks for the post.
    Our Website

  3. Hey what a brilliant post I have come across and believe me I have been searching out for this similar kind of post for past a week and hardly came across this. Thank you very much and will look for more postings from you. AC Market Download

  4. I have read your blog its very attractive and impressive. I like it your blog.

    KissAnime alternatives

  5. Choose 'what's the reason': The researcher of extraction of information ought to do inside and out investigation of your organization for which he is procured.machine learning certification

  6. This is a great post. I like this topic.This site has lots of advantage.I found many interesting things from this site. It helps me in many ways.Thanks for posting this again. scherm