Tuesday, 22 October 2013

Paper summary : Guided Data Repair

It is important to solicit user feedback in the data cleaning process since automatic updates could be risky for critical data. However, automating the data repair process is also necessary to reduce the cost of cleaning data. The authors present GDR, a framework which provides the best of both worlds by automatically suggesting updates while also involving users in the cleaning process. The machine learning algorithm used by GDR can also take on the task of deciding the correctness of updates once users are confident of the returned suggestions.

The GDR repair process is roughly divided into a sequence of steps. In step 1, all dirty tuples (w.r.t CFDs) are identified and a repair algorithm is used to generate candidate updates. In step 2, candidates are then grouped. Grouping provides users with contextual information when deciding on updates and also prevents the ranking algorithm from overfitting. In step 3, ranking of groups is performed, based on the concept of VOI and an active learning approach. VOI (assuming updates within a group are independent) relies on a loss function to approximate DB quality if a particular candidate is picked. Active learning relies on a classification model (using random forests) to make predictions (confirm, reject or retain) about the correctness of a candidate update. Updated values selected by the user result in labeled data which used to retrain the classifier. Users can choose to allow the classifier to make the candidate decisions if necessary. In step 4, user (or classifier) selected updates are applied to the DB. In step 5, the candidate updates are then regenerated for the next set of dirty tuples by the consistency manager. Regeneration can also be carried out if new tuples were added (e.g., using DB triggers). Steps 3-5 are repeated till the DB is clean w.r.t the constraints.

Experiments suggest that the ranking mechanism is very quick, given the dataset. When tested in isolation, the various VOI ranking algorithms’ performance is dependent on the input. The same is true if only the active learning model is used in isolation. However, combining VOI with the active learning model (i.e., GDR ranking) is very effective.


4 comments:

  1. These are some great tools that i definitely use for SEO work. This is a great list to use in the future.. Laptop cu

    ReplyDelete
  2. I agree with all the ideas which you presented in your post. They are really convincing and will certainly work. But the positions are too short for novices. Could you please expand a little next time? Thanks for the post.
    Our Website

    ReplyDelete
  3. Hey what a brilliant post I have come across and believe me I have been searching out for this similar kind of post for past a week and hardly came across this. Thank you very much and will look for more postings from you. AC Market Download

    ReplyDelete
  4. You know your projects stand out of the herd. There is something special about them. It seems to me all of them are really brilliant! AppEven iOS 11

    ReplyDelete