Data cleaning is an important process since dirty data essentially reduces the ability to extract useful information from the sources. Typically, data cleaning involves data auditing (to discover discrepancies), and data transformations (which are chosen and applied to fix the discrepancies). This process is iterative, but usually non-interactive, and with long wait times. This paper proposes a solution- Potter’s Wheel (PW), an interactive tool that integrates discrepancy detection and offers interactive transformation tools to clean data.
The core of PW’s architecture consists of an online reorderer, a transformation engine, and an automatic discrepancy detector (ADD). The online reorderer is a spreadsheet-like interface which supports sorting, scrolling, and displays streams of data dynamically. The transformation engine applies specified transforms immediately on the interface and to discovered discrepancies. The ADD runs in the background and highlights discrepancies to users automatically.
Discrepancy detection is handled in the background (on the latest transformed view) using the Minimum Description Length (MDL) metric in order to infer suitable structures for values in a column. These structures are a sequence of domains (which can defined by users). Records which fit the inferred structure poorly are flagged by the discrepancy detector.
The structure inference algorithm works by enumerating a fixed number of structures, and picking the best one. Enumeration of candidates is done recursively, utilizing prefixes of the strings (values of the domain), making sure that redundant candidates are avoided. Next, each enumerated candidate is used to compute a distance length (DL) measure for all the values in a particular column. The mathematical formula used to compute DL encapsulates three favourable criteria for good structures- recall, precision and conciseness. The structure which results in the lowest DL is thus selected as the best approximate structure for a column, and is used to detect poorly matching values i.e., discrepancies. Structures can be parameterized, improving detection ability e.g., computing statistical measures on the parameters to detect anomalies.
Next, transforms are applied to discrepancies. Transforms can be applied to rows, to all values of the column, and to schemas. PW provides a number of transforms which are easy to specify graphically (e.g., add, drop, copy, fold, etc). These transforms can be undone too. Other transforms, like splitting on values, are difficult to manifest visually.
Splitting is handled by performing them on example values. The MDL metric is used to infer structures about those example values. Once structures have been inferred, splitting follows. A number of splitting algorithms can be applied. The paper proposes the Dec-Specificity algorithm as a suitable candidate.
Paper : Potter’s Wheel: An Interactive Data Cleaning System; Raman et. Al., Proc. VLDB, 2001