Data cleaning is a difficult problem. Existing ETL tools and data reengineering tools are not sophisticated enough to design data flow graphs (which specify data transformation steps) efficiently and effectively. These tools are usually closed or undocumented (with respect to the implementation of the transformation algorithms), non-interactive, and have long wait times (hindering the stepwise refinement process crucial to data cleaning). This paper introduces the AJAX framework, enabling users to perform data cleaning operations cleanly and efficiently.
The AJAX framework consists of a logical level and physical level. The logical level alludes to the specification of data transformations using a declarative language while the physical level alludes to the selection of specific implementations of algorithms or for optimizing specific implementations. These implementations are written and registered within the AJAX library.
Five logical data cleaning operations are provided- mapping, view, matching, clustering and merging. The mapping operator takes in a tuple and produces one or more tuples. The view operator behaves like a regular SQL query and can be used to represent many-to-one mappings. The matching operator computes a similarity distance measure between each tuple pair (in the Cartesian product of the two input relations). This is quite an important and sensitive operator as its implementation greatly affects how duplicates are discovered (deduplication uses matching, clustering and merging). The clustering operator groups similar tuples together, depending on the clustering algorithm selected. The merging operator collapses each cluster based on a defined aggregator function e.g., selecting the longest value in the cluster as the collapsed value. Tuples which cannot be transformed by these operators produce exceptions, which can be corrected interactively and re-integrated into the data flow graph. Unlike the other operators, the clustering operator does not generate exceptions.
Two optimized implementations of the matching operator are explored- the neighbourhood join algorithm (NJ) and the multi-pass neighbourhood algorithm (MPN). A naive matching algorithm would involve computing the Cartesian product of large relations, which is expensive. NJ optimizes this by applying filters on the inputs. The distance filtering optimization (implemented by NJ) involves devising a function over the input tuples such that a cheaper similarity distance function can be computed between them, as compared to computing the actual similarity distance between the two input tuples (e.g., compare only prefixes of input values). If this filter is passed, only then are the actual similarities computed. NJ works efficiently if fewer tuples pass the filter. The Damerau-Levenshtein algorithm is used as the distance function. MPN improves the naive algorithm by limiting the number of inputs, and unlike NJ, allows false dismissals. It consists of performing an outer join on the relations, selecting a key for each record, sorting all the keys, and comparing records that are close to each other within a fixed-sized window. Multiple passes of the algorithm can be carried out.
From the experiments carried out, MPN performs faster than NJ, but is less accurate. On the other hand, NJ is faster in getting a good recall for domains that are more unstructured.
Paper : Declarative Data Cleaning; Galhardas et. Al., Proc. VLDB, 2001