Data cleaning consists of removing inconsistencies from a database, given a set of constraints. It is a crucial process, since unclean databases result in economic loss and poor decision making. Currently, no general repair algorithms exist which handle a wide variety of constraints and different strategies to select preferred values. The authors present an implementation (called LLUNATIC) to solve these problems. In addition, they propose a general framework for expressing constraints and repairs, and also develop a minimum cost model for computing scalable, and high quality repairs.
The presented framework is essentially centered around the concept of a cleaning scenario (CS). Each CS encapsulates- a source schema (a clean database), a target schema (a dirty database), a set of equality generating dependencies (EGDs; used to specify types of constraints), and a set of partial orders (indicates preferred upgrades to the target database). “lluns” are introduced, which are sets of symbols used to express inconsistencies in the data that require human intervention to resolve.They are used along with cell groups to describe the semantics of repairs and lineage.
The crux of the data repair problem is to find solutions for CS’, which is done via a variant of the chase algorithm (which always terminates). This algorithm scales well because it works on equivalence classes instead of tuples, uses scalable datastructures (like delta databases) and relies on pruning (with the help of a cost manager). Experiments indicate a high potential for scalability, coupled with the fact that LLUNATIC runs atop a DBMS. Moreover, certain cost managers are shown to produce high quality repairs due to forward and backward chasing.
The presented framework is essentially centered around the concept of a cleaning scenario (CS). Each CS encapsulates- a source schema (a clean database), a target schema (a dirty database), a set of equality generating dependencies (EGDs; used to specify types of constraints), and a set of partial orders (indicates preferred upgrades to the target database). “lluns” are introduced, which are sets of symbols used to express inconsistencies in the data that require human intervention to resolve.They are used along with cell groups to describe the semantics of repairs and lineage.
The crux of the data repair problem is to find solutions for CS’, which is done via a variant of the chase algorithm (which always terminates). This algorithm scales well because it works on equivalence classes instead of tuples, uses scalable datastructures (like delta databases) and relies on pruning (with the help of a cost manager). Experiments indicate a high potential for scalability, coupled with the fact that LLUNATIC runs atop a DBMS. Moreover, certain cost managers are shown to produce high quality repairs due to forward and backward chasing.
No comments:
Post a Comment