There is a strong need for data cleaning tools to automatically detect and remove costly data inconsistencies. However, present tools, such as ETL tools require significant manual effort. Other tools focus on constraint repairs, often relying on FDs (functional dependencies) to flag inconsistencies in the data as violations of constraints. However, FDs capture schema design, but not the semantic information present in the data. To this end, the paper outlines a framework for modelling semantic information using CFDs (Conditional Functional Dependencies). In addition, the authors- describe techniques for reasoning about CFDs, develop SQL techniques for detecting CFD violations and perform an experimental study on the performance of their system.
Given a framework for modeling CFDs, consistency and implication analysis are two important processes required in order to obtain a minimum cover of a set of CFDs. Since the cost of checking and repairing CFDs depends on the size of the CFD set, a minimal cover leads to less validation and repair costs. Once the set of CFDs are found, these are used to detect inconsistencies in the database (this involves merging all the pattern tableaux belonging to the set of CFDs, converting them into a single SQL query pair and running the query). Repair of data follows, different from the repair of the CFDs themselves (not described in the paper).
The authors performed experiments on real world data. In terms of scalability, the authors found performance is more dependent on the size of the relation and the number of attributes in the pattern tableau, rather than the number of tuples or the noise introduced. In terms of detecting inconsistencies, the authors found that the implementation of the merge step and SQL generation step (how the “where” clauses are evaluated) influences the performance of the system.
Paper : Conditional Functional Dependencies for Data Cleaning; Bohannon et. Al., ICDE 2007
Given a framework for modeling CFDs, consistency and implication analysis are two important processes required in order to obtain a minimum cover of a set of CFDs. Since the cost of checking and repairing CFDs depends on the size of the CFD set, a minimal cover leads to less validation and repair costs. Once the set of CFDs are found, these are used to detect inconsistencies in the database (this involves merging all the pattern tableaux belonging to the set of CFDs, converting them into a single SQL query pair and running the query). Repair of data follows, different from the repair of the CFDs themselves (not described in the paper).
The authors performed experiments on real world data. In terms of scalability, the authors found performance is more dependent on the size of the relation and the number of attributes in the pattern tableau, rather than the number of tuples or the noise introduced. In terms of detecting inconsistencies, the authors found that the implementation of the merge step and SQL generation step (how the “where” clauses are evaluated) influences the performance of the system.
Paper : Conditional Functional Dependencies for Data Cleaning; Bohannon et. Al., ICDE 2007
No comments:
Post a Comment