Monday 30 September 2013

Paper summary : Conditional Functional Dependencies for Data Cleaning

There is a strong need for data cleaning tools to automatically detect and remove costly data inconsistencies. However, present tools, such as ETL tools require significant manual effort. Other tools focus on constraint repairs, often relying on FDs (functional dependencies) to flag inconsistencies in the data as violations of constraints. However, FDs capture schema design, but not the semantic information present in the data. To this end, the paper outlines a framework for modelling semantic information using CFDs (Conditional Functional Dependencies). In addition, the authors- describe techniques for reasoning about CFDs, develop SQL techniques for detecting CFD violations and perform an experimental study on the performance of their system.

Given a framework for modeling CFDs, consistency and implication analysis are two important processes required in order to obtain a minimum cover of a set of CFDs. Since the cost of checking and repairing CFDs depends on the size of the CFD set, a minimal cover leads to less validation and repair costs. Once the set of CFDs are found, these are used to detect inconsistencies in the database (this involves merging all the pattern tableaux belonging to the set of CFDs, converting them into a single SQL query pair and running the query). Repair of data follows, different from the repair of the CFDs themselves (not described in the paper).

The authors performed experiments on real world data. In terms of scalability, the authors found performance is more dependent on the size of the relation and the number of attributes in the pattern tableau, rather than the number of tuples or the noise introduced. In terms of detecting inconsistencies, the authors found that the implementation of the merge step and SQL generation step (how the “where” clauses are evaluated) influences the performance of the system.

Paper : Conditional Functional Dependencies for Data Cleaning; Bohannon et. Al., ICDE 2007

Paper summary : Mining Association Rules between Sets of Items in Large DBs

Supermarkets need to make important business decisions every day (e.g., what to put on sale, how to design coupons, etc), but databases do not provide functionality to derive association rules from customer transaction information. This paper presents an algorithm to mine such rules in real world datasets.

Given a set of antecedents and one consequent, the algorithm finds rules that have a specified transaction support (the percentage of transactions that contain the union of consequent and antecedent) and a specified confidence (the percentage of transactions that satisfy the antecedent also satisfy the consequent). This is done by finding large itemsets (with a minimum transaction support) and generating rules from these itemsets.

Finding large itemsets is expensive, hence, estimation procedures, pruning, and memory management are essential factors for consideration. The estimation procedure is used to determine the itemsets which should be measured in a passes. The algorithm seeks to balance the number of passes over the data and the number of itemsets measured in a pass. The pruning technquies discard certain itemsets to reduce wastage of unnecessary computational effort, and can prune out itesets as soon as they are generated. Memory management describes the situation when all the itemsets generated in the pass do not fit in memory. By spooling data to the disk, the memory management algorithm guarantees that the algorithm terminates.

The algorithm produces excellent results on the provided dataset. The estimation procedures were accurate and the pruning procedures worked well.

Paper : Mining Association Rules between Sets of Items in Large DBs; Agrawal et. Al., ACM SIGMOD 1993

Thursday 26 September 2013

Paper Summary : AJAX

Data cleaning is a difficult problem. Existing ETL tools and data reengineering tools are not sophisticated enough to design data flow graphs (which specify data transformation steps) efficiently and effectively. These tools are usually closed or undocumented (with respect to the implementation of the transformation algorithms), non-interactive, and have long wait times (hindering the stepwise refinement process crucial to data cleaning). This paper introduces the AJAX framework, enabling users to perform data cleaning operations cleanly and efficiently.

The AJAX framework consists of a logical level and physical level. The logical level alludes to the specification of data transformations using a declarative language while the physical level alludes to the selection of specific implementations of algorithms or for optimizing specific implementations. These implementations are written and registered within the AJAX library.

Five logical data cleaning operations are provided- mapping, view, matching, clustering and merging. The mapping operator takes in a tuple and produces one or more tuples. The view operator behaves like a regular SQL query and can be used to represent many-to-one mappings. The matching operator computes a similarity distance measure between each tuple pair (in the Cartesian product of the two input relations). This is quite an important and sensitive operator as its implementation greatly affects how duplicates are discovered (deduplication uses matching, clustering and merging). The clustering operator groups similar tuples together, depending on the clustering algorithm selected. The merging operator collapses each cluster based on a defined aggregator function e.g., selecting the longest value in the cluster as the collapsed value. Tuples which cannot be transformed by these operators produce exceptions, which can be corrected interactively and re-integrated into the data flow graph. Unlike the other operators, the clustering operator does not generate exceptions.

Two optimized implementations of the matching operator are explored- the neighbourhood join algorithm (NJ) and the multi-pass neighbourhood algorithm (MPN). A naive matching algorithm would involve computing the Cartesian product of large relations, which is expensive. NJ optimizes this by applying filters on the inputs. The distance filtering optimization (implemented by NJ) involves devising a function over the input tuples such that a cheaper similarity distance function can be computed between them, as compared to computing the actual similarity distance between the two input tuples (e.g., compare only prefixes of input values). If this filter is passed, only then are the actual similarities computed. NJ works efficiently if fewer tuples pass the filter. The Damerau-Levenshtein algorithm is used as the distance function. MPN improves the naive algorithm by limiting the number of inputs, and unlike NJ, allows false dismissals. It consists of performing an outer join on the relations, selecting a key for each record, sorting all the keys, and comparing records that are close to each other within a fixed-sized window. Multiple passes of the algorithm can be carried out.

From the experiments carried out, MPN performs faster than NJ, but is less accurate. On the other hand, NJ is faster in getting a good recall for domains that are more unstructured.

Paper : Declarative Data Cleaning; Galhardas et. Al., Proc. VLDB, 2001

Paper Summary : Potter’s Wheel

Data cleaning is an important process since dirty data essentially reduces the ability to extract useful information from the sources. Typically, data cleaning involves data auditing (to discover discrepancies), and data transformations (which are chosen and applied to fix the discrepancies). This process is iterative, but usually non-interactive, and with long wait times. This paper proposes a solution- Potter’s Wheel (PW), an interactive tool that integrates discrepancy detection and offers interactive transformation tools to clean data.

The core of PW’s architecture consists of an online reorderer, a transformation engine, and an automatic discrepancy detector (ADD). The online reorderer is a spreadsheet-like interface which supports sorting, scrolling, and displays streams of data dynamically. The transformation engine applies specified transforms immediately on the interface and to discovered discrepancies. The ADD runs in the background and highlights discrepancies to users automatically.

Discrepancy detection is handled in the background (on the latest transformed view) using the Minimum Description Length (MDL) metric in order to infer suitable structures for values in a column. These structures are a sequence of domains (which can defined by users). Records which fit the inferred structure poorly are flagged by the discrepancy detector.

The structure inference algorithm works by enumerating a fixed number of structures, and picking the best one. Enumeration of candidates is done recursively, utilizing prefixes of the strings (values of the domain), making sure that redundant candidates are avoided. Next, each enumerated candidate is used to compute a distance length (DL) measure for all the values in a particular column. The mathematical formula used to compute DL encapsulates three favourable criteria for good structures- recall, precision and conciseness. The structure which results in the lowest DL is thus selected as the best approximate structure for a column, and is used to detect poorly matching values i.e., discrepancies. Structures can be parameterized, improving detection ability e.g., computing statistical measures on the parameters to detect anomalies.

Next, transforms are applied to discrepancies. Transforms can be applied to rows, to all values of the column, and to schemas. PW provides a number of transforms which are easy to specify graphically (e.g., add, drop, copy, fold, etc). These transforms can be undone too. Other transforms, like splitting on values, are difficult to manifest visually.

Splitting is handled by performing them on example values. The MDL metric is used to infer structures about those example values. Once structures have been inferred, splitting follows. A number of splitting algorithms can be applied. The paper proposes the Dec-Specificity algorithm as a suitable candidate.

Paper : Potter’s Wheel: An Interactive Data Cleaning System; Raman et. Al., Proc. VLDB, 2001

Tuesday 10 September 2013

PostgreSQL 9.3 - A few interesting features

PostgreSQL 9.3 was released yesterday, and as a big fan of the DBMS, I thought it apt to highlight some of its more interesting new features. Here's a link to its manual.

My favourite feature is better support for the JSON data type. With the rapid adoption of node.js, the newfound "MEAN" acronym (mongoDB, express, angular.js, node.js), it seems like mongoDB is at the forefront of the "hip" startup. They should probably remove the "M" though. Combining postgres hstore with better JSON support, async drivers, PL/V8 (not part of core or contrib modules; needs to be installed separately; enables you to write SQL functions in javascript, which you can later call using SQL), node ORM libraries (e.g., Sequelize) and wrappers, I would much rather utilize postgres over mongo for most use cases, even with node.js (granted, I like the sharding capabilities of mongo). Here's one benchmark, and a pretty informative presentation, that favours postgres over mongo.

"Prevent non-key-field row updates from blocking foreign key checks" is probably the next best feature, in terms of concurrent performance since it "reduces the probability of deadlocks when updating tables involved in a foreign-key constraint".

Event triggers look pretty cool, and clearly something I might think of using. "Unlike regular triggers, which are attached to a single table and capture only DML events, event triggers are global to a particular database and are capable of capturing DDL events." I wonder if perhaps it might be better capturing such logic in the application code.

SP-GiST indexes interest me, mainly for theoretical reasons. The fact that there is yet another implementation pathway to construct indexes is pretty cool, though I'll probably stick with the existing GIN and GiST indexes for most things.

Lateral joins (I doubt I'll use this; I'd rather handle such logic in the application) and updatable views are "meh". The updatable view logic seems highly restrictive. If you were seriously using a view, it would likely be a fairly complex query, spanning multiple tables. I would just create a new one and have the application switch or something.

Here's what the folks at hacker news have to say about postgres 9.3.