Monday, 11 November 2013

Paper summary : Privacy Preserving Schema and Data Matching

Record matching is the process of identifying identical entities across different data sources (or identifying duplicates in a single source). When record matching is performed across autonomous businesses, privacy preservation becomes important (e.g., businesses might not want to share schema or tuple information). The authors thus present an efficient privacy-preserving protocol for record matching across data sources, assuming semi-honest parties. The protocol provides both secure data matching and secure schema matching.

Secure data matching is carried out in 3 phases. In phase 1, two parties, P and Q decide on a set of random strings and a comparison function. The SparseMap algorithm is used with the set and the comparison function to build an embedding space. In phase 2, P and Q both apply SparseMap to embed Rp and Rq into the embedding space. Two structures, Pstr and Qstr are also created in this process (these keep track of vectors in the euclidean space; a vector is an attribute value of a tuple). In phase 3, a third party, W compares each record in Pstr and Qstr. If the comparison distance is below a threshold, then those records are the same (approximate matching c.f. [Agrawal, SIGMOD ‘03] which uses exact matching). Thus, all the same records can be collected and returned to P and Q.

In secure schema matching, P and Q map their local schemas to a global schema owned by W in a series of 4 phases. In phase 1, P and Q decide on a secret key k (not known to W). In phase 2, P and Q map their local schemas to a global schema using mapping expressions, generating matched schemas. In phase 3, these matched schemas are encrypted by P and Q respectively by using a hash function (provided by W) and the key k. In the final phase 4, W computes the intersection of these encrypted sets. The intersection can be sent to P and Q, which can then be decrypted using key k.

The overall protocol uses both the secure data matching step and (a slightly modified version of) the secure schema matching step. Proofs are provided to corroborate the security properties of the overall protocol. Experiments suggest that (i) high quality embedding is possible by tuning the embedded space, (ii) the protocol is effective, but there is a tradeoff with quality and efficiency wrt heuristics (iii) the protocol is doubly efficient compared to [Agrawal, SIGMOD ‘03] and comparably efficient compared to record matching in the original space.

Paper : Privacy Preserving Schema and Data Matching; Scannapieco et. Al., SIGMOD ‘07

No comments:

Post a Comment