Friday, 15 November 2013

Postgres 9.1 : Installing contrib modules on ec2

Having installed PostgreSQL 9.1 on my ec2 instance (Amazon Linux) following these instructions, I wanted to experiment with a few contrib modules. However, this process was not trivial, mainly because the installed postgres distribution did not have the contrib modules by default. Here's what I did :
  1. Checking if the modules exist :
    • cd /usr/pgsql-9.1/share/extension
    • Only plpgsql seems to be there.
  2. Extensions don't exist, preparing to install contrib. Check the specific postgres version first :
    • sudo su -
    • su - postgres
    • /usr/pgsql-9.1/bin/psql -p 5432
    • postgres=# select version();
       PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54), 64-bit
      (1 row)
  3. Installing contrib :
    • sudo yum install postgresql91-contrib
      Error: Package: postgresql91-contrib-9.1.10-1PGDG.rhel5.x86_64 (pgdg91)
  4. Whoops! Fixing errors and try installing again :
    • Downloaded the required rpm file from the following link.
    • wget
    • rpm -Uvh (instructions found here)
  5. Success! Checking "/usr/pgsql-9.1/share/extension" indeed shows that the modules were installed. What extensions are installed on my current database?
    • sudo su -
    • su - postgres
    • /usr/pgsql-9.1/bin/psql -p 5432
    • postgres=# select * from pg_extension;
       extname | extowner | extnamespace | extrelocatable | extversion | extconfig | e
       plpgsql |       10 |           11 | f              | 1.0        |           | 
      (1 row)
    • You can use "postgres=# \dx" instead of the select statement if you want. Anyway, it looks like only plpgsql is installed.
  6. Let us install some other extensions. These are installed on the current database by default.
    • postgres=# create extension pg_trgm;
    • postgres=# create extension hstore;
    • postgres=# \dx
                                          List of installed extensions
        Name   | Version |   Schema   |                            Description                            
       hstore  | 1.0     | public     | data type for storing sets of (key, value) pairs
       pg_trgm | 1.0     | public     | text similarity measurement and index searching based on trigrams
       plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
      (3 rows)
Extra information :
  1. How to change the current database :
    • postgres=# \connect foobar
      psql (9.1.10, server 9.1.9)
      You are now connected to database "foobar" as user "postgres".
  2. How to check the current database :
    • postgres=# select current_database();
      (1 row)
    • But I don't know why you'd need that in this case since you can check the database name in the prompt itself.
  3. Here's a list of modules to explore.
  4. What I ended up installing :
    • foobar=# \dx
                                               List of installed extensions
              Name        | Version |   Schema   |                            Description                            
       hstore             | 1.0     | public     | data type for storing sets of (key, value) pairs
       pg_stat_statements | 1.0     | public     | track execution statistics of all SQL statements executed
       pg_trgm            | 1.0     | public     | text similarity measurement and index searching based on trigrams
       pgstattuple        | 1.0     | public     | show tuple-level statistics
       plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
       uuid-ossp          | 1.0     | public     | generate universally unique identifiers (UUIDs)
      (6 rows)

Tuesday, 12 November 2013

Monday, 11 November 2013

Paper summary : K-Anonymous Data Mining, A Survey

Extracting useful information from large datasets has become an increasingly common endeavour. However, if datasets are not privacy protected, data mining can lead to a breach of individual privacy. One privacy preserving technique, k-anonymity, is examined. The authors describe k-anonymity, its enforcement, threats to k-anonymity and addressing these threats (in the context of data mining).

k-anonymity requires that the size of a QI-group (equivalence class) is at least k. QI attributes are those that can serve as an identifier for an individual. The aim is for tuples in a QI-group to be indistinguishable from each other. QI-groups can be built by generalization (replacing specific values with generalized versions from a hierarchy) and suppression (changing a specific attribute value to its most generalized form). These techniques result in information loss, which algorithms seek to minimize e.g., Samarati’s algorithms, etc.

k-anonymity can be violated by data mining techniques (e.g., association rule mining, classification, etc) since they can used to draw inferences and breach privacy. To address this, one could either anonymize-and-mine (AM) or mine-and-anonymize (MA). The AM approach decouples anonymization and mining but reduces the utility of the data while the MA approach extracts more utility from the data and is more efficient but restricts data mining to the owner.

To anonymize data in the AM approach, one could either apply top-down or bottom-up algorithms. Top-down algorithms initially generalize tables completely and subsequently refine them, increasing data utility but decreasing anonymity (vice versa for bottom-up).

To anonymize mined output in the MA approach, the algorithms depend on the specific scenario, but the main theme is to detect and close inference channels.

Paper summary : Privacy Preserving Schema and Data Matching

Record matching is the process of identifying identical entities across different data sources (or identifying duplicates in a single source). When record matching is performed across autonomous businesses, privacy preservation becomes important (e.g., businesses might not want to share schema or tuple information). The authors thus present an efficient privacy-preserving protocol for record matching across data sources, assuming semi-honest parties. The protocol provides both secure data matching and secure schema matching.

Secure data matching is carried out in 3 phases. In phase 1, two parties, P and Q decide on a set of random strings and a comparison function. The SparseMap algorithm is used with the set and the comparison function to build an embedding space. In phase 2, P and Q both apply SparseMap to embed Rp and Rq into the embedding space. Two structures, Pstr and Qstr are also created in this process (these keep track of vectors in the euclidean space; a vector is an attribute value of a tuple). In phase 3, a third party, W compares each record in Pstr and Qstr. If the comparison distance is below a threshold, then those records are the same (approximate matching c.f. [Agrawal, SIGMOD ‘03] which uses exact matching). Thus, all the same records can be collected and returned to P and Q.

In secure schema matching, P and Q map their local schemas to a global schema owned by W in a series of 4 phases. In phase 1, P and Q decide on a secret key k (not known to W). In phase 2, P and Q map their local schemas to a global schema using mapping expressions, generating matched schemas. In phase 3, these matched schemas are encrypted by P and Q respectively by using a hash function (provided by W) and the key k. In the final phase 4, W computes the intersection of these encrypted sets. The intersection can be sent to P and Q, which can then be decrypted using key k.

The overall protocol uses both the secure data matching step and (a slightly modified version of) the secure schema matching step. Proofs are provided to corroborate the security properties of the overall protocol. Experiments suggest that (i) high quality embedding is possible by tuning the embedded space, (ii) the protocol is effective, but there is a tradeoff with quality and efficiency wrt heuristics (iii) the protocol is doubly efficient compared to [Agrawal, SIGMOD ‘03] and comparably efficient compared to record matching in the original space.

Paper : Privacy Preserving Schema and Data Matching; Scannapieco et. Al., SIGMOD ‘07

Monday, 4 November 2013

Paper summary : Information Sharing Across Private Databases

Recent trends (e.g., outsourcing) in the field of information integration highlight the importance of security when sharing information across autonomous DBs. It is essential that queries (which span multiple DBs) be answered in a terse manner i.e., no additional information apart from query results is leaked. Present solutions are inadequate (e.g., using third party middlemen, which requires a high amount of trust on these middlemen). Instead, the authors propose certain cryptographic protocols which are used to simulate trusted third parties. Protocols for 4 operations are explored. These protocols only permit a minimal necessary amount information to be shared for queries, assuming semi-honest parties.

The protocols support 4 operations- (i) intersection, (ii) equijoin, (iii) intersection size and (iv) equijoin size. For (i), a naïve protocol is proposed, which has the disadvantage that R (the receiver) can guess all additional information sent by S (the sender) if the domain is small. This can be fixed by using a pair of commutative functions (assuming ideal hashing). For (ii), yet another naïve protocol is proposed, which incorrectly encrypts extra information. This has the same weakness as that in (i) wrt the naïve protocol. To overcome this, the extra information is encrypted with a key, which can obtained by inverting a secure function. For (iii), the intersection protocol in (i) is modified so that R learns only about the intersection size, and not the intersection set (which would also result in the intersection size). For (iv), the protocol is similar to that in (iii). All the 4 protocols are all proven correct and secure assuming semi-honest parties (i.e., they disclose minimal information apart from the query result).

The authors acknowledge certain limitations of their protocols. Their protocols do not make any guarantees for multiple queries (which can be handled in other ways e.g. maintaining audit trails). They also assume that database schemas are known, and don't account for schema heterogeneity.

Finally, the authors present the theoretical efficiency of the proposed protocols and also perform estimated cost analysis for certain scenarios.