Friday 15 August 2014

Paper summary : Injecting Utility into Anonymized Datasets by Kifer et. Al., SIGMOD '06


Summary
  • In this paper, we will introduce a formal approach to measuring utility. Using this measure, we will show how to inject utility into k-anonymous and l-diverse tables, while maintaining the same level of privacy.
Introduction
  • k-anonymity and l-diversity rely on generalizations to preserve privacy.
  • In the real-world, many attributes often need to be suppressed in order to guarantee privacy, which is bad for utility no matter what operations are being performed on the data.
  • One solution is to publish marginals (i.e., contingency tables for a subset of the attributes) along with the original anonymized data. This would require anonymizing the marginals too (also via generalizations) in order to preserve privacy.
  • However, there are many possible subsets of attributes (marginals) for which contingency tables can be built. How do we decide which particular collection of marginals to publish?
Preliminaries
  • (Defn 2.3) K-anonymity : Table $D$ satisfies k-anonymity if $\forall t \in D$, there are at least $k-1$ other tuples that have the same values as $t$ for every QI (quasi identifier attribute). Note that we assume that the set of all non-sensitive attributes form the QI.
  • (Defn 2.4) Anonymized group : An anonymized group is a setwise maximal set of tuples that have the same (generalized) value for each non-sensitive attribute.
  • (Defn 2.5) (c,l)-diversity : Let $c>0$ be a constant and $q$ be an anonymized group. Let $S$ be a sensitive attribute and $s_{1},..., s_{m}$ be the values of $S$ that appear in $q$ and let $r_{1},..., r_{m}$ be their frequency counts. Let $r_{(1)},...,r_{(m)}$ be those counts sorted in a descending order. We say $q$ satisfies (c,l)-diversity wrt $S$ if $r_{(1)} \leq c \sum_{i=l}^{m}r_{(i)}$.
Existing utility measures
  • Generalization height is one utility measure.
  • Another measure is discernability, which assigns a cost to each tuple based on how many other tuples are indistinguishable from it. It is the sum of squares of anonymized groups, plus $|D|$ times the number of suppressed tuples.
  • Both of the above measures do not consider the distributions of the tuples.
  • A third measure is the classification metric, appropriate when one wants to train a classifier over the anonymized data. Thus, one attribute is treated as a class label. This metric assigns a penalty of 1 for every tuple suppressed. If a tuple $t$ is not suppressed, it looks at the majority class label of $t$'s anonymized group. If the class label of tuple $t$ is different from the majortiy, assign a penalty of 1. This metric the sum of all penalties. But it is not clear what happens if one wants to build classifiers for several different attributes.
  • A fourth measure is the information to privacy loss ratio, also designed for classification. However, it suffers from the same weakness as the classification metric.
Proposed utility measure
  • We view the data as an iid sample generated from some distribution $F$.
  • Suppose tuples in our table have (discrete valued) attributes $U_{1},..., U_{n}$. Then we can estimate $F$ using $\hat{F_{1}}$ where $\hat{F_{1}}$ corresponds to $P(t.U_{1}=u_{1},...,t.U_{n} = u_{n})$ where $t.U_{1}$ refers to the attribute value $U_{1}$ for tuple $t$.
  • Now suppose we are given anonymized marginals (e.g., 23% of tuples have the age attribute appearing between [46-50] years old while 77% appear between [50-55]). We can view 23% and 77% as constraints i.e., the marginals represent constraints. We can compute the (maximum entropy) probability distribution that corresponds to these constraints, $\hat{F_{2}}$.
  • (It turns out that the maximum entropy is also the maximum likelihood estimate associated with log linear models.)
  • We now have $\hat{F_{1}}$ associated with the original data and $\hat{F_{2}}$ associated with the anonymized marginals. We can compare them using the standard KL (kullback-liebler) divergence, which is minimized with $\hat{F_{1}} = \hat{F_{2}}$.
  • Since our goal is to determine which anonymized marginals to publish, we want to minimize KL-divergence between the various possible $\hat{F_{2}}$ and a fixed $\hat{F_{1}}$.
Extending privacy definitions
  • We can extend k-anonymity and l-diversity to collections of anonymized marginals.
  • (Defn 4.1) k-link anonymity : A collection of anonymized marginals $M_{1},...,M_{r}$ satisfies k-link anonymity if for all i = $1,...,r$ and for all tuples $t \in NonSensitive(M_{i})$, either $M_{i}(t) = 0$ or $M_{i}(t) \geq k$. $NonSensitive(M_{i})$ refers to the non sensitive attributes which $M_{i}$ is comprised of while $M_{i}(t)$ refers to the number of the tuples which have the same attribute values as $M_{i}$.
  • We must also be sure that an adversary cannot use combinatorial techniques to determine that a tuple with a certain value for its quasi-identifiers exists in the original table and that the number of such tuples is less than $k$.
  • (Defn 4.2) k-combinatorial anonymity : Let $D$ be the domain of the nonsensitive attributes. A collection of anonymized marginals $M_{1},... ,M_{r}$ satisfies k-combinatorial anonymity if for all $t \in D$ one of the following holds:
    1. For all tables $T$ consistent with $M_{1},... ,M_{r}$, $T(t) = 0$
    2. There exists a table $T$ consistent with $M_{1},... ,M_{r}$ st $T(t) \geq k$
  • (Defn 4.3) MaxEnt l-diversity : $M_{1},... ,M_{r}$ satisfy MaxEnt l-diversity if the maximum entropy distribution that is consistent with $M_{1},... ,M_{r}$ satisfies l-diversity.
Experiments
  • Experiments showed that even a very simple search for anonymized marginals can yield dramatic results c.f. to just a single anonymized table.

No comments:

Post a Comment