Computer Science and Technology: Paper summary : Injecting Utility into Anonymized Datasets by Kifer et. Al., SIGMOD '06

Summary

In this paper, we will introduce a formal approach to measuring utility. Using this measure, we will show how to inject utility into k-anonymous and l-diverse tables, while maintaining the same level of privacy.

Introduction

k-anonymity and l-diversity rely on generalizations to preserve privacy.
In the real-world, many attributes often need to be suppressed in order to guarantee privacy, which is bad for utility no matter what operations are being performed on the data.
One solution is to publish marginals (i.e., contingency tables for a subset of the attributes) along with the original anonymized data. This would require anonymizing the marginals too (also via generalizations) in order to preserve privacy.
However, there are many possible subsets of attributes (marginals) for which contingency tables can be built. How do we decide which particular collection of marginals to publish?

Preliminaries

(Defn 2.3) K-anonymity : Table $D$ satisfies k-anonymity if $\forall t \in D$, there are at least $k-1$ other tuples that have the same values as $t$ for every QI (quasi identifier attribute). Note that we assume that the set of all non-sensitive attributes form the QI.
(Defn 2.4) Anonymized group : An anonymized group is a setwise maximal set of tuples that have the same (generalized) value for each non-sensitive attribute.
(Defn 2.5) (c,l)-diversity : Let $c>0$ be a constant and $q$ be an anonymized group. Let $S$ be a sensitive attribute and $s_{1},..., s_{m}$ be the values of $S$ that appear in $q$ and let $r_{1},..., r_{m}$ be their frequency counts. Let $r_{(1)},...,r_{(m)}$ be those counts sorted in a descending order. We say $q$ satisfies (c,l)-diversity wrt $S$ if $r_{(1)} \leq c \sum_{i=l}^{m}r_{(i)}$.

Existing utility measures

Generalization height is one utility measure.
Another measure is discernability, which assigns a cost to each tuple based on how many other tuples are indistinguishable from it. It is the sum of squares of anonymized groups, plus $|D|$ times the number of suppressed tuples.
Both of the above measures do not consider the distributions of the tuples.
A third measure is the classification metric, appropriate when one wants to train a classifier over the anonymized data. Thus, one attribute is treated as a class label. This metric assigns a penalty of 1 for every tuple suppressed. If a tuple $t$ is not suppressed, it looks at the majority class label of $t$'s anonymized group. If the class label of tuple $t$ is different from the majortiy, assign a penalty of 1. This metric the sum of all penalties. But it is not clear what happens if one wants to build classifiers for several different attributes.
A fourth measure is the information to privacy loss ratio, also designed for classification. However, it suffers from the same weakness as the classification metric.

Proposed utility measure

We view the data as an iid sample generated from some distribution $F$.
Suppose tuples in our table have (discrete valued) attributes $U_{1},..., U_{n}$. Then we can estimate $F$ using $\hat{F_{1}}$ where $\hat{F_{1}}$ corresponds to $P(t.U_{1}=u_{1},...,t.U_{n} = u_{n})$ where $t.U_{1}$ refers to the attribute value $U_{1}$ for tuple $t$.
Now suppose we are given anonymized marginals (e.g., 23% of tuples have the age attribute appearing between [46-50] years old while 77% appear between [50-55]). We can view 23% and 77% as constraints i.e., the marginals represent constraints. We can compute the (maximum entropy) probability distribution that corresponds to these constraints, $\hat{F_{2}}$.
(It turns out that the maximum entropy is also the maximum likelihood estimate associated with log linear models.)
We now have $\hat{F_{1}}$ associated with the original data and $\hat{F_{2}}$ associated with the anonymized marginals. We can compare them using the standard KL (kullback-liebler) divergence, which is minimized with $\hat{F_{1}} = \hat{F_{2}}$.
Since our goal is to determine which anonymized marginals to publish, we want to minimize KL-divergence between the various possible $\hat{F_{2}}$ and a fixed $\hat{F_{1}}$.

Extending privacy definitions

We can extend k-anonymity and l-diversity to collections of anonymized marginals.
(Defn 4.1) k-link anonymity : A collection of anonymized marginals $M_{1},...,M_{r}$ satisfies k-link anonymity if for all i = $1,...,r$ and for all tuples $t \in NonSensitive(M_{i})$, either $M_{i}(t) = 0$ or $M_{i}(t) \geq k$. $NonSensitive(M_{i})$ refers to the non sensitive attributes which $M_{i}$ is comprised of while $M_{i}(t)$ refers to the number of the tuples which have the same attribute values as $M_{i}$.
We must also be sure that an adversary cannot use combinatorial techniques to determine that a tuple with a certain value for its quasi-identifiers exists in the original table and that the number of such tuples is less than $k$.
(Defn 4.2) k-combinatorial anonymity : Let $D$ be the domain of the nonsensitive attributes. A collection of anonymized marginals $M_{1},... ,M_{r}$ satisfies k-combinatorial anonymity if for all $t \in D$ one of the following holds:

(Defn 4.3) MaxEnt l-diversity : $M_{1},... ,M_{r}$ satisfy MaxEnt l-diversity if the maximum entropy distribution that is consistent with $M_{1},... ,M_{r}$ satisfies l-diversity.

Experiments

Experiments showed that even a very simple search for anonymized marginals can yield dramatic results c.f. to just a single anonymized table.

Computer Science and Technology