Friday 15 August 2014

Paper summary : Injecting Utility into Anonymized Datasets by Kifer et. Al., SIGMOD '06

  • In this paper, we will introduce a formal approach to measuring utility. Using this measure, we will show how to inject utility into k-anonymous and l-diverse tables, while maintaining the same level of privacy.
  • k-anonymity and l-diversity rely on generalizations to preserve privacy.
  • In the real-world, many attributes often need to be suppressed in order to guarantee privacy, which is bad for utility no matter what operations are being performed on the data.
  • One solution is to publish marginals (i.e., contingency tables for a subset of the attributes) along with the original anonymized data. This would require anonymizing the marginals too (also via generalizations) in order to preserve privacy.
  • However, there are many possible subsets of attributes (marginals) for which contingency tables can be built. How do we decide which particular collection of marginals to publish?
  • (Defn 2.3) K-anonymity : Table $D$ satisfies k-anonymity if $\forall t \in D$, there are at least $k-1$ other tuples that have the same values as $t$ for every QI (quasi identifier attribute). Note that we assume that the set of all non-sensitive attributes form the QI.
  • (Defn 2.4) Anonymized group : An anonymized group is a setwise maximal set of tuples that have the same (generalized) value for each non-sensitive attribute.
  • (Defn 2.5) (c,l)-diversity : Let $c>0$ be a constant and $q$ be an anonymized group. Let $S$ be a sensitive attribute and $s_{1},..., s_{m}$ be the values of $S$ that appear in $q$ and let $r_{1},..., r_{m}$ be their frequency counts. Let $r_{(1)},...,r_{(m)}$ be those counts sorted in a descending order. We say $q$ satisfies (c,l)-diversity wrt $S$ if $r_{(1)} \leq c \sum_{i=l}^{m}r_{(i)}$.
Existing utility measures
  • Generalization height is one utility measure.
  • Another measure is discernability, which assigns a cost to each tuple based on how many other tuples are indistinguishable from it. It is the sum of squares of anonymized groups, plus $|D|$ times the number of suppressed tuples.
  • Both of the above measures do not consider the distributions of the tuples.
  • A third measure is the classification metric, appropriate when one wants to train a classifier over the anonymized data. Thus, one attribute is treated as a class label. This metric assigns a penalty of 1 for every tuple suppressed. If a tuple $t$ is not suppressed, it looks at the majority class label of $t$'s anonymized group. If the class label of tuple $t$ is different from the majortiy, assign a penalty of 1. This metric the sum of all penalties. But it is not clear what happens if one wants to build classifiers for several different attributes.
  • A fourth measure is the information to privacy loss ratio, also designed for classification. However, it suffers from the same weakness as the classification metric.
Proposed utility measure
  • We view the data as an iid sample generated from some distribution $F$.
  • Suppose tuples in our table have (discrete valued) attributes $U_{1},..., U_{n}$. Then we can estimate $F$ using $\hat{F_{1}}$ where $\hat{F_{1}}$ corresponds to $P(t.U_{1}=u_{1},...,t.U_{n} = u_{n})$ where $t.U_{1}$ refers to the attribute value $U_{1}$ for tuple $t$.
  • Now suppose we are given anonymized marginals (e.g., 23% of tuples have the age attribute appearing between [46-50] years old while 77% appear between [50-55]). We can view 23% and 77% as constraints i.e., the marginals represent constraints. We can compute the (maximum entropy) probability distribution that corresponds to these constraints, $\hat{F_{2}}$.
  • (It turns out that the maximum entropy is also the maximum likelihood estimate associated with log linear models.)
  • We now have $\hat{F_{1}}$ associated with the original data and $\hat{F_{2}}$ associated with the anonymized marginals. We can compare them using the standard KL (kullback-liebler) divergence, which is minimized with $\hat{F_{1}} = \hat{F_{2}}$.
  • Since our goal is to determine which anonymized marginals to publish, we want to minimize KL-divergence between the various possible $\hat{F_{2}}$ and a fixed $\hat{F_{1}}$.
Extending privacy definitions
  • We can extend k-anonymity and l-diversity to collections of anonymized marginals.
  • (Defn 4.1) k-link anonymity : A collection of anonymized marginals $M_{1},...,M_{r}$ satisfies k-link anonymity if for all i = $1,...,r$ and for all tuples $t \in NonSensitive(M_{i})$, either $M_{i}(t) = 0$ or $M_{i}(t) \geq k$. $NonSensitive(M_{i})$ refers to the non sensitive attributes which $M_{i}$ is comprised of while $M_{i}(t)$ refers to the number of the tuples which have the same attribute values as $M_{i}$.
  • We must also be sure that an adversary cannot use combinatorial techniques to determine that a tuple with a certain value for its quasi-identifiers exists in the original table and that the number of such tuples is less than $k$.
  • (Defn 4.2) k-combinatorial anonymity : Let $D$ be the domain of the nonsensitive attributes. A collection of anonymized marginals $M_{1},... ,M_{r}$ satisfies k-combinatorial anonymity if for all $t \in D$ one of the following holds:
    1. For all tables $T$ consistent with $M_{1},... ,M_{r}$, $T(t) = 0$
    2. There exists a table $T$ consistent with $M_{1},... ,M_{r}$ st $T(t) \geq k$
  • (Defn 4.3) MaxEnt l-diversity : $M_{1},... ,M_{r}$ satisfy MaxEnt l-diversity if the maximum entropy distribution that is consistent with $M_{1},... ,M_{r}$ satisfies l-diversity.
  • Experiments showed that even a very simple search for anonymized marginals can yield dramatic results c.f. to just a single anonymized table.

Thursday 7 August 2014

Paper summary : The Cost of Privacy by Brickell, KDD '08

  • Association of QIs (quasi identifier attributes) with sensitive attributes is known as sensitive attribute disclosure (when breached) c.f., membership disclosure which is concerned with identifying whether an individual is in a table.
  • Goal of the paper is to evaluate the tradeoff bet. incremental gain (by generalization and suppression) in data mining utility and the degradation in privacy caused by publishing QI attributes with sensitive attributes (by the very same generalization and suppression).
  • Privacy loss is the increase in adversary's ability to learn sensitive attributes corresponding to a given identity.
  • Utility gain is the increase in accuracy of machine learning (ml) tasks evaluated on the sanitized dataset.
  • Tuples $t_{i}$ and $t_{j}$ are quasi-equivalent if $t_{i}[Q] = t_{j}[Q]$ for the QI attribute $Q$. We can thus partition the original table $T$ into QI classes (equivalence classes), denoted as $<t_{j}>$.
  • Let $\epsilon_{Q} \subseteq T$ be a set of representative records for each equivalence class in $T$.
  • Consider a subset of tuples $U = \{u_{1}, u_{2}, ..., u_{p}\} \subseteq T$. For any sensitive attribute value $s$, $U_{s} = \{u \in U | u[S] = s\}$. We define $p(U,s) = \frac{|U_{s}|}{|U|}$.
  • Semantic privacy is concerned with prior/posterior knowledge of adversary. Anonymization is usually performed by randomly perturbing attribute values.
  • In contrast, syntactic privacy (microdata privacy) is only concerned about the distribution of attribute values in the sanitized table without any care about the adversary's knowledge. Privacy here is performed by generalization and suppression.
  • This paper is only concerned with semantic privacy because of the numerous weaknesses of syntactic privacy (weaknesses of k-anonymity, l-diversity, t-closeness were presented by the authors).
Attack model
  • Adversary only sees $T'$ (sanitized table).
  • We assume that generalization and suppression is only carried out to QI attibutes, not to sensitive attributes. Why? To keep sanitized data ``truthful'' [31,34] (this reasoning doesn't make much sense to me).
  • We define adversary's baseline knowledge, $A_{base}$, which is the minimum information he/she can learn from trivial santization (publishing QIs and sensitive attributes in two separate tables). Formally, $A_{base}$ is a prob. density function of the sensitive attribute values in $T$ i.e., $A_{base} = <p(T, s_{1}), p(T,s_{2}), ... , p(T, s_{l})>$.
  • Adversary's posterior knowledge is denoted by $A_{san}$ i.e., what he/she learns from $T'$ about sensitive attributes of some target $t \in T$. The adversary can identify the equiv. class for $t$ (i.e., $<t>$) from $T'$ since the related generalization hierarchy is totally ordered. Thus, formally, $A_{san} = <p(<t>, s_{1}), p(<t>, s_{2}), ..., p(<t>, s_{l})>$.
  • We can now define sensitive attribute disclosure : $A_{diff}(<t>) = \frac{1}{2} \sum_{i=1}^{l}|p(T, s_{i}) - p(<t>, s_{i})|$ $A_{quot}(<t>) = |\frac{p(<t>, s)}{p(T, s)}|$ Basically, these values define how much more the adversary learns from observing santizied QI attributes that he would have learnt from trivial sanitization.
Semantic Privacy
  • (Defn 1)$\delta$-disclosure privacy : $<t>$ is $\delta$-disclosure private wrt $S$ (sensitive attribute) for all $s \in S$ if $A_{quot}(<t>) = |\log \frac{p(<t>, s)}{p(T,s)}| < \delta$. A table is $\delta$-disclosure private if $\forall t \in \epsilon_{Q}, <t>$ is $\delta$-disclosure private (intuitively, this means that the distribution of sensitive attribute values within each QI class is roughly the same as the others in $T$).
  • Defn 1 allows us to relate $\delta$ with the information gain parameter (defined as $GAIN(S,Q) = H(S) - H(S|Q)$ used by decision tree classifiers such as ID3 and C4.5). Refer to lemma 1 below.
  • (Lemma 1) If $T$ satisfies $\delta$-disclosure privacy, then $GAIN(S,Q) < \delta$ (proof is easy enough to understand).
Syntactic Privacy
  • These privacy measures refer to k-anonymity (i.e., $\forall t_{j} \in \epsilon_{Q}, |<t_{j}>| \geq k$), l-diversity, (c, l)-diversity and t-closeness. These measures have many weaknesses. Ultimately, the authors are only concerned with sematic privacy and not with syntactic privacy.
Measuring privacy
  • We have a bound on disclosure based on definition 1. However, defintion 1 is insufficient for expressing privacy because actual table instances may have less sensitive attribute disclosure (thus more privacy) than permitted by the definition.
  • Thus, we define sensitive attribute disclosure as : $A_{know} = \frac{1}{|T|} \sum_{t \in \epsilon_{Q}} |<t>| \cdot A_{diff}(<t>)$ where $know$ refers to knowledge gain.
  • Another metric quantifies the attacker's ability to guess $t$'s sensitive attribute using a greedy strategy (to guess the most common sensitive attribute value in $<t>$). For $<t>$, let $s_{max}(<t>)$ be the most common sensitive attribute value in $<t>$. Thus, $A_{acc} = \frac{1}{|T|} \sum_{t \in \epsilon_{Q}}(|<t>| \cdot p(<t>, s_{max}(<t>))) - p(T, s_{max}(T))$ where $acc$ stands for adversarial accuracy gain.
  • $A_{acc}$ underestimates the amount of info leaked by $T'$ because it doesn't consider the shifts in the probabilty of non-majority sensitive attribute values.
  • One approach to measuring utility is to minimize the amount of generalization and suppression to QI attributes [10].
  • Utility measurement is innately tied to the computations that are to be performed on $T'$. We are concerned with measuring the utility of classification tasks.
  • Intuitively, the utility of $T'$ should be measured by how well cross attribute correlations are preserved after sanitization.
  • In our context, we are using the semantic definition of privacy and semantic definition of utility and wish to evaluate the tradeoff between them.
  • For some workload $w$, we measure workload specific utility of trivially sanitized datasets (where privacy is maximum), denoted by $U_{base}^{w}$. One example for utility is the accuracy of the classifier.
  • We then compute $U_{san}^{w}$ for $T'$.
  • $U_{san}^{w} - U_{base}^{w}$ is the utility gain. If this value is close to 0, sanitization is pointless (i.e. trivial sanitization provides as much utility as any sophisticated sanitization algorithm while providing as much privacy as possible).
  • Another utility measure is $U_{max}^{w}$ on the original dataset $T$. If this value is low, the workload is inappropriate for the data regardless of sanitization because even if users were given $T$, utility would have been low.
  • Experiments suggest that trivial sanitization produces equivalent utility and better privacy than non-trivial generalization and suppression.
  • Authors used implementation of generalization and suppression from LeFerre[20]. Weka was used for classification.
  • Of course if sanitization has been correctly performed, it is impossible to build a good classifier for $S$ (sensitive attribute) based only on $Q$ (QI attribute) because good sanitization must destroy any correlation between $S$ and $Q$.
  • Perhaps it is not surprising that sanitization makes it difficult to build an accurate classifier for $S$. But experiments on the UCI dataset show that sanitization also destroys correlation between $S$ and neutral attributes.
  • Experiments indicate that it is difficult to find DBs on which sanitization permits both privacy and utility.
  • However, we can construct artifical DBs for specific scenarios that permits both privacy and accuracy (example provided is very contrived).

Wednesday 6 August 2014

Paper summary : The Boundary Between Privacy and Utility in Data Publishing by Rastogi et. Al., VLDB '07

  • Definition of privacy : an attackers prior belief of tuple $t$ belonging to private instance $I$ divided by his posterior belief of tuple $t$ belonging to $I$ given the published instance $V$.
  • Prior is given by $Pr[t] \leq k \frac{n}{m}$ where $n = |I|$, $m = |D|$ ($D$ refers to the domain of the attribute values of $I$), and $k$ is some constant. This is a reasonable definition because $I$ contains $n$ tuples out of $m$ possible tuples.
  • Posterior is given by $Pr[t | V] \leq \gamma$ for some $\gamma$.
  • Definition of utility : Utility is the ability to estimate counting queries (example of a counting query- SELECT count(*) WHERE country=``Canada''). Utility is conveyed by the formula : $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ where $n = |I|$, $Q(I)$ refers to a counting query over the instance $I$, $\widetilde{Q}(V)$ refers to a counting query estimate over the published instance $V$ and $\rho$ is some number.
  • One result of the paper is that if $k$ is $\Omega (\sqrt{m})$ (hence $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$, which represents a very powerful attacker with a large prior) then no algorithm can achieve both utility and privacy.
  • If $k$ is $O(1)$ (i.e., $Pr[t] = O(\frac{n}{m})$, which means that the attacker is weaker) then we describe an algorithm that achieves a privacy/utility tradeoff given by the formula $\rho = \sqrt{\frac{k}{\gamma}}$. The idea is that if $\rho$ is small ($\rho$ represents query estimation error, shown in the RHS of the utility formula), then utility is high because query estimation error is low. $k$ conveys prior while $\gamma$ conveys the posterior (definition of privacy).
  • The described anonymization algorithm is a randomized algorithm st for any $Q$ (query), the probability that the algorithm outputs $V$ st $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ is $1-\epsilon$ for some $\epsilon$.
Classification of adversaries
  • For every tuple, the attackers prior is either small or equal to 1.
  • (Defn 2.1) Let $d \in [0,1]$. A d-bounded adversary is one for which $\forall t \in D$ either $Pr[t] \leq d$ or $Pr[t] = 1$. If $Pr$ is tuple independent, it is called a d-independent adversary.
  • For a d-independent adversary, there is no correlation amongst tuples though there may be correlation among attributes.
  • (Defn 2.3) An algorithm is (d-$\gamma$)-private for all d-independent adversaries if (i) $\frac{d}{\gamma} \leq \frac{Pr[t|V]}{Pr[t]}$ and (ii) $Pr[t|V] \leq \gamma$
  • If $t$ fails condition (i), then there is a negative leakage (i.e., adversary learns more about the fact that tuple $t$ is not in $I$), while if it fails condition (ii), this is a positive leakage (i.e., adversary learns more about the fact that the tuple $t$ is in $I$).
  • (Defn 2.5) A randomized algorithm is called $(\rho, \epsilon)$-useful if it has an estimator (i.e., $\widetilde{Q}$) st for any $Q$ and $I$: $Pr[|Q(I) - \widetilde{Q}(V)| \geq \rho \sqrt{n}] \leq \epsilon$.
Impossibility result
  • Prove that if the attackers prior is large (i.e., $k=\Omega(\sqrt{m})$ thus, $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$ thus $d= \Omega(\frac{n}{\sqrt{m}})$ by defn 2.1), then no algorithm can achieve both privacy and utility.
  • We first show that if $d = \Omega(\frac{n}{\sqrt{m}})$, no $(d, \gamma)$-private algorithm can achieve even little utility.
  • (Defn 3.1) Statistical difference between 2 distributions over a domain $X$ is : $SD(Pr_{A}, Pr_{B}) = \sum_{x \in X}{|Pr_{A}(x) - Pr_{B}(x)|}$.
  • Connection between SD and utility :
    • Let $Q$ be a large query (i.e., returns a sizable fraction if executed on the domain).
    • When we execute $\widetilde{Q}$, we get errors which depend on $n=|I|$.
    • If there is utility in algorithm $A$, the users should be able to distinguish between when $Q(I) = n$ and when $Q(I) = 0$. This means we should be able to differentiate between $Pr_{A}(V) = Pr[V | E_{Q}]$ and $Pr_{B}(V) = Pr[V | E_{Q}']$, where $E_{Q}$ refers to the event where $Q(I) = n$ while $E_{Q}'$ refers to the event where $Q(I) = 0$.
    • Intuitively, $SD(Pr_{A}, Pr_{B})$ should be large in order to differentiate $Pr_{A}$ and $Pr_{B}$. This would mean that utility exists in algorithm A, hence we can get a reasonable estimate for $Q$ by running $\widetilde{Q}$ on $V$.
    • An algorithm is considered meaningless if $SD(Pr_{A}, Pr_{B})$ is smaller than 0.5 for 2/3 of the queries that a user executes.
  • (Theorem 3.3) For all $\gamma < 1$, there exists a constant $c$ (independent of $m,n, \gamma$) st algorithm $A$ is not $(d, \gamma)$-private for any $d \geq \frac{1}{c} \frac{\gamma}{1- \gamma} \frac{n}{\sqrt{m}}$ (i.e., no meaninigful algorithm can offer $(k \frac{n}{m}, \gamma)$-privacy for $k=\Omega(\sqrt{m})$).
  • But then absence of $(d, \gamma)$-privacy means that there exists some d-independent adversary for which the following happens for some tuple $t$ : either (i) positive leakage (i.e., $Pr[t] \leq d$ but $Pr[t|V] \geq \gamma$) or (ii) negative leakage (i.e., $Pr[t|V] << Pr[t]$). Hence, we have proven the impossibility result.
  • Let us assume that $k=O(1)$. Now we can develop an algorithm to publish $V$ from $I$ so that $V$ has some utility.
  • $\alpha \beta$ algorithm
    1. Let $D = D_{1} \times D_{2} \times ... \times D_{l}$
    2. $\forall t \in I$, insert into $V$ with probability $\alpha + \beta$
    3. $\forall t \in D \setminus I$, insert into $V$ with probability $\beta$
    4. Publish $V, \alpha, \beta, D$
  • Query estimation algorithm
    1. Let $D = D_{1} \times D_{2} \times ... \times D_{l}$
    2. Compute $Q(V) = n_{V}$
    3. Compute $Q(D) = n_{D}$
    4. $\widetilde{Q}(V) = \frac{n_{v} - \beta n_{D}}{\alpha}$
  • (Theorem 4.3) $\alpha \beta$ algorithm is $(d, \gamma)$-private where $d \leq \gamma$ if we choose $\alpha$ and $\beta$ st $\frac{\beta}{\alpha + \beta} \geq \frac{d(1-\gamma)}{\gamma (1-d)}$ and $\alpha + \beta \leq 1- \frac{d}{\gamma}$.
  • (Theorem 4.4) $\alpha \beta$ algorithm is $(\rho, \epsilon)$-useful.
  • One extension of the algorithm described in the paper is where $\alpha \beta$ algorithm is modified so that multiple views can be published over time.