## Thursday, 7 August 2014

### Paper summary : The Cost of Privacy by Brickell, KDD '08

Summary
• Association of QIs (quasi identifier attributes) with sensitive attributes is known as sensitive attribute disclosure (when breached) c.f., membership disclosure which is concerned with identifying whether an individual is in a table.
• Goal of the paper is to evaluate the tradeoff bet. incremental gain (by generalization and suppression) in data mining utility and the degradation in privacy caused by publishing QI attributes with sensitive attributes (by the very same generalization and suppression).
• Privacy loss is the increase in adversary's ability to learn sensitive attributes corresponding to a given identity.
• Utility gain is the increase in accuracy of machine learning (ml) tasks evaluated on the sanitized dataset.
Notations
• Tuples $t_{i}$ and $t_{j}$ are quasi-equivalent if $t_{i}[Q] = t_{j}[Q]$ for the QI attribute $Q$. We can thus partition the original table $T$ into QI classes (equivalence classes), denoted as $<t_{j}>$.
• Let $\epsilon_{Q} \subseteq T$ be a set of representative records for each equivalence class in $T$.
• Consider a subset of tuples $U = \{u_{1}, u_{2}, ..., u_{p}\} \subseteq T$. For any sensitive attribute value $s$, $U_{s} = \{u \in U | u[S] = s\}$. We define $p(U,s) = \frac{|U_{s}|}{|U|}$.
• Semantic privacy is concerned with prior/posterior knowledge of adversary. Anonymization is usually performed by randomly perturbing attribute values.
• In contrast, syntactic privacy (microdata privacy) is only concerned about the distribution of attribute values in the sanitized table without any care about the adversary's knowledge. Privacy here is performed by generalization and suppression.
• This paper is only concerned with semantic privacy because of the numerous weaknesses of syntactic privacy (weaknesses of k-anonymity, l-diversity, t-closeness were presented by the authors).
Attack model
• Adversary only sees $T'$ (sanitized table).
• We assume that generalization and suppression is only carried out to QI attibutes, not to sensitive attributes. Why? To keep sanitized data truthful'' [31,34] (this reasoning doesn't make much sense to me).
• We define adversary's baseline knowledge, $A_{base}$, which is the minimum information he/she can learn from trivial santization (publishing QIs and sensitive attributes in two separate tables). Formally, $A_{base}$ is a prob. density function of the sensitive attribute values in $T$ i.e., $A_{base} = <p(T, s_{1}), p(T,s_{2}), ... , p(T, s_{l})>$.
• Adversary's posterior knowledge is denoted by $A_{san}$ i.e., what he/she learns from $T'$ about sensitive attributes of some target $t \in T$. The adversary can identify the equiv. class for $t$ (i.e., $<t>$) from $T'$ since the related generalization hierarchy is totally ordered. Thus, formally, $A_{san} = <p(<t>, s_{1}), p(<t>, s_{2}), ..., p(<t>, s_{l})>$.
• We can now define sensitive attribute disclosure : $A_{diff}(<t>) = \frac{1}{2} \sum_{i=1}^{l}|p(T, s_{i}) - p(<t>, s_{i})|$ $A_{quot}(<t>) = |\frac{p(<t>, s)}{p(T, s)}|$ Basically, these values define how much more the adversary learns from observing santizied QI attributes that he would have learnt from trivial sanitization.
Semantic Privacy
• (Defn 1)$\delta$-disclosure privacy : $<t>$ is $\delta$-disclosure private wrt $S$ (sensitive attribute) for all $s \in S$ if $A_{quot}(<t>) = |\log \frac{p(<t>, s)}{p(T,s)}| < \delta$. A table is $\delta$-disclosure private if $\forall t \in \epsilon_{Q}, <t>$ is $\delta$-disclosure private (intuitively, this means that the distribution of sensitive attribute values within each QI class is roughly the same as the others in $T$).
• Defn 1 allows us to relate $\delta$ with the information gain parameter (defined as $GAIN(S,Q) = H(S) - H(S|Q)$ used by decision tree classifiers such as ID3 and C4.5). Refer to lemma 1 below.
• (Lemma 1) If $T$ satisfies $\delta$-disclosure privacy, then $GAIN(S,Q) < \delta$ (proof is easy enough to understand).
Syntactic Privacy
• These privacy measures refer to k-anonymity (i.e., $\forall t_{j} \in \epsilon_{Q}, |<t_{j}>| \geq k$), l-diversity, (c, l)-diversity and t-closeness. These measures have many weaknesses. Ultimately, the authors are only concerned with sematic privacy and not with syntactic privacy.
Measuring privacy
• We have a bound on disclosure based on definition 1. However, defintion 1 is insufficient for expressing privacy because actual table instances may have less sensitive attribute disclosure (thus more privacy) than permitted by the definition.
• Thus, we define sensitive attribute disclosure as : $A_{know} = \frac{1}{|T|} \sum_{t \in \epsilon_{Q}} |<t>| \cdot A_{diff}(<t>)$ where $know$ refers to knowledge gain.
• Another metric quantifies the attacker's ability to guess $t$'s sensitive attribute using a greedy strategy (to guess the most common sensitive attribute value in $<t>$). For $<t>$, let $s_{max}(<t>)$ be the most common sensitive attribute value in $<t>$. Thus, $A_{acc} = \frac{1}{|T|} \sum_{t \in \epsilon_{Q}}(|<t>| \cdot p(<t>, s_{max}(<t>))) - p(T, s_{max}(T))$ where $acc$ stands for adversarial accuracy gain.
• $A_{acc}$ underestimates the amount of info leaked by $T'$ because it doesn't consider the shifts in the probabilty of non-majority sensitive attribute values.
Utility
• One approach to measuring utility is to minimize the amount of generalization and suppression to QI attributes [10].
• Utility measurement is innately tied to the computations that are to be performed on $T'$. We are concerned with measuring the utility of classification tasks.
• Intuitively, the utility of $T'$ should be measured by how well cross attribute correlations are preserved after sanitization.
• In our context, we are using the semantic definition of privacy and semantic definition of utility and wish to evaluate the tradeoff between them.
• For some workload $w$, we measure workload specific utility of trivially sanitized datasets (where privacy is maximum), denoted by $U_{base}^{w}$. One example for utility is the accuracy of the classifier.
• We then compute $U_{san}^{w}$ for $T'$.
• $U_{san}^{w} - U_{base}^{w}$ is the utility gain. If this value is close to 0, sanitization is pointless (i.e. trivial sanitization provides as much utility as any sophisticated sanitization algorithm while providing as much privacy as possible).
• Another utility measure is $U_{max}^{w}$ on the original dataset $T$. If this value is low, the workload is inappropriate for the data regardless of sanitization because even if users were given $T$, utility would have been low.
Experiments
• Experiments suggest that trivial sanitization produces equivalent utility and better privacy than non-trivial generalization and suppression.
• Authors used implementation of generalization and suppression from LeFerre[20]. Weka was used for classification.
• Of course if sanitization has been correctly performed, it is impossible to build a good classifier for $S$ (sensitive attribute) based only on $Q$ (QI attribute) because good sanitization must destroy any correlation between $S$ and $Q$.
• Perhaps it is not surprising that sanitization makes it difficult to build an accurate classifier for $S$. But experiments on the UCI dataset show that sanitization also destroys correlation between $S$ and neutral attributes.
• Experiments indicate that it is difficult to find DBs on which sanitization permits both privacy and utility.
• However, we can construct artifical DBs for specific scenarios that permits both privacy and accuracy (example provided is very contrived).