Wednesday 6 August 2014

Paper summary : The Boundary Between Privacy and Utility in Data Publishing by Rastogi et. Al., VLDB '07

  • Definition of privacy : an attackers prior belief of tuple $t$ belonging to private instance $I$ divided by his posterior belief of tuple $t$ belonging to $I$ given the published instance $V$.
  • Prior is given by $Pr[t] \leq k \frac{n}{m}$ where $n = |I|$, $m = |D|$ ($D$ refers to the domain of the attribute values of $I$), and $k$ is some constant. This is a reasonable definition because $I$ contains $n$ tuples out of $m$ possible tuples.
  • Posterior is given by $Pr[t | V] \leq \gamma$ for some $\gamma$.
  • Definition of utility : Utility is the ability to estimate counting queries (example of a counting query- SELECT count(*) WHERE country=``Canada''). Utility is conveyed by the formula : $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ where $n = |I|$, $Q(I)$ refers to a counting query over the instance $I$, $\widetilde{Q}(V)$ refers to a counting query estimate over the published instance $V$ and $\rho$ is some number.
  • One result of the paper is that if $k$ is $\Omega (\sqrt{m})$ (hence $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$, which represents a very powerful attacker with a large prior) then no algorithm can achieve both utility and privacy.
  • If $k$ is $O(1)$ (i.e., $Pr[t] = O(\frac{n}{m})$, which means that the attacker is weaker) then we describe an algorithm that achieves a privacy/utility tradeoff given by the formula $\rho = \sqrt{\frac{k}{\gamma}}$. The idea is that if $\rho$ is small ($\rho$ represents query estimation error, shown in the RHS of the utility formula), then utility is high because query estimation error is low. $k$ conveys prior while $\gamma$ conveys the posterior (definition of privacy).
  • The described anonymization algorithm is a randomized algorithm st for any $Q$ (query), the probability that the algorithm outputs $V$ st $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ is $1-\epsilon$ for some $\epsilon$.
Classification of adversaries
  • For every tuple, the attackers prior is either small or equal to 1.
  • (Defn 2.1) Let $d \in [0,1]$. A d-bounded adversary is one for which $\forall t \in D$ either $Pr[t] \leq d$ or $Pr[t] = 1$. If $Pr$ is tuple independent, it is called a d-independent adversary.
  • For a d-independent adversary, there is no correlation amongst tuples though there may be correlation among attributes.
  • (Defn 2.3) An algorithm is (d-$\gamma$)-private for all d-independent adversaries if (i) $\frac{d}{\gamma} \leq \frac{Pr[t|V]}{Pr[t]}$ and (ii) $Pr[t|V] \leq \gamma$
  • If $t$ fails condition (i), then there is a negative leakage (i.e., adversary learns more about the fact that tuple $t$ is not in $I$), while if it fails condition (ii), this is a positive leakage (i.e., adversary learns more about the fact that the tuple $t$ is in $I$).
  • (Defn 2.5) A randomized algorithm is called $(\rho, \epsilon)$-useful if it has an estimator (i.e., $\widetilde{Q}$) st for any $Q$ and $I$: $Pr[|Q(I) - \widetilde{Q}(V)| \geq \rho \sqrt{n}] \leq \epsilon$.
Impossibility result
  • Prove that if the attackers prior is large (i.e., $k=\Omega(\sqrt{m})$ thus, $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$ thus $d= \Omega(\frac{n}{\sqrt{m}})$ by defn 2.1), then no algorithm can achieve both privacy and utility.
  • We first show that if $d = \Omega(\frac{n}{\sqrt{m}})$, no $(d, \gamma)$-private algorithm can achieve even little utility.
  • (Defn 3.1) Statistical difference between 2 distributions over a domain $X$ is : $SD(Pr_{A}, Pr_{B}) = \sum_{x \in X}{|Pr_{A}(x) - Pr_{B}(x)|}$.
  • Connection between SD and utility :
    • Let $Q$ be a large query (i.e., returns a sizable fraction if executed on the domain).
    • When we execute $\widetilde{Q}$, we get errors which depend on $n=|I|$.
    • If there is utility in algorithm $A$, the users should be able to distinguish between when $Q(I) = n$ and when $Q(I) = 0$. This means we should be able to differentiate between $Pr_{A}(V) = Pr[V | E_{Q}]$ and $Pr_{B}(V) = Pr[V | E_{Q}']$, where $E_{Q}$ refers to the event where $Q(I) = n$ while $E_{Q}'$ refers to the event where $Q(I) = 0$.
    • Intuitively, $SD(Pr_{A}, Pr_{B})$ should be large in order to differentiate $Pr_{A}$ and $Pr_{B}$. This would mean that utility exists in algorithm A, hence we can get a reasonable estimate for $Q$ by running $\widetilde{Q}$ on $V$.
    • An algorithm is considered meaningless if $SD(Pr_{A}, Pr_{B})$ is smaller than 0.5 for 2/3 of the queries that a user executes.
  • (Theorem 3.3) For all $\gamma < 1$, there exists a constant $c$ (independent of $m,n, \gamma$) st algorithm $A$ is not $(d, \gamma)$-private for any $d \geq \frac{1}{c} \frac{\gamma}{1- \gamma} \frac{n}{\sqrt{m}}$ (i.e., no meaninigful algorithm can offer $(k \frac{n}{m}, \gamma)$-privacy for $k=\Omega(\sqrt{m})$).
  • But then absence of $(d, \gamma)$-privacy means that there exists some d-independent adversary for which the following happens for some tuple $t$ : either (i) positive leakage (i.e., $Pr[t] \leq d$ but $Pr[t|V] \geq \gamma$) or (ii) negative leakage (i.e., $Pr[t|V] << Pr[t]$). Hence, we have proven the impossibility result.
  • Let us assume that $k=O(1)$. Now we can develop an algorithm to publish $V$ from $I$ so that $V$ has some utility.
  • $\alpha \beta$ algorithm
    1. Let $D = D_{1} \times D_{2} \times ... \times D_{l}$
    2. $\forall t \in I$, insert into $V$ with probability $\alpha + \beta$
    3. $\forall t \in D \setminus I$, insert into $V$ with probability $\beta$
    4. Publish $V, \alpha, \beta, D$
  • Query estimation algorithm
    1. Let $D = D_{1} \times D_{2} \times ... \times D_{l}$
    2. Compute $Q(V) = n_{V}$
    3. Compute $Q(D) = n_{D}$
    4. $\widetilde{Q}(V) = \frac{n_{v} - \beta n_{D}}{\alpha}$
  • (Theorem 4.3) $\alpha \beta$ algorithm is $(d, \gamma)$-private where $d \leq \gamma$ if we choose $\alpha$ and $\beta$ st $\frac{\beta}{\alpha + \beta} \geq \frac{d(1-\gamma)}{\gamma (1-d)}$ and $\alpha + \beta \leq 1- \frac{d}{\gamma}$.
  • (Theorem 4.4) $\alpha \beta$ algorithm is $(\rho, \epsilon)$-useful.
  • One extension of the algorithm described in the paper is where $\alpha \beta$ algorithm is modified so that multiple views can be published over time.

No comments:

Post a Comment