Friday 7 November 2014

New blog

Please visit dhruvgairola.com to view my latest blog posts. I am discontinuing using blogger as my blog hosting platform.

Sunday 2 November 2014

Google removes my app from playstore.

My free app for Scrabble players, which had about 35,000 downloads was recently removed from the Google play store. Initially, Google sent me the following warning email :
This is a notification that your application, Scrabble Game, with package ID com.dhruvg.apps.bingledemo, is currently in violation of our developer terms.
REASON FOR WARNING: Violation of the spam provisions of the Content Policy. Please refer to the spam policy help article for more information.
  • Your title and/or description attempts to impersonate or leverage another popular product without permission. Please remove all such references. Do not use irrelevant, misleading, or excessive keywords in apps descriptions, titles, or metadata.
Please refer to the keyword spam policy help article for more information.
Your application will be removed if you do not make modifications to your application’s description to bring it into compliance within 7 days of the issuance of this notification. If you have additional applications in your catalog, please also review them for compliance. Note that any remaining applications found to be in violation will be removed from the Google Play Store.
Please also consult the Policy and Best Practices and the Developer Distribution Agreement as you bring your applications into compliance. You can also review this Google Play Help Center article for more information on this warning.
All violations are tracked. Serious or repeated violations of any nature will result in the termination of your developer account, and investigation and possible termination of related Google accounts.
I asked them for an explanation of the violating keywords. I had an idea that this might be the "Scrabble" keyword. However, my app is built only for tournament level Scrabble players. It is very specialized, and removing that keyword would probably result in very few downloads. Regardless, I removed all but one reference to the "Scrabble" keyword. However, this wasn't enough :
This is a notification that your application, Scrabble Bingo Game, with package ID com.dhruvg.apps.bingledemo, has been removed from the Google Play Store.
REASON FOR REMOVAL: Violation of the spam provisions of the Content Policy. Please refer to the keyword spam policy help article for more information.
  • Your title and/or description attempts to impersonate or leverage another popular product without permission. Please remove all such references. Do not use irrelevant, misleading, or excessive keywords in apps descriptions, titles, or metadata.
This particular app has been disabled as a policy strike. If your developer account is still in good standing, you may revise and upload a policy compliant version of this application as a new package name.
This notification also serves as notice for remaining, unsuspended violations in your catalog, and you may avoid further app suspensions by immediately unpublishing any apps in violation of (but not limited to) the above policy. Once you have resolved any existing violations, you may republish the app(s) at will. Before publishing applications, please ensure your apps’ compliance with the Developer Distribution Agreement and Content Policy.
All violations are tracked. Serious or repeated violations of any nature will result in the termination of your developer account, and investigation and possible termination of related Google accounts. If your account is terminated, payments will cease and Google may recover the proceeds of any past sales and the cost of any associated fees (such as chargebacks and payment transaction fees) from you.
If you feel we have made this determination in error, you can visit the Google Play Help Center article for additional information regarding this removal.
My appeal :
My app has been in the store for 2 years now, and I have so far complied
with all requirements. It is a free app (I earn no money from adds) and the
app was built by me alone. I only made it to help competitive
tournament-level Scrabble players improve their game. The app is innately
tied to the game of Scrabble, hence the "Scrabble" keyword is necessary to
promote the game, because it is only directed at improving a players
Scrabble vocabulary via 7-word scores.
I implore you to reconsider reinstating my app on the play store. At least
advise me on the specific keyword violation so I can remove it. I have no
intention of spamming users. My game is only built for Scrabble players, so
how can I remove the Scrabble keyword from the game title? I would
be grateful for any help and guidance and am willing to cooperate fully.
And the automated response :
Hi,
We have reviewed your appeal and will not be reinstating your app. This decision is final and we will not be responding to any additional emails regarding this removal.
If your account is still in good standing and the nature of your app allows for republishing you may consider releasing a new, policy compliant version of your app to Google Play under a new package name. We are unable to comment further on the specific policy basis for this removal or provide guidance on bringing future versions of your app into policy compliance. Instead, please reference the REASON FOR REMOVAL in the initial notification email from Google Play.
Please note that additional violations may result in a suspension of your Google Play Developer account.

Lessons learnt :
- Google has terrible customer service. There is no way for me to talk to a human being about my problem.
- All information about downloads, ratings, etc is lost and apparently, this decision is irreversible. I will avoid developing apps for Android (though I highly doubt iOS is any any better in these types of situations).
- This is my last blog post on Google's blogger platform. I think it's time for me to avoid relying on just a single platform/provider/company. Mostly because of this line -
"Serious or repeated violations of any nature will result in the termination of your developer account, and investigation and possible termination of related Google accounts."
- All future blog posts will be hosted on Github pages instead.

Friday 15 August 2014

Paper summary : Injecting Utility into Anonymized Datasets by Kifer et. Al., SIGMOD '06


Summary
  • In this paper, we will introduce a formal approach to measuring utility. Using this measure, we will show how to inject utility into k-anonymous and l-diverse tables, while maintaining the same level of privacy.
Introduction
  • k-anonymity and l-diversity rely on generalizations to preserve privacy.
  • In the real-world, many attributes often need to be suppressed in order to guarantee privacy, which is bad for utility no matter what operations are being performed on the data.
  • One solution is to publish marginals (i.e., contingency tables for a subset of the attributes) along with the original anonymized data. This would require anonymizing the marginals too (also via generalizations) in order to preserve privacy.
  • However, there are many possible subsets of attributes (marginals) for which contingency tables can be built. How do we decide which particular collection of marginals to publish?
Preliminaries
  • (Defn 2.3) K-anonymity : Table $D$ satisfies k-anonymity if $\forall t \in D$, there are at least $k-1$ other tuples that have the same values as $t$ for every QI (quasi identifier attribute). Note that we assume that the set of all non-sensitive attributes form the QI.
  • (Defn 2.4) Anonymized group : An anonymized group is a setwise maximal set of tuples that have the same (generalized) value for each non-sensitive attribute.
  • (Defn 2.5) (c,l)-diversity : Let $c>0$ be a constant and $q$ be an anonymized group. Let $S$ be a sensitive attribute and $s_{1},..., s_{m}$ be the values of $S$ that appear in $q$ and let $r_{1},..., r_{m}$ be their frequency counts. Let $r_{(1)},...,r_{(m)}$ be those counts sorted in a descending order. We say $q$ satisfies (c,l)-diversity wrt $S$ if $r_{(1)} \leq c \sum_{i=l}^{m}r_{(i)}$.
Existing utility measures
  • Generalization height is one utility measure.
  • Another measure is discernability, which assigns a cost to each tuple based on how many other tuples are indistinguishable from it. It is the sum of squares of anonymized groups, plus $|D|$ times the number of suppressed tuples.
  • Both of the above measures do not consider the distributions of the tuples.
  • A third measure is the classification metric, appropriate when one wants to train a classifier over the anonymized data. Thus, one attribute is treated as a class label. This metric assigns a penalty of 1 for every tuple suppressed. If a tuple $t$ is not suppressed, it looks at the majority class label of $t$'s anonymized group. If the class label of tuple $t$ is different from the majortiy, assign a penalty of 1. This metric the sum of all penalties. But it is not clear what happens if one wants to build classifiers for several different attributes.
  • A fourth measure is the information to privacy loss ratio, also designed for classification. However, it suffers from the same weakness as the classification metric.
Proposed utility measure
  • We view the data as an iid sample generated from some distribution $F$.
  • Suppose tuples in our table have (discrete valued) attributes $U_{1},..., U_{n}$. Then we can estimate $F$ using $\hat{F_{1}}$ where $\hat{F_{1}}$ corresponds to $P(t.U_{1}=u_{1},...,t.U_{n} = u_{n})$ where $t.U_{1}$ refers to the attribute value $U_{1}$ for tuple $t$.
  • Now suppose we are given anonymized marginals (e.g., 23% of tuples have the age attribute appearing between [46-50] years old while 77% appear between [50-55]). We can view 23% and 77% as constraints i.e., the marginals represent constraints. We can compute the (maximum entropy) probability distribution that corresponds to these constraints, $\hat{F_{2}}$.
  • (It turns out that the maximum entropy is also the maximum likelihood estimate associated with log linear models.)
  • We now have $\hat{F_{1}}$ associated with the original data and $\hat{F_{2}}$ associated with the anonymized marginals. We can compare them using the standard KL (kullback-liebler) divergence, which is minimized with $\hat{F_{1}} = \hat{F_{2}}$.
  • Since our goal is to determine which anonymized marginals to publish, we want to minimize KL-divergence between the various possible $\hat{F_{2}}$ and a fixed $\hat{F_{1}}$.
Extending privacy definitions
  • We can extend k-anonymity and l-diversity to collections of anonymized marginals.
  • (Defn 4.1) k-link anonymity : A collection of anonymized marginals $M_{1},...,M_{r}$ satisfies k-link anonymity if for all i = $1,...,r$ and for all tuples $t \in NonSensitive(M_{i})$, either $M_{i}(t) = 0$ or $M_{i}(t) \geq k$. $NonSensitive(M_{i})$ refers to the non sensitive attributes which $M_{i}$ is comprised of while $M_{i}(t)$ refers to the number of the tuples which have the same attribute values as $M_{i}$.
  • We must also be sure that an adversary cannot use combinatorial techniques to determine that a tuple with a certain value for its quasi-identifiers exists in the original table and that the number of such tuples is less than $k$.
  • (Defn 4.2) k-combinatorial anonymity : Let $D$ be the domain of the nonsensitive attributes. A collection of anonymized marginals $M_{1},... ,M_{r}$ satisfies k-combinatorial anonymity if for all $t \in D$ one of the following holds:
    1. For all tables $T$ consistent with $M_{1},... ,M_{r}$, $T(t) = 0$
    2. There exists a table $T$ consistent with $M_{1},... ,M_{r}$ st $T(t) \geq k$
  • (Defn 4.3) MaxEnt l-diversity : $M_{1},... ,M_{r}$ satisfy MaxEnt l-diversity if the maximum entropy distribution that is consistent with $M_{1},... ,M_{r}$ satisfies l-diversity.
Experiments
  • Experiments showed that even a very simple search for anonymized marginals can yield dramatic results c.f. to just a single anonymized table.

Thursday 7 August 2014

Paper summary : The Cost of Privacy by Brickell, KDD '08

Summary
  • Association of QIs (quasi identifier attributes) with sensitive attributes is known as sensitive attribute disclosure (when breached) c.f., membership disclosure which is concerned with identifying whether an individual is in a table.
  • Goal of the paper is to evaluate the tradeoff bet. incremental gain (by generalization and suppression) in data mining utility and the degradation in privacy caused by publishing QI attributes with sensitive attributes (by the very same generalization and suppression).
  • Privacy loss is the increase in adversary's ability to learn sensitive attributes corresponding to a given identity.
  • Utility gain is the increase in accuracy of machine learning (ml) tasks evaluated on the sanitized dataset.
Notations
  • Tuples $t_{i}$ and $t_{j}$ are quasi-equivalent if $t_{i}[Q] = t_{j}[Q]$ for the QI attribute $Q$. We can thus partition the original table $T$ into QI classes (equivalence classes), denoted as $<t_{j}>$.
  • Let $\epsilon_{Q} \subseteq T$ be a set of representative records for each equivalence class in $T$.
  • Consider a subset of tuples $U = \{u_{1}, u_{2}, ..., u_{p}\} \subseteq T$. For any sensitive attribute value $s$, $U_{s} = \{u \in U | u[S] = s\}$. We define $p(U,s) = \frac{|U_{s}|}{|U|}$.
  • Semantic privacy is concerned with prior/posterior knowledge of adversary. Anonymization is usually performed by randomly perturbing attribute values.
  • In contrast, syntactic privacy (microdata privacy) is only concerned about the distribution of attribute values in the sanitized table without any care about the adversary's knowledge. Privacy here is performed by generalization and suppression.
  • This paper is only concerned with semantic privacy because of the numerous weaknesses of syntactic privacy (weaknesses of k-anonymity, l-diversity, t-closeness were presented by the authors).
Attack model
  • Adversary only sees $T'$ (sanitized table).
  • We assume that generalization and suppression is only carried out to QI attibutes, not to sensitive attributes. Why? To keep sanitized data ``truthful'' [31,34] (this reasoning doesn't make much sense to me).
  • We define adversary's baseline knowledge, $A_{base}$, which is the minimum information he/she can learn from trivial santization (publishing QIs and sensitive attributes in two separate tables). Formally, $A_{base}$ is a prob. density function of the sensitive attribute values in $T$ i.e., $A_{base} = <p(T, s_{1}), p(T,s_{2}), ... , p(T, s_{l})>$.
  • Adversary's posterior knowledge is denoted by $A_{san}$ i.e., what he/she learns from $T'$ about sensitive attributes of some target $t \in T$. The adversary can identify the equiv. class for $t$ (i.e., $<t>$) from $T'$ since the related generalization hierarchy is totally ordered. Thus, formally, $A_{san} = <p(<t>, s_{1}), p(<t>, s_{2}), ..., p(<t>, s_{l})>$.
  • We can now define sensitive attribute disclosure : $A_{diff}(<t>) = \frac{1}{2} \sum_{i=1}^{l}|p(T, s_{i}) - p(<t>, s_{i})|$ $A_{quot}(<t>) = |\frac{p(<t>, s)}{p(T, s)}|$ Basically, these values define how much more the adversary learns from observing santizied QI attributes that he would have learnt from trivial sanitization.
Semantic Privacy
  • (Defn 1)$\delta$-disclosure privacy : $<t>$ is $\delta$-disclosure private wrt $S$ (sensitive attribute) for all $s \in S$ if $A_{quot}(<t>) = |\log \frac{p(<t>, s)}{p(T,s)}| < \delta$. A table is $\delta$-disclosure private if $\forall t \in \epsilon_{Q}, <t>$ is $\delta$-disclosure private (intuitively, this means that the distribution of sensitive attribute values within each QI class is roughly the same as the others in $T$).
  • Defn 1 allows us to relate $\delta$ with the information gain parameter (defined as $GAIN(S,Q) = H(S) - H(S|Q)$ used by decision tree classifiers such as ID3 and C4.5). Refer to lemma 1 below.
  • (Lemma 1) If $T$ satisfies $\delta$-disclosure privacy, then $GAIN(S,Q) < \delta$ (proof is easy enough to understand).
Syntactic Privacy
  • These privacy measures refer to k-anonymity (i.e., $\forall t_{j} \in \epsilon_{Q}, |<t_{j}>| \geq k$), l-diversity, (c, l)-diversity and t-closeness. These measures have many weaknesses. Ultimately, the authors are only concerned with sematic privacy and not with syntactic privacy.
Measuring privacy
  • We have a bound on disclosure based on definition 1. However, defintion 1 is insufficient for expressing privacy because actual table instances may have less sensitive attribute disclosure (thus more privacy) than permitted by the definition.
  • Thus, we define sensitive attribute disclosure as : $A_{know} = \frac{1}{|T|} \sum_{t \in \epsilon_{Q}} |<t>| \cdot A_{diff}(<t>)$ where $know$ refers to knowledge gain.
  • Another metric quantifies the attacker's ability to guess $t$'s sensitive attribute using a greedy strategy (to guess the most common sensitive attribute value in $<t>$). For $<t>$, let $s_{max}(<t>)$ be the most common sensitive attribute value in $<t>$. Thus, $A_{acc} = \frac{1}{|T|} \sum_{t \in \epsilon_{Q}}(|<t>| \cdot p(<t>, s_{max}(<t>))) - p(T, s_{max}(T))$ where $acc$ stands for adversarial accuracy gain.
  • $A_{acc}$ underestimates the amount of info leaked by $T'$ because it doesn't consider the shifts in the probabilty of non-majority sensitive attribute values.
Utility
  • One approach to measuring utility is to minimize the amount of generalization and suppression to QI attributes [10].
  • Utility measurement is innately tied to the computations that are to be performed on $T'$. We are concerned with measuring the utility of classification tasks.
  • Intuitively, the utility of $T'$ should be measured by how well cross attribute correlations are preserved after sanitization.
  • In our context, we are using the semantic definition of privacy and semantic definition of utility and wish to evaluate the tradeoff between them.
  • For some workload $w$, we measure workload specific utility of trivially sanitized datasets (where privacy is maximum), denoted by $U_{base}^{w}$. One example for utility is the accuracy of the classifier.
  • We then compute $U_{san}^{w}$ for $T'$.
  • $U_{san}^{w} - U_{base}^{w}$ is the utility gain. If this value is close to 0, sanitization is pointless (i.e. trivial sanitization provides as much utility as any sophisticated sanitization algorithm while providing as much privacy as possible).
  • Another utility measure is $U_{max}^{w}$ on the original dataset $T$. If this value is low, the workload is inappropriate for the data regardless of sanitization because even if users were given $T$, utility would have been low.
Experiments
  • Experiments suggest that trivial sanitization produces equivalent utility and better privacy than non-trivial generalization and suppression.
  • Authors used implementation of generalization and suppression from LeFerre[20]. Weka was used for classification.
  • Of course if sanitization has been correctly performed, it is impossible to build a good classifier for $S$ (sensitive attribute) based only on $Q$ (QI attribute) because good sanitization must destroy any correlation between $S$ and $Q$.
  • Perhaps it is not surprising that sanitization makes it difficult to build an accurate classifier for $S$. But experiments on the UCI dataset show that sanitization also destroys correlation between $S$ and neutral attributes.
  • Experiments indicate that it is difficult to find DBs on which sanitization permits both privacy and utility.
  • However, we can construct artifical DBs for specific scenarios that permits both privacy and accuracy (example provided is very contrived).

Wednesday 6 August 2014

Paper summary : The Boundary Between Privacy and Utility in Data Publishing by Rastogi et. Al., VLDB '07


Summary
  • Definition of privacy : an attackers prior belief of tuple $t$ belonging to private instance $I$ divided by his posterior belief of tuple $t$ belonging to $I$ given the published instance $V$.
  • Prior is given by $Pr[t] \leq k \frac{n}{m}$ where $n = |I|$, $m = |D|$ ($D$ refers to the domain of the attribute values of $I$), and $k$ is some constant. This is a reasonable definition because $I$ contains $n$ tuples out of $m$ possible tuples.
  • Posterior is given by $Pr[t | V] \leq \gamma$ for some $\gamma$.
  • Definition of utility : Utility is the ability to estimate counting queries (example of a counting query- SELECT count(*) WHERE country=``Canada''). Utility is conveyed by the formula : $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ where $n = |I|$, $Q(I)$ refers to a counting query over the instance $I$, $\widetilde{Q}(V)$ refers to a counting query estimate over the published instance $V$ and $\rho$ is some number.
  • One result of the paper is that if $k$ is $\Omega (\sqrt{m})$ (hence $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$, which represents a very powerful attacker with a large prior) then no algorithm can achieve both utility and privacy.
  • If $k$ is $O(1)$ (i.e., $Pr[t] = O(\frac{n}{m})$, which means that the attacker is weaker) then we describe an algorithm that achieves a privacy/utility tradeoff given by the formula $\rho = \sqrt{\frac{k}{\gamma}}$. The idea is that if $\rho$ is small ($\rho$ represents query estimation error, shown in the RHS of the utility formula), then utility is high because query estimation error is low. $k$ conveys prior while $\gamma$ conveys the posterior (definition of privacy).
  • The described anonymization algorithm is a randomized algorithm st for any $Q$ (query), the probability that the algorithm outputs $V$ st $Q(I) - \widetilde{Q}(V) \leq \rho \sqrt{n}$ is $1-\epsilon$ for some $\epsilon$.
Classification of adversaries
  • For every tuple, the attackers prior is either small or equal to 1.
  • (Defn 2.1) Let $d \in [0,1]$. A d-bounded adversary is one for which $\forall t \in D$ either $Pr[t] \leq d$ or $Pr[t] = 1$. If $Pr$ is tuple independent, it is called a d-independent adversary.
  • For a d-independent adversary, there is no correlation amongst tuples though there may be correlation among attributes.
  • (Defn 2.3) An algorithm is (d-$\gamma$)-private for all d-independent adversaries if (i) $\frac{d}{\gamma} \leq \frac{Pr[t|V]}{Pr[t]}$ and (ii) $Pr[t|V] \leq \gamma$
  • If $t$ fails condition (i), then there is a negative leakage (i.e., adversary learns more about the fact that tuple $t$ is not in $I$), while if it fails condition (ii), this is a positive leakage (i.e., adversary learns more about the fact that the tuple $t$ is in $I$).
  • (Defn 2.5) A randomized algorithm is called $(\rho, \epsilon)$-useful if it has an estimator (i.e., $\widetilde{Q}$) st for any $Q$ and $I$: $Pr[|Q(I) - \widetilde{Q}(V)| \geq \rho \sqrt{n}] \leq \epsilon$.
Impossibility result
  • Prove that if the attackers prior is large (i.e., $k=\Omega(\sqrt{m})$ thus, $Pr[t] = \Omega(\frac{n}{\sqrt{m}})$ thus $d= \Omega(\frac{n}{\sqrt{m}})$ by defn 2.1), then no algorithm can achieve both privacy and utility.
  • We first show that if $d = \Omega(\frac{n}{\sqrt{m}})$, no $(d, \gamma)$-private algorithm can achieve even little utility.
  • (Defn 3.1) Statistical difference between 2 distributions over a domain $X$ is : $SD(Pr_{A}, Pr_{B}) = \sum_{x \in X}{|Pr_{A}(x) - Pr_{B}(x)|}$.
  • Connection between SD and utility :
    • Let $Q$ be a large query (i.e., returns a sizable fraction if executed on the domain).
    • When we execute $\widetilde{Q}$, we get errors which depend on $n=|I|$.
    • If there is utility in algorithm $A$, the users should be able to distinguish between when $Q(I) = n$ and when $Q(I) = 0$. This means we should be able to differentiate between $Pr_{A}(V) = Pr[V | E_{Q}]$ and $Pr_{B}(V) = Pr[V | E_{Q}']$, where $E_{Q}$ refers to the event where $Q(I) = n$ while $E_{Q}'$ refers to the event where $Q(I) = 0$.
    • Intuitively, $SD(Pr_{A}, Pr_{B})$ should be large in order to differentiate $Pr_{A}$ and $Pr_{B}$. This would mean that utility exists in algorithm A, hence we can get a reasonable estimate for $Q$ by running $\widetilde{Q}$ on $V$.
    • An algorithm is considered meaningless if $SD(Pr_{A}, Pr_{B})$ is smaller than 0.5 for 2/3 of the queries that a user executes.
  • (Theorem 3.3) For all $\gamma < 1$, there exists a constant $c$ (independent of $m,n, \gamma$) st algorithm $A$ is not $(d, \gamma)$-private for any $d \geq \frac{1}{c} \frac{\gamma}{1- \gamma} \frac{n}{\sqrt{m}}$ (i.e., no meaninigful algorithm can offer $(k \frac{n}{m}, \gamma)$-privacy for $k=\Omega(\sqrt{m})$).
  • But then absence of $(d, \gamma)$-privacy means that there exists some d-independent adversary for which the following happens for some tuple $t$ : either (i) positive leakage (i.e., $Pr[t] \leq d$ but $Pr[t|V] \geq \gamma$) or (ii) negative leakage (i.e., $Pr[t|V] << Pr[t]$). Hence, we have proven the impossibility result.
Algorithm
  • Let us assume that $k=O(1)$. Now we can develop an algorithm to publish $V$ from $I$ so that $V$ has some utility.
  • $\alpha \beta$ algorithm
    1. Let $D = D_{1} \times D_{2} \times ... \times D_{l}$
    2. $\forall t \in I$, insert into $V$ with probability $\alpha + \beta$
    3. $\forall t \in D \setminus I$, insert into $V$ with probability $\beta$
    4. Publish $V, \alpha, \beta, D$
  • Query estimation algorithm
    1. Let $D = D_{1} \times D_{2} \times ... \times D_{l}$
    2. Compute $Q(V) = n_{V}$
    3. Compute $Q(D) = n_{D}$
    4. $\widetilde{Q}(V) = \frac{n_{v} - \beta n_{D}}{\alpha}$
  • (Theorem 4.3) $\alpha \beta$ algorithm is $(d, \gamma)$-private where $d \leq \gamma$ if we choose $\alpha$ and $\beta$ st $\frac{\beta}{\alpha + \beta} \geq \frac{d(1-\gamma)}{\gamma (1-d)}$ and $\alpha + \beta \leq 1- \frac{d}{\gamma}$.
  • (Theorem 4.4) $\alpha \beta$ algorithm is $(\rho, \epsilon)$-useful.
  • One extension of the algorithm described in the paper is where $\alpha \beta$ algorithm is modified so that multiple views can be published over time.



Thursday 24 July 2014

Paper summary : Authorization-Transparent Access Control for XML under the Non-Truman Model by Kanza et. Al., EDBT '06


  • Truman model - invalid queries are modified. 
  • Non-truman model - invalid queries rejected.
  • Security policies specified using XPath. e.g., for /Dept[Name=CS]/Course exclude //Grade
  • Not only XML elements but edges and paths can also be concealed.
  • Query is locally valid for dataset $D$ and set of rules $R$ if it conceals all relationships of $R$. Query is globally valid for any $D$ which conforms to some schema $S$ if $R$ is concealed for all $D$.
  • Document model $D = (X, E_{D}, root_{D}, labelOf_{D}, valueOf_{D})$.$X$ refers to nodes (nodes contain metadata like ids), $E_{D}$ refers to elements of the XML (i.e., attribute values, this is, in reality, a metadata of a node). $root_{D}$ is the root node, $labelOf_{D}$ is a function $f: X \to E_{D}$ and $valueOf_{D}$ is a function $f: X \to A$ where $A$ is a set of element values of atomic nodes (nodes with no outgoing edges).
  • k-concealment of relationship $(A,B)$ for some elements $A, B$ in $D$: given $b \in B$, we want a subset $A_{k} \subset A$ st (i) no element in $A_{k}$ is ancestor of another (ii) user cannot infer which among $k$ is an ancestor of $B$.
  • Expansion of $D$ is $D''$, created by replacing $E_{D}$ with $E'$ (children) and $E''_{D}$ (descendents).
  • A transitive closure $\bar{D}$ is an expansion of $D$ st there is an edge bet. every 2 nodes connected by direct path in $D$ and an edge from every node to itself.
  • An XPath query is equivalent to evaluating the query over the transitive closure of the document.
  • To prune a document expansion is to remove from $D''$ all edges that connect restricted nodes.
  • Universe of expansions of $D$ is the set of all $D''$ st $prune(\bar{D}$) = prune($D'')$. Denote this by $U_{R}(D)$.
  • (Local validity) Query $Q$ is locally valid if $Q(D) = Q(D''), \forall D'' \in U_{R}(D)$.
  • (Pseudo validity) Query $Q$ is pseudo valid if $Q(D) = Q(prune(\bar{D}))$. You can leak relationships here e.g., singleton source disclosure (example 11).
  • (Global validity) $Q$ is globally valid for $R$ give schema $S$ if $Q$ is locally valid $\forall D$ that conform to $S$. This is more restrictive than local validity but offers benefits (no need to check $Q$ over all $D$ if $Q$ is globally valid, saving computation).
  • $R$ is coherent if it is impossible to infer any hidden relationships from the relationships not pruned by the rules.
  • $R$ has incomplete concealment in $D$ if either (i) $D$ has 3 elements $e1,e2,e3$ st $prune(\bar{D})$ has an edge from $e1$ to $e2$ and $e2$ to $e3$ but none from $e1$ to $e3$ or (ii) $D$ has 3 elements $e1,e2,e3$ st $prune(\bar{D})$ has an edge from $e1$ to $e3$ and $e2$ to $e3$ but none from $e1$ to $e2$.
  • (Coherent) R is coherent if incomplete concealment does not occur in $D$.
  • Encapsulating edge in $\bar{D}$: $(e1, e2)$ encapsulates $(e1' e2')$ if there is a path running through all of these edges and $e1$ appears before (or at the same spot) as $e1'$ or if $e2$ appears after (or at the same spot) as $e2'$.
  • If $R$ is coherent, $\forall (e1,e2)$ in $\bar{D}$ which are removed by pruning, if there is an edge $(e1',e2')$ which is encapsulated by $(e1,e2)$, then $(e1',e2')$ should be removed too.
  • Weakness of approach so far is singleton source disclosure i.e., $\exists \rho=(x1,x2) \in R$ st $(x1(D), x1x2(D))$ is not 2-concealed ($x1(D)$ means that $x1$ was the query whose answer exsits in $D$). Solution : algorithm $computeK$ is provided.
  • (Theorem 1) Algo $computeK$ returns $k$ st (i) all the relationships in $R$ are k-concealed (ii) $\exists r \in R$ which is not $k+1$ concealed.

Crucial 16GB RAM upgrade for my Macbook Pro (Late 2011)

It had become really taxing for me to work on simultaneous applications on my 4GB macbook pro. Eclipse especially consumes an inordinate amount on memory. Hence I decided to purchase a 16GB RAM card on Amazon.

The innards of my Macbook Pro (late 2011)!
The difference between 4GB and 16GB is truly astounding. Now I can actually open up RStudio and Eclipse simultaneously. What a luxury :|

As an aside, it is ridiculous how shipping a RAM card from the US to Canada even with the shipping charges saves you a significant amount compared to buying the RAM card here. This is true for almost all products. Why Canada, WHY?!

Thursday 5 June 2014

Reflections on my first year at McMaster

My first year at McMaster University has been relatively quiet. The vast majority of my time was spent working in my modest room near the campus. Engulfed with coursework and research, I wish I had spent more time roaming around Hamilton. Perhaps visiting Hamilton's purportedly "world-famous" waterfalls would have created more meaningful memories.

Still, a great GPA is a some consolation. As per my course requirements I took the following courses (i) graduate computational complexity (ii) advanced topics in data management (iii) algebraic methods in computer science and (iv) software design. The course on graduate computational complexity was mind-bending and really exposed me to new ideas and interesting research directions. Advanced topics in data management focused on data cleaning, data mining and data privacy, a significant overlap with my research. Algebraic methods was assignment focused and completely math-y, while we had a lot more freedom with the project-based software design course. Overall, I think my course choices were great.

I also audited the graduate algorithms and data structures class and partially audited an introduction to graduate statistics course with the Math department. The algorithms course was fantastic and really a step up from the typical undergrad-level algorithms courses that you find in most universities. Unfortunately, I started auditing the statistics course too late in the term and could not make too much sense of it halfway through. I'll audit it from the beginning in my second year.

It was also my first time TA-ing a course- web systems and web computing, which is offered to 3rd/4th year undergrads and grad students. I actually enjoyed giving the 45 min lectures, covering overviews of topics such as Java web frameworks, front-end frameworks, databases, etc. However, marking student assignments is a real chore.

In terms of hobbies, I started teaching myself how to play the guitar, probably one of the best decisions I've ever made. During the frigid winter months, nothing beats strumming a few tunes. Occasionally, I would go to the MacBeat club meetings and jam with a few guitarists on Thursday's.

Yet, most of my Thursday's were actually channelled towards chess. As a beginner, all those losses by various members of the chess club are hard to bear. I'm trying to learn from them and get better. But chess can get frustrating at times (I hate losing!). It's not just chess- combined with coursework and research, one really needs some sort of an outlet to combat stress. Luckily, I'm swimming regularly...

I've made a few good friends this year, so I'm quite looking forward to spending more time with them next year, and not staying holed up in my room all the time. Still, I wish the grad CS cohort was larger. As it is, it is difficult to find sociable CS students, haha!

Grad life is vastly different from my undergrad days. There is a lot more studying and a lot less socializing. Its also strangely difficult to identify with the undergrads and their lifestyle. I guess I have changed a bit since my undergrad days.

Targets for next year
  • Conduct meaningful research and publish a paper in a top conference or journal.
  • Audit the graduate statistics course with the Math department.
  • Complete reading the books: PRML by Bishop and PGM's by Koller.
  • Complete the MOOCs: Machine Learning by Ng and Learning from Data by Yaser.
  • Focus on my own side-project.
  • Participate in at least 1 Kaggle competition.
  • Explore Hamilton's waterfalls.

Paper summaries : Statecharts and ATL

Here are two paper summaries I had submitted for one of my university courses, Software Design.

"Statecharts : A visual formalism for complex systems." by Harel

The authors propose a modeling formalism called Statecharts to addresses some of the inherent weaknesses of conventional state diagrams in realistically modeling complex reactive systems (one typical weakness is state space explosion even for relatively simple reactive systems). Statecharts extend conventional state diagrams (thus leveraging the visual power of these conventional diagrams) by the incorporation of features such as clustering of states, adding concurrency capabilities, enabling fluid movement between different levels of abstraction, modeling hierarchical information, modeling of activities (i.e., something that takes a nonzero amount of time like beeping), etc. The Statecharts formalism was exemplified using the Timex stopwatch system. In the industrial setting, Statecharts were successfully used in a large scale avionics project. The author also proposes further extensions (e.g., probabilistic Statecharts).

The author presented valuable and practical extensions to conventional state diagrams. Though verbose, the paper was accessible. The Timex stopwatch examples were welcome motifs, vastly preferable to a purely theoretical approach in presenting the ideas. Overall, the research is highly original.

"ATL : A model transformation tool." by Jouault et. Al.

ATL is a rule-based domain specific language (built upon the OCL specifications) with declarative and imperative constructs for specifying model-to-model transformations. The transformations are unidirectional i.e., the source models are read-only while the target models are write-only. The declarative features of ATL (e.g., matched rules, lazy rules, etc) encapsulate complex algorithms and are thus simple and intuitive to use. However, imperative statements are often essential for complex transformational problems. A number of tools (e.g., editor, compiler, virtual machine, debugger) were developed for ATL on the Eclipse platform. These tools support many of features highlighted in [1] except for (i) static type checking, (ii) helpers in the context of collection types, and (iii) more strongly typed resolution. Feedback on the use of ATL has been positive, and ATL has been evaluated on more than 100 sites.

The paper itself is quite concise and not deeply technical, or even expansive. Moreover, some sections (e.g., related work) are perfunctory. However, the presented contributions are concrete, the idea is novel and the implementation (of the ATL tools) is praiseworthy (given the popularity of Eclipse). The successful evaluations and positive responses (mentioned in section 6) to the developed ATL tools is indicative of the high quality of the presented work.

Sunday 30 March 2014

Presentation : A Generic Algebraic Model for the Analysis of Cryptographic-Key Assignment Schemes

Here is a short presentation I did for one of my courses, Algebraic Methods in Computer Science and Software Engineering.