Sunday 27 January 2013

Data science updates

I have successfully completed the "Computing for Data Analysis" course in Coursera! Analyzing data sets has never been this fun! The R plotting libraries are pretty cool. I especially like the different data visualization functions. I also have a soft spot for the lapply, sapply, tapply functions- they remind me of the foldl and foldr functions in Haskell.

It is time I got a hold of some huge datasets (which can't be loaded into memory) and try to work with those in R. To this end, I recently requested for the newly released "Click dataset" by Indiana University, which is about 2.5 TB (when compressed) of data. Unfortunately, they denied my request as I have to be associated with a research institution due to the "sensitive nature" of the requested data. I do empathize with them. This is not a problem though- there are plenty of large datasets out there.

I have also enrolled in the "Data analysis" course in Coursera as a follow up. Here are some of my short term goals with respect to data analysis in R :
  1. Experiment with the machine learning libraries in R
  2. Participate in a Kaggle competition
  3. Perform object oriented programming in R
  4. Visualize huge datasets
  5. Take more data analysis courses in Coursera

Friday 25 January 2013

Java : Covariant, invariant, reified, erased (Bloch Item 25)

I've compiled a summary of Bloch's item 25 below :

Arrays are covariant i.e., if Sub is a subtype of Super, then the array type Sub[] is a subtype of Super[]. Generics are invariant: for any two distinct types Type1 and Type2, List<Type1> is neither a subtype nor a supertype of List<Type2> [JLS, 4.10; Naftalin07, 2.5].
Object[] objectArray = new Long[1]; 
objectArray[0] = "I don't fit in"; // Throws ArrayStoreException 
List<Object> ol = new ArrayList<Long>(); // Incompatible types 
ol.add("I don't fit in");// Won't compile!
You can’t put a String into a Long container. With an array you find out that you’ve made a mistake at runtime; with a list, you find out at compile time.

Arrays are reified [JLS, 4.7]. This means that arrays know and enforce their element types at runtime. Generics, by contrast, are implemented by erasure [JLS, 4.6]. This means that they enforce their type constraints only at compile time and discard (or erase) their element type information at runtime. Erasure is what allows generic types to interoperate freely with legacy code that does not use generics.

Because of these fundamental differences, arrays and generics do not mix well. For example, it is illegal to create an array of a generic type, a parameterized type, or a type parameter. None of these array creation expressions are legal: new List<E>[], new List<String>[], new E[]. All will result in generic array creation errors at compile time.

Why is it illegal to create a generic array? Because it isn’t typesafe. If it were legal, casts generated by the compiler in an otherwise correct program could fail at runtime with a ClassCastException. This would violate the fundamental guarantee provided by the generic type system.

In summary, arrays and generics have very different type rules. Arrays are covariant and reified; generics are invariant and erased. As a consequence, arrays provide runtime type safety but not compile-time type safety and vice versa for generics. Generally speaking, arrays and generics don’t mix well. If you find yourself mixing them and getting compile-time errors or warnings, your first impulse should be to replace the arrays with lists.

Wednesday 16 January 2013

Computing for data analysis

I have enrolled in the course "Computing for data analysis" on Coursera, taught by Roger Peng from Johns Hopkins. Essentially you get to learn and new language, R, and play around with large data sets. What could go wrong, right?

The last 2 weeks have been pretty stressful at work. Even weekends were not spared (its a new project). Cue Coursera's weekly quizzes and assignments. I barely have time to study after work. Having just completed the second programming assignment (submitted a day late, I wish I had more time), I'm pretty spent. I hope I'll be able to complete the course successfully, but it is proving rather difficult at the moment.

Early thoughts on R - the syntax is fairly straightforward, coming from a Java background. Everything is an object, and the last statement is returned automatically (kinda like Ruby), which I don't particularly like. It seems to have some really powerful statistical functions, and I'm hoping to unleash its power on some of my own datasets.

Tuesday 8 January 2013

How online recommendations work

I recently wrote a short article for PCQuest on collaborative filtering. It has since been published in both the print form (Jan 2013) and online too. In case, you're not able to access the online version, I've created a backup here, or you can read my article below:

It's late at night and you're bored. The television is devoid of entertainment- fairly typical. You're in the mood for a movie anyway. This latest one has great reviews but you're still not sure if it lives up to your high standards, so you call a friend who watched it recently. Once it passes the litmus test, you head online and purchase the movie. The movie is engaging and you have a wonderful time.

How is this relevant to your online experience? Online services like Amazon and Netflix make a living acting as your friends, ostensibly helping you out by recommending things to purchase along the way. Even when you purchase the movie, your information is stored and processed to be served as recommendations to you and even others. The better their recommendations, the more you're likely to follow their recommendations and purchase the product (at least in theory). In any case, your overall online experience is enhanced and you're pleased with their astute inferences.

This innocuous recommendation feature is in reality powered by sophisticated algorithms and data crunching machines which reside in Amazon's data centers. Companies spend a large amount of time constantly refining these algorithms.

There are various ways one might implement this algorithm. Companies might examine users who are similar to you and use this information to serve you recommendations. They might decide to identify similar or correlated items. One popular algorithm to match similar items (very basic and naive) is outlined below:
for each item I1
    for each customer C who bought I1
        for each I2 bought by some customer C
            record purchase C{I1, I2}
    for each item I2
        calculate similarity(I1, I2)
return table
Basically, items that a particular customer bought together are stored in a table. This is done for all items, and this information is used to calculate a similarity rating to match similar items. Similarity is calculated using the resultant item vectors (I1 and I2 for example) and algorithms like the cosine similarity algorithm take these vectors as inputs to produce a similarity rating. Billions of records are thus processed. All the complicated and heavy processing is done in data centers. When you click on a item, Amazon refers to these tables (this is a relatively fast operation; the building of these tables is the slow part) to determine which items to recommend to you.

It's interesting how such seemingly simple "customer's who bought this also bought this" feature is backed by so much research and complexity. In a world where customer attention is king, every competitive advantage counts. 

So the next time you get a recommendation online, think of the lengths that such companies go through to get you this information. Don't feel guilty though- this service is not free because you give them your information to work with too.