code.google.com

Congratulations to Avinash, Prashant, and Karthik.

MMDS 2008 and CIM

Last week the second Workshop on Algorithms for Modern Massive Data Sets was held at Stanford University. This workshop had an incredible density of prestigious speakers from the field of machine learning; I guess spending a weekend in California during the summer is an easy sell.

Each of the four days of the conference had a theme. The official themes:

Data Analysis and Data Applications
Networked Data and Algorithmic Tools
Statistical, Geometric, and Topological Methods
Machine Learning and Dimensionality Reduction

After attending several talks by these machine learning luminaries over the course of four days, I tried to pull together a few common themes of my own for further exploration:

Incomplete Dyadic Data
Distributing Data and Computation
Manifolds with Noise

I'll try to put a detailed post up on each of these three topics this week.

Over the past few years I've become a regular conference attendee. It seems plausible that I will continue to attend KDD, VLDB, SIGMOD, and MMDS for years to come, leaving me with a major question: where can I find a longitudinal analysis of the content of these conferences?

It seems that these conferences provide an excellent yearly cross-section of the state of their respective fields, but I'm much more interested in the deltas. What are the new topics, which topics are making forward progress, and which topics are losing steam? Taking time to collect this information would provide some fascinating insight into the progress of science.

The KDD community has done some analysis of the DBLP data set, but not with the aims proposed above. From the VLDB community has come some work that is a bit closer in intent: AnHai Doan's Cimple project. So far, they've produced the moderately useful DBLife, and they've outlined a promising research direction.

I've recently started a conference type on Freebase and have a simple tool to monitor conference progress using their API. I'd like to make this tool more general and robust. Any CIM students or conference geeks looking for a summer project?

jeh

My good friend, musical mastermind Matt O'Malley, cooked up a theme song for the Facebook Data Team. You can also check it out on my Muxtape. It's just so, so good. If you would like to create an animation to accompany the music, please let me know.

jeh

REPL

Python is an amazing language for many reasons, but code prototyping via the interactive interpreter has crippled my development speed in languages that do not have a REPL.

For this reason, I was ecstatic when I joined Facebook and learned they had developed their own interactive shell for PHP, phpsh. My Javascript development was hastened when I found Mozilla's spidermonkey.

I also enjoyed learning Ruby and Erlang, because in Ruby, you have irb, and in Erlang, you have the Erlang shell, Eshell.

I recently came across CERN's CINT, a REPL for C/C++. Finally! I have come to dread coding in these languages, especially C++, but I'm now looking forward to my next big C/C++ project. If you've used CINT, drop me a line.

I've also started looking for a nice Java REPL. There's DynamicJava, and the Groovy interactive shell. Anything else I should try?

jeh

Wired Magazine: (Brief) Dispatches from the "Petabyte Age"

I tend to geek hard when a publication drops a series of articles about petabyte-scale data analysis. Frequent culprits include the SIGMOD Record or Teradata Magazine, but this week, Wired Magazine splashed into the pool.

Chris Anderson (the long tail one) starts things off by declaring that the data deluge spells trouble for the scientific method. Clearly this claim is false; Google's success is almost entirely due to rapid product iterations via thousands of hypothesis tests every month. To give Chris the benefit of the doubt, he seems to be asserting instead that the practice of model development should be done in close concert with the collection of empirical data. I couldn't agree more.

Near the end of the article, Chris makes a common (if frustrating) mistake: he claims that the cluster used for the NSF's CluE program will be running the Google File System, as well as crediting Google and IBM for building the software that will power this cluster. In actuality, Yahoo deserves almost all of the credit, as they have done an incredible job scaling the Hadoop project to thousands of nodes and making the NSF program possible. Just another testament to the effectiveness of Google's PR machine.

The rest of the articles cover many application areas of large scale data analysis in brief, including agriculture, astronomy, high-energy physics, politics, epidemiology, and insurance. There's also a startlingly incoherent attempt to describe how MapReduce works that probably should have been left out.

Wired's intentions were noble but their execution was not up to expectations. I'm hoping that tomorrow's MMDS Workshop will have a bit more substance.

jeh

Bloomberg for the Web? We Need Real-Time News, Data, and Analytics.

The NYT had an article today about the battle for efficient financial services information delivery currently heating up between Bloomberg and Thomson Reuters. A few weeks ago, there were articles about the NASDAQ and NYSE making their stock quotes available in real time to multiple information outlets. And just last week, Tibco, once a wholly owned subsidiary of Reuters, purchased Insightful. Insightful makes the S-Plus statistical analysis software on which the R project is based. Tibco already owns Spotfire, a firm that makes excellent software for data exploration.

These developments made me think back to my time in financial services. Trading floors have the highest concentration of numerate people I've ever been around. They also have ready access to superb information manipulation software.

Now consider the current state of analytics software for the web. I won't bother to list the major competitors, as they are all pretty mediocre. As the online advertising space grows in sophistication and mechanisms to promote a more efficient market are introduced (see, for example, Right Media and ContextWeb's ADSDAQ), a workbench for the real-time exploration of news and data related to the web will be a necessary tool for many quantitative marketers.

As Hal Varian points out, marketing is the next field to be overrun with quants, and I expect that the tools most useful in finance will be brought along for the invasion.

jeh

Yoga Bear: Now Using YUI and App Engine

www.yogabear.org

My girlfriend Halle works hard at her nonprofit, Yoga Bear. I help out by maintaining the website. Today I ported the website from Django (hosted by WebFaction) to Google's App Engine; I also ported the CSS to use YUI's reset, fonts, and grids.

As a result, we no longer have to pay hosting fees and the site renders properly in IE 6. Furthermore, the transition will allow me to explore the YUI AJAX libraries and App Engine data store to add some new features to the Yoga Bear website. The web is beginning to feel less hostile to rapid development.

jeh

wordle.net

I made a wordle from my delicious tags. I found this site in the App Engine gallery.

I really dislike web programming due to its many accidental complexities, but the simplicity of developing a web app in python via App Engine has me experimenting with the web once again.

jeh

TIME Digital's 1998 "Cyber Elite"

www.time.com

A late-night browsing session landed me on this gem from ten years ago. Some highlights:

It's been said before, but reading the description of GeoCities from 1998 really drives home the fact that online social networking dates back to the beginning of the web.
DoubleClick is really old school. It's crazy to think of them integrating with Google. Also, there's a reminder of how controversial cookies were on the early web: "Using controversial software called cookies, DoubleClick sites can snoop on users' browsing habits, sometimes picking up such critical information as zip codes".
Rupert Murdoch's internet ambitions were already apparent.
Joe Nacchio was not on trial.
Even before the iPod, Jonathan Ive was recognized for his visionary designs.

TIME also showed an early awareness of the web's global reach, highlighting Japanese, Chinese, and Brazilian leaders.

Historical artifacts like these have helped bring perspective to my first two and a half years in the Valley. For another historical reference, check out an old Forbes article about the aftermath of Yahoo's first major acquisition spree.

jeh

Creative Commons Technology Summit 2008

On Wednesday, Creative Commons hosted their first ever technology summit at Google. I've been into CC for a while now and I was relatively pleased by their progress since I last checked in.

Joi Ito started the day off with a lucid delineation of CC's major components. He pointed out the technical side of CC is hoping to create a set of standards for digital media exchange in a similar spirit to what the IETF does for the internets as a whole. The political side of CC is more akin to the Open Internet Coalition that is fighting to keep these standards in neutral hands.

Ben then gave a great outline of the technical components of ccREL. I am impressed with the refinements and flexibility introduced by the full-scale adoption of RDFa for semantic markup.

The later sessions started to drift away from my core interests, but I was intrigued by the proliferation of digital copyright registries: Registered Commons, SafeCreative, and Noank Media, for example. It was great to see Attributor have a presence at the summit. They're also heavy users of Hadoop--I am once again impressed by what Jim has built.

I've enjoyed watching Creative Commons evolve over the past several years and I'm still holding out some hope that I'll be able to have a material impact on their success some time in the future. For now, it's great to keep up with the team; I trust they're in good hands with a fellow Cavalier leading the way.

jeh

Hive JIRA ticket opened!

issues.apache.org

We're one step closer to getting our data warehousing framework built on top of Hadoop into the hands of the community.

jeh

I learned yesterday that Google's ads data warehouse, built on Netezza, is named "Everest"--the same name Yahoo chose for their modified Postgres data warehouse. Awk-ward...

jeh

Repeatability, again

In a recent post I mentioned the idea of reproducible research. It turns out that this year's SIGMOD conference, where we'll be presenting a new approach to structured storage, has conducted an experiment in reproducible research. You can read a fascinating account of the experiment in this month's SIGMOD Record. While you're there, be sure to check out the "Data Management Projects at Google" article as well!

jeh

Hadoop at Facebook

There's a post on the Facebook Engineering blog today from one of the Data team engineers, Joydeep Sen Sarma, discussing how we use Hadoop here at Facebook. Check it out.

For more on Hadoop at Facebook, you can check out the slides from a set of lectures I gave at IBM's Cloud Computing Center in Dublin.

For a deeper look at the architecture of HDFS, check out the presentation [PDF] that dhruba, a recent addition to the Data team, gave at the IBM storage team's recent offsite.

jeh

Participatory Sensing, a project from the Center for Embedded Networked Sensing at UCLA, reconceptualizes your mobile phone as a diverse collection of sensors that can be used to record data throughout your day. This data is then shipped to a centralized aggregation service that provides different views and analyses of your data. The Reality Mining project from MIT takes a similar perspective but has more targeted goals.

Both of these research projects were conducted in conjunction with Nokia. I gave a talk over at Nokia Research a few months back and was fascinated by their take on the future of the internet. They're positioning Ovi as your portal into the data collected via all of your Nokia sensors. I'm quite interested to see how their offerings evolve, particularly in parallel to the iPhone and Android.

jeh

The Peach open movie project, initiated by the Blender Foundation and hosted at the Blender Institute in Amsterdam, is an innovative project to produce high-quality digital media with open source tools.

Another interesting project in this space is Justin Frankel's REAPER project. These tools, in addition to the venerable GIMP, are rapidly commoditizing the software needed to produce digital media of professional quality.

Unfortunately, I lack the artistic talent required to make nontrivial projects with these tools, but I'm looking forward to consuming the creations of my more talented friends. In addition, a lower barrier to the creation of digital media means a rapid increase in the amount of multimedia data to be stored and analyzed. Doing machine learning over audio, image, and video data is something that Hadoop should handle well. If anyone has a project using Hadoop to do data mining over a multimedia data set, let me know!