Rewriting Excel for the era of big(ger) data

23-2-2015

The spreadsheet may very well be the biggest innovation after the personal computer itself. Spreadsheets are used by professionals in virtually all sectors where information is processed, where they are entrusted with anything between shopping lists and billion-dollar, life-or-death decisions support systems. They are used for all sorts of tasks that spreadsheets were and weren’t designed to do: modeling, simulation, information storage, extract-transform-load, to name a few. Spreadsheets are the quantitative lingua franca of the business world.

Spreadsheets are so ubiquitous because the mental model for a spreadsheet is so easy to grasp, even for non-programmers. A spreadsheet is, after all, nothing more than a grid that contains numbers, and formulas that use those numbers. Spreadsheets were modelled after blackboard calculations. The power of spreadsheets comes from the fact that using this simple grid, it is possible to compute virtually anything (flight simulators, a processor emulator and K-means clustering are just a few examples of such Excel abuse). The spreadsheet is one of the few modes of computation that is both easy to use and understand, as well as extremely powerful. And most importantly, doesn’t require you to think like a computer, like virtually all programming and query languages (despite several attempts attempts). The key invention that made spreadsheets possible is the process by which the web of formulas in a spreadsheet is made into a computer program.

The spreadsheet is one of the few modes of computation that is both easy to use and understand, as well as extremely powerful.

Spreadsheets have some downsides as well. For one, logic and data is not clearly separated. This becomes a problem when you want to generalize your spreadsheet, for example to bigger data sets. Because spreadsheet formulas always take a fixed number of cells as their inputs, it is not easy to (automatically) accomodate increases of data size (Excel actually does a quite decent job by assuming that data added in a range should be treated like the other items in the range and rewriting references to the range, and also provides tables functionality to make this explicit).

Increasingly, data-analyzing people like myself find themselves in a situation where they need to work with data sets that Excel just cannot stomach anymore. The concept of a spreadsheet simply doesn’t scale (try analyzing files with more than a few million cells in a recent version of Excel and you’ll agree). Because any cell could (in theory) influence the value in any other cell, Excel needs to calculate relationships between the cells before it can compute results. Beyond a certain size, big data can only be analyzed using a divide-and-conquer approach. Because spreadsheets contain so many interdependencies, they are not so easy to divide (much less to conquer!).

For many organizations, big data is “anything that doesn’t fit in Excel anymore”.

There are numerous tools and platforms available that are able to grok such amounts of data. Unfortunately, none of these tools come close to the ease of use of a spreadsheet. They either require you to think like a computer, or do not provide a simple alternative mental model.

What the world needs is not another Hadoop, but a new Excel for bigger data: an intuitive analysis tool for big data, with a simple mental model but the same powerful capabilities.