Alluxio: In memory sql compute layer kicking Spark ass?

Commenting on Matt Asay's article about Alluxio on ReadWrite here

There's a constant strive to increase spead, reduce latency, and increase data scale for the compute layer of our modern distributed stacks.

Traditionally there's a trade-off here: Ceteris parabis, you can either have scale, or speed, but not both.

To increase both scale and speed, you need to change something, like throwing more hardware at the problem, or constraining the problem:

Latency vs Data Scale

In the real world, latency scales WAY faster than data scale. Or in other words, performance falls off a cliff at each storage tier threshold: going from CPU Cache to Memory, Memory to Disk, Disk to Network (Remote Node), Remote Node to Remote Data Center. Each tier is thousands of times slower than the one before. It takes some real smarts to avoid these cliffs, or intelligently work around. This is a highly complicated data-locality problem

Hadoop brought scale, but we quickly realized it was shockingly slow on the compute side. Only really good for batch workloads. Also, it was tough to use (not SQL)

Along came Hbase solving the SQL problem, but if anything it was slower.

Finally, along came a little company called DataStax with Spark, promising to speed up computation of these large distributed datasets, specifically with machine learning in mind.

Well, given that it was still a "generic" platform, they couldn't optimize it for the types of data or workloads they knew to expect. Hence they couldn't squeeze every last bit of performance out of it.

In answer to this, there have been a bunch of purpose built systems that have managed to get "real-time" performance off of distributed systems with large data sets, even when pulling off of spinning disk. One of the most well known (if not actually available to the public) is Scuba at Facebook. Originally developed for their user analytics, they constrained the types of queries (and hence computations) that could be applied to the data, and also constrained it to assume everything is a time-series (as it is with user click-streams). Using these constraints, they were able to get extremely impressive performance across vast data-sets. See Lior Abrams' fantastic original blog post here:

Data Diving with Scuba

As an aside - the Scuba team went on to start the company Interana where they're trying (and succeeding) to build a version of this analytics engine for companies other than Facebook.

Now onto Alluxio - doing a little more research, it doesn't look like they're actually breaking the laws of physics. I.e. They aren't fundamentally moving that real-world data processing line down. What they are in fact doing is writing spark to be extremely optimized for in-memory; and then doing some much more intelligent data-locality work to keep caches and memory filled.

Is it a step in the right direction? Yes. Is it 100x faster than Spark as the click bait title would suggest? No, it's actually only 30x faster, in certain workloads, and the author apparently can't do math (it's actually 100x faster than HBase - not a hard task).

Great to see things moving in the right direction here, but this isn't some panacea breakthrough as the article would herald.