Hey! Presto - Facebook's latest open source code

Posted by Codepope's Development Hell on Wednesday, November 13, 2013
Last Modified on Saturday, August 31, 2024

Facebook, in their now traditional goal of taking on big data problems, solving them and then open sourcing the result, have open-sourced Presto, a distributed SQL query engine “optimized for ad-hoc analysis at interactive speed”. This type of app is designed for the folks who need to work out what people who like chips and cheese and rock but dont like bagels or opera also have, statistically, in common. Its a simple enough question, but when you get up to Facebook scale, its a hard question to answer. This is the land of Hadoop and Hadoop has its own SQL-like query engine, Hive. 

But unlike Hive which converts queries into MapReduce tasks saving intermediate results to disk, Presto has a query and execution engine which runs in memory and is pipelined through the network. Presto is implemented in Java for easy integration with other parts of Facebook that are also built in Java and compiles parts of queries down to bytecode, letting the JVM JIT compile to machine code to get the best out of the Java environment. Although it doesn’t need Hive, Presto does need a datasource for its queries and it includes a plugin for Hive, though it only uses the Hive metastore service, presumably to obtain structural information, and then accesses the data over HDFS.

The Facebook announcement says “Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook” and has been in use internally since Spring of this year with multiple deployments and one cluster scaled to a thousand nodes. A thousand users actively use it with 30000 queries and processing a petabyte a day. Thats a good work out for any big data offering.

There’s plenty missing from Presto; various joins and aggregations are restricted and there’s no way to write results back into tables - they go straight to the client. Those issues, plus improved performance, query accelerators, hot cached data subsets and a high performance HBase connector are all on the roadmap for Presto.

Presto is licensed under the Apache License 2.0 but does not appear to be heading to the foundation with active development taking place around Facebook’s GitHub repository.

This article was imported from the original CodeScaling blog