How Nostradamus Predicted Presto Engine for Querying 250 PB Data


Who is Nostradamus?

Nostradamus is the french seer, who has done predictions of the future?

As per bible, god created a man ( Mark Zuckerberg).



Man ( Mark Zuckerberg) created a  Facebook

The database, information was so small, when he started.Summary of all the cloud-related stuff after this quote  

MapReduce 

MapReduce is used in distributed computing environment.  It maps various data points. After computation and logical operations reduces, and results are produced.

 

HBase (database)

HBase Logo.png

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.


 
Hadoop and HDFS
 
Hadoop was created by Doug Cutting and Michael J. Cafarella.Doug, who was working at Yahoo at the time, named it after his son’s toy elephant. It was originally developed to support distribution for the Nutch search engine project.
Hadoop cluster
Hadoop Architecture

















1 Hadoop  and  its “Hadoop Distributed File System” (HDFS), an open source Java product  

Problems with Hive and others 

Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”


Image result for presto facebook

Why PRESTO?

Presto solve the hive and other problems, as mentioned above/
Presto is open source project from Facebook
 
According to netflix team
We had been in search of an interactive querying engine that could work well for us. Ideally, we wanted an open source project that could handle our scale of data & processing needs, had great momentum, was well integrated with the Hive metastore, and was easy for us to integrate with our DW on S3. We were delighted when Facebook open sourced Presto.
 
In terms of scale, we have a 10 petabyte data warehouse on S3. Our users from different organizations query diverse data sets across expansive date ranges. For this use case, caching a specific dataset in memory would not work because cache hit rate would be extremely low unless we have an unreasonably large cache. The streaming DAG execution architecture of Presto is well-suited for this sporadic data exploration usage pattern.
 
In terms of integrating with our big data platform, Presto has a connector architecture that is Hadoop friendly. It allows us to easily plug in an S3 file system. We were up and running in test mode after only a month of work on the S3 file system connector in collaboration with Facebook.
 
Hope this helps

 

Leave a comment