There is a growing requirement of running Hadoop on Cloud. In response to this, various solutions have come out in both public and private domains. Being early solutions, they provide basic features but have only touched the tip of the iceberg. We are yet to see the real benefits of taking Hadoop to cloud. In this presentation we discuss our experiences in building an orchestrator called MultiStack for deploying, managing and monitoring Hadoop jobs across clouds. It exploits cloud to offer features like auto-scaling, aggregated monitoring, job specific optimizations, isolation and smarter job scheduling within and across clouds.

Traditionally, Hadoop has never been a part of cloud and has been restricted to a single cluster. MultiStack allows enterprises to run jobs across multiple clouds/clusters seamlessly. It reduces job completion time and improves resource utilization using machine learning based job scheduling. MultiStack scheduler uses fine-grained monitoring data to scale Hadoop clusters automatically and can use spot instances to lower the cost. We plan to integrate Hive, Mahout and Giraph to drive data analytics on cloud.


