All Blogs

A Three-Step Plan to Innovate Hadoop for the Cloud

Madhup Mishra Madhup Mishra
Director, DataOps Product Marketing, Hitachi Vantara

January 06, 2021


How large is your Hadoop data lake? 500 terabytes? A petabyte? Even more? And it is certainly growing, bit by bit, day after day.

What began as inexpensive big data infrastructure now demands ever more expenditures on storage and servers while becoming increasingly unwieldy and expensive to manage. Such rapacity makes it ever harder to realize a proper return on investment from that Hadoop infrastructure.

Moving the data lake to the cloud is the obvious next step, one that may even be mandated by your cloud migration priorities. Options like Amazon EMR, Google Dataproc, and Azure Data Lake offer enticing capabilities. But there are hazards on that road.

Most companies can’t simply abandon their investment in on-premise systems and start from scratch in the cloud. Every day you wait to make a move, however, the available Hadoop storage shrinks, and the next million-dollar invoice is just around the corner.

There is a way forward. Hitachi Vantara has developed a three-step playbook that enables clients to reduce costs, make the most of existing Hadoop infrastructure, and move incrementally to a cloud-oriented data architecture.

Step One: Drive Out Cost

Hadoop comes with an inherent catch-22. Its architecture tightly couples storage and compute, so as storage increases, one needs to add matching, expensive, and likely under-utilized compute capability. What if we could break that cycle, gaining storage capacity without adding servers?

How far back does your organization typically look for day-to-day operating analytics? Most likely it is on the order of days, weeks, or maybe a couple of months, at most. Deeper dives may be periodically conducted, but on average 60 to 80 percent of all data stored in a data lake is what can be considered “cold” data; dormant, unused, and simply taking up space.

The solution is to push that idle data to an object store, getting it out of your Hadoop data lake and moving it to lower-cost, back-end local storage that may be 80% cheaper than HDFS. Properly tiering data in this way could save an organization one million dollars per petabyte when compared to expanding a Hadoop cluster, and that is only a fraction of the total benefit.

Shifting that cold data simultaneously creates headroom in the existing Hadoop data lake, in effect providing storage and compute capacity without adding any more costly nodes to the cluster. Such critical analysis of the heat profile of the data in the data lake is not only a best practice, but it is also a vital first step toward avoiding the same over-provisioned/under-utilized scenario when migrating to the cloud. This is the role of Lumada Data Optimizer for Hadoop: to reduce Hadoop infrastructure costs by implementing an intelligent data tiering solution that then uses Hitachi Content Platform for seamless application access. To help you begin to evaluate your savings potential, we put together a simple and free savings calculator.

Step Two: Data Governance

With the data now properly tiered, leaving hot data on Hadoop clusters and moving cold data to backend object stores, the next objective is to make it possible to manage the data so it can be used productively and securely.

Of course, building data pipelines today is infamously laborious. A lot of time and effort is required to track down the requisite data and to then understand the data in detail. Typically this involves manually correlating extensive, granular detail about the data and applying business meaning to things such as the names of specific databases, tables, and schema associated with it.

This is the role of the Lumada Data Catalog, which makes it possible to build, automatically, a highly comprehensible catalog based on the metadata describing everything in the data lake. It pulls back the shroud on “dark” data in the data lake, revealing its meaning and importance, making self-service discovery possible for users, but also providing the awareness an organization needs to make informed choices about data governance policies.

Lumada Data Catalog does the hard work of combing through data resources, profiling and analyzing what is available, and then abstracting the minutiae so that it is possible to make sense of the resources. It lets you build policies for enforcing compliance, e.g. by identifying and protecting sensitive data. That visibility also makes it far easier to make informed, confident choices about what data can be safely migrated and accessed from the cloud and what data should remain on-premises, perhaps for security reasons. Those choices form the foundation of enforceable data policies.

In addition, abstracting the information about data location makes it possible to have a unified means of finding data across the cloud and on-premise resources. Now, when an analytic requires different data assets about a customer, for example, the catalog can point to the data across multiple stores. And if data is moved, perhaps migrated to the cloud, the pipelines that rely on that data remain intact.

Step Three: Deliver a Rich Data Fabric

With a now optimized data lake and a full understanding of its contents, it is time to execute on the CIO’s vision of a data lake in the cloud. Or, far more likely, it is time to deliver a coherent, hybrid solution. There are still many options and numerous choices to be made about the data.

Data that is sensitive, for example, can be kept on-premise in the now optimized Hadoop cluster. Other data can move to the cloud to benefit from rapid execution, increased analytics capability, and simplified management. On the other hand, maybe this also would be a good opportunity to push the processing of IoT sensor data out to the edge, reducing its impact on the data lake.

Hitachi Vantara and its nearly one thousand data lake experts can help smooth that transition, delivering the data infrastructure needed today while anticipating future evolution. They help develop a unified, long-term data strategy, then set up the pipelines and governance that will move data from on-premise to the cloud. They’ll preserve the functionality of valuable analytics and connect applications so critical operations continue to produce.

These teams fully understand Hadoop methodologies, EMR, Dataproc, and Azure. More importantly, they are partners that help ensure your organization transitions to a sophisticated, optimized data fabric that is a tightly woven tapestry of edge, on-premise, and cloud capabilities that delivers the agility your business needs and cost management that makes good business sense.

To get started, contact us for an assessment of your organization’s data and AI maturity.


Madhup Mishra

Madhup Mishra

Madhup drives product marketing for Hitachi Vantara's DataOps portfolio business. He has +20 years' enterprise software experience, and covers a range of topics including data operations, big data analytics, data governance and Internet of Things (IoT).