Shining a Light on the Growing Dark Data Problem

Manish Jain
Vice President, Product Management, Lumada Software, Hitachi Vantara

August 12, 2021

The arrival of more and more data in all segments of the enterprise started out as an embarrassment of riches, but quickly transformed into something close to a nightmare of dark data. However, a raft of new technologies and the processes embodied in DataOps are charting a path forward in which a much higher percentage of data becomes useful.

The challenge most companies face is how to manage and get access to all the data flooding in from all directions. An apt name for all this data that cannot be seen, used or managed, or all three, is dark data.

The Dark Data Problem Is Growing

A variety of forces are making this problem more acute:

Data volumes are exploding, expanding the amount of data that must be understood or remain in the dark. According to IDC, roughly only 2.5% of all data is analyzed; the remainder is lost, and its potential value squandered.
Public cloud adoption has vastly expanded the locations in which data resides and loosened control over data creation and management. Gartner® predicts that, “By 2022, public cloud services will be essential for 90% of data and analytics innovation.”^[1] According to ESG, 81% of companies use more than one cloud infrastructure provider, whether for IaaS or PaaS. And 51% use three or more.
Self-service must be supported to expand the size of the effective team in a company that can make data useful. If data is in the dark, it might as well not exist from the perspective of self-service users. If it is hard to find, productivity plummets. A report from 451 Research on DataOps revealed that 25% of all chief data officers (CDOs) with more than 2PB of data say staff spend over 50% of time finding and gaining access to data.

The remedy to dark data is not a single data lake to rule them all or any single other act or method. The goal must be to have people, processes and technology that expand visibility, help to automatically understand, clean and integrate data, enable access to data both for analytics and operational workloads, and then catalog the data so it can be easily found.

Given the vast number of people, skills, tools and technology that are involved in such an effort, the practices defined as DataOp s provide a useful framework through which a systematic attack on dark data can be mounted.

How DataOps Reduces Friction

Friction is pervasive across the data stack. There is friction in integration due to the lack of automation, and the inability to combine IT and operational technology (OT) data. Friction exists in the governance processes, too; it is difficult to understand the meaning behind data, and data cataloging is manual and error prone. Finally, there is friction when it comes to using the data for analytics due to lack of self-service; for example, a business analyst needs to wait for IT to provision the data. Removing this friction makes data far more visible and useful.

Here are several ways that DataOps culture, processes and technology enable enterprises to address the challenges surrounding dark data friction.

Automating Data Discovery

Business and data analysts spend 80% of their time finding and preparing data and only 20% obtaining insights. The lack of visibility into datasets, the contents of those datasets, and quality and usefulness of each is a major problem.

Although the amount of data in different repositories has grown rapidly in most organizations, much of it is not useful because it has not been automatically tagged for discovery and use. If an analyst is working with inadequate and poorly understood datasets, the quality of their analysis will suffer as the data they use is incomplete or inaccurate.

Data cataloging gives data analysts the ability to find and prepare data quickly and efficiently. Hitachi’s Lumada Data Catalog accelerates data discovery and metadata tagging to secure sensitive data, infer hidden relationships, and accelerate data self-service and insights.

A 360 View of Data

The contextualization of data is one aspect of DataOps that can help resolve dark data issues. Think of all the nuanced ways a consumer engages with a bank: through personal and business accounts, investments, loans, mortgages and even loyalty programs. All these touchpoints are managed through different departments or divisions across a bank’s value chain. But, how can they achieve a customer 360-degree view across multiple lines of business?

In the industrial internet of things or IIoT arena, manufacturers have sensor data coming in from physical assets to indicate how an asset performs or the temperature in a building. Data also flows in from a supply chain, revealing what’s available, what’s on its way and what’s being ordered. Bringing these disparate data sets together creates the context that makes the data relevant to the business through an asset 360 view.

Creating a complete context gives organizations the insight they need to understand what is going on in their business, on their shop floors and with their customers. Context helps answer important questions about whether there is enough output on a factory line to support the current order commitment. Context supports better predictions, based on existing commitments and availability, and the best time to take a machine offline for routine maintenance. Context can even help model the impact of downtime on fulfilling the order.

On the logistics side, creating a complete context through a supply-chain 360 view can reduce supply chain volatility by showing where the right parts are in the supply chain. This approach can show whether they will arrive on time to complete an order, and what effect a delay would have on fulfilling an order.

Data Correlation

Correlations refer to how data is interrelated between different parts of the business. When data, people and technology come together, organizations can use the insight generated operationally and strategically. For example, in a manufacturing company, IoT data is collected from different sensors. These sensors are generating data around what’s happening in discrete areas. When the information is joined up with other sensor data somewhere else in the journey, a story unfolds. It provides visibility along a continuum, set of processes, or steps within that domain. Connecting each of these single points of view further enhances a full context, providing a more comprehensive view of the broader operation.

Integrating IT and OT

The pace of change is accelerating, and it is becoming harder for Industrial customers to catch up. They are falling behind. The move to Industry 4.0 is more important today to drive industrial digitization. Data, analytics and artificial intelligence (AI) present new opportunities for IT and OT convergence.

Leveraging DataOps enables organizations to ingest OT data from the edge of the network, manage devices and drive local analytics right from the factory floor. Some of this data is brought to the core of your network and blended together with IT sources, customer relationship management (CRM) data, supplier data and so forth.

Once the data streams from multiple processes and parts of the business are discovered, they can be correlated by applying AI and machine learning (ML) models, making it easier to optimize and act throughout the business and discrete areas. Furthermore, as correlated and contextualized data bubbles up over the edges of its silos of people, places and things, it gains more visibility from other users, who can look at it through a different lens. As a result, more value is derived.

DataOps Creates a New Data Culture

While DataOps is a powerful way to address the tactical problems mentioned so far, it also has a profound cultural impact that operationalizes data management with automation and collaboration by:

Applying DevOps principles broadly to all use of data, bringing a continuous integration and continuous delivery (CI/CD) style to bear.
Enabling collaboration between data stakeholders: data engineers, data stewards, data consumers and IT operations staff.
Leveraging tools and technologies that automate manual processes and foster collaboration between teams.

DataOps culture revolves around simplifying and shifting the power of data integration and management toward self-service. As a result, business and end-users can execute tasks without relying on IT intervention, some of which is done through automation.

Various analysts have pointed to a generational change in digital transformation and use of data. Adoption by younger staff is driven by tech-ability, not age—the ability to do a little coding, not full-on programming. In a tech-friendly world, data accessibility is not limited to data scientists. DataOps creates a culture of self-service, not requests.

Data Agility Leads to Business Agility

DataOps cuts out friction by empowering more people throughout the organization to tap into the data pool. They can access, inspect and contextualize the data available without relying on IT. As a result, organizations gain greater visibility into the insight they need to solve tactical and strategic problems that span different functions and departments across an enterprise.

Build a Modern DataOps Practice with Lumada

Manish Jain is Vice President, Product Management, Lumada Software at Hitachi Vantara.

^{[1]Smarter with Gartner, Gartner Top 10 Trends in Data and Analytics for 2020, Laurence Goasduff, October 19, 2020. GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.}

Manish Jain

Manish leads product management for the data operations and IIoT product suites, including data integration, governance, quality and compliance. He has +20 years' experience building and driving global GTM strategies for the adoption of products and platforms.