Frequently Asked Questions
Data discovery refers to the use of advanced algorithms to perform analysis of data to detect patterns that would otherwise go unnoticed. Data discovery is about seeing the larger picture among multiple data sources, sometimes hundreds of in-house and 3rd party sources. Data insights then are translated into better decision-making and business strategy.
At its best, data discovery automatically discovers data sources in an organization’s data environment, methodically and algorithmically sifting through databases and files to uncover specific predefined patterns and keywords laid out by classification and identification rules. This method is increasingly more important in the face of massive volumes of structured and unstructured data generated in many business cases.
Data discovery is the process of identifying data sources within an organization's changing data environment. By leveraging automation, and often cloud-based systems to support data environments, data discovery becomes a foundational aspect of agile businesses. A data discovery platform makes business operations transparent, and sets up future success by creating a hub for other data innovations.
Data classification is a step following data discovery where data is identified and categorized by type using pattern and keyword rules that apply labels to identified data. In one instance of this, In the health industry, medical ID patterns are used to find patterns of categorization.
Data discovery has multiple purposes serving data stakeholders in different ways. In all of them, it is to find a more accurate and complete picture of the organization as a whole, and insights into the operational aspect of the business.
For businesses, the iterative data discovery process helps to extract valuable insights from several data streams and centralizes insights for top leadership to make better strategic decisions.
For data users, data discovery and data sharing allows multiple user tiers the ability to access relevant insights to their operations among all the data insights produced. This means, each department can view and analyze data specific to their needs without being bogged down with searching, cleaning, and preparing data.
Technically, data discovery is the process of consolidating raw data from multiple sources, of which each may be fundamentally different, like combining structured and unstructured data. Because of significant volumes of data, organizations rely on smart data discovery tools to digest operational information and visualize it. Popular data visualization includes graphs, charts, tables, maps, infographics, dashboards, etc.
Generally, there is a 5-step data process that intakes raw data and produces valuable insight.
- Connect And Blend Data Sources — Enumerating all data sources is the starting point. This includes listing and understanding necessary measurements and metrics to be collected. Typically, these data are ingested and stored within a data warehouse—a place where it is possible for disparate data types to merge. This stage is where discovery occurs.
- Clean And Prepare Raw Data — Raw data is typically unreadable, and processing prepares data for analytical use. Raw data in this stage is cleaned, standardized or normalized, in order to detect and remove data errors, distortions, and corruptions. Also data is put in alignment, such as using correct units of measure.
- Data Sharing — Data sharing among data stakeholders is a key benefit of data discovery. At this level, analysis has not yet occurred, and data can be categorized into informational domains which benefit different levels in the organization. Sometimes this comes in the form of a data mart, something like a data warehouse except for a specific singular data domain. Teams can then access the data most relevant to their day-to-day, and planning.
- Analyze And Develop Business Insights — At this point, management teams and data scientists can access and analyze data sets relevant to their needs. To analyze, teams deploy data discovery tools that can perform distributional analysis, and predictive analytics or market basket analysis. Depending on the need, custom analysis is often performed.
- Visualize Insights — What typically comes hand in hand with data analysis is data visualizations, which supports the main aim of data discovery, to deliver insights as clearly and quickly as possible. From this point, teams can develop action plans and react to these insights with confidence.
There are many tools and vendors to assist in data discovery and analysis. But the data discovery process initially began as manual. Manual data discovery and smart data discovery are the two types of data discovery processes today. And going forward, likely more and more businesses will utilize smart data discovery.
Manual data discovery — As the name suggests, this is the manual, human, tedious process of discovering data patterns within data sets. This typically requires a highly qualified and trained human data technician, popularly assigned the role title “data steward”. These caretakers of the data would have to manually map, prioritize, and prepare data for analysis, including creating and categorizing metadata, documenting rules and standards, and ultimately conceptualizing the entire data strategy and company data models.
Smart data discovery — The idea of a data steward has evolved with the advent of modern automated data processing. Today, AI and machine learning have augmented data discovery, and beneficially made data more robust, accurate, and usable, all while removing much human produced errors. The role of the data steward has changed with smart data discovery, now its emphasis is on ensuring the fitness of data and data governance.
Smart data discovery is a popular term for AI and machine learning advancements in data discovery. Before machines could perform data discovery, these tasks were conducted manually by data stewards. AI functions within the data discovery domain reached a tipping point and Gartner identified a new category of business intelligence software capable of dramatically organizing company data, and discovering sensitive data that can now be secured and made compliant to regulations.
Gartner defines smart data discovery: “Automatically finds, visualizes and narrates important findings such as correlations, exceptions, clusters, links and predictions in data that are relevant to users without requiring them to build models or write algorithms. Users explore data via visualizations, natural-language-generated narration, search and natural-language query technologies.”
A data discovery platform (sometimes called sensitive data discovery platforms), such as Hitachi Content Intelligence, provides a complete set of data tools for detecting deep patterns using advanced analytics within disparate data sources. These patterns are then further put into context using other relevant systems, then subsequently visualized for data users, or otherwise presented using clear delivery methods, such as dashboards, charts, tables, etc., to clarify underlying business insights.
These platforms include the following features:
- Automated data discovery tools
- Data monitoring in real time and sensitive data discovery algorithms
- Contextual search functions and other metadata search functions
- Compliance functions that enable organizations to adhere to industry regulatory standards (GDPR, CCPA, HIPAA, etc.)