Frequently Asked Questions
Data lakes are larger data repositories than data warehouses, which provide the greatest ease and largest capacity for storing nearly any type of data format. Data lakes are often the first repository in a data stack, receiving the influx of all raw, semi-structured, and structured data that applications and infrastructure produce by a company, and acting as the organization's central data repository.
Because the speed and volume of data growth have only been accelerating in our digital world and are expected to accelerate more as IoT connects more devices than ever to the Internet, data lakes were created to solve the massive job of rapidly ingesting and storing a diverse collection of Big Data sets.
Functionally, data lakes operate by storing data differently than other repositories, foregoing the added step of data analysis that data warehouses perform. Because data lakes do not perform data analysis (not true in emerging cases, as newer technology is available that enables data lakes advanced analytics features), they do not bother with structuring data before storing, rather they simply store the data in its native format, speeding up ingestion.
Data lakes use flat architectures and Object Storage rather than hierarchical file systems found in data warehouses. Object Storage tags data with metatags and unique identifiers making it possible to easily retrieve data later. This is considered a schema-read principle, where data is stored with no pre-defined data schema. Data lakes can then be used by data warehouse analytics systems, dipping into the lake and pulling out the desired data that is then parsed and adapted to a data schema and moved to the data warehouse, analyzed, refined, and combined with other data sets.
Many enterprises gain significant business insights from their data which can be leveraged to get a foothold over their competitors. Faced with the increasing costs of collecting and processing Big Data sets, and to stay ahead, they turn to the advantages of data lakes: open-format, low-cost scaling, and advanced machine learning analytics.
Open-format allows the storage of any type of unstructured, semi-structured, and structured data, so, enterprises that struggle to maintain operations while uncovering data insights can simply dump all their data into their data lake and sort through it later because it's stored in its original form. Likewise, data scientists can return to the data lake at any time and like an archeological dig find undiscovered insights.
While data lakes can be on-premises, providing centralization and control, many enterprises are moving their data lakes to the cloud for superior flexibility and scalability. And because data is stored in raw formats, enterprises can avoid vendor lock-in, though switching vendors entails moving vast sets of data (petabytes and more) which can be time-consuming.
The raw data in data lakes can be held indefinitely, allowing data scientists to continuously transform it into actionable analytics. To help them sift through the waters, data lakes can be integrated with AI and machine learning solutions that apply analytics to these sets of unstructured and structured data. The ability for AI to analyze any and all types of data has become a future focus of enterprises.
Data lake benefits:
- Open-format, store any type of data format
- Flexible and scalable data storage to grow with consumption
- Perform analytics at any time on stored data, continuously discover value insights
- AI and machine learning integrations
- Eliminate data silos
- Democratize data access through data management platforms
Data warehouses, unlike data lakes, are considered scheme-write systems, meaning that when data is stored in a data warehouse, it is fitted into a predefined data scheme which helps in cataloging and organizing. This process alludes to the fact that data warehouses are designed to carefully prepare data before storage so that analysis can quickly follow.
Though data warehouses cannot store the same volume as data lakes, to try would be exceptionally cost-prohibitive, they are helpful in processing immediate, critical data metrics helpful to real-time business operations. Oftentimes, enterprises use data lakes as a base in their data stack, connecting it to data warehouses, or other AI and machine learning analytics through their data pipeline.
Data lakes are broader data repository systems with data ingestion as a primary concern over data analysis. Though analytics is developing around data lakes, data lakes are highly inclusive, accepting all data types, supporting all users, and easy to adapt. Because of these characteristics, data lakes potentially hold the deepest business insights. The challenge in drawing out those insights is defined by the very data lake characteristics that enable deep insights, so much data and the breadth of diversity requires time to process and analyze.
In contrast, data warehouses standardize data formats at ingestion so that insights can be quickly delivered about domain-specific channels on time, such as marketing insights, or account billings. Conceptually, data warehouses represent an increase in data refinement at the sacrifice of data scope over data lakes.
Many of the top cloud vendors also offer leading data lake solutions. When choosing a data lake ask:
- What are your use cases for the data lake? It’s important to know how the data lake will be used before deploying one. Understanding your use cases can make deciding which features to include obvious.
- Cloud or on-premises platform? Many data lakes are deployed in the cloud because of scalability. If your use cases include sensitive data, on-premise may suit your security processes better.
- Open-source or proprietary? Open-sources are normally less expensive but require a greater depth of technical knowledge. Proprietary systems may better fit the use cases, but be more expensive to maintain, and develop.
- Self-managed or third-party managed? Similarly with proprietary systems, self-managed systems, even in the cloud, will require expertise on the vendor’s systems and the time to manage them. A managed data lake on the other hand reduces those time costs to a line-item cost, however, the challenge then becomes finding the right partner.
The top cloud data lake solutions in 2021 are:
- Amazon Web Services — AWS data lake makes it easy to securely set up a data lake based on their core system to service client data lake needs.
- Microsoft Azure Data Lake — Azure data lake supports big data sets and can work with existing IT investments.
- Databricks Unified Analytics Platform — Named the Lakehouse platform, Databricks brings their expertise delivering data management in data warehouses to combine with the flexibility and low-costs of data lakes.
- Google Cloud Data Lake — Google brings their suite of tools to data lakes, like Google BigQuery designed for the performance of data warehouses but also applicable to data lakes.
- Cloudera Data Platform — Cloudera is a hybrid cloud platform capable of working with existing IT infrastructure and vendors to seamlessly connect multiple data stores.
Data lakes have the potential of becoming a fundamentally critical piece of many enterprises’ IT makeup. Despite the advantages that drive companies to use data lakes, they are still emerging as a technology and therefore have challenges yet to be overcome. Most of these challenges stem from the fact that data resides in a single morass of data types and sets that muddy reliability, performance, security, and data governance. This is referred to as a data swamp, and results from:
- Reliability and Visibility Issues — Because data lakes are first repositories with little or no content oversight, the data needs to be made usable. Data lakes are heterogeneous, and without proper tools difficu lt to categorize. Depending on the setup, syncing the main repository to local data sources can result in inconsistent data.
- Slow Performance — Data lakes intend to grow, and fast. As that happens, system performance will decrease. Data duplication will occur as more analysis is applied to the lake, replicating data while searching for insights. Partitions and metadata management help establish “road signs” that can help in data management.
- Data Security — The data swamp is a difficult terrain to secure. Data lake architecture inherently lacks the fine-grained access controls that are found in other enterprise systems because using Object Storage does not allow clear data segmentation. One file object may contain huge amounts of unstructured or raw data, giving access to a single file object could expose sensitive data to unsanctioned users. Approaches to secure data lakes have tried built-in IAM controls, which are difficult to implement, or partitioning certain data into staging areas within the data lake and providing access to those staging areas, and using high-level data tools with access controls and analytics.
- Governance Compliance — The latest challenge to implementing data lakes is data governance as laid out in the EU’s General Data Protection Regulation (GDPR). And while there is no broad law like GDPR in the United States, there are several that regulate data within specific industries like healthcare, and state-level data protection laws, see California Consumer Privacy Act (CCPA). The trend means greater data regulations, and in short, data lakes are poor repositories for sensitive data like personally identifiable information PII, or protected health information (PHI). Rather, these laws may require sensitive data to be stored on-premises, separate from the data lake.
The journey to build your data lake could take anywhere from 3 months to implement basic functionality, and up to a year to implement it with advanced analytics and machine learning using a leading cloud provider like AWS. The following best practices can help prevent future challenges if applied during all phases of data lake design and operations.
- Duplicate Data but Smartly — Data lakes are designed to store unheard of volumes of data. And while duplicate data does slow performance, the trade-off is ease versus cost. This is counterintuitive for database users trained on systems where storage is precious. In data lakes, historic data can be processed, and then stored in the data lake, offering both views to analysts at any time. In the data lake storage is inexpensive, so don’t be scared to duplicate if it suits your needs.
- Establish Retention Policies — Data lakes store data cheaply, which makes this another counterintuitive best-practice—set limits for retaining specific data. While data storage is cheap it is not free, and regulations that protect sensitive information, in a way, target that information for deletion. While PII and associated personal data may remain relevant to the companies for many years, at some point it may not, at which time deleting those archives may prove to be beneficial both for security and for cost savings.
- Know Your Tributaries — Data swamps are formed when organizations proceed using their data lakes as if they will remain pristine in the face of dumping everything inside. This is not a sustainable practice. Data lakes ingested flows of unstructured data, but unstructured data does not need to be disorganized. Understanding the data flowing into your data lakes can save in both processing and security. Some tools buck the schema-read idea and can help companies discover schema on ingestion, helping them organize and keep their data lakes clean.
Data is ubiquitous, and how we choose to use it makes it valuable or simply cluttered. The main use case of data lakes is to rapidly ingest and store real-time streaming data flow and batch processing data, in any format, and then secondarily perform analytics on sets of diverse data. To that end, large multinationals, manufacturing, municipalities, and other companies have leveraged data lakes for many businesses uses:
- Advanced Analytics Support
- Application Support
- Archival and Historical Data Storage
- Augment Data Warehouse Storage
- Business Analysis
- Distributed Processing
- Experimental Analysis
- Lambda Architecture Support
- Preparation for Data Warehousing