• Blog
  • Podcast
  • Contact
  • Sign in
CloverDX Logo
Product
  • OVERVIEW
  • Discover CloverDX Data Integration Platform###Automate data pipelines, empower business users.
  • Deploy in Cloud
  • Deploy on Premise
  • Deploy on Docker
  • Plans & Pricing
  • Release Notes
  • Documentation
  • Customer Portal
  • More Resources
  • CAPABILITIES
  • Sources and Targets###Cloud and On-premise storage, Files, APIs, messages, legacy sources…
  • AI-enabled Transformations###Full code or no code, debugging, mapping
  • Automation & Orchestration###Full workflow management and robust operations
  • MDM & Data Stewardship###Reference data management
  • Manual Intervention###Manually review, edit and approve data
  • ROLES
  • Data Engineers###Automated Data Pipelines
  • Business Experts###Self-service & Collaboration
  • Data Stewards###MDM & Data Quality
clip-mini-card

 

Ask us anything!

We're here to walk you through how CloverDX can help you solve your data challenges.

 

Request a demo
Solutions
  • Solutions
  • On-Premise & Hybrid ETL###Flexible deployment & full control
  • Data Onboarding###Accelerate setup time for new data
  • Application Integration###Integrate operational data & systems
  • Replace Legacy Tooling###Modernize slow, unreliable or ad-hoc data processes
  • Self-Service Data Prep###Empower business users to do more
  • MDM & Data Stewardship###Give domain experts more power over data quality
  • Data Migration###Flexible, repeatable migrations - cloud, on-prem or hybrid
  • By Industry
  • SaaS
  • Healthcare & Insurance
  • FinTech
  • Government
  • Consultancy
zywave-3

How Zywave freed up engineer time by a third with automated data onboarding

Read case study
Services
  • Services
  • Onboarding & Training
  • Professional Services
  • Customer Support

More efficient, streamlined data feeds

Discover how Gain Theory automated their data ingestion and improved collaboration, productivity and time-to-delivery thanks to CloverDX.

 

Read case study
Customers
  • By Use Case
  • Analytics and BI
  • Data Ingest
  • Data Integration
  • Data Migration
  • Data Quality
  • Data Warehousing
  • Digital Transformation
  • By Industry
  • App & Platform Providers
  • Banking
  • Capital Markets
  • Consultancy & Advisory
  • E-Commerce
  • FinTech
  • Government
  • Healthcare
  • Logistics
  • Manufacturing
  • Retail
Migrating data to Workday - case study
Case study

Effectively Migrating Legacy Data Into Workday

Read customer story
Company
  • About CloverDX
  • Our Story & Leadership
  • Contact Us
  • Partners
  • CloverDX Partners
  • Become a Partner
Pricing
Demo
Trial

Data Lake: Buzzword or Useful Concept?

Data Quality Data Management
Posted November 27, 2017
8 min read
Data Lake: Buzzword or Useful Concept?

The data lake – it’s a phrase that’s thrown around a lot right now, but is it just an empty buzzword, or does it actually bring real value?

Well, there’s certainly some misconceptions around the concept of data lakes.  The biggest one of these is, “we can keep everything because storage is cheap”. This is the overarching  idea behind data lakes.

However, if you don’t manage your data lake or have the right governance, skills and processes to get value out of it, it can be more bad news than benefit. But how can you avoid turning your data lake into a hindrance?

Let’s take a look at some points to be aware of when managing your data lake.

A note on the difference between data and information

In the following paragraphs I’ll be referring to data and information. There is a seemingly subtle, but extremely important difference between the two. To save you some internet searching, data is an information carrier. Usually data carries more than one piece of information. Keep in mind, people are ultimately interested in the information data carries, not data itself. Imagine a sensor emitting its current state every second. Potentially there are two pieces of information attached to this datum – its state, obviously, (which gains significance only when it changes) and the sheer existence of the datum which suggests the sensor is working.

New call-to-action

The similarities between a data lake and an actual lake

If you think of a data lake as an actual lake, it'll help you understand some of the ideas around it.

Data lake - buzzword or useful concept?

Let’s lay out the landscape first. We have a nice pure lake (our data lake), with several creeks/streams (our data sources) feeding it with water (data). At this very moment, our lake is filled with pure mountain water so we can even see the bottom of it, leaving people (data scientists) observing its banks and tributaries (metadata and data sources) or diving directly into it (making analysis).

This parallel continues further. As with this beautiful mountain lake, only good swimmers and experienced divers (analysts and data professionals), should be allowed in. Their safety, and the preservation of the lake’s ecosystem, water quality and long-term sustainability are all very real concerns.

The same goes for a data lake. As beautiful and tempting as it is to dive into all that data, you need to understand how to interpret what you’re swimming in and refrain from polluting the data lake in any way to keep it sustainable. On top of that, there are regions of the lake where only authorized personnel are allowed access, due to further safety (security access) and privacy (anonymization, data encryption, etc.) regulations.

No swimming in the data lake

Take GDPR — a stringent set of regulations impacting any organization doing business in the EU.

With sky-high sanctions, GDPR imposes strict requirements for both governance of personal data and communicating transparency around storage and processes that control the data. Pouring such data into a lake means it can’t be a free-for-all, and there’s a need to restrict swimmers (users) from accessing whatever they want at any time. There may be restrictions around what data can be stored, in what format, and in what combinations. Ultimately, data owners need to be accountable for what they’re putting into the data lake, who is swimming in it, and for what purpose.

Where did your water come from?

Data lineage defines the source of your data, where it’s been and what’s happened to it on the way. As much as this could be a topic for a separate article, it's absolute important to know your creek’s origin and how exactly it got to the lake.

Does it flow past a chemical plant (has it been polluted on its journey)? How much mud (data carrying no or insignificant information) your creek brings into the lake is also a valuable indicator. This will determine how much and what kind of maintenance your data lake might need and if the creek is even trustworthy. (You’d probably drink from a mountain spring but not so much from water at a freight port.) If you allow too much garbage in your data lake, people will eventually stop using it or, even worse, ignore the muddy waters but fail to notice those seemingly crystal clean creeks that are polluted with invisible toxic chemicals.

Download the Guide to Enterprise Data Architecture

Monitoring data quality

This leads on to another important point — monitoring the data flowing into the lake. This obviously helps as an early warning to protect the quality of your lake, but there’s a broader and deeper benefit.

Read more: Guide to Data Quality

Receiving corrupted or incomplete data could suggest something wrong with an upstream system. In a sense, this central hub should act as the health monitoring facility or internal standards authority that affects all connected systems.

We’ve seen instances where analysis of data loading process logs has helped to identify low-performing branches of a large organization, just by revealing the above average occurrence of errors in the data they were providing for the central hub. This “meta information” can be equally important and as transformative as the data itself.

Data lake - buzzword or useful concept?: Data swamp

In the worst case, you could end up with a data swamp (or shall we call it a “data dump”?) What is that?

Well, if your data lake is un-maintained, with no traces of where its contents have come from, with unreliable or absent controlling mechanisms, your lake will suffer pollution and you will be unable to navigate it. There’s a much higher chance of someone getting hurt or even drowning if the banks of your lake are uncultivated and left to the wilderness of nature. Thinking of a data lake as being just a place to put data, without plans and processes for proper treatment, might quickly waste your investment.

New call-to-action

What’s in your data lake?

Assuming you’ve monitored the quality of your incoming water (or data), and you’ve fixed any problems, how do you go about retrieving the parts you want out again?

The big advantage of a data lake (as opposed to a data warehouse, for example) is that you don’t have to spend too much time upfront organizing and structuring your data. However, you do need to have something to organize it with, otherwise it’ll become a mess.

Even simple tagging of the data’s arrival day, time and source can be an enormous help. But, imagine what could happen if there was also more general information about the data’s content on top of this. The benefit of a data lake – being able to sift through what you have and find what you need – becomes much more achievable when some basic measures are put in place to track what you have in your lake.

Lifeguards on duty

Since a data lake contains just about all the data you have, it is also a great potential risk, and requires carefully defined rules to manage that risk.

Remember when we talked about GDPR? The level of data governance you need can vary depending on your circumstances. Going back to our water lake, sometimes we’d be fine just with a lifeguard ensuring the swimmers and visitors aren't doing anything they shouldn't be. Other times, a perimeter fence and guard dogs are required. The point is to fulfil legal and regulatory restrictions as well as internal guidelines.

What's more, security also relates to making sure people are considerate and respectful to each other. The lifeguard’s duty is to make sure one person does not bully others and hog the whole lake for themselves. Sometimes, the provisioning of a smaller pool with samples of water can be a way to deal with complex hypotheses requiring multiple iterations without affecting the work of other people.

Data Warehouses, Lakes, Hubs and Vaults explained

Are data lakes for everybody?

In a word – no. The ‘data lake’ is a great buzzword, but as we’ve seen, it’s not the magic bullet to solve all your data problems. You need to put the right processes, tools and governance in place to make the data lake work for you. And, crucially, you need to have the right skills to be an effective user of the data lake.

Data lake - buzzword or useful concept? Data lake ingredients

We're going to leave the lake analogy for a minute and focus more on a cooking analogy (apologies for this, you’ve probably already guessed this is a very analogy-heavy blog post). You might have all the ingredients and a book of recipes, but if you lack the skill or drive to be a good cook, you’re probably won't make a great meal.

Hoarding data is like stockpiling ingredients in your pantry. You’re not going to get a single meal if you don’t step up and, well, start cooking.

Can the ‘data lake’ be a useful concept?

Yes, the data lake may be a powerful concept. But only if the right effort is put in to get the results out. To continue our cooking analogy, if you really want to cook, and especially if you want to cook exotic, never-tried-before meals, a data lake is the way to go. It’s a giant pantry that allows you to store all sorts of ingredients to use in your experimental cooking. But if you’re NOT into cooking, hoarding ingredients won’t fill your bellies with delicious meals.

Articles which oversimplify the situation and use phrases like “keep data and figure out later” or “self-service and on-demand access” just deepen the problem; they are likeable but unrealistic. I became quite a fan of an article Are Data Lakes Fake News by Uli Bethke, challenging these and some other claims.

Still thinking about employing a data lake? Good. Just remember to keep track of your data, mind your data governance, catalogue data from day one, and keep people without enough skills away.

Above all, remember that a data lake is not a magical place where all data problems are going to be solved in the blink of an eye.

Read more about Data Architecture

Editor's note: This blog has since been updated in 2019 to help streamline the content and offer more value. Enjoy!

New call-to-action

Share

Facebook icon Twitter icon LinkedIn icon Email icon
Behind the Data  Learn how data leaders solve complex problems every day

Newsletter

Subscribe

Join 54,000+ data-minded IT professionals. Get regular updates from the CloverDX blog. No spam. Unsubscribe anytime.

Related articles

Back to all articles
Data Quality
4 min read

Why data quality is crucial for data integration projects

Continue reading
Black and white image of someone typing on a computer
Data Management Data Democratization
7 min read

6 major data management risks — and how to tackle them

Continue reading
Street crossing in a shopping district symbolising trust
Data Quality Data Strategy
4 min read

Why data trust matters to your customers

Continue reading
CloverDX logo
Book a demo
Get the free trial
  • Company
  • Our Story
  • Contact
  • Partners
  • Our Partners
  • Become a Partner
  • Product
  • Platform Overview
  • Plans & Pricing
  • Customers
  • By Use Case
  • By Industry
  • Deployment
  • AWS
  • Azure
  • Google Cloud
  • Services
  • Onboarding & Training
  • Professional Services
  • Customer Support
  • Resources
  • Customer Portal
  • Documentation
  • Downloads & Licenses
  • Webinars
  • Academy & Training
  • Release Notes
  • CloverDX Forum
  • CloverDX Blog
  • Behind the Data Podcast
  • Tech Blog
  • CloverDX Marketplace
  • Other resources
Blog
The vital importance of data governance in the age of AI
Data Governance
Bringing a human perspective to data integration, mapping and AI
Data Integration
How AI is shaping the future of data integration
Data Integration
How to say ‘yes’ to all types of data and embark on a data-driven transformation journey
Data Ingest
© 2025 CloverDX. All rights reserved.
  • info@cloverdx.com
  • sales@cloverdx.com
  • ●
  • Legal
  • Privacy Policy
  • Cookie Policy
  • EULA
  • Support Policy