Understand the differences between data lakes, data warehouses, and data marts, and how they can meet your cloud data storage and analysis needs.

This is the fourth article of our Big Data in the Cloud series. You can read the first three here:

To get notified of our future Big Data in the Cloud posts, click here to sign up for our email list.

In our last post, we discussed what factors to consider when selecting between SQL and NoSQL to use as your primary database. In this post, we’ll continue with the cloud storage theme and discuss additional infrastructure such as data lakes, warehouses, and marts. Oh my!

Keep in mind that as we discuss all of these cloud storage options, none of these are mutually exclusive. This means that databases, data lakes, data warehouses, and data marts can all co-exist in your cloud architecture, and many times it’s best to design it this way.

Let’s go!

Data lakes blog image

Data lakes

A data lake is a repository of structured (relational data), semi-structured (CSV or JSON files), and unstructured (machine and sensor data) data that is stored in its raw, as-is form until it is needed. The term “Data Lake” was coined by James Dixon, the founder of Pentaho, a data integration and analytics platform.

Data lakes have been growing in popularity, frankly, because companies just need a place to quickly and easily store their massive amounts of data until they figure out what to do with it. And they need to store it at a low cost.

Data in a data lake can come from pretty much anywhere, inside and outside of your company. They can store data from your transactional systems, social networks, sensors, devices, and much more. And you can give access to your data lake to your customers so they can share files and data with you.

aws-datalake-diagram

An example of a data lake – image courtesy of AWS

 

You can scale data lakes easily and at a reasonable cost. You don’t need to structure the data in any way until you need to use it (aka schema-on-read), which makes it very efficient to store data.

A data lake can be a conglomeration of different technologies, including but not limited to:

  • SQL and NoSQL databases
  • Cloud storage such as Amazon S3, Azure Storage, and Google Cloud Storage
  • Hadoop’s Distributed File System (HDFS)

A data lake is like the pantry where you keep all of your food until you need to cook or eat it. You might have some fruits, vegetables, pasta, cookies, cereal, sugar, spices, and much more in there. And they’re stored on shelves, in baskets, or other containers and will sit there until you’re hungry.

Because the data is uncurated and may come from data sources that are outside of your company’s operational systems, your typical business analyst likely won’t be able to make use of the data in a data lake. Rather, you may need a data scientist or engineer to curate and transform the data before it can be analyzed.

You do need to be careful about how you use a data lake.

While data lakes provide an unprecedented amount of flexibility in the types of data you can store and how to store it, the data that is stored can quickly become disorganized. This can lead to a couple of issues.

First, if you have a lot of disorganized data, it can be tough to find the data that you need to perform analyses. This can render all of this valuable data useless. While you don’t need to formally structure the data, you will need to create a cataloging system that your team can refer to track down the data that’s needed.

Also, a disorganized data lake can be a security issue. With data coming from internal and external sources, data lakes are often at the center of many different data technologies, with many users having access to it. Thus, the security surface area of data lakes is very large and they are susceptible to attack. So you’ll need to understand how your data lake will be used, outline what applications and users will have access to it, and create policies to maintain the security of your data lake.

If you don’t address the issues above, your data lake can quickly turn into a data swamp – a gross, useless mess. Sorry, Florida, swamps are not cool.

Florida Swamp

Yucky swamp – image courtesy of KatVitulano Photos on Flickr

Data warehouses and data marts

Data warehouses are similar to data lakes in that they aggregate data from multiple sources. But the big difference is that this data is organized and structured before being stored (schema-on-write), and thus is readily available for analysis by business analysts and other analytics professionals.

Because stored data is more structured, data warehouses are a bit more rigid and less agile when compared to data lakes’ flexibility.

Storage of data in a data warehouse can be costly, especially if the amount of data is very large. This is because 1) data warehouses are optimized for fast query performance, and 2) free or low-cost open source technologies are often used for data lakes, which is not the case for data warehouses.

Compared to data lakes, security of data warehouses is higher. As mentioned before, data lakes aggregate data from multiple internal and external sources and allows access to many different users, which make them susceptible to security breaches. The more rigid structure and internal-facing nature of data warehouses make them more secure than data lakes.

Furthermore, the more structured, schema-on-read nature of data warehouses make the data more easily analyzed by less technical staff. Business analysts, marketers, and the finance team can more easily work with data in a data warehouse, while data lakes typically require data engineers and scientists for analysis.

Data warehouse vendors include:

  • AWS Redshift
  • Google BigQuery
  • Azure SQL Data Warehouse
  • Cloudera
  • Oracle Autonomous Data Warehouse
  • Teradata
  • Snowflake
  • Many others

Data marts are simply a subset of a data warehouse that is highly curated for a specific end user. This can be customer purchase data for the marketing team to analyze, inventory data for a particular product line, or sales data for the finance team to assess.

Data marts can be built from an existing data warehouse, or other sources of operational data.

Data warehouse and mart

Possible data warehouse and mart architecture – image courtesy of TechDifferences

How databases, data lakes, and data warehouses can all work together

We’ve covered SQL and NoSQL databases, data lakes, and data warehouses. As mentioned earlier, these technologies don’t have to be mutually exclusive in your cloud data storage architecture. Rather, they can and often will be integrated together to provide everything you need to optimally store your big data in the cloud and leverage it to improve your business.

For instance, a NoSQL database may be used as part of your data lake. And data in your data lake may be fed into a data warehouse to power frequently-run reports. Then specific data marts can be created from your data warehouse.

In our next post, we’ll review cloud data storage and analytics architectures that we’ve used for our clients. This will provide different ways that databases, data lakes, and data warehouses can all work together.

Conclusion

There’s a lot that can go into a cloud data storage architecture.

SQL and NoSQL databases, data lakes, data warehouses, and data marts all play an important role in storing and analyzing your company’s data. Knowing the strengths and weaknesses of each technology will help you design the best architecture to leverage your data to extract insights that can help you better serve your customers.

How are you using these cloud data storage technologies? We would love to hear more about your architecture and the tools you’re using. Feel free to post a comment below.

To be notified of future Big Data in the Cloud posts, make sure you sign up for our mailing list below.