Data Lake VS Database VS Data Warehouse

This has been floating around in my head for a while, so just penning it down for clarity.

Disclaimer: All of the below text are (mostly) literally cut and paste from the links provided - full credits to the authors! This is merely a summary page for ease of understanding.

What is a data lake?

A data lake is a storage repository or a storage bank that holds a huge amount of raw data in its original format until it’s needed. It can store relational data and at the same time, non-relational data. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future.

Read more here:

https://www.bmc.com/blogs/data-lake-vs-data-warehouse-vs-database-whats-the-difference/
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

Common implementations:

Hadoop Distributed File System (HDFS): https://www.scnsoft.com/blog/data-lake-implementation-approaches

Cloud alternative:

AWS S3 (+ Athena as a query engine + Glue for Cataloging): https://aws.amazon.com/big-data/datalakes-and-analytics/
Azure Data Lake: https://azure.microsoft.com/en-us/solutions/data-lake/

What is a database?

A relational database contains information organized in columns, rows, and tables that is periodically indexed to make accessing relevant information more accessible. A database is optimized to maximize the speed and efficiency with which data is updated (added, modified, or deleted) and enable faster analysis and data access. Response times from databases need to be extremely quick for efficient transaction processing.

Most databases use a normalized data structure. The more normalized your data is, the more complex the queries needed to read the data because a single query combines data from many tables. This puts a huge strain on computing resources.

Read more here:

https://panoply.io/data-warehouse-guide/the-difference-between-a-database-and-a-data-warehouse/
https://www.slant.co/topics/5950/~relational-databases

Common relational databases:

Postgre SQL (complex and powerful)
SQLite (quick and easy, not production ready)
MySQL

Cloud alternative:

AWS Aurora: https://aws.amazon.com/rds/aurora/
Azure Database: https://docs.microsoft.com/en-us/azure/postgresql/overview

A non-relational database is a database that does not use the tabular schema of rows and columns found in most traditional database systems.

Read more here:

https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data

What is a data warehouse?

Data warehouses use Online Analytical Processing (OLAP) that is optimized to handle a low number of complex queries on aggregated large historical data sets. Tables are denormalized and transformed to yield summarized data, multidimensional views, and faster query response times.

The data in a data warehouse does not need to be organized for quick transactions. Therefore, data warehouses normally use a denormalized data structure. A denormalized data structure uses fewer tables because it groups data and doesn’t exclude data redundancies. Denormalization offers better performance when reading data for analytical purposes.

Read more here:

https://panoply.io/data-warehouse-guide/the-difference-between-a-database-and-a-data-warehouse/

Common data warehouses:

Teradata
Oracle 12c

https://www.softwaretestinghelp.com/data-warehouse-tools/

Cloud alternatives:

AWS Redshift: https://aws.amazon.com/redshift/data-warehouse-migration/
Azure SQL Data Warehouse: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/
Google BiqQuery: https://cloud.google.com/bigquery/
Snowflake: https://www.snowflake.com/our-customers/

https://www.alooma.com/blog/data-warehousing-tools

What is Hadoop?

Hadoop is an open source, a Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.

Hadoop is made up of 4 modules:

Distributed File-System: Distributed File System allows data to be stored in an easily accessible format, across a large number of linked storage devices.
Map Reduce: Map Reduce is the combination of two operations – reading data from the database and putting it into a format suitable for analysis (map) and performing mathematical operations (reduce).
Hadoop Common: Hadoop Common provides the tools needed for the data stored in HDFS (Hadoop Distributed File System)
YARN: YARN manages resources of the systems storing the data and running the analysis.

Read more here:

https://www.educba.com/data-warehouse-vs-hadoop/
https://hadoop.apache.org

It seems that most confusion comes (at least for me) when people refer to Hadoop and they mean just a specific part of it and not all 4 modules.

Imagine that we have got all the data in to HDFS - now how do we extract it out?

Note that I have used HDFS as an example below of a datalake. Theoretically (to me at least), it should be possbile to replace HDFS with any other data lake.

MapReduce

HDFS -> MapReduce

Difficult and time consuming to implement complex business logic

Read more here: https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163

Pig

HDFS -> MapReduce -> Pig

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster.

Read more here: https://hortonworks.com/tutorial/beginners-guide-to-apache-pig/#what-is-pig

Hive

HDFS -> MapReduce -> Hive

Hive is a query engine. It was started by Facebook to provide hadoop developers with more of a traditional data warehouse interface for MapReduce programming. Apache Hive is similar to an SQL engine that has its own metastore on HDFS and the tables can be queried through a SQL like query language known as HQL (Hive Query Language). Hive queries are converted to MapReduce programs in the background by the hive compiler for the jobs to be executed parallel across the Hadoop cluster.

Read more here: https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163

Hbase

HDFS -> HBase

HBase is a data storage, particularly for unstructured data. Unlike Hive, operations in HBase are run in real-time on the database instead of transforming into MapReduce jobs.

Read more here: https://www.dezyre.com/article/hive-vs-hbase-different-technologies-that-work-better-together/322

HDFS -> HBase -> Phoenix

Phoenix is an SQL query engine. It transforms SQL queries into native HBase API calls.

Read more here: http://phoenix.apache.org/presentations/HBaseCon2015-16x9.pdf

Presto

HDFS -> Presto (BUT it uses Hive metastore as a catalogue)

Presto is an SQL query engine, best used for quickly exploring the data.

Hive is optimized for query throughput, while Presto is optimized for latency. Presto has a limitation on the maximum amount of memory that each task in a query can store, so if a query requires a large amount of memory, the query simply fails.

A key advantage of Hive over newer SQL-on-Hadoop engines is robustness: Other engines like Cloudera’s Impala and Presto require careful optimizations when two large tables (100M rows and above) are joined. Hive can join tables with billions of rows with ease and should the jobs fail it retries automatically.

Read more here:

https://medium.com/@adirmashiach/facebook-prestodb-full-review-4ba59720a92
https://blog.treasuredata.com/blog/2015/03/20/presto-versus-hive/

To use Presto, the data structure of the data lake needs to be defined using Hive tables.

Beginning in Hive 3.0, the Metastore is released as a separate package and can be run without the rest of Hive. This is referred to as standalone mode.

Read more here:

https://stackoverflow.com/questions/48932907/setup-standalone-hive-metastore-service-for-presto-and-aws-s3

Kylin

HDFS -> Hive -> HBase (to store cubes)

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

Read more here:

https://community.hortonworks.com/articles/8305/a-quick-skinny-on-apache-kylin.html
https://www.ebayinc.com/stories/blogs/tech/announcing-kylin-extreme-olap-engine-for-big-data/
http://kylin.apache.org/docs/index.html

Spark

HDFS -> Spark

Spark is a general-purpose distributed data processing engine. The goal of the Spark project was to keep the benefits of MapReduce’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use.

Note that Spark can be run in various ways: as an alternative to MapReduce; over YARN; in MapReduce.

HDFS -> Shark -> Hive

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. It avoids the high task launching overhead of Hadoop MapReduce and does not require materializing intermediate data between stages on disk.

Read more here:

https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters/
https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
https://github.com/amplab/shark/wiki

Elasticsearch

HDFS -> Elasticsearch

ElasticSearch is a search engine. It is a tool for indexing data - it could integrate with a data lake by indexing the data stored in it. Use Elasticsearch as a data catalog to quickly find data.

Read more here:

https://www.elastic.co/products/hadoop
https://www.quora.com/Can-we-use-Elasticsearch-as-a-data-lake
https://blog.softwaremill.com/6-not-so-obvious-things-about-elasticsearch-422491494aa4
https://db-engines.com/en/system/Elasticsearch%3BHBase

HDFS -> HBase -> Elasticsearch

There is some overlap between Elasticsearch and HBase as Elasticsearch also functions as a database, which stores search indexes and archives text data. There has been a good deal of discussion lately on reasons to NOT use Elasticsearch as a primary datastore, among them weak consistency guarantees that can result in loss of data.

Connecting HBase to Elasticsearch will enable: 1) Elasticsearch users to store their data in Hbase; 2) Hbase users to enable full-text search on their existing tables via REST API.

Read more here: https://lessc0de.github.io/connecting_hbase_to_elasticsearch.html

Bonus: Logstash

Logstash -> ElasticSearch

Logstash is an open source data collection engine with real-time pipelining capabilities.

Read more here: https://www.elastic.co/products/logstash

Bonus 2: Cassandra

Cassandra is a ‘self-sufficient’ technology for data storage and management, while HBase is not. The latter was intended as a tool for random data input/output for HDFS, which is why all its data is stored there.

Read more here: https://www.scnsoft.com/blog/cassandra-vs-hbase