Openlake

Welcome to the Openlake repository! In this repository, we will guide you through the steps to build a Data Lake using open source tools like Spark, Kafka, Trino, Apache Iceberg, Airflow, and other tools deployed in Kubernetes with MinIO as the object store.

What is a Data Lake?

A Data Lake is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. It enables you to break down data silos and create a single source of truth for all your data, which can then be used for various analytical purposes.

Prerequisites

Before you get started, you will need the following:

A Kubernetes cluster: You will need a Kubernetes cluster to deploy the various tools required for building a Data Lake. If you don't have a Kubernetes cluster, you can set one up using tools like kubeadm/kind/minikube or a managed Kubernetes service like Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service (EKS)
kubectl: Command line tool for communicating with Kubernetes cluster
A MinIO instance: You will need a MinIO instance to use as the object store for your Data Lake
MinIO Client(mc): You will need mc to run commands to peform actions in MinIo
A working knowledge of Kubernetes: You should have a basic understanding of Kubernetes concepts and how to interact with a Kubernetes cluster using kubectl
Familiarity with the tools used in this repo: You should have a basic understanding of the tools used in this repo, including Spark, Kafka, Trino, Apache Iceberg etc.
JupyterHub/ Notebook (Optional): If you are palnning to walkthrough the instructions using Notebooks

Apache Spark

In this section we will cover

Setup spark on Kubernetes using spark-operator
Run spark jobs with MinIO as object storage
Use different type of S3A Committers, checkpoints and why running spark jobs on object storage (MinIO) is a much better approach than HDFS.
Peform CRUD operations on Apache Iceberg table using Spark
Spark Streaming

Setup Spark on k8s

To run spark jobs on kubernetes we will use spark-operator. You can follow the complete walkthrough here or use the notebook

Run Spark Jobs with MinIO as Object Storage

Reading and writing data from and to MinIO using spark is very easy. You can follow the complete walkthrough here or use the notebook

Maintain Iceberg Table using Spark

Apache Iceberg is a table format for huge analytic datasets. It supports ACID transactions, scalable metadata handling, and fast snapshot isolation. You can follow the complete walkthrough using the notebook

Dremio

Dremio is a general purpose engine that enables you to query data from multiple sources, including object stores, relational databases, and data lakes. In this section we will cover

Setup Dremio on Kubernetes using Helm
Run Dremio queries with MinIO as object storage and Iceberg tables

Setup Dremio on K8s

To setup Dremio on kubernetes we will use Helm. You can follow the complete walkthrough using the notebook

Access MinIO using Dremio

You can access datasets or Iceberg tables stored in MinIO using Dremio by adding a new source. You can follow the complete walkthrough using the notebook

Apache Kafka

Apache Kafka is a distributed streaming platform. It is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, fast, and runs in production in thousands of companies. In this section we will cover how to setup Kafka on Kubernetes and store Kafka topics in MinIO.

Setup Kafka on K8s

To setup Kafka on kubernetes we will use Strimzi. You can follow the complete walkthrough using the notebook

Store Kafka Topics in MinIO

You can store Kafka topics in MinIO using Sink connectors. You can follow the complete walkthrough using the notebook

Kafka Schema Registry and Iceberg Table (experimental)

You can use Kafka Schema Registry to store schemas for data management for kafka topics and you can als use them to create Iceberg tables (expreimental). You can follow the complete walkthrough using the notebook

Kafka Spark Structured Streaming

In this section we will cover how to use Spark Structured Streaming to read data from Kafka topics and write to MinIO.

Spark Structured Streaming

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can follow the complete walkthrough using the notebook

End-to-End Spark Structured Streaming for Kafka

You can use Spark Structured Streaming to read data from Kafka topics and write to MinIO. You can follow the complete walkthrough using the notebook

Join Community

Openlake is a MinIO project. You can contact the authors over the slack channel:

MinIO Slack

License

Openlake is released under GNU AGPLv3 license. Please refer to the LICENSE document for a complete copy of the license.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
dremio		dremio
kafka		kafka
spark		spark
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dremio

dremio

kafka

kafka

spark

spark

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Openlake

What is a Data Lake?

Prerequisites

Table of Contents

Apache Spark

Setup Spark on k8s

Run Spark Jobs with MinIO as Object Storage

Maintain Iceberg Table using Spark

Dremio

Setup Dremio on K8s

Access MinIO using Dremio

Apache Kafka

Setup Kafka on K8s

Store Kafka Topics in MinIO

Kafka Schema Registry and Iceberg Table (experimental)

Kafka Spark Structured Streaming

Spark Structured Streaming

End-to-End Spark Structured Streaming for Kafka

Join Community

License

About

Releases

Packages

Contributors 2

Languages

License

minio/openlake

Folders and files

Latest commit

History

Repository files navigation

Openlake

What is a Data Lake?

Prerequisites

Table of Contents

Apache Spark

Setup Spark on k8s

Run Spark Jobs with MinIO as Object Storage

Maintain Iceberg Table using Spark

Dremio

Setup Dremio on K8s

Access MinIO using Dremio

Apache Kafka

Setup Kafka on K8s

Store Kafka Topics in MinIO

Kafka Schema Registry and Iceberg Table (experimental)

Kafka Spark Structured Streaming

Spark Structured Streaming

End-to-End Spark Structured Streaming for Kafka

Join Community

License

About

Resources

License

Stars

Watchers

Forks

Languages