Skip to main content

The Delta Spark Handbook

This repo has all the resources you need to become an amazing data engineer focused on delta spark!

Architecture

Data Pipeline Architecture

This data pipeline architecture implements a modern data processing and storage solution using a stack of open-source technologies. The system is containerized using Docker and Docker Compose for easy deployment and scalability.

Components Overview

Input Layer

  • DBBeaver: A universal database tool for handling database connections and queries
  • Pandas: Python data analysis library for data manipulation and preprocessing
  • Beeline: Command-line interface for connecting to Hive and executing queries

Processing Layer

  • Kyuubi SQL Gateway: Serves as the SQL interface gateway, providing multi-tenant Spark SQL engines
  • Apache Spark: Core processing engine for large-scale data processing and analytics
  • Metastore & Metastore DB:
    • Hive Metastore service for managing metadata
    • Stores metadata information about Spark SQL tables, partitions, and schemas

Storage Layer

  • Delta Lake: Storage layer that brings ACID transactions and versioning to data lakes
  • Hadoop AWS: Provides integration with AWS services for data storage and processing
  • MINIO: Object storage system compatible with Amazon S3 API

Data Flow

  1. Data enters the system through various input sources (DBBeaver, Pandas, or Beeline)
  2. Kyuubi SQL Gateway manages the routing of queries and data to Spark engines
  3. Apache Spark processes the data, interacting with the Metastore for metadata management
  4. Processed data is stored in Delta Lake format
  5. Final data is persisted in the storage layer through Hadoop AWS connectors to MINIO

Containerization

The entire architecture is containerized using Docker, with Docker Compose orchestrating the multi-container application. This ensures:

  • Consistent development and production environments
  • Easy deployment and scaling
  • Isolated services with defined networking
  • Simplified dependency management

Key Features

  • Multi-tenant Support: Through Kyuubi SQL Gateway
  • ACID Compliance: Provided by Delta Lake
  • Metadata Management: Via Hive Metastore
  • Scalable Processing: Using Apache Spark
  • S3-Compatible Storage: Through MINIO
  • Multiple Access Methods: Supporting various data tools and interfaces

Use Cases

This architecture is suitable for:

  • Data warehousing
  • Real-time analytics
  • Large-scale ETL operations
  • Data lake implementations
  • Business intelligence applications

Architecture Diagram

flowchart LR dbeaver[DBBeaver] pandas[Pandas] beeline[Beeline] kyuubi[Kyuubi SQL Gateway] spark[Apache Spark] metastore[Metastore] metastoredb[Metastore DB] deltalake[Delta Lake] hadoop[Hadoop AWS] storage[Storage/MINIO] subgraph Input Sources dbeaver pandas beeline end subgraph Processing Layer kyuubi spark metastore metastoredb end subgraph Storage Layer deltalake hadoop storage end dbeaver --> kyuubi pandas --> kyuubi beeline --> kyuubi kyuubi --> spark spark --> metastore spark --> deltalake metastore --> metastoredb deltalake --> hadoop hadoop --> storage

Comparison of Spark Storage Formats

Storage Format Comparison

OperationParquetORCDelta LakeApache Iceberg
Basic Createsql CREATE TABLE t1 (id INT) USING PARQUETsql CREATE TABLE t1 (id INT) USING ORCsql CREATE TABLE t1 (id INT) USING DELTAsql CREATE TABLE t1 (id INT) USING ICEBERG
Partitioned Tablesql CREATE TABLE t1 (id INT, dt DATE) USING PARQUET PARTITIONED BY (dt)sql CREATE TABLE t1 (id INT, dt DATE) USING ORC PARTITIONED BY (dt)sql CREATE TABLE t1 (id INT, dt DATE) USING DELTA PARTITIONED BY (dt)sql CREATE TABLE t1 (id INT, dt DATE) USING ICEBERG PARTITIONED BY (dt)
Insert Datasql INSERT INTO t1 VALUES (1)sql INSERT INTO t1 VALUES (1)sql INSERT INTO t1 VALUES (1)sql INSERT INTO t1 VALUES (1)
Merge/UpsertNot SupportedNot Supportedsql MERGE INTO t1 USING t2 ON t1.id = t2.id WHEN MATCHED THEN UPDATE...sql MERGE INTO t1 USING t2 ON t1.id = t2.id WHEN MATCHED THEN UPDATE...
Time TravelNot SupportedNot Supportedsql SELECT * FROM t1 VERSION AS OF 1sql SELECT * FROM t1 FOR VERSION AS OF 1234567
Schema EvolutionManual Alter TableManual Alter Tablesql ALTER TABLE t1 ADD COLUMN name STRINGsql ALTER TABLE t1 ADD COLUMN name STRING
Streaming Writesql writeStream.format("parquet")sql writeStream.format("orc")sql writeStream.format("delta")sql writeStream.format("iceberg")
Delete RecordsNot SupportedNot Supportedsql DELETE FROM t1 WHERE id = 1sql DELETE FROM t1 WHERE id = 1
Update RecordsNot SupportedNot Supportedsql UPDATE t1 SET col = 'value' WHERE id = 1sql UPDATE t1 SET col = 'value' WHERE id = 1
Vacuum/CleanManual File DeleteManual File Deletesql VACUUM t1 RETAIN 168 HOURSsql CALL t1.system.rewrite_data_files()

Key Features Comparison

FeatureParquetORCDelta LakeApache Iceberg
ACID Transactions
Schema EvolutionLimitedLimited
Time Travel
CDC Support
Partition Evolution
Row-level Operations
Streaming SupportLimitedLimited
Cloud Storage Support
Metadata ManagementFile-basedFile-basedTransaction LogTable Format
Unity Catalog Support
Delta Sharing SupportLimited

Storage Properties

PropertyParquetORCDelta LakeApache Iceberg
File FormatColumnarColumnarParquet BasedParquet Based
CompressionSnappy/GZIPZLIB/SnappyInherits ParquetInherits Parquet
StatisticsBasicAdvancedAdvancedAdvanced
Metadata StorageFile LevelFile LevelTransaction LogMetadata Tables

Example Use Cases

FormatBest Used For
Parquet- Static analytics data
- Simple data lake storage
- When only columnar storage is needed
ORC- Hive workloads
- Hadoop ecosystem
- Legacy systems integration
Delta Lake- Data lakehouse architecture
- Real-time + batch processing
- When ACID is required
- Databricks ecosystem
Iceberg- Cloud-native data lakes
- Multi-table transactions
- Complex schema evolution
- Vendor-neutral solution

Common Configuration Properties

-- Delta Lake Properties
CREATE TABLE delta_table
USING DELTA
TBLPROPERTIES (
'delta.enableChangeDataFeed' = 'true',
'delta.autoOptimize.optimizeWrite' = 'true'
);

-- Iceberg Properties
CREATE TABLE iceberg_table
USING ICEBERG
TBLPROPERTIES (
'format-version' = '2',
'write.merge.mode' = 'merge-on-read'
);

-- Parquet Properties
CREATE TABLE parquet_table
USING PARQUET
TBLPROPERTIES (
'parquet.compression' = 'SNAPPY'
);

-- ORC Properties
CREATE TABLE orc_table
USING ORC
TBLPROPERTIES (
'orc.compress' = 'SNAPPY'
);

Resources

Data Engineering Communities

Sample Delta Lake Configuration Projects