Hudi
Hudi - Hadoop Upserts Deletes and Incremental
Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores).
Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.
Hudi Features
- Mutability support for all data lake workloads - Quickly update & delete data with Hudi’s fast, pluggable indexing. This includes streaming workloads, with full support for out-of-order data, bursty traffic & data deduplication.
- Improved efficiency by incrementally processing new data - Replace old-school batch pipelines with incremental streaming on your data lake. Experience faster ingestion and lower processing times for analytical workloads.
- ACID Transactional guarantees to your data lake - Bring transactional guarantees to your data lake, with consistent, atomic writes and concurrency controls tailored for longer-running lake transactions.
- Unlock historical data with time travel - Query historical data with the ability to roll back to a table version; debug data versions to understand what changed over time; audit data changes by viewing the commit history.
- Interoperable multi-cloud ecosystem support - Extensive ecosystem support with plug-and-play options for popular data sources & query engines. Build future-proof architectures interoperable with your vendor of choice.
- Comprehensive table services for high-performance analytics - Fully automated table services that continuously schedule & orchestrate clustering, compaction, cleaning, file sizing & indexing to ensure tables are always ready.
- A rich platform to build your lakehouse faster - Effortlessly build your lakehouse with built-in tools for auto ingestion from services like Debezium and Kafka and auto catalog sync for easy discoverability & more.
- Query acceleration through multi-modal indexes - Experience faster write transactions on huge/wide tables & faster query performance with first-of-its kind multi-modal indexing subsystem.
- Resilient Pipelines with schema evolution & enforcement - Easily change the current schema of a Hudi table to adapt to the data that is changing over time and ensure pipeline resilience by failing fast and avoiding data corruption.
Original Motivation
- Batch ingestion is too slow
- Rewrite entire table/partition several times a day
- ETLs off raw data have no smarts to recompute
- Late arriving data is a nightmare
Architecture
Storage Type
Copy On Write (COW)
Queries: Snapshot, Incremental
Merge on read (MOR)
Choosing Between COW and MOR
The choice between COW and MOR in Apache Hudi largely depends on your specific requirements.
- Read vs. Write Frequency: If your workload is read-heavy, COW may be the better choice due to its optimized read performance. Conversely, for write-heavy applications where data is ingested frequently, MOR can handle the load more efficiently.
- Data Consistency: If your application requires strong consistency and atomicity during writes, COW is preferable. MOR is better suited for scenarios where eventual consistency is acceptable.
- Use Case: For analytical workloads and batch processing, COW shines. For real-time data processing and streaming applications, MOR is often the way to go.
Links
- https://hudi.apache.org
- Apache Hudi: A Deep Dive with Python Code Examples
- Apache Hudi vs. Delta Lake: Choosing the Right Tool for Your Data Lake on AWS | by Siladitya Ghosh | Medium
- Exploring Time Travel Queries in Apache Hudi - DEVOPS DONE RIGHT..
- Understanding COW and MOR in Apache Hudi: Choosing the Right Storage Strategy - DEVOPS DONE RIGHT..