BigLake
BigLake is a storage engine that provides a unified interface for analytics and AI engines to query multiformat, multicloud, and multimodal data in a secure, governed, and performant manner. Build a single-copy AI lakehouse designed to reduce management of and need for custom data infrastructure.
BigLake tables let you query structured data in external data stores with access delegation. Access delegation decouples access to the BigLake table from access to the underlying data store. An external connection associated with a service account is used to connect to the data store. Because the service account handles retrieving data from the data store, you only have to grant users access to the BigLake table. This lets you enforce fine-grained security at the table level, including row-level and column-level security. For BigLake tables based on Cloud Storage, you can also use dynamic data masking.
Supported data stores
You can use BigLake tables with the following data stores:
- Amazon S3 by using BigQuery Omni
- Blob Storage by using BigQuery Omni
- Cloud Storage
Comparison
Item | BigQuery | BigLake | |||
---|---|---|---|---|---|
Native Table | External Table | BigLake Table | BigLake Iceberg Tables via BigLake Metastore | BigLake Managed Tables | |
Storage Format | Capacitor | CSV,ORC, Parquet, etc. | CSV, Iceberg, Parquet, etc. | Iceberg | Iceberg |
Storage Location | Google Internal | Customer GCS | Customer GCS | Customer GCS | Customer GCS |
Read/Write | CRUD | Read only | Read only from BQ / Updates via Spark (manual BQ metadata updates) | Read only from BQ / Updates via Spark | CRUD |
RLS / CLS / Data Mask | Yes | No | Yes | Yes | Yes |
Fully Managed | Yes (recluster, optimize, etc.) | No | No | No | Yes (recluster, optimize, etc.) |
Partitioning | Partition/Clustering | Partitioning | Partitioning | Partitioning | Clustering |
Streaming (native) | Yes | No | No | No | Yes |
Time Travel | Yes | No | Manual | No | Yes |
BigLake is a storage engine that unifies data stored in GCS (or other object stores) and BigQuery. It allows you users a uniform BQ experience whether their data is in native BQ storage or in an object store
For example, if you want to keep all of your data in an open source format like Parquet or Iceberg and not ingest into BQ, you can instead define a BigLake table. And still put things like fine-grained access control (e.g. row, column-level security) on top, including in other public clouds. Similar to BQ Native tables, you can also put BQML models on BigLake tables, or access BigLake tables via different analytics engines like Spark or Presto.
To your question - BigLake is the storage component in other Public Clouds (e.g. data in S3) and BigQuery Omni is the compute component that's run on the other cloud (sitting on a fleet of EC2 machines). Right now, you can see BQ native tables, GCS-backed tables (BigLake), or S3/Azure Blob-backed BigLake tables all in the familiar BQ console.
Unfortunately, multi-cloud tables cannot be joined yet. Much like how you can't join BQ native tables across regions. But I think that's on the BQ team's roadmap.
What's the point of BigLake? : r/bigquery
Links
- Introduction to BigLake external tables | BigQuery | Google Cloud
- GCP BigLake introduction. BigLake is the name given by Google to… | by Neil Kolban | Google Cloud - Community | Medium
- Data Analytics Deep Dives - BigLake Managed Tables - YouTube
- BigLake: Build an Apache Iceberg lakehouse | Google Cloud
- BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse