Skip to main content

BigLake

BigLake is a storage engine that provides a unified interface for analytics and AI engines to query multiformat, multicloud, and multimodal data in a secure, governed, and performant manner. Build a single-copy AI lakehouse designed to reduce management of and need for custom data infrastructure.

BigLake tables let you query structured data in external data stores with access delegation. Access delegation decouples access to the BigLake table from access to the underlying data store. An external connection associated with a service account is used to connect to the data store. Because the service account handles retrieving data from the data store, you only have to grant users access to the BigLake table. This lets you enforce fine-grained security at the table level, including row-level and column-level security. For BigLake tables based on Cloud Storage, you can also use dynamic data masking.

Supported data stores

You can use BigLake tables with the following data stores:

Comparison

ItemBigQueryBigLake
Native TableExternal TableBigLake TableBigLake Iceberg Tables via BigLake MetastoreBigLake Managed Tables
Storage FormatCapacitorCSV,ORC, Parquet, etc.CSV, Iceberg, Parquet, etc.IcebergIceberg
Storage LocationGoogle InternalCustomer GCSCustomer GCSCustomer GCSCustomer GCS
Read/WriteCRUDRead onlyRead only from BQ / Updates via Spark (manual BQ metadata updates)Read only from BQ / Updates via SparkCRUD
RLS / CLS / Data MaskYesNoYesYesYes
Fully ManagedYes (recluster, optimize, etc.)NoNoNoYes (recluster, optimize, etc.)
PartitioningPartition/ClusteringPartitioningPartitioningPartitioningClustering
Streaming (native)YesNoNoNoYes
Time TravelYesNoManualNoYes

BigLake is a storage engine that unifies data stored in GCS (or other object stores) and BigQuery. It allows you users a uniform BQ experience whether their data is in native BQ storage or in an object store

For example, if you want to keep all of your data in an open source format like Parquet or Iceberg and not ingest into BQ, you can instead define a BigLake table. And still put things like fine-grained access control (e.g. row, column-level security) on top, including in other public clouds. Similar to BQ Native tables, you can also put BQML models on BigLake tables, or access BigLake tables via different analytics engines like Spark or Presto.

To your question - BigLake is the storage component in other Public Clouds (e.g. data in S3) and BigQuery Omni is the compute component that's run on the other cloud (sitting on a fleet of EC2 machines). Right now, you can see BQ native tables, GCS-backed tables (BigLake), or S3/Azure Blob-backed BigLake tables all in the familiar BQ console.

Unfortunately, multi-cloud tables cannot be joined yet. Much like how you can't join BQ native tables across regions. But I think that's on the BQ team's roadmap.

What's the point of BigLake? : r/bigquery

Google’s lakehouse offering, BigLake: A deep dive into various BigLake tables | by Suteja Kanuri | Medium