Datalake

Architecture

System design and data flow for Datalake

Datalake is a Python 3.13 service that consumes Spanner change stream events from Pub/Sub and writes changelog rows into BigQuery.

System context

Spanner Change Streams -> Dataflow -> Pub/Sub -> Datalake (subscriber)
                                                 |-> BigQuery (changelog tables)
                                                 |-> Pub/Sub topics (per table)
                                                 |-> Sentry

Processing flow

The service subscribes to the configured Pub/Sub subscription at startup.
Each message is parsed into a ChangeStreamEvent.
The processor computes record and data hash keys and filters no-op updates.
Records are written to BigQuery tables with a _changelog suffix.
A per-table Pub/Sub topic message is published for downstream pipelines.
Messages are acked or nacked based on processing success.

Components

services/datalake/main.py: app setup, Pub/Sub subscription, lifecycle.
services/datalake/core/change_stream_processor.py: event processing pipeline.
services/datalake/core/table_config.py: source system + hash exclusions.
services/datalake/utils/framework_utils.py: hashing and change detection.
services/datalake/utils/pub_sub_utils.py: topic creation helpers.
services/datalake/models/*: change stream event schemas.

Reliability and scaling

Ack deadlines are extended in a background thread for large messages.
Flow control limits are configurable via MAX_MESSAGES and MAX_BYTES.
BigQuery inserts are retried up to three times per batch.
Pub/Sub ordering uses record_hash_key to keep entity updates ordered.

Last updated on

Overview

Python 3.13 change stream ingestion into BigQuery datalake

Data model

Storage, schemas, and data ownership for Datalake

On this page

Processing flow

Reliability and scaling