Datalake
Architecture
System design and data flow for Datalake
Datalake is a Python 3.13 service that consumes Spanner change stream events from Pub/Sub and writes changelog rows into BigQuery.
System context
Spanner Change Streams -> Dataflow -> Pub/Sub -> Datalake (subscriber)
|-> BigQuery (changelog tables)
|-> Pub/Sub topics (per table)
|-> SentryProcessing flow
- The service subscribes to the configured Pub/Sub subscription at startup.
- Each message is parsed into a
ChangeStreamEvent. - The processor computes record and data hash keys and filters no-op updates.
- Records are written to BigQuery tables with a
_changelogsuffix. - A per-table Pub/Sub topic message is published for downstream pipelines.
- Messages are acked or nacked based on processing success.
Components
services/datalake/main.py: app setup, Pub/Sub subscription, lifecycle.services/datalake/core/change_stream_processor.py: event processing pipeline.services/datalake/core/table_config.py: source system + hash exclusions.services/datalake/utils/framework_utils.py: hashing and change detection.services/datalake/utils/pub_sub_utils.py: topic creation helpers.services/datalake/models/*: change stream event schemas.
Reliability and scaling
- Ack deadlines are extended in a background thread for large messages.
- Flow control limits are configurable via
MAX_MESSAGESandMAX_BYTES. - BigQuery inserts are retried up to three times per batch.
- Pub/Sub ordering uses
record_hash_keyto keep entity updates ordered.
Last updated on