Datalake
Overview
Python 3.13 change stream ingestion into BigQuery datalake
Datalake is an internal Python 3.13 FastAPI service that subscribes to Spanner change stream events from Pub/Sub, writes them into BigQuery changelog tables, and publishes per-table messages to Pub/Sub topics for downstream consumers.
Service profile
| Field | Value |
|---|---|
| Code | services/datalake/ |
| Package | datalake |
| Runtime | Python 3.13 (FastAPI) |
| Status | Active |
| Primary owner | Joe Pardi |
| Secondary owner | None |
| Ingress | Internal only (Pub/Sub subscription) |
| Data sinks | BigQuery dataset tables |
| Outputs | Pub/Sub topics per table |
Responsibilities
- Subscribe to Spanner change stream events from Pub/Sub.
- Normalize events and write changelog records to BigQuery.
- Publish per-table messages to Pub/Sub for data mart pipelines.
Non-goals
- Public API surface beyond health checks.
- Heavy transformation or business logic (keeps records close to source).
Tech stack
- Python 3.13 + FastAPI (health endpoint only).
- Pub/Sub subscriber + publisher.
- BigQuery client for datalake writes.
orjsonfor payload parsing.- Sentry for error reporting.
Code entrypoints
services/datalake/main.py: app setup, Pub/Sub subscription, lifecycle.services/datalake/core/change_stream_processor.py: message processing.services/datalake/core/table_config.py: source system + hash exclusions.services/datalake/utils/framework_utils.py: hashing and change detection.services/datalake/utils/pub_sub_utils.py: topic creation helpers.
Dependencies
- Upstream: Pub/Sub (Dataflow -> Spanner change stream events).
- Downstream: BigQuery datalake dataset; Pub/Sub topics for downstream.
- External: Secret Manager; Sentry.
Related pages
Last updated on