Nest Engineering Docs
Datalake

Overview

Python 3.13 change stream ingestion into BigQuery datalake

Datalake is an internal Python 3.13 FastAPI service that subscribes to Spanner change stream events from Pub/Sub, writes them into BigQuery changelog tables, and publishes per-table messages to Pub/Sub topics for downstream consumers.

Service profile

FieldValue
Codeservices/datalake/
Packagedatalake
RuntimePython 3.13 (FastAPI)
StatusActive
Primary ownerJoe Pardi
Secondary ownerNone
IngressInternal only (Pub/Sub subscription)
Data sinksBigQuery dataset tables
OutputsPub/Sub topics per table

Responsibilities

  • Subscribe to Spanner change stream events from Pub/Sub.
  • Normalize events and write changelog records to BigQuery.
  • Publish per-table messages to Pub/Sub for data mart pipelines.

Non-goals

  • Public API surface beyond health checks.
  • Heavy transformation or business logic (keeps records close to source).

Tech stack

  • Python 3.13 + FastAPI (health endpoint only).
  • Pub/Sub subscriber + publisher.
  • BigQuery client for datalake writes.
  • orjson for payload parsing.
  • Sentry for error reporting.

Code entrypoints

  • services/datalake/main.py: app setup, Pub/Sub subscription, lifecycle.
  • services/datalake/core/change_stream_processor.py: message processing.
  • services/datalake/core/table_config.py: source system + hash exclusions.
  • services/datalake/utils/framework_utils.py: hashing and change detection.
  • services/datalake/utils/pub_sub_utils.py: topic creation helpers.

Dependencies

  • Upstream: Pub/Sub (Dataflow -> Spanner change stream events).
  • Downstream: BigQuery datalake dataset; Pub/Sub topics for downstream.
  • External: Secret Manager; Sentry.

Last updated on