Nest Engineering Docs
Datalake

Architecture

System design and data flow for Datalake

Datalake is a Python 3.13 service that consumes Spanner change stream events from Pub/Sub and writes changelog rows into BigQuery.

System context

Spanner Change Streams -> Dataflow -> Pub/Sub -> Datalake (subscriber)
                                                 |-> BigQuery (changelog tables)
                                                 |-> Pub/Sub topics (per table)
                                                 |-> Sentry

Processing flow

  1. The service subscribes to the configured Pub/Sub subscription at startup.
  2. Each message is parsed into a ChangeStreamEvent.
  3. The processor computes record and data hash keys and filters no-op updates.
  4. Records are written to BigQuery tables with a _changelog suffix.
  5. A per-table Pub/Sub topic message is published for downstream pipelines.
  6. Messages are acked or nacked based on processing success.

Components

  • services/datalake/main.py: app setup, Pub/Sub subscription, lifecycle.
  • services/datalake/core/change_stream_processor.py: event processing pipeline.
  • services/datalake/core/table_config.py: source system + hash exclusions.
  • services/datalake/utils/framework_utils.py: hashing and change detection.
  • services/datalake/utils/pub_sub_utils.py: topic creation helpers.
  • services/datalake/models/*: change stream event schemas.

Reliability and scaling

  • Ack deadlines are extended in a background thread for large messages.
  • Flow control limits are configurable via MAX_MESSAGES and MAX_BYTES.
  • BigQuery inserts are retried up to three times per batch.
  • Pub/Sub ordering uses record_hash_key to keep entity updates ordered.

Last updated on