Scrape
Architecture
Pipeline design and dependencies for Scrape
Pipeline stages
- Load clinic and organization metadata from Spanner.
- Skip clinics where
Clinics.scrapeis false. - Launch Playwright and log into EzyVet per clinic.
- Download CSV reports (agenda, animal, contact, financial, receipts).
- Transform CSVs into normalized DataFrames with Polars.
- Upsert records into Spanner using batch mutations.
System context
Spanner (Clinics/Organizations) -> Scrape Job
|-> EzyVet UI (CSV reports)
|-> Secret Manager (credentials + proxy)
|-> Cloud Spanner (upserts)
|-> SentryDependencies
- Upstream: EzyVet UI reports; Spanner Clinics/Organizations.
- Downstream: Cloud Spanner normalized tables.
- External: Secret Manager; Sentry; proxy provider.
Failure handling
- Playwright login and report downloads retry on transient failures.
- Spanner writes use insert_or_update mutations (idempotent).
- The job surfaces failures after all clinics attempt processing.
Last updated on