Skip to content

T-053a — GCP Billing Statements scrape (NAS)

Companion to T-053 (spec) + PR #757 (web-app side). This runbook covers the NAS-side execution: scrape pre-Oct-2025 billing statements from GCP Console, post them to the ingest route under the new gcp_statement doc type, and schedule a monthly run.

Why

GCP started issuing formal HK tax invoices ~Oct 2025. For billing periods 2024-10 → 2025-09 (13 months) the auditor-side artefact is a statement, not a tax invoice. The existing run-gcp-invoices.sh only scrapes the tax-invoice preset; statements were never fetched. This runbook adds a second pass.

Pre-reqs (one-time)

  • The web-app side (PR #757) is merged — the ingest route accepts vendor: 'gcp_statement' and the feed reads the new doc type.
  • The scraper code (PR #758, this branch) is deployed to the NAS. Specifically services/workspace-csv-scraper/gcp-invoices.mjs knows about INGEST_VENDOR + STATEMENT_MODE (gated by INGEST_VENDOR === 'gcp_statement').
  • Container rebuilt + redeployed from the new services/workspace-csv- scraper/ (the NAS uses the Docker image from Dockerfile there).

1. Add run-gcp-statements.sh

On the NAS:

# /volume1/docker/workspace-billing/run-gcp-statements.sh
#!/usr/bin/env bash
# Mirrors run-gcp-invoices.sh but scrapes the STATEMENTS preset and
# tells the scraper to use the statement parser + post under the
# gcp_statement vendor discriminator.
set -euo pipefail

cd /volume1/docker/workspace-billing

# Same Docker invocation as run-gcp-invoices.sh — only the env vars
# differ. Adjust IMAGE / volume mounts to match your existing run-gcp-
# invoices.sh exactly; the deltas below are the ONLY change.
docker run --rm \
  --name workspace-csv-scraper-gcp-statements \
  -v /volume1/docker/workspace-billing/profile:/profile \
  -v /volume1/docker/workspace-billing/downloads:/downloads \
  -e WORKSPACE_INGEST_SECRET="${WORKSPACE_INGEST_SECRET}" \
  -e VERCEL_AUTOMATION_BYPASS_SECRET="${VERCEL_AUTOMATION_BYPASS_SECRET}" \
  -e GCP_DOCS_URLS="https://console.cloud.google.com/billing/0102F7-3FFF33-823945/invoices" \
  -e GCP_PRESET_INITIAL="ALL_STATEMENTS" \
  -e GCP_PRESET_TARGET="ALL_STATEMENTS"  `# do NOT switch to ALL_STATUTORY_DOCUMENTS` \
  -e GCP_DOWNLOAD=true \
  -e RECON=false \
  -e GCP_SELECT_ALL=true \
  -e INGEST_VENDOR=gcp_statement       `# T-053: new vendor discriminator` \
  workspace-csv-scraper:latest \
  gcp-invoices

Make it executable:

chmod +x /volume1/docker/workspace-billing/run-gcp-statements.sh

2. One-time backfill (13 historical months)

/volume1/docker/workspace-billing/run-gcp-statements.sh 2>&1 | tee /var/log/gcp-statements-backfill.log

Expected output (from the scraper): - download: parsed N/N GCP statement(s) — should be ~13. - For each: GCS-YYYYMM01 | YYYY-MM-01 | HKD <amount> | bill null | "Google Cloud billing statement". - POSTs return HTTP 200 with { ok: true, documentId: 'GCS-YYYYMM01', alreadyExisted: false, ... }. - Drive: 13 new PDFs under 14b. Vendor Invoices/ named ERL_GcpStatement_YYYYMM01_GCS-YYYYMM01.pdf.

3. Verify in the web app

  • https://p-eop.theestablishers.com/records?tab=vendor-invoices (or the post-#753 deep link /records?tab=vendor-invoices lands on the Service Inv sub-tab).
  • Expect: all 13 historical GCP months expose a PDF icon + an orange "Statement" tag below it.
  • Expect: the 8 Oct 2025+ tax-invoice months still show the PDF icon with no tag (tax invoice wins the join).

4. Schedule monthly

Cron on the NAS — run the day after the typical statement release (GCP usually finalises ~5 days after period end):

# /etc/cron.d/gcp-statements
0 3 6 * * root /volume1/docker/workspace-billing/run-gcp-statements.sh \
   >>/var/log/gcp-statements.log 2>&1

(Runs 03:00 on the 6th of every month, server local time.)

Idempotency

Safe to re-run: the ingest route's id == invoiceNumber (here GCS-YYYYMM01) makes re-ingest a no-op (alreadyExisted: true). The existing run-gcp-invoices.sh is untouched — they run in parallel without interfering, since the doc type + reference prefix differ.

Caveats / known gaps

  1. Statement filename → YYYYMM regex. The scraper extracts the billing period from the downloaded filename via /(\d{6})/. If GCP ever ships statements with a different name format (e.g. 2024_10_statement.pdf), the regex needs widening. Log the first backfill carefully and adjust if invoiceDate lands as null.

  2. Amount parsing best-effort. Statement PDFs may not carry a clean HK$<amount> line; when the regex misses, the scraper posts invoiceAmount: 0. The cost-table CSV remains the authoritative amount source — the statement PDF is just documentary support, so the 0 fallback doesn't affect bookkeeping.

  3. Login session. Same Chrome profile as run-gcp-invoices.sh (mounted under /profile). If the existing tax-invoice scraper is working, statements will too.

Rollback

If something goes wrong during the backfill:

  • The gcp_statement File-Archive docs are deletable from Firestore (aote-system/file-archive/documents/entries/{GCS-YYYYMM01}) and the PDFs from Drive 14b. Vendor Invoices/. Re-run after.
  • The web app remains functional — null fileUrl + null pdfKind falls back to the disabled PDF button, unchanged from before T-053.

Log

  • 2026-06-19 created. Pending NAS-side deployment of PR #758 + one-time backfill.