T-053a — GCP Billing Statements scrape (NAS)¶
Companion to T-053 (spec) + PR #757 (web-app side). This runbook covers the NAS-side execution: scrape pre-Oct-2025 billing statements from GCP Console, post them to the ingest route under the new
gcp_statementdoc type, and schedule a monthly run.
Why¶
GCP started issuing formal HK tax invoices ~Oct 2025. For billing
periods 2024-10 → 2025-09 (13 months) the auditor-side artefact
is a statement, not a tax invoice. The existing run-gcp-invoices.sh
only scrapes the tax-invoice preset; statements were never fetched.
This runbook adds a second pass.
Pre-reqs (one-time)¶
- The web-app side (PR #757) is merged — the ingest route accepts
vendor: 'gcp_statement'and the feed reads the new doc type. - The scraper code (PR #758, this branch) is deployed to the NAS.
Specifically
services/workspace-csv-scraper/gcp-invoices.mjsknows aboutINGEST_VENDOR+STATEMENT_MODE(gated byINGEST_VENDOR === 'gcp_statement'). - Container rebuilt + redeployed from the new
services/workspace-csv- scraper/(the NAS uses the Docker image fromDockerfilethere).
1. Add run-gcp-statements.sh¶
On the NAS:
# /volume1/docker/workspace-billing/run-gcp-statements.sh
#!/usr/bin/env bash
# Mirrors run-gcp-invoices.sh but scrapes the STATEMENTS preset and
# tells the scraper to use the statement parser + post under the
# gcp_statement vendor discriminator.
set -euo pipefail
cd /volume1/docker/workspace-billing
# Same Docker invocation as run-gcp-invoices.sh — only the env vars
# differ. Adjust IMAGE / volume mounts to match your existing run-gcp-
# invoices.sh exactly; the deltas below are the ONLY change.
docker run --rm \
--name workspace-csv-scraper-gcp-statements \
-v /volume1/docker/workspace-billing/profile:/profile \
-v /volume1/docker/workspace-billing/downloads:/downloads \
-e WORKSPACE_INGEST_SECRET="${WORKSPACE_INGEST_SECRET}" \
-e VERCEL_AUTOMATION_BYPASS_SECRET="${VERCEL_AUTOMATION_BYPASS_SECRET}" \
-e GCP_DOCS_URLS="https://console.cloud.google.com/billing/0102F7-3FFF33-823945/invoices" \
-e GCP_PRESET_INITIAL="ALL_STATEMENTS" \
-e GCP_PRESET_TARGET="ALL_STATEMENTS" `# do NOT switch to ALL_STATUTORY_DOCUMENTS` \
-e GCP_DOWNLOAD=true \
-e RECON=false \
-e GCP_SELECT_ALL=true \
-e INGEST_VENDOR=gcp_statement `# T-053: new vendor discriminator` \
workspace-csv-scraper:latest \
gcp-invoices
Make it executable:
2. One-time backfill (13 historical months)¶
/volume1/docker/workspace-billing/run-gcp-statements.sh 2>&1 | tee /var/log/gcp-statements-backfill.log
Expected output (from the scraper):
- download: parsed N/N GCP statement(s) — should be ~13.
- For each: GCS-YYYYMM01 | YYYY-MM-01 | HKD <amount> | bill null | "Google Cloud billing statement".
- POSTs return HTTP 200 with { ok: true, documentId: 'GCS-YYYYMM01', alreadyExisted: false, ... }.
- Drive: 13 new PDFs under 14b. Vendor Invoices/ named ERL_GcpStatement_YYYYMM01_GCS-YYYYMM01.pdf.
3. Verify in the web app¶
https://p-eop.theestablishers.com/records?tab=vendor-invoices(or the post-#753 deep link/records?tab=vendor-invoiceslands on the Service Inv sub-tab).- Expect: all 13 historical GCP months expose a PDF icon + an orange "Statement" tag below it.
- Expect: the 8 Oct 2025+ tax-invoice months still show the PDF icon with no tag (tax invoice wins the join).
4. Schedule monthly¶
Cron on the NAS — run the day after the typical statement release (GCP usually finalises ~5 days after period end):
# /etc/cron.d/gcp-statements
0 3 6 * * root /volume1/docker/workspace-billing/run-gcp-statements.sh \
>>/var/log/gcp-statements.log 2>&1
(Runs 03:00 on the 6th of every month, server local time.)
Idempotency¶
Safe to re-run: the ingest route's id == invoiceNumber (here
GCS-YYYYMM01) makes re-ingest a no-op (alreadyExisted: true). The
existing run-gcp-invoices.sh is untouched — they run in
parallel without interfering, since the doc type + reference prefix
differ.
Caveats / known gaps¶
-
Statement filename → YYYYMM regex. The scraper extracts the billing period from the downloaded filename via
/(\d{6})/. If GCP ever ships statements with a different name format (e.g.2024_10_statement.pdf), the regex needs widening. Log the first backfill carefully and adjust ifinvoiceDatelands as null. -
Amount parsing best-effort. Statement PDFs may not carry a clean
HK$<amount>line; when the regex misses, the scraper postsinvoiceAmount: 0. The cost-table CSV remains the authoritative amount source — the statement PDF is just documentary support, so the 0 fallback doesn't affect bookkeeping. -
Login session. Same Chrome profile as
run-gcp-invoices.sh(mounted under/profile). If the existing tax-invoice scraper is working, statements will too.
Rollback¶
If something goes wrong during the backfill:
- The
gcp_statementFile-Archive docs are deletable from Firestore (aote-system/file-archive/documents/entries/{GCS-YYYYMM01}) and the PDFs from Drive14b. Vendor Invoices/. Re-run after. - The web app remains functional — null
fileUrl+ nullpdfKindfalls back to the disabled PDF button, unchanged from before T-053.
Log¶
- 2026-06-19 created. Pending NAS-side deployment of PR #758 + one-time backfill.