The challenge

A digital media publisher with 15 million monthly unique visitors had enabled the GA4 to BigQuery export expecting a clean, reliable data feed. What they got was a raw event stream with significant gaps — streaming export data that did not match daily table totals, events missing required parameters, and a table schema that changed without warning when GA4 updated. Their data engineering team had spent three months trying to build stable Looker Studio and Tableau reports on top of the raw export but kept running into inconsistencies they could not explain.

Diagnosing the data quality issues

A systematic audit of the BigQuery dataset against GA4 reports identified three distinct data quality problems. First, the streaming export was missing approximately 3% of events due to a known GA4 behaviour where high-traffic properties experience event throttling in the streaming table. Second, twelve custom event parameters had been registered after the initial export configuration and were appearing inconsistently in the schema. Third, the data engineering team was querying the streaming table directly rather than using the daily tables — which are deduplicated and more reliable.

We also identified that internal traffic from the publisher's editorial team was reaching GA4 despite an IP filter — the filter had been configured incorrectly, applying to the wrong data stream.

Pipeline architecture redesign

We redesigned the pipeline to use the daily export tables as the primary source, with the streaming table used only for same-day data that had not yet been consolidated. A daily reconciliation job compared event counts between GA4 Reporting API and BigQuery to flag any significant gaps before downstream reports consumed the data.

A transformation layer was built on top of the raw export using dbt — normalising event parameters into consistent columns, sessionising events using the ga_session_id parameter, and joining user properties to event records for easier querying.

  • Daily table as primary source for T+1 data
  • Streaming table for same-day data only, clearly labelled as provisional
  • Reconciliation job comparing GA4 API vs BigQuery event counts
  • dbt transformation layer for sessionisation and parameter normalisation
  • Alert when daily completeness drops below 98%

Content and revenue analytics models

With a reliable data foundation, we built three analytics models specifically for the publisher's use cases. A content performance model aggregating engagement metrics — scroll depth, time on page, return visits — by article, author, and category. An audience cohort model tracking reader retention by acquisition channel and first content type consumed. A revenue attribution model connecting ad impression data from the publisher's ad server with reader sessions in GA4.

All three models were built in dbt and exposed via Looker Studio and Tableau — giving editorial, product, and commercial teams a shared data foundation rather than three separate teams running their own queries against the raw export.

Outcome

The pipeline now achieves 99.8% data completeness measured daily against the GA4 Reporting API. The 4-hour data freshness SLA — daily data available in BigQuery by 10am — has been met consistently since implementation. Three BI tools now connect to the same dbt-modelled layer, eliminating the metric discrepancies that had eroded trust in analytics across editorial, product, and commercial teams.