Building a Cross-Cloud Data Pipeline

Title: Building a Cross-Cloud Data Pipeline
Date: 2024-10-27
Tags: Data Engineering, Google Cloud Platform, Amazon Web Services, BigQuery, S3

Pasted image 20260104202514.png

Modern data stacks often span multiple clouds due to acquisitions, best-of-breed services, or cost optimization.
This exercise explores querying data stored in AWS S3 directly from Google BigQuery using BigQuery's external table functionality, eliminating the need for complex ETL and data duplication.

GCP Project: analytics-prod-001
AWS Account: data-lake-account
Key Services: GCP BigQuery & IAM, AWS S3 & IAM.
Data: Sample CSV files (~2 GB) in an S3 bucket: s3://company-raw-data/logs/.

GCP Side: Created a service account (bq-s3-reader@) and generated a JSON key.
AWS Side: Created an IAM user, attached a custom policy granting s3:GetObject and s3:ListBucket on the specific bucket. Associated the GCP service account's key with this user.
BigQuery: Used the Cloud Console to create an external data source, providing the AWS user's access key and secret.
SQL Execution: Wrote a CREATE EXTERNAL TABLE DDL statement pointing to the S3 URI and defining the CSV schema.

Challenge: Initial "Access Denied" errors from BigQuery.
Fix: Realized the AWS IAM policy needed an explicit "s3:ListBucket" action on the bucket resource (not just the objects). Updated the policy, and the table creation succeeded.
Challenge: CSV file had a malformed row.
Fix: Used the MAX_BAD_RECORDS option in the external table definition to skip errors, but noted this for data quality review.

The setup is surprisingly straightforward for read-only, ad-hoc querying. The data never moves to GCP; BigQuery orchestrates the compute and pulls only the necessary data from S3.
Performance: Querying is noticeably slower than native BigQuery tables, especially for complex JOIN operations. This is expected as it's scanning data over the internet.
Cost Model: You pay for BigQuery compute (bytes processed) + AWS S3 GET requests and data transfer out. For infrequent queries on cold data, this can be cheaper than replicating the entire dataset.

For production, consider using AWS Role-based access (instead of user keys) for better security, though it requires a more complex "GCP Workload Identity Federation" setup.
For frequently queried data, set up a scheduled materialized view inside BigQuery to cache results and improve performance.
Test with columnar formats like Parquet or ORC for significantly better query performance and cost efficiency.

BigQuery's federated query capability is a powerful tool for a multi-cloud strategy, enabling a single point of analysis.
The key takeaway is the trade-off: unparalleled flexibility vs. query performance. It's perfect for exploratory analysis on raw data in S3 before committing to a full migration or ETL process.