Building a Cross-Cloud Data Pipeline
Title: Building a Cross-Cloud Data Pipeline
Date: 2024-10-27
Tags: Data Engineering, Google Cloud Platform, Amazon Web Services, BigQuery, S3
I. Introduction
1.1 Context & Purpose
- Modern data stacks often span multiple clouds due to acquisitions, best-of-breed services, or cost optimization.
- This exercise explores querying data stored in AWS S3 directly from Google BigQuery using BigQuery's external table functionality, eliminating the need for complex ETL and data duplication.
1.2 What This Covers
- Setting up a secure, read-only connection from GCP to AWS.
- Configuring IAM roles and permissions on both platforms.
- Creating and querying a BigQuery external table over S3 data.
- Observations on performance and cost implications.
II. Setup & Environment
2.1 Network & Tools Overview
- GCP Project:
analytics-prod-001 - AWS Account:
data-lake-account - Key Services: GCP BigQuery & IAM, AWS S3 & IAM.
- Data: Sample CSV files (~2 GB) in an S3 bucket:
s3://company-raw-data/logs/.
2.2 Prerequisites / Preparations
- Admin access to both GCP and AWS accounts.
- A GCP service account created specifically for this integration.
- S3 bucket with sample data already in place.
III. Execution & Findings
3.1 Steps Taken
- GCP Side: Created a service account (
bq-s3-reader@) and generated a JSON key. - AWS Side: Created an IAM user, attached a custom policy granting
s3:GetObjectands3:ListBucketon the specific bucket. Associated the GCP service account's key with this user. - BigQuery: Used the Cloud Console to create an external data source, providing the AWS user's access key and secret.
- SQL Execution: Wrote a
CREATE EXTERNAL TABLEDDL statement pointing to the S3 URI and defining the CSV schema.
3.2 Challenges & Fixes
- Challenge: Initial "Access Denied" errors from BigQuery.
- Fix: Realized the AWS IAM policy needed an explicit
"s3:ListBucket"action on the bucket resource (not just the objects). Updated the policy, and the table creation succeeded. - Challenge: CSV file had a malformed row.
- Fix: Used the
MAX_BAD_RECORDSoption in the external table definition to skip errors, but noted this for data quality review.
IV. Observations & Insights
- The setup is surprisingly straightforward for read-only, ad-hoc querying. The data never moves to GCP; BigQuery orchestrates the compute and pulls only the necessary data from S3.
- Performance: Querying is noticeably slower than native BigQuery tables, especially for complex
JOINoperations. This is expected as it's scanning data over the internet. - Cost Model: You pay for BigQuery compute (bytes processed) + AWS S3 GET requests and data transfer out. For infrequent queries on cold data, this can be cheaper than replicating the entire dataset.
V. Considerations & Next Steps
- For production, consider using AWS Role-based access (instead of user keys) for better security, though it requires a more complex "GCP Workload Identity Federation" setup.
- For frequently queried data, set up a scheduled materialized view inside BigQuery to cache results and improve performance.
- Test with columnar formats like Parquet or ORC for significantly better query performance and cost efficiency.
VI. Conclusion
- BigQuery's federated query capability is a powerful tool for a multi-cloud strategy, enabling a single point of analysis.
- The key takeaway is the trade-off: unparalleled flexibility vs. query performance. It's perfect for exploratory analysis on raw data in S3 before committing to a full migration or ETL process.
