From Pub/Sub to BigQuery, Billions of Events

I recently got contract job from an adtech startup.

They were using Kafka and Kafka consumers to insert event data into BigQuery.

That setup was running into scaling issues, so they wanted me to replace it with Pub/Sub and DataFlow.

As I understood, they wanted it to be:

a) Cheap (cheaper the pipeline, more margin for the ad tech startup)

b) Reliable

c) Elastic (Handle between 1 billion to 50 billion events per day)

d) Delay in ingestion into BigQuery up to 24 hours is fine.

BigQuery

Streaming inserts cost $0.05 per GB

When this startup does not need real-time ingestion, why pay for it? I want to stretch their runway further by making pipeline as cheap as possible within their requirements. Load job seems fine.

Load jobs per table per day — 1,000 (including failures). The limit of 1,000 load jobs per table per day cannot be raised.

Assuming load job has a failure rate of 50%. We could push one load job every 3 minutes and still stay within the limit. To avoid getting throttled we settled on 1 update every 15 minutes per table

Load jobs per project per day — 50,000 (including failures)

Currently, the startup has 150 customers.

Total jobs per day = 150 * 4 * 24 = 14,400

Assuming, 50% failure rate we need 28800 daily limit to operate.

So, we don’t risk running over this limit and surely this can be increased by contacting Google Cloud Platform support later as their documentation has not cautioned against it, unlike the previous limit.

Delay up to 24 hours is fine in our case, so we can just drop frequency if we ran into some issue

Cloud Pub/Sub

A message that cannot be delivered within the maximum retention time of 7 days is deleted and is no longer accessible. This typically happens when subscribers do not keep up with the flow of messages

The default retention time is good enough for our purpose

Cloud Pub/Sub does not currently support replays

So, it’s not a true Kafka replacement.

I tried looking for hosted Kafka replacements but did find any fully managed price competitive Mongo Atlas like Kafka offering.

This is a startup and they don’t have enough people working for them, let alone hiring dedicated people to manage Kafka cluster.

But since Pub/Sub is more elastic and this job specifically demands elasticity (daily events vary from 1B to 50B) and low maintenance, this is a good tradeoff.

Dataflow

First, I started with Python SDK but ran into many limitations. So, I just dropped it and started using Java SDK even tho I’ve zero experience working with Java. How hard can it be?

I found both language and SDK quite confusing.

Code

Now, I just needed code which:

Read JSON events from Cloud Pub/Sub
Load the events from Cloud Pub/Sub to BigQuery every 15 minutes using file loads to save cost on streaming inserts.
The destination differed based on user_id and campaign_id field in the JSON event, user_id is the dataset name and campaign_id is the table name. The partition name comes from the created_at timestamp field.
The schema for all tables stays same.

The solution should be flexible enough for use within multiple pipelines. Ideally, just one file which can easily be changed and revision controlled.

And here is what I ended up using (if you see any issues, leave a comment on the gist on GitHub):

You’ll not see the embeded gist if you’ve JavaScript disabled, here is the gist URL: https://gist.github.com/zero-master/88c9f3698c91f64c158da6716ed24f58

Cost