I recently got contract job from an adtech startup.
They were using Kafka and Kafka consumers to insert event data into BigQuery.
That setup was running into scaling issues, so they wanted me to replace it with Pub/Sub and DataFlow.
As I understood, they wanted it to be:
a) Cheap (cheaper the pipeline, more margin for the ad tech startup)
b) Reliable
c) Elastic (Handle between 1 billion to 50 billion events per day)
d) Delay in ingestion into BigQuery up to 24 hours is fine.
BigQuery
Streaming inserts cost $0.05 per GB
When this startup does not need real-time ingestion, why pay for it? I want to stretch their runway further by making pipeline as cheap as possible within their requirements. Load job seems fine.
Load jobs per table per day — 1,000 (including failures). The limit of 1,000 load jobs per table per day cannot be raised.
Assuming load job has a failure rate of 50%. We could push one load job every 3 minutes and still stay within the limit. To avoid getting throttled we settled on 1 update every 15 minutes per table
Load jobs per project per day — 50,000 (including failures)
Currently, the startup has 150 customers.
Total jobs per day = 150 * 4 * 24 = 14,400
Assuming, 50% failure rate we need 28800 daily limit to operate.
So, we don’t risk running over this limit and surely this can be increased by contacting Google Cloud Platform support later as their documentation has not cautioned against it, unlike the previous limit.
Delay up to 24 hours is fine in our case, so we can just drop frequency if we ran into some issue
Cloud Pub/Sub
A message that cannot be delivered within the maximum retention time of 7 days is deleted and is no longer accessible. This typically happens when subscribers do not keep up with the flow of messages
The default retention time is good enough for our purpose
Cloud Pub/Sub does not currently support replays
So, it’s not a true Kafka replacement.
I tried looking for hosted Kafka replacements but did find any fully managed price competitive Mongo Atlas like Kafka offering.
This is a startup and they don’t have enough people working for them, let alone hiring dedicated people to manage Kafka cluster.
But since Pub/Sub is more elastic and this job specifically demands elasticity (daily events vary from 1B to 50B) and low maintenance, this is a good tradeoff.
Dataflow
First, I started with Python SDK but ran into many limitations. So, I just dropped it and started using Java SDK even tho I’ve zero experience working with Java. How hard can it be?
I found both language and SDK quite confusing.
Code
Now, I just needed code which:
Read JSON events from Cloud Pub/Sub
Load the events from Cloud Pub/Sub to BigQuery every 15 minutes using file loads to save cost on streaming inserts.
The destination differed based on
user_id
andcampaign_id
field in the JSON event,user_id
is the dataset name andcampaign_id
is the table name. The partition name comes from thecreated_at
timestamp field.The schema for all tables stays same.
The solution should be flexible enough for use within multiple pipelines. Ideally, just one file which can easily be changed and revision controlled.
And here is what I ended up using (if you see any issues, leave a comment on the gist on GitHub):
You’ll not see the embeded gist if you’ve JavaScript disabled, here is the gist URL: https://gist.github.com/zero-master/88c9f3698c91f64c158da6716ed24f58
Cost
It’s a cash-strapped startup and they said they are short on money. Been there and done that, so I understand their difficulty. I ended up charging 100 euros for this solution.