It was April 2016. I received a call from my partner, Ankit. He had an idea for creating a Shopify app which will allow customers to group multiple products together and sell them as a kit.
On the surface, it looked pretty simple. So I and my friend Pradeep started working on it. Little did we know that we are embarking on a journey which will demand an exceptional level of creativity and problem-solving skills on our part. Our problems were compounded by the fact that we had to use PHP.
In this post, I will try to sum up the approaches and techniques which we used to scale this app by processing 100 webhooks per minute to 100,000 webhooks per minute.
What is Product Kits?
Product Kits is a Shopify app which is really simple in terms of functionality – It helps stores selling a combo of two or more products. Idea is, you create a bundle, you sell it, we set the inventory of children when a bundle is sold and calculate the inventory of bundle when a child is sold. For example, a Shaving Kit is a bundle and you linked two razors and a foam. If you have 200 razors and 50 foams, we will set Shaving Kit to have 50 inventory. If Shaving kit is sold, we will set razors to 198 and foam to 49 and vice-versa Got it?
A Good Start
We started with a “multiple-database” approach that means each Store will be provided with a separate database, we were burnt before for having tables more than 100 millions rows – we wanted to avoid Sharding. As one database would hold information of a Shop and its database, a single query to determine is always
For the sake of avoiding confusion, I will be using “Webhook” when I am talking about traffic either from Shopify or Store.
The Scary Launch
Immediately after the launch, we found that it is bigger than we anticipated. We were getting about one hook per second after a week and we had our first ultimate plan Store using our app – He probably hired someone as there were 400 plus bundles. Speaking of a webhook, it in an information that Shopify sends us when something has been updated either order or product. There are two kinds of major webhooks we needed, a hot webhook causes outgoing hits four times and multiple updates to the database, this causes five further cold webhooks. Five is an average and it doesn’t depend on the kind of Shop. Cold webhook requires simple database value verification so three queries. Another one is Naked webhook which is unnecessary but there was no way of disabling it, Here we simply see if there is no bundle associated to webhook, we don’t process it – two queries
First Big Change after Three Months
We were having about 10 webhooks per second with the peak of 40. We were in a desperate need of splitting PHP and MySQL servers as RAM was getting full. We launched a master-slave server and gave an entire RAM to MySQL. We used Slave for backups. There was no downtime as we omitted webhook database but stored webhook got delayed by 25 mins. A queue had 17K webhooks as pending in ~25 mins of total export. It took us 6 hours to see 0 webhooks pending. We were querying a database. Lesson learned – have some link to know the status. Another lesson learned, delete old databases as soon as Store uninstalls the app. We were growing every day. Pradeep wrote on Skype “Our death is inevitable” (“hum to marenge he”)
Optimizing the Sync
At the time of installation, we do a product sync, the biggest store we had, had over 1 million products. If a store has 100,000 products, and Shopify imposes a limit that it would give the maximum of 250 products in one hit. Thus we would sync the products by hitting Shopify 400 times. Keeping this data in memory requires more RAM or we can reduce the count to 50 and increase the request to 2000. This amount of data also requires more than 2 seconds to get and a second to save as it requires some “preprocessing” of data. We were growing by 3 installations per day but getting 6 to 8. Every installation was contributing to a sync which would cause 10K QPS (Query Per Second) and takes on our CPU and RAM. Inventory got mismatched for the first time. Our client yelled at us and uninstalled our app. We separated sync queue into a different server. After running four workers in a single server, we got our first support complaining that sync is taking time (Still!). We launched another server with the same configuration. We can now handle 8 installations simultaneously. It didn’t help in sticking of customers. The same ticket came when the 9th installation was waiting for its sync. Now it would be madness to have another server right?
We split one big sync into sub-syncs. Right now workers table was containing store and status now we had store, status, page, action.
Now, these workers can pick top rows of each store and hence, no one would complain. As one big store was blocking sync of other stores (Why FIFO is a pain – later). This was resolved! We never faced the same problem again. In our best (or worst) day, we had over 20 installations taking the total of 6 hours at most but every Store had few bundles to start within a matter of mins. Later, we used the frontend Shopify API to eliminate the need for Sync but unpublished products were not available thus, we kept two options. We could have moved to GraphQL (latest launch in Beta at the time of writing this post) but we didn’t – more on this on next post.
First downtime and few more in one day – the Fallout!
Remember the Webhooks? Linode restarted our server for some reasons – Emergency Maintenance blah blah blah. It took 2 mins to reboot and then we had 70K webhooks coming from Shopify in just a matter of minutes if not seconds. Then we realized that if Webhooks don’t get 200 status code – Shopify resends it but only quicker! We were not ready for such a huge traffic. Server chocked and PHP-FPM crashed for the very first time. We thought the server isn’t starting but it happened again. We spun two more servers after found it in logs. This is something Amazon Kinesis would have resolved for us – but we are limited. Now we had three servers dedicated to Webhook finally survived the ‘fallout’ of downtime. In front, we had Nginx load-balancer. At very worst day, the same server handled a peak traffic of 4000 hits per seconds with no problems. However, we did set up open-files settings and tcp_mem size.
Delaying the Scalability
By now we were pretty much stable or so we thought, when we reached 80 hooks per seconds, we just spun a new CRUD, a parallel webhook server. However, the main optimization was not that far! This time we didn’t die, we were only curious. Our MySQL queries were hovering over 5K QPS, on 8GB Linode machine. This isn’t enough considering that you had 4 core CPU at your disposal. We looked at MySQL and set following as per our needs (Nothing more)
binlog_row_image, binlog_rows_query_log_events, binlog_cache_size, innodb_log_file_size, innodb_read_io_threads, innodb_write_io_threads, innodb_stats_persistent_sample_pages, innodb_stats_transient_sample_pages, innodb_rollback_on_timeout, thread_cache_size, max_allowed_packet
Imagine our excitement when mere changing few variables led to 30K QPS? However, it is worth mentioning that we hired multiple service providers but a company that actually came up with a proper fix was MyDBops. They spent significant time explaining what they did and why they did. They traveled an extra mile by explaining each parameter to us- later after expanding servers further, we were comfortable ourselves enough to make a change.
The pain of FIFO and Leaky Bucket
At the moment, we had each worker taking data as FIFO (First in First out), the problem with this approach was, if you have one store sending you 10K webhooks over the span of two mins (consider a bulk product import or update apps – our sworn enemies), along with the combination of Shopify rate limit of 2QPS with 40 of leaky bucket, we are done! So we coded a custom Algo, we called it Thor. Thor will see the distribution of webhooks, collect more but at most 20, for more active stores and less but at least 5 for less active stores. It can be configured to work for created at, distribution spread over (like Head, Middle, Top and Tail) If at any point, we are stuck by reaching Shopify limit, we will set the “wait” flag with time and move to the next Shop, we will re-visit the Store there after 20 seconds as a leaky bucket will be empty.
In Elixir, we might have different processes for each store. Meanwhile, we created a naked webhook removal query, this would remove those webhooks, for which no bundle exists. It worked with products but for orders, it was too tricky and needed a procedure which would run after every 2 mins. This one was for “emergencies” when we have a million webhooks pending. It would delay it and it did in the coming months.
Failed to ReWrite
We hired two devs to rewrite our PHP webhooks handling codes into Node, both of them failed. First dev gave up as he couldn’t solve leaky connection problem, second dev gave up on the first dev’s code and re-wrote his own, he got trapped in an infinite loop and many more bugs. After a year-long round-trip we decided to stick to PHP.
Meltdown – Migration – Importance of CRON!
When Linode had to fix “Meltdown” vulnerability, it needed a downtime to migrate servers. We had to launch duplicate of each server and gun down the old ones. However, we didn’t have anything for the master server. This led to the need of Active-Master approach. Both will be master-master but one server will have all the writes. MaxScale was deployed. Switching between the server takes time – 2 seconds by default. We finally moved MySQL servers out of vulnerability. We realized that we forgot adding workers in cron. This was a ticking time-bomb – waiting to explode but another solution was waiting. Guess what happened? – Linode gave a really tempting offer, more ram and bandwidth on the same price. We took the bait. We moved our Linode servers starting from CRUD, everything went smoothly and found that we were 1.2X faster now. Then we detonated the bomb! We moved our workers (All 7 of them)
After 10 hours, we had 7.5M webhooks pending with first support – inventory isn’t working. We checked and page for stats, it didn’t load, but count query was working fine and we had a bit over 7.5M webhooks. It was 13:00 IST when we started the migration, at 00:25 IST, we three people were on Skype! While my partner was doing Maths to calculate Webhooks per second from 1.5K stores for possible “attack”, we were creating three “new” webhook tables for the future. All future webhooks will go there from three servers handling webhooks currently. MySQL table size matters a lot, especially when you are grouping the results. We then created 7 views, dividing each with a one million row and rest on the last one. We spun three more workers for new tables. And each server workers will handle one million rows. Next day (after 12 hours) we checked and had 2M webhooks on only new tables. We knew that most of them are cold webhooks and it will go away.
The difficult decision of firing a customer!
We found that one Store that was blasting our server with over 1M webhooks per day. We uninstalled our app from this store and explained the problem we were facing. One sad customer made the rest of our customers very happy.
So not all the scalability problems are resolved by servers or configuration. Some scalability problems are also solved by saying NO.
Dropping this customer without one-star review was a win for us but it did make us sad.