Scale or Die: How We Made Gruveo Scalable

May 1, 2015
/ Case Studies

Quick fact: This April, Gruveo monthly call minutes grew 644% year over year. Actually, we have been experiencing this 7X to 8X YoY growth every month since October last year when the iOS app was launched. (December 2014 was an exception where call minutes jumped a staggering 13X vs. December 2013.)

Gruveo monthly call minutes since Aug. 2013

With this kind of growth rate, the question we began to ask ourselves was whether our infrastructure was up to the task of keeping up. An internal audit conducted in February showed that scaling Gruveo in its then-current state was problematic, so we rolled up our sleeves and got to work fixing that.

There are three main components to Gruveo (notwithstanding the iOS app), each of which would have experienced severe strain if another TechCrunch article about Gruveo came out:

The web server serving the Gruveo web app, as well as the blog and static pages such as the FAQ
The signaling server used for matching WebRTC endpoints and forwarding messages between them
The STUN/TURN servers that help in NAT traversal and relay encrypted media traffic when a P2P connection between clients cannot be established.

The first thing we set out to address was the website. Instead of hosting it on a self-setup VPS, we refactored it into an Elastic Beanstalk app that is now hosted on AWS. With goodies such as Elastic Load Balancers and auto-scaling in place, we no longer worry about a sudden spike in traffic.

With our STUN/TURN servers, the problem that prevented us from scaling efficiently was the TURN credentials mechanism we had been using. Under the old scheme, referred to as the “Typical TURN Auth” in this presentation, the signaling server would generate TURN credentials for each connecting client and store those in a database accessible by the TURN servers. That worked, but increasing the number of TURN servers also meant having to scale the database while maintaining the signaling server – database – TURN server link.

To solve this issue, we ditched the old approach in favor of the TURN REST server API (described as “Stateless TURN Auth” at the link above). Frankly, that was like a breath of fresh air to us. No longer did we need to haul around a MySQL credentials database and worry about scaling it along; bumping up our TURN server capacity became as easy as spinning off new instances from a virtual machine image.

Having put the TURN scalability behind us, we were left with addressing one last component: the signaling server. The custom Node.js server we had in place could scale up pretty decently, but scaling out was an issue. How do we efficiently pass messages between distinct server instances?

Enter Redis and its Pub/Sub functionality where clients can subscribe to and publish messages in “channels”. By having all our Node.js server instances communicate through a Redis cluster, we solved the problem of scaling out the signaling, and also made it automated by making our signaling into just another Elastic Beanstalk app.

The result: Scalable, powerful Gruveo able to withstand whatever growth the future may bring. (And if past experience is any guide, there will be a lot.)