FEBRUARY 17 2023
Dynamic AutoScaling of GitHub Runners

In this article we explain our transition to GitHub Actions for our CI/CD needs at Dgraph Labs Inc. As a part of this effort we have built (in-house) & implemented a new architecture for "Dynamic AutoScaling of GitHub Runners" to power this setup.
In the past, our CI/CD was powered by a
self-hosted on-prem TeamCity setup - this
turned out to be a little difficult to operate & manage in a startup setting
like ours. Transitioning to
GitHub Actions & implementing our new
in-house built "Dynamic AutoScaling of GitHub Runners" - has helped us reduce
our Compute Costs
, Maintenance Efforts
& Configuration Time
across our
repositories for our CI/CD efforts (with
improved security).
Background
Before we begin we would like to give you an overview of CI/CD & explain our needs for it at Dgraph Labs Inc.
What is CI/CD?
source: https://www.synopsys.com/glossary/what-is-cicd.html
CI/CD is a two-step process that dramatically streamlines code development and delivery using the power of automation. CI (Continuous Integration) makes developer tasks around source code integration, testing and version control more efficient - so the software can get built with higher quality. CD (Continuous Deployment) automates software testing, release & deployment. CI/CD is often referred to as the DevOps Infinity Loop (as illustrated in the image above).
Why is CI/CD important to us?
At Dgraph Labs Inc, we use CI/CD to facilitate our SDLC (Software Development Life Cycle) for our Dgraph Database and our Dgraph DBaaS (Cloud Offering) components. Like any tech company we want to minimize our bugs & deliver high quality products. The testing standards become strict for a Database company like ours, as a database becomes the most critical component of a software stack. To facilitate this, we follow the Practical Test Pyramid model along with other kinds of measurements instrumented into our CI/CD. To summarize, CI/CD helps us with:
- Ensuring higher code quality
- Obtaining continuous feedback
- Shipping efficiently (with higher confidence)
How we use CI/CD?
CI/CD at Dgraph
Our CI/CD use-cases mostly revolve in the following areas: Building for Multi-Architectures (amd64/arm64), Testing (Unit/Integration & Load), Deployments, Security Audits, Code Linting, Benchmarking & CodeCoverage. We will cover some of these topics in our future blog posts.
Old Setup (TeamCity)
As described above, Dgraph Labs Inc ran a self-managed on-prem TeamCity setup for CI/CD in the past. The setup looked similar to the image below.
source: https://confluence.jetbrains.com/display/tcd65/getting+started
This setup was quite difficult to manage, monitor & have a high-uptime for a
small team like ours today. The work infrastructure setup, ensuring right
security posture and instrumenting observability on these systems. There were 3
issues here for us, and they were Compute Costs
, Maintenance Efforts
&
Configuration Time
.
Firstly, the Compute Costs
was additive because we not only needed a Server &
Agent Compute Machines, but we also needed Observability Stack (&
Instrumentation) for these critical systems. Secondly, the Maintenance Efforts
on the issues we encountered (like Security Patching, Upgrades, Disk-Issues,
Inconsistent Test Results Reporting etc.) was taxing the team and was taking
time away from our development cycles. Lastly, the Configuration Time
was also
a problem for us because the job configurations were outside our codebase (and
in the Server), VCS configurations for new repo's needed instrumentation & we
had to write our custom install/cleanups for basic setup tasks in the job
definitions.
As a result, Compute Costs
⬆⬆, Maintenance Efforts
⬆⬆ &
Configuration Time
⬆⬆ were all high. This led us to re-think how we can
transition to a new system that solved these issues and offered Public & Private
repositories support.
NOTE: TeamCity is a great product. As
explained above time
& easy-of-use
were driving factors that made us
transition out.
New Setup (GitHub Actions)
Our research led us to GitHub Actions. Given that we were already on GitHub for our VCS, this made us explore this further. We were quite content with what it had to offer, as it came with immediate benefits. Notable architecture differences were around how there was a fully managed Server (unlike the previous setup) & how GitHub semi-managed the Runners (a.k.a. Agents). Below we show an example CD run for multi-architecture release for Dgraph Labs Inc on GitHub Actions.
GitHub Actions CD steps for Dgraph (everything well integrated to GitHub
eco-system)
Firstly, the Compute Costs
were lower because we only manage the
Self-Hosted Runners.
We made use of the free
GitHub-hosted runners
wherever possible. For jobs that required higher resource specs, we had the
option to run them on higher resources using
Self-Hosted Runners.
Secondly, the Maintenance Costs
reduced because we had fewer components to
manage. Although, GitHub had done a great job in simplifying the Runner setup
steps - it still needed manual management - this was still a concern. Lastly,
the Configuration Time
reduced drastically because of the
Action Marketplace, which provided
pre-templated tasks to perform pre-setup and post-cleanup on the Runners. And
with this transition, the code & job definitions lived together (unlike our old
setup).
As a result, our Compute Costs
⬆, Maintenance Efforts
⬆ &
Configuration Time
⬇ was a great win. There was still some room to improve
here, because we were manually attaching & detaching Runners on a need basis.
There was another problem here - the "Idle Runners" - it was leading to wasted
resource spends.
Dynamic AutoScaling Of GitHub Runners
As described in the previous step, our problems were limited to Compute Costs
(idle runners) & Maintenance Efforts
(manual Runner attach/detach). We started
exploring potential solutions for "Dynamic AutoScaling of GitHub Runners". There
were 2 solutions we found, which were
ARC &
Philips-Scalable-Runner.
The ARC was specific to
container eco-system and did not apply to us. The
Philips-Scalable-Runner
had
too many components
to facilitate this. So this led us to building our in-house solution.
Our design needed to solve these:
- minimal AWS service use
- support different labels (i.e. Runner types like arm64 / amd64)
- support different repositories This led us to the below architecture.
Dynamic AutoScaling of GitHub Runners
There are 3 logical pieces in this architecture, and they are:
VM Images
, We bake custom AMI's with a specialized startup script in them. The startup script that we bake into the image, has a logic to connect to the SSM Parameter Store & read its configuration at startup. When the AMI comes up as an EC2 instance, it will read its config from SSM Parameter Store & self-configure itself to GitHub to service our Jobs.SSM
, we use SSM Parameter Store as a way to store Runner configurations. This is essentially a KV store for configs. We store the Runner configurations as Values with the EC2 Instance as the Key deliminator.Orchestrator
, This is essentially our Controller (written in Python). TheOrchestrator
monitors GitHub events. It has logic to Scale Up or Down based on the Job Queue & available Runner count dynamically. This has hooks to GitHub & AWS (SSM Parameter Store & EC2) to facilitate this process. In the Scale Up phase, theOrchestrator
will create an SSM Parameter Store entry and follow it up by creating an EC2 using the AMI (through the Launch-Templates). In the Scale Down phase, theOrchestrator
will delete the SSM Parameter Store entry and follow it up by a deletion operation on the created EC2 instance.
Note: We are considering to Open Source this project. For that reason we have only given an overview and skipped the full implementation details. If you are interested to discuss further, do hit us up. We would love to partner.
Financial Analysis
CI/CD Cost Graph
We enabled "Dynamic AutoScaling of GitHub Runners" on 2023-Jan-06
. And we have
seen drastic reduction in our spends since then. Not only has it saved us money,
it has also saved us Engineering time by eliminating Maintenance Efforts
. The
"Idle Runner" problem was real, and it would have affected us in different ways
had we not addressed.
High Level Analysis
- Before AutoScaling
- our costs increased as we got closer to our release cycles because we attached more Runners
- increase in costs was primarily because of beefy Idle Runners
- After enabling AutoScaling
- costs shrunk drastically
- our weekend costs touched
~$0
(as it's a break day) - we serviced more runs (almost triple) to what we did in our previous release
- no manual attach/detach of Runners required by Engineers
Average Daily Runner Cost (dropped by ⬇~87%
)
- Before AutoScaling
~$63.36/day
(or$1,900.8/month
) ⬆⬆ - After enabling AutoScaling
~$8.12/day
(or~$243.6/month
) ⬇⬇
Average Per PR cost (after AutoScaling)
- We spend about
~$1.2/PR
for dgraph-io/dgraph repository
NOTE: We will continue to see more savings over time (as the savings compound here).
Conclusion
To conclude, the table shows our OKRs and how we went about solving for these. The last column is where we are today in our journey.
| OKRs | Old Setup
TeamCity
| New Setup
Github Actions
| New Setup
GitHub Actions
w/ AutoScaling |
| -------------------------- | ---------------------------- | ---------------------------------- | -------------------------------------------------- |
| Compute Costs
| ⬆⬆ | ⬆ | ⬇ |
| Maintenance Costs
| ⬆⬆ | ⬆ | ⬇ |
| Configuration Time Costs
| ⬆⬆ | ⬇ | ⬇ |
Acknowledgements
On behalf of the project team, I would like to express our gratitude to all the internal contributors who have played a crucial role in bringing this project to fruition. Their dedication and hard work have been instrumental in making this idea a reality.
A special thanks goes to Kevin for his efforts in implementing and testing the initial concept. His efforts ensured that our vision could be transformed into a tangible execution.
We would also like to acknowledge the contributions of Aditya (co-author), Anurag, Dilip & Joshua. Their valuable insights, expertise, and collaboration have greatly enriched the project.