Model Router is here. One API, one bill, multiple models.

Learn more

MAY 15 2025

How we built Model Router

We built Model Router to simplify switching between model providers. Here's the behind-the-scenes story of how it works, why we wrote it in Go, and what’s coming next.

Kevin Mingtarja
Kevin Mingtarja
Senior Engineer, Hypermode

We recently launched Model Router, a unified API that lets you swap between commercial and open-source models without changing your app logic. In this post, we dive into the behind-the-scenes technical story of why we built Model Router from scratch in Go, how we designed it for flexibility, and what’s next on the roadmap.

The problem with too many APIs

Different commercial inference providers usually have different APIs, meaning different formats of request and response bodies, different ways to pass in API keys, etc. This makes it pretty hard to try out, switch, or compare different models from different providers, as you’ll have to modify your client code.

It’s also a different story for open source models. Unlike commercial models, they aren’t immediately consumable through an API. You need to to choose an inference serving engine (vLLM, SGLang, TensorRT-LLM, etc). Some have their own native APIs and some choose to implement an existing API.

What has been true though is that inference providers, big and small, are choosing to provide OpenAI-compatible APIs on top of their native APIs, or even as their only API. DeepSeek, for example, only implements the OpenAI API.

To move quickly, we scoped Model Router with OpenAI-compatible APIs. We started with two endpoints:

  • /chat/completions
  • /embeddings

That gave us broad coverage across reasoning and embedding models, commercial and open source. It also made it easy for developers to switch providers without rewriting app logic.

Of course, they call it “OpenAI-compatible” for a reason. There are some gotchas here and there— every provider has edge-cases especially around more advanced functionality such as tool calling. We'll continue to normalize those over time.

How Model Router works?

Model Router works a lot like your usual HTTP reverse proxy. Here's what happens when a request comes in:

  1. Geographic routing: a request hits https://models.hypermode.host/ and is routed to the nearest cluster from your geographical location
  2. Authentication and rate limiting: your Hypermode API Key gets authenticated and checked for rate limits
  3. Request inspection and rewriting: we parse the request body, and determine:
    • Which upstream provider to route to
    • Which API key to use
    • How to modify the request url, header, body
  4. Proxying the request: we forward the transformed request and stream back the response

Determining where to route the request gets interesting and arbitrarily complex. Today, we only make the decision based on the model name in the request body. But we are constantly exploring more dynamic approaches, such as splitting traffic between multiple models to route simpler prompts to cheaper models, or A/B testing new models.

Why we didn't use Envoy

Initially, we looked at Envoy AI Gateway, since we already use Envoy with Istio. It has some very compelling features like AI/LLM specific metrics and token-based rate limiting.

But it is early days, and more complex functionalities are not available out-of-the-box. Extending it with custom logic would mean writing Envoy plugins, something we haven't done before. That would’ve introduced more complexity and unknown unknowns.

So we wrote it ourselves in Go.

An HTTP reverse proxy in Go

At its core, Model Router is a custom HTTP reverse proxy written in Go. It listens for incoming requests, rewrites them based on the target model, and forwards them to the appropriate provider.

Here's what happens when a request comes in:

  1. Parse the request body: we decode the request body to look at the model specified to determine which provider to route to
  2. Transform the request: based on the target model, we rewrite:
    • The URL host to forward the requests to, e.g. api.openai.com or generativelanguage.googleapis.com
    • The URL path, e.g. /v1/chat/completions or /v1beta/openai/chat/completions
    • The Host header (or :authority” for HTTP/2), to match the destination
    • The Authorization header, using the correct API key of the respective provider
  3. Proxy the request: once transformed, the request is forwarded and the response is streamed back.

Go’s net/http provides a lot of useful things for dealing with networking (one of the things we love the most with Go). httputil.ReverseProxy is particularly helpful. It lets you plug in a Director or Rewrite function to modify requests before they’re sent upstream. This is where we put the logic to transform incoming requests.

// ReverseProxy is an HTTP Handler that takes an incoming request and
// sends it to another server, proxying the response back to the
// client.
// ...
type ReverseProxy struct {
	// Rewrite must be a function which modifies
	// the request into a new request to be sent
	// using Transport. 
	...
	// At most one of Rewrite or Director may be set.
	Rewrite func(*ProxyRequest)

	// Director is a function which modifies
	// the request into a new request to be sent
	// using Transport.
	// ...
	// At most one of Rewrite or Director may be set.
	Director func(*http.Request)

	// The transport used to perform proxy requests.
	// If nil, http.DefaultTransport is used.
	Transport http.RoundTripper

	...
}

func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, req *http.Request) {
	...
}

You can create your own Transport if you need fine-grained control over connection behavior--like configuring HTTP1 or HTTP2 parameters such as MaxIdleConns, ReadIdleTimeout, PingTimeout, etc.

After that, simply call ServeHTTP and the ReverseProxy takes care of the rest!

What's next

Today, routing is based on the model field in the request body. But the architecture is built to evolve.

Coming soon:

  • Observability: analyze logs, latency, and usage metrics directly in the Hypermode console
  • Uptime optimization with multi-provider routing: route the same model to different providers (if available) to improve availability and ensure optimal performance
  • Prompt caching: reduce cost and latency for repetitive tasks. Most providers handle caching automatically, but some, like Anthropic, require manual injection, which Model Router handles for you

Try it out

Model Router is live and free to use (within fair use limits) through May 31.

Get started