Planet London Python

May 15, 2019

Peter Bengtsson

How I simulate a CDN with Nginx

Usually, a CDN is just a cache you put in front of a dynamic website. You set up the CDN to be the first server your clients get data from, the CDN quickly decides if it was a copy cached or otherwise it asks the origin server for a fresh copy. So far so good, but if you really care about squeezing that extra performance out you need to worry about having a decent TTL and as soon as you make the TTL more than a couple of minutes you need to think about cache invalidation. You also need to worry about preventing certain endpoints from ever getting caught in the CDN which could be very bad.

For this site, www.peterbe.com, I'm using KeyCDN which I've blogged out here: "I think I might put my whole site behind a CDN" and here: "KeyCDN vs. DigitalOcean Nginx". KeyCDN has an API and a python client which I've contributed to.

The next problem is; how do you test all this stuff on your laptop? Unfortunately, you can't deploy a KeyCDN docker image or something like that, that attempts to mimic how it works for reals. So, to simulate a CDN locally on my laptop, I'm using Nginx. It's definitely pretty different but it's not the point. The point is that you want something that acts as a reverse proxy. You want to make sure that stuff that's supposed to be cached gets cached, stuff that's supposed to be purged gets purged and that things that are always supposed to be dynamic is always dynamic.

The Configuration

First I add peterbecom.local into /etc/hosts like this:

▶ cat /etc/hosts | grep peterbecom.local
127.0.0.1       peterbecom.local origin.peterbecom.local
::1             peterbecom.local origin.peterbecom.local

Next, I set up the Nginx config (running on port 80) and the configuration looks like this:

proxy_cache_path /tmp/nginxcache  levels=1:2    keys_zone=STATIC:10m
    inactive=24h  max_size=1g;

server {
    server_name peterbecom.local;
    location / {
        proxy_cache_bypass $http_secret_header;
        add_header X-Cache $upstream_cache_status;
        proxy_set_header x-forwarded-host $host;
        proxy_cache STATIC;
        # proxy_cache_key $uri;
        proxy_cache_valid 200  1h;
        proxy_pass http://origin.peterbecom.local;
    }
    access_log /tmp/peterbecom.access.log combined;
    error_log /tmp/peterbecom.error.log info;
}

By the way, I've also set up origin.peterbecom.local to be run in Nginx too but it could just be proxy_pass http://localhost:8000; to go straight to Django. Not relevant for this context.

The Purge

Without the commercial version of Nginx (Plus) you can't do easy purging just for purging sake. But with proxy_cache_bypass $http_secret_header; it's very similar to purging except that it immediately makes a request to the origin.

First, to test that it works, I start up Nginx and Django and now I can run:

▶ curl -v http://peterbecom.local/about > /dev/null
< HTTP/1.1 200 OK
< Server: nginx/1.15.10
< Cache-Control: public, max-age=3672
< X-Cache: MISS
...

(Note the X-Cache: MISS which comes from add_header X-Cache $upstream_cache_status;)

This should trigger a log line in /tmp/peterbecom.access.log and in the Django runserver foreground logs.

At this point, I can kill the Django server and run it again:

▶ curl -v http://peterbecom.local/about > /dev/null
< Server: nginx/1.15.10
< HTTP/1.1 200 OK
< Cache-Control: max-age=86400
< Cache-Control: public
< X-Cache: HIT
...

Cool! It's working without Django running. As expected. This is how to send a "purge request"

▶ curl -v -H "secret-header:true" http://peterbecom.local/about > /dev/null
> GET /about HTTP/1.1
> secret-header:true
>
< HTTP/1.1 502 Bad Gateway
...

Clearly, it's trying to go to the origin, which was killed, so you start that up again and you get back to:

▶ curl -v http://peterbecom.local/about > /dev/null
< HTTP/1.1 200 OK
< Server: nginx/1.15.10
< Cache-Control: public, max-age=3672
< X-Cache: MISS
...

In Python

In my site, there are Django signals that are triggered when a piece of content changes and I'm using python-keycdn-api in production but obviously, that won't work with Nginx. So I have a local setting and my Python code looks like this:

# This function gets called by a Django `post_save` signal
# among other things such as cron jobs and management commands.

def purge_cdn_urls(urls):
    if settings.USE_NGINX_BYPASS:
        # Note! This Nginx trick will not just purge the proxy_cache, it will
        # immediately trigger a refetch.
        x_cache_headers = []
        for url in urls:
            if "://" not in url:
                url = settings.NGINX_BYPASS_BASEURL + url
            r = requests.get(url, headers={"secret-header": "true"})
            r.raise_for_status()
            x_cache_headers.append({"url": url, "x-cache": r.headers.get("x-cache")})
        print("PURGED:", x_cache_headers)
        return 

    ...the stuff that uses keycdn...

Notes and Conclusion

One important feature is that my CDN is a CNAME for www.peterbe.com but it reaches the origin server on a different URL. When my Django code needs to know the outside facing domain, I need to respect that. The communication between by the CDN and my origin is a domain I don't want to expose. What KeyCDN does is that they send an x-forwarded-host header which I need to take into account when understanding what outward facing absolute URL was used. Here's how I do that:

def get_base_url(request):
    base_url = ["http"]
    if request.is_secure():
        base_url.append("s")
    base_url.append("://")
    x_forwarded_host = request.headers.get("X-Forwarded-Host")
    if x_forwarded_host and x_forwarded_host in settings.ALLOWED_HOSTS:
        base_url.append(x_forwarded_host)
    else:
        base_url.append(request.get_host())
    return "".join(base_url)

That's about it. There are lots of other details I glossed over but the point is that this works good enough to test that the cache invalidation works as expected.

May 15, 2019 06:08 PM