Planet London Python

September 30, 2019

Peter Bengtsson

Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON

Last week, I blogged about "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Judging from a lot of comments, people misinterpreted this. (By the way, Redis is persistent). It's no surprise that Redis is faster.

However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.

But is it entirely wrong to look at WHICH database to get the best speed?

Reviewing this corner of Song Search helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? Nginx

Nginx??

The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:

const response = await fetch('https://songsear.ch/api/song/5246889/amazon');

Internally, in the app, what it does is that it looks this up, by ID, on the AmazonAffiliateLookup ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as ProductTypeName, Studio, TrackSequence etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.

Again, Nginx?

Nginx has a built in reverse HTTP proxy cache which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the total possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.

Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:

▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes

Summary:
  Total:    0.9882 secs
  Slowest:  0.0279 secs
  Fastest:  0.0001 secs
  Average:  0.0010 secs
  Requests/sec: 10119.8265


Response time histogram:
  0.000 [1] |
  0.003 [9752]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [108]   |
  0.008 [70]    |
  0.011 [32]    |
  0.014 [8] |
  0.017 [12]    |
  0.020 [11]    |
  0.022 [1] |
  0.025 [4] |
  0.028 [1] |


Latency distribution:
  10% in 0.0003 secs
  25% in 0.0006 secs
  50% in 0.0008 secs
  75% in 0.0010 secs
  90% in 0.0013 secs
  95% in 0.0016 secs
  99% in 0.0068 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0001 secs, 0.0279 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0026 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0011 secs
  resp wait:    0.0008 secs, 0.0001 secs, 0.0206 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0013 secs

Status code distribution:
  [200] 10000 responses

10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a Cache-Control HTTP header with something like max-age=86400.

Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long Cache-Control, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).

Conclusion

If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.

September 30, 2019 03:06 AM

September 28, 2019

Peter Bengtsson

How much faster is Redis at storing a blob of JSON compared to PostgreSQL?

tl;dr; Redis is 16 times faster and reading these JSON blobs.*

In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?

Example view of affiliate links

The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:

class AmazonAffiliateLookup(models.Model, TotalCountMixin):
    song = models.ForeignKey(Song, on_delete=models.CASCADE)
    matches = JSONField(null=True)
    search_index = models.CharField(max_length=100, null=True)
    lookup_seconds = models.FloatField(null=True)
    created = models.DateTimeField(auto_now_add=True, db_index=True)
    modified = models.DateTimeField(auto_now=True)

At the moment this database table is 3GB on disk.

Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.

The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URL,
        "TIMEOUT": config("CACHE_TIMEOUT", 500),
        "KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
        "OPTIONS": {
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        },
    }
}

The speed difference

Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost.

Reads

The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.

I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:

PG:
median: 8.66ms
mean  : 11.18ms
stdev : 19.48ms

Redis:
median: 0.53ms
mean  : 0.84ms
stdev : 2.26ms

(310 measurements)

It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.

Writes

The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:

PG:
median: 8.59ms
mean  : 8.58ms
stdev : 6.78ms

Redis:
median: 0.44ms
mean  : 0.49ms
stdev : 0.27ms

(137 measurements)

It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.

Conclusion and discussion

First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.

Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.

I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.

UPDATE September 29, 2019

I experimented with an optimization of NOT turning the Django ORM query into a model instance for each record. Instead, I did this:

+from dataclasses import dataclass


+@dataclass
+class _Lookup:
+    modified: datetime.datetime
+    matches: list

...

+base_qs = base_qs.values_list("modified", "matches")
-lookup = base_qs.get(song__id=song_id)
+lookup_tuple = base_qs.get(song__id=song_id)
+lookup = _Lookup(*lookup_tuple)

print(lookup.modified)

Basically, let the SQL driver's "raw Python" content come through the Django ORM. The old difference between PostgreSQL and Redis was 16x. The new difference was 14x instead.

September 28, 2019 03:50 PM