Planet London Python

September 12, 2019

Peter Bengtsson

Fastest Python function to slugify a string

In MDN I noticed a function that turns a piece of text (Python 2 unicode) into a slug. It looks like this:

    non_url_safe = ['"', '#', '$', '%', '&', '+',
                    ',', '/', ':', ';', '=', '?',
                    '@', '[', '\\', ']', '^', '`',
                    '{', '|', '}', '~', "'"]

    def slugify(self, text):
        """
        Turn the text content of a header into a slug for use in an ID
        """
        non_safe = [c for c in text if c in self.non_url_safe]
        if non_safe:
            for c in non_safe:
                text = text.replace(c, '')
        # Strip leading, trailing and multiple whitespace, convert remaining whitespace to _
        text = u'_'.join(text.split())
        return text

The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.

I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.

The candidates

translate_table = {ord(char): u'' for char in non_url_safe}
non_url_safe_regex = re.compile(
    r'[{}]'.format(''.join(re.escape(x) for x in non_url_safe)))


def _slugify1(self, text):
    non_safe = [c for c in text if c in self.non_url_safe]
    if non_safe:
        for c in non_safe:
            text = text.replace(c, '')
    text = u'_'.join(text.split())
    return text

def _slugify2(self, text):
    text = text.translate(self.translate_table)
    text = u'_'.join(text.split())
    return text

def _slugify3(self, text):
    text = self.non_url_safe_regex.sub('', text).strip()
    text = u'_'.join(re.split(r'\s+', text))
    return text

I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.

The results

The slowest is fast enough. But if you're still reading, here are the results:

_slugify1 0.101ms
_slugify2 0.019ms
_slugify3 0.033ms

So using a translate table is 5 times faster. And a regex 3 times faster. But they're all sufficiently fast.

Conclusion

This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.

Also, aren't there better solutions that just blacklist all control characters?

September 12, 2019 08:20 PM