Say what you mean in a regex
This is part of a series of posts I’m doing as a sort of Python/Django Advent calendar, offering a small tip or piece of information each day from the first Sunday of Advent through Christmas Eve. See the first post for an introduction.
An URL-y warning
Suppose you’re writing a blog in Django, and you get to the point where you’re setting up the URLs for the entries. Django has two ways to write URLs, depending on your preferred style:
- The
path()
function, which uses path-converter syntax to let you declare the types you expect things to be and derives the matching rules from that. - The
re_path()
function, which uses regex syntax to describe the URL.
Here’s an example of each:
from django.urls import path, re_path
from blog import views
urlpatterns = [
re_path(
r"^(?P<year>\d{4})/$",
views.EntryArchiveYear.as_view(),
name="entries_by_year",
),
path(
"<int:year>/<int:month>/",
views.EntryArchiveMonth.as_view(),
name="entries_by_month",
),
]
But there’s a bug here. Can you spot it?
A digital extravaganza
Ever since the Python 3 transition, Python’s regex implementation, in the re
module, is Unicode-aware, and will use Unicode properties when determining whether something fits in a particular character class. So this works:
>>> import re
>>> year_pattern = re.compile(r"^(?P<year>\d{4})/$")
>>> year_pattern.match('2020/')
<re.Match object; span=(0, 5), match='2020/'>
But, crucially, so does this:
>>> year_pattern.match('۵७੪୭/')
<re.Match object; span=(0, 5), match='۵७੪୭/'>
That sequence is U+1781 EXTENDED ARABIC-INDIC DIGIT FIVE
, U+2413 DEVANAGARI DIGIT SEVEN
, U+2666 GURMUKHI DIGIT FOUR
, U+2925 ORIYA DIGIT SEVEN
, in case you’re interested.
And that behavior probably isn’t what was wanted, but is what the regex asked for: the \d
regex metacharacter matches anything that Unicode considers to be a digit, which is a much larger set of things than just the ten ASCII digits. Many languages around the world have their own digit characters, after all, and Unicode recognizes all of them.
So the correct pattern is not \d{4}
, but [0-9]{4}
, matching only the ten ASCII digits. This is a bug I’ve seen multiple times now in real-world codebases, sometimes lurking for years after a Python 3 migration, and can pop up anywhere you use regex, so it’s worth keeping an eye out for and probably even actively auditing your code for if you’re feeling ambitious.