Fuzzy text matching in Python
Imagine that you’ve been given a spreadsheet recording which of your students will attend which of 9 computer classes. You just have the names and classes, but you need to know the email addresses of the students in each class so that you can contact the students in a specific class (e.g. to tell them that it is cancelled because you are going on strike). Asking for a spreadsheet containing the emails as well as the names doesn’t help.
Imagine further that you have another spreadsheet which lets you match names to emails, but that the names on the second spreadsheet aren’t exact matches with the names on the first. For example, they might omit parts of the names (John Fitzgerald Kennedy -> John Kennedy), or use variations of first names (Joseph Stalin -> Joe Stalin). If you’ve got 300+ students, you don’t want to do the name-email matching by hand.
You can do this in Python using the csv
module for accessing
spreadsheets and the difflib
module for fuzzy matching. Both of these
are in the standard library. The function you need is
difflib.get_close_matches(name, names)
where you want to find the best match for name
in the list names
.
get_close_matches
returns a list of the closest matches above a threshold (0.6 by
default, I believe difflib
uses
this to
calculate the score), so you can deal separately with the case of no
close matches. In my case there were only a handful of names out of over
300 that wouldn’t match, and so far as I could see there were no false
positives.