Fuzzy text matching in Python

Imagine that you’ve been given a spreadsheet recording which of your students will attend which of 9 computer classes. You just have the names and classes, but you need to know the email addresses of the students in each class so that you can contact the students in a specific class (e.g. to tell them that it is cancelled because you are going on strike). Asking for a spreadsheet containing the emails as well as the names doesn’t help.

Imagine further that you have another spreadsheet which lets you match names to emails, but that the names on the second spreadsheet aren’t exact matches with the names on the first. For example, they might omit parts of the names (John Fitzgerald Kennedy -> John Kennedy), or use variations of first names (Joseph Stalin -> Joe Stalin). If you’ve got 300+ students, you don’t want to do the name-email matching by hand.

You can do this in Python using the csv module for accessing spreadsheets and the difflib module for fuzzy matching. Both of these are in the standard library. The function you need is

difflib.get_close_matches(name, names)

where you want to find the best match for name in the list names. get_close_matches returns a list of the closest matches above a threshold (0.6 by default, I believe difflib uses this to calculate the score), so you can deal separately with the case of no close matches. In my case there were only a handful of names out of over 300 that wouldn’t match, and so far as I could see there were no false positives.