I constantly find myself in a situation, like developing the Music Suggestions Ninja, where I need to have the user enter a string and find the best match for that string from a list of strings. In order to compensate for misspellings I’ve found the fuzzywuzzy package very helpful. But how about speed? Its slow to do that over 100,000+ words.
So given I have a list of band names, how on earth do I search for one? I want to allow for mispellings because I want to always at least try to find something. This turns out, to be pretty easy. There is a pretty reliable and fast algorithm for computing the Levenshtein distance which can essentially tell you the edit distance between two words.
There are problems with just using edit distance, though, as the edit distance can be quite misleading if the words are only just rearranged. However, someone smart has already thought of this and fixed it! The solution is to use the Python package
fuzzywuzzy which can compute Levenshtein distances and account for all sorts of stuff, like rearrangements.
So I like to use two metrics, as I found they work better than one, so my string matching code is simply:
Now how to get the best match? Here’s another problem: their can be a lot of band names so its very slow to use a single loop. Here’s the solution: use more computer power! We can actually use the multiprocessing library to efficiently go through the band names, such as:
Try it out! Combine the two and test it with a tiny list of band names:
So if you search for “Van Morrison” you’ll get a top result of
('Van Morrison', 100, 1.0) followed by
('Jim Morrison', 80, 0.75), and
(Van Halen', 50, 0.47619047619047616) as the respective scores (second and third in tuple) decreases. It works pretty well! I’m happy with it at least - its fast, efficient, precise, and scalable.
Faster string matching
To solve this one can use a dictionary of spliced strings - basically a hash table of string doublets. For example a word like “pillow” is composed of
['pi','il','ll','lo','ow'] doublets. To make the hash table, each doublet then is put into a dictionary where each of the entries map back to the word, ‘pillow’.
To use this hash table, you can then take an input string, find the string doublets and then get a new list of “most likely” strings that correspond to the values in though string doublet keys. Searching through those is very likely a very small fraction of the entire list, thus it takes a small fraction of the time! I also use a multiprocessing pool to speed up this part of the search.
My code for trying this out is on Github.
Try it out!
Download the code here and try it out for yourself.
First download a wordlist. The example I use below uses this one which you can download from the terminal with:
You can use whatever wordlist you want, just make sure you change the
WORDLIST_FILE variable in
setup.py (and if you have a different format to your file, change how it loads in the words).
Then install the dependencies using
pip install fuzzywuzzy python-Levenshtein
After you finished the setup, finally run the
setup.py to generate the pickle files.
Then you can test the speed with
python test_string_matching WORD
For example, searching for “pellow”:
Of course the fastest is direct lookup, but in this case “pillow” wasn’t in the wordlist (it is actually “Pillow” in the list). Matching through the entire list takes 1.5 seconds, while the hashed version only takes 111 milliseconds! Way beter. The results for both are similar, as well, getting next results that make sense (billow, willow, pill).
What happens if you mispell a word?
Both full list method and the hash method got the right result, but again the hashed version was 14x faster!
To use it yourself, simply make your pickle files as done in the
setup.py file and then import the
string_matching.py into your project. Eventually I could make this an actual Python package, but I don’t have the expertise for that yet.