There are situations where you want to take the user’s input and match a primary key in a database. But, immediately a problem is introduced: what happens if the user spells the primary key incorrectly? This fuzzy string matching program solves this problem - it takes any string, misspelled or not, and matches to one a specified key list.
Previously I wrote about a method for fast fuzzy string matching using Python. It worked well, but had one problem. It didn’t scale as much as I’d like and it wasn’t very fast (but pretty fast). I found solutions like and there were GNU and better than my program. But I wanted something even faster - and thats where I built .
requires building a precomputed database from the file that has the target strings. Then, when querying, splits the search string into smaller subsets, and then finds the corresponding known target strings that contain each subset. It then runs Levenshtein’s algorithm on the new list of target strings to find the best match to the search string. This greatly decreases the search space and thus increases the matching speed.
The subset length dictates how many pieces a word should be cut into, for purposes of finding partial matches for mispelled words. For instance example: a subset length of 3 for the word “olive” would index “oli”, “liv”, and “ive”. This way, if one searched for “oliv” you could still return “olive” since subsets “oli” and “liv” can still grab the whole word and check its Levenshtein distance (which should be very close as its only missing the one letter).
A smaller subset length will be more forgiving (it allows more mispellings), thus more accurate, but it would require more disk and more time to process since there are more words for each subset. A bigger subset length will help save hard drive space and decrease the runtime since there are fewer words that have the same, longer, subset. You can get much faster speeds with longer subset lengths, but keep in mind that this will not be able to match strings that have an error in the middle of the string and are have a length < 2*subset length - 1.
Why use ?
It seems that really a comparable choice for most applications. It does not require any database and its comparable speed to . However, there are situations where is more useful:
- can search much longer strings: is limited to 32 characters while is only limited to 500. is not limited, but is much slower.
- can handle many mistakes in a string: is limited to edit distances of 8, while has no limit.
- is fast (see benchmarking below), and the speed can be tuned: You can set higher subset lengths to get faster speeds and less accuracy - leaving the tradeoff up to you
Benchmarking using the (3.5 MB), run with
perf stat -r 50 -d <CMD> using Intel(R) Core(TM) i5-4310U CPU @ 2.00GHz. These benchmarks were run with a single word, and can flucuate ~50% depending on the word.
|3 ms||69 MB (subset size = 5)|
|7 ms||66 MB (subset size = 4)|
|84 ms||58 MB (subset size = 3)|
|53 ms||3.5 MB (original file)|
|1,178 ms||3.5 MB (original file)|
Installation is really easy, if you have Golang. If you don’t have Golang, you can easily . Or, if you just want to download the
goagrep binaries, they are .
First compile a list of your phrases or words that you want to match (see
testlist). Then you can build a database using:
And then you can match any of the words using:
which returns the best match and the levenshtein score.
You can test with a big list of words from Univ. Michigan:
You can also use as a library. Here’s an example program (see in