There are situations where you want to take the user’s input and match a primary key in a database. But, immediately a problem is introduced: what happens if the user spells the primary key incorrectly? This fuzzy string matching program solves this problem - it takes any string, misspelled or not, and matches to one a specified key list.
Previously I wrote about a method for fast fuzzy string matching using Python. It worked well, but had one problem. It didn’t scale as much as I’d like and it wasn’t very fast (but pretty fast). I found solutions like agrep and tre-agrep there were GNU and better than my program. But I wanted something even faster - and thats where I built
goagrep requires building a precomputed database from the file that has the target strings. Then, when querying,
goagrep splits the search string into smaller subsets, and then finds the corresponding known target strings that contain each subset. It then runs Levenshtein’s algorithm on the new list of target strings to find the best match to the search string. This greatly decreases the search space and thus increases the matching speed.
The subset length dictates how many pieces a word should be cut into, for purposes of finding partial matches for mispelled words. For instance example: a subset length of 3 for the word “olive” would index “oli”, “liv”, and “ive”. This way, if one searched for “oliv” you could still return “olive” since subsets “oli” and “liv” can still grab the whole word and check its Levenshtein distance (which should be very close as its only missing the one letter).
A smaller subset length will be more forgiving (it allows more mispellings), thus more accurate, but it would require more disk and more time to process since there are more words for each subset. A bigger subset length will help save hard drive space and decrease the runtime since there are fewer words that have the same, longer, subset. You can get much faster speeds with longer subset lengths, but keep in mind that this will not be able to match strings that have an error in the middle of the string and are have a length < 2*subset length - 1.
goagrepcan search much longer strings:
agrepis limited to 32 characters while
goagrepis only limited to 500.
tre-agrepis not limited, but is much slower.
goagrepcan handle many mistakes in a string:
agrepis limited to edit distances of 8, while
goagrephas no limit.
goagrepis fast (see benchmarking below), and the speed can be tuned: You can set higher subset lengths to get faster speeds and less accuracy - leaving the tradeoff up to you
Benchmarking using the 319,378 word dictionary (3.5 MB), run with
perf stat -r 50 -d <CMD> using Intel(R) Core(TM) i5-4310U CPU @ 2.00GHz. These benchmarks were run with a single word, and can flucuate ~50% depending on the word.
|goagrep||3 ms||69 MB (subset size = 5)|
|goagrep||7 ms||66 MB (subset size = 4)|
|goagrep||84 ms||58 MB (subset size = 3)|
|agrep||53 ms||3.5 MB (original file)|
|tre-agrep||1,178 ms||3.5 MB (original file)|
First compile a list of your phrases or words that you want to match (see
testlist). Then you can build a
goagrep database using:
And then you can match any of the words using:
which returns the best match and the levenshtein score.
You can test with a big list of words from Univ. Michigan:
You can also use as a library. Here’s an example program (see in