batch_jaro_winkler

Fast batch jaro winkler distance implementation in C99 with Ruby, OCaml and Python bindings.

This project gets its performance from the pre-calculation of an optimized model in advance of the actual runtime calculations. Supports any encoding.

C99, Python >= 3.3, OCaml >= 4.0 and Ruby >= 2.1 (Warning regarding ruby versions)

Language specific parts:

Python
Ruby
OCaml
C

Benchmark

Linear scale for dramatic effect.

2 datasets are used for the benchmark: english words and chinese words. These datasets are fairly small (~ 5 MB). The bigger the dataset, the greater the speedup you get by using batch_jaro_winkler over other libraries.

The metric MB/s refers to the number of megabytes of the original datasets processed per second. The datasets are utf-8 encoded, so 1 byte = 1 character for english text. So if we have an english dataset of 1 million 10 characters words, our dataset is 10 MB, and batch_jaro_winkler 4 threads is able to calculate the score for each of the original words against a new value in 10 MB / 400 MB/s = 25 ms with min_score=0.0, and 10 MB / 1.7 GB/s = 6 ms with min_score=0.9.

Libraries used for comparison: Levenshtein, jellyfish, hotwater, jaro_winkler, fuzzy-string-match, amatch, textdistance, py_stringmatching, jaro, pyjarowinkler

Installation

Python: pip3 install batch_jaro_winkler

OCaml: opam update && opam install batch_jaro_winkler

Ruby: gem install batch_jaro_winkler

Issues?

You need a development version of python or ruby, with apt-get that would be apt-get install python3-dev or apt-get install ruby-dev.
You need a C compiler installed, gcc or clang for example.
You need make for the ruby library.

Examples

import batch_jaro_winkler as bjw

candidates = ['héllo', '中国', 'hiz']
exportable_model = bjw.build_exportable_model(candidates)
runtime_model = bjw.build_runtime_model(exportable_model)
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [('中国', 0.0), ('hiz', 0.5), ('héllo', 0.9666666388511658)]

Use of min_score for each candidate:

candidates = [{ 'candidate': 'héllo', 'min_score': 0.99 }, { 'candidate': '中国', 'min_score': 0.0 }, { 'candidate': 'hiz', 'min_score': 0.4 }]
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [('中国', 0.0), ('hiz', 0.5)]

Use of min_score as runtime argument, which takes precedence over the min scores for each candidate:

candidates = [{ 'candidate': 'héllo', 'min_score': 0.99 }, { 'candidate': '中国', 'min_score': 0.0 }, { 'candidate': 'hiz', 'min_score': 0.4 }]
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz', min_score=0.5)
# res = [('hiz', 0.5), ('héllo', 0.9666666388511658)]

Use of weight, threshold and n_best_results:

candidates = ['héllo', '中国', 'hiz']
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz', weight=0.2, threshold=0.5, n_best_results=1)
# res = [('héllo', 0.9888888597488403)]

Correctness

Different libraries calculate scores differently. The output of this project matches the output of the original implementation by Bill Winkler, George McLaughlin and Matt Jaro. See here for more details.

How to use

The exportable model

The first step is to build a model from a set of candidates. This model is a simple string (bytes in Python) that we can store where we like: RAM, disk, a database, S3 etc.

We can optionally set a min_score requirement for each candidate, so that a candidate is only returned at runtime if the matching score is higher than a certain value, except if we manually pass a min_score at runtime, which takes precedence.

We need to choose how many threads we want to use for runtime calculations when we build the exportable model, as this value is used for an optimized internal representation.

The exportable model compresses very well, but decompressing at runtime might be too slow depending on your needs:

>>> import gzip
>>> exportable_model = bjw.build_exportable_model(candidates)
>>> len(exportable_model)
34738694
>>> compressed_exportable_model = gzip.compress(exportable_model)
>>> len(compressed_exportable_model)
9443218

The runtime model and runtime calculations

Once we have an exportable model, we can make runtime score calculations. A prerequisite is to build a runtime model. This is a very cheap operation, but to prevent us from doing it over and over we can reuse the model for any number of runtime calculations. Beware that while an exportable model can be used by multiple threads at the same time as we only read from it, we write to the runtime model passed when doing runtime calculations, so it isn't thread safe.

Then come the score calculations. We can perform jaro or jaro winkler distance calculations. We can set the min_score argument so that only candidates matching with at least a certain score are returned. This argument overrides the min_score that we may have set for each candidate when building the exportable model. A high value improves the performance, see the benchmark. We can also set the n_best_results argument, it filters the candidates and makes the runtime function return only the best scoring candidates. A small value (< 20% of the dataset size) improves the performance, see the benchmark.

How does it work?

One possible approach to performing jaro winkler distance calculations against a set of values is to test each value one after the other. This project leverages the fact that values are known in advance to build a data structure (a model) that prevents us from looping to find matches, using a hash table linking to a list of matches instead. More often than not, you only care about matches that have a score higher than a threshold value. We benefit from this fact by skipping candidates that can't match the threshold as the calculations go on, allowing us to speed up further.

We calculate the jaro winkler distance in 2 steps. The first one calculates the number of matches for a candidate, and populates a 'flags' data structure that is used later on to calculate the number of transpositions.

These would be the flags for 'im marhta yo' as the input and 'martha' as the candidate:

input_flags = [0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
cand_flags_flags = [1, 1, 1, 1, 1, 1]

The data structure used to speed up the finding of matching characters is a hash table linking characters to their occurrences in candidates. It is stored this way:

{
  'a': |  cand1 |  cand1 |  cand1 |  cand1 |  nb_oc |  ind1 |  ind2 |  ind3 |  cand2 | ...,
  'b': |  cand1 |  cand1 |  cand1 |  cand1 |  nb_oc |  ind1 |  ind2 |  cand2 |  ...,
}

Where 'a' and 'b' are characters which appear in candidates, 'cand1' and 'cand2' are the indexes of the candidates in the model, 'nb_oc' is the number of occurrences of the character in the candidate and 'ind1', 'ind2' etc. are the indexes of the character's occurrences in the candidate.

Additionally, we can know in advance how many occurrences matches we need to satisfy the min_score requirement. This allows us to ignore candidates once we know that they won't be able to match. For example, with a min_score of 1.0:

runtime_input_len = 8, candidate_len = 8 => We know that we must have 8 matches
runtime_input_len = 9, candidate_len = 8 => We know that it is impossible that they match

With a min_score of 0.9:

runtime_input_len = 8, candidate_len = 8 =>
  (3.0 * min_score * candidate_len * runtime_input_len - (candidate_len * runtime_input_len)) / (candidate_len + runtime_input_len)
  => 6.8, we need at least 7 matches
runtime_input_len = 9, candidate_len = 8 =>
  (3.0 * min_score * candidate_len * runtime_input_len - (candidate_len * runtime_input_len)) / (candidate_len + runtime_input_len)
  => 7.2, we need at least 8 matches

This data representation allows us to have an efficient runtime, which looks something like this (simplified):

for i_char, char in runtime_input:
  remaining_chars = len(runtime_input) - i_char
  occurrences = char_matches[char]
  for candidate_ind, indexes in occurrences:
    enough_matches_possible = candidates[candidate_ind].nb_matches + remaining_chars >= candidates[candidate_ind].required_nb_matches
    if not enough_matches_possible:
      continue
    for ind in indexes:
      match_in_search_range = ind >= i_char - candidates[candidate_ind].search_range and ind <= i_char + candidates[candidate_ind].search_range
      if match_in_search_range:
        candidates[candidate_ind].nb_matches += 1
        runtime_input_flags[candidate_ind][i_char] = 1
        candidates_flags[candidate_ind][ind] = 1
        break

Once this is done, all that is left to do is to calculate the number of transpositions for possibly matching candidates (candidates that have a number of matches at least equal to the required number of matches) from the flags.

Things that were tried but did not improve performance. They could still bring performance gains if done right:

Instead of iterating over all occurrences matches for a character everytime, keep a linked list (as offsets in the candidates' array) of possible candidate occurrences for a given character. We can eliminate candidate occurrences when we know a candidate can't match, or when we explored all occurrences for this candidate and this character.
Using bits instead of uint8_t for runtime_input_flags and candidate_flags.
Keep track of the number of potential matches left for a candidate, that way we can skip impossible candidates in try_to_match_occurrence the same way we do with nb_matches + remaining_chars < required_nb_matches

Things that were not tried:

Sort candidates by jaro distance when building the model. This would greatly optimize cache usage, as potential candidates for a particular runtime_input would all be near one another, making memory accesses more efficient. Right now we are using an approximation of this, sorting by alphabetical order + length.
Better split candidates across each thread's local storage so that each thread takes around the same time to finish. This would greatly improve the multi-threading performance, which caps at around 3 times faster than with 1 thread, even with 6 threads.
If bits are used for the flags, maybe use binary operations for transpositions calculation? Seems unlikely.
Use SIMD instructions, probably tough because there is a lot of branching right now.
Keep track of the current best score when finding matches, to invalidate candidates that are guaranteed to have a smaller score. Same when calculating transpositions.

Python

import batch_jaro_winkler as bjw

candidates = ['héllo', '中国', 'hiz']
exportable_model = bjw.build_exportable_model(candidates)
runtime_model = bjw.build_runtime_model(exportable_model)
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [('中国', 0.0), ('hiz', 0.5), ('héllo', 0.9666666388511658)]

build_exportable_model

build_exportable_model(candidates, nb_runtime_threads=1)

Parameter	Type	Comment
candidates	list-like	A list (or list-like object) containing the strings to match runtime values against. Must respect one of these 2 schemas: `['hi', 'hello']` or `[{ 'candidate': 'hi', 'min_score': 0.5 }, { 'candidate': 'hello', 'min_score': 0.8 }]`. If one candidate has a `min_score`, all of them must have one. If `min_score` is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a `min_score` at runtime, which takes precedence.
nb_runtime_threads	int	The number of threads to use at runtime (`jaro_distance[_bytes]` and `jaro_winkler_distance[_bytes]`).

Returns a bytes object.

build_exportable_model_bytes

build_exportable_model_bytes(char_width, candidates, nb_runtime_threads=1)

Parameter	Type	Comment
char_width	int	Must be one of {1, 2, 4}. The width in bytes of a single character in the strings you provide in the `candidates` parameter. For example, if you use `utf-32`, set `char_width` to 4.
candidates	list-like	A list (or list-like object) containing the strings to match runtime values against. They can be encoded however you like, including in custom encodings, and including with `0` characters in the middle of the encoded strings. Must respect one of these 2 schemas: `[b'hi', b'hello']` or `[{ 'candidate': b'hi', 'min_score': 0.5 }, { 'candidate': b'hello', 'min_score': 0.8 }]`. If one candidate has a `min_score`, all of them must have one. If `min_score` is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a `min_score` at runtime, which takes precedence.
nb_runtime_threads	int	The number of threads to use at runtime (`jaro_distance[_bytes]` and `jaro_winkler_distance[_bytes]`).

Returns a bytes object.

build_runtime_model

build_runtime_model(exportable_model)

Parameter	Type	Comment
exportable_model	bytes	An exportable model built with `build_exportable_model[_bytes]`.

Returns an object that you can then pass as argument to one of the runtime functions: jaro_distance[_bytes] and jaro_winkler_distance[_bytes].

jaro_winkler_distance

jaro_winkler_distance(runtime_model, inp, min_score=None, weight=0.1, threshold=0.7, n_best_results=None)

Parameter	Type	Comment
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	str	The input to get scores for.
min_score	float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
weight	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value.
threshold	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value.
n_best_results	int	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns a list of tuples containing the candidates and the matching scores, following this schema: [('中国', 0.0), ('hiz', 0.5)].

jaro_winkler_distance_bytes

jaro_winkler_distance_bytes(char_width, runtime_model, inp, min_score=None, weight=0.1, threshold=0.7, n_best_results=None)

Parameter	Type	Comment
char_width	int	Must be one of {1, 2, 4}. The value used must match with the `char_width` passed when calling `build_exportable_model`. The width in bytes of a single character in the `inp` parameter, as well as in the candidates in the exportable model.
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	bytes	The input to get scores for.
min_score	float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
weight	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value.
threshold	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value.
n_best_results	int	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns a list of tuples containing the candidates (as bytes) and the matching scores, following this schema: [(b'-N\x00\x00\xfdV\x00\x00', 0.0), (b'hiz', 0.5)].

jaro_distance

jaro_distance(runtime_model, inp, min_score=None, n_best_results=None)

Parameter	Type	Comment
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	bytes	The input to get scores for.
min_score	float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
n_best_results	int	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns a list of tuples containing the candidates and the matching scores, following this schema: [('中国', 0.0), ('hiz', 0.5)].

jaro_distance_bytes

jaro_distance_bytes(char_width, runtime_model, inp, min_score=None, n_best_results=None)

Parameter	Type	Comment
char_width	int	Must be one of {1, 2, 4}. The value used must match with the `char_width` passed when calling `build_exportable_model`. The width in bytes of a single character in the `inp` parameter, as well as in the candidates in the exportable model.
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	bytes	The input to get scores for.
min_score	float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
n_best_results	int	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns a list of tuples containing the candidates and the matching scores, following this schema: [(b'-N\x00\x00\xfdV\x00\x00', 0.0), (b'hiz', 0.5)].

Ruby

require 'batch_jaro_winkler'

candidates = ['héllo', '中国', 'hiz']
exportable_model = BatchJaroWinkler.build_exportable_model(candidates)
runtime_model = BatchJaroWinkler.build_runtime_model(exportable_model)
res = BatchJaroWinkler.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [['中国', 0.0], ['hiz', 0.5], ['héllo', 0.9666666388511658]]

build_exportable_model

build_exportable_model(candidates, nb_runtime_threads: 1)

Parameter	Type	Comment
candidates	array-like	An array (or array-like object) containing the strings to match runtime values against. Must respect one of these 2 schemas: `['hi', 'hello']` or `[{ candidate: 'hi', min_score: 0.5 }, { candidate: 'hello', min_score: 0.8 }]`. If one candidate has a `min_score`, all of them must have one. If `min_score` is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a `min_score` at runtime, which takes precedence.
nb_runtime_threads	Integer	The number of threads to use at runtime (`jaro_distance[_bytes]` and `jaro_winkler_distance[_bytes]`).

Returns an ascii encoded String object.

build_exportable_model_bytes

build_exportable_model_bytes(char_width, candidates, nb_runtime_threads: 1)

Parameter	Type	Comment
char_width	Integer	Must be one of {1, 2, 4}. The width in bytes of a single character in the strings you provide in the `candidates` parameter. For example, if you use `utf-32`, set `char_width` to 4.
candidates	array-like	An array (or array-like object) containing the strings to match runtime values against. They can be encoded however you like, including in custom encodings, and including with `0` characters in the middle of the encoded strings. Must respect one of these 2 schemas: `['hi'.encode('utf-32le'), 'hello'.encode('utf-32le')]` or `[{ candidate: 'hi'.encode('utf-32le'), min_score: 0.5 }, { candidate: 'hello'.encode('utf-32le'), min_score: 0.8 }]`. If one candidate has a `min_score`, all of them must have one. If `min_score` is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a `min_score` at runtime, which takes precedence.
nb_runtime_threads	Integer	The number of threads to use at runtime (`jaro_distance[_bytes]` and `jaro_winkler_distance[_bytes]`).

Returns an ascii encoded String object.

build_runtime_model

build_runtime_model(exportable_model)

Parameter	Type	Comment
exportable_model	String	An exportable model built with `build_exportable_model[_bytes]`.

Returns an object that you can then pass as argument to one of the runtime functions: jaro_distance[_bytes] and jaro_winkler_distance[_bytes].

jaro_winkler_distance

jaro_winkler_distance(runtime_model, inp, min_score: nil, weight: 0.1, threshold: 0.7, n_best_results: nil)

Parameter	Type	Comment
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	String	The input to get scores for.
min_score	Float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
weight	Float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value.
threshold	Float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value.
n_best_results	Integer	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns an array of arrays containing the candidates and the matching scores, following this schema: [['中国', 0.0], ['hiz', 0.5]].

jaro_winkler_distance_bytes

jaro_winkler_distance_bytes(char_width, runtime_model, inp, min_score: nil, weight: 0.1, threshold: 0.7, n_best_results: nil)

Parameter	Type	Comment
char_width	Integer	Must be one of {1, 2, 4}. The value used must match with the `char_width` passed when calling `build_exportable_model`. The width in bytes of a single character in the `inp` parameter, as well as in the candidates in the exportable model.
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	String	The input to get scores for.
min_score	Float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
weight	Float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value.
threshold	Float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value.
n_best_results	Integer	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns an array of arrays containing the candidates and the matching scores, following this schema: [['\u4E2D\u56FD', 0.0], ['hiz', 0.5]].

jaro_distance

jaro_distance(runtime_model, inp, min_score: nil, n_best_results: nil)

Parameter	Type	Comment
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	String	The input to get scores for.
min_score	Float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
n_best_results	Integer	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns an array of arrays containing the candidates and the matching scores, following this schema: [['中国', 0.0], ['hiz', 0.5]].

jaro_distance_bytes

jaro_distance_bytes(char_width, runtime_model, inp, min_score: nil, n_best_results: nil)

Parameter	Type	Comment
char_width	Integer	Must be one of {1, 2, 4}. The value used must match with the `char_width` passed when calling `build_exportable_model`. The width in bytes of a single character in the `inp` parameter, as well as in the candidates in the exportable model.
runtime_model	RuntimeModel	Object returned by `build_runtime_model`.
inp	String	The input to get scores for.
min_score	Float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
n_best_results	Integer	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.

Returns an array of arrays containing the candidates and the matching scores, following this schema: [['\u4E2D\u56FD', 0.0], ['hiz', 0.5]].

OCaml

test_bjw.ml

module Bjw = Batch_jaro_winkler

let () =
  let encoding = Bjw.Encoding.UTF8 in
  let candidates = [("héllo", None) ; ("中国", None) ; ("hiz", None)] in
  let exportable_model = Bjw.build_exportable_model ~encoding candidates in
  let runtime_model = Bjw.build_runtime_model exportable_model in
  let res = Bjw.jaro_winkler_distance ~encoding runtime_model "hélloz" in
  (* res = [("中国", 0.0) ; ("hiz", 0.5) ; ("héllo", 0.9666666388511658)] *)
  List.iter (fun (candidate, score) -> Printf.printf "'%s': %f\n" candidate score) res

dune

(executable
 (name test_bjw)
 (libraries batch_jaro_winkler))

$ dune build
$ ./_build/default/test_bjw.exe
'héllo': 0.966667
'hiz': 0.500000
'中国': 0.000000

build_exportable_model

val build_exportable_model : encoding:Encoding.t -> ?nb_runtime_threads:int ->
  (string * float option) list -> string
let build_exportable_model ~encoding ?nb_runtime_threads:(nb_runtime_threads=1) candidates

Parameter	Type	Comment
encoding	Encoding.t	One of `{ASCII, UTF8, UTF16, UTF32, CHAR_WIDTH_1, CHAR_WIDTH_2, CHAR_WIDTH_4}`. Describes the encoding of the string passed in the `candidates` parameter. You can use a custom fixed-width encoding with `CHAR_WIDTH_1`, `CHAR_WIDTH_2` and `CHAR_WIDTH_4`.
nb_runtime_threads	int	The number of threads to use at runtime (`jaro_distance` and `jaro_winkler_distance`).
candidates	(string * float option) list	A list containing the strings to match runtime values against, and optionally the minimum matching score required at runtime. Must respect the following schema: `[("héllo", None) ; ("中国", Some 0.5)]`. If one candidate has a `min_score`, all of them must have one. If you give a minimum score for a candidate, it is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a `min_score` at runtime, which takes precedence.

Returns a string.

build_runtime_model

val build_runtime_model : string -> runtime_model
let build_runtime_model exportable_model

Parameter	Type	Comment
exportable_model	string	An exportable model built with `build_exportable_model`.

Returns a runtime_model that you can then pass as argument to one of the runtime functions: jaro_distance and jaro_winkler_distance.

jaro_winkler_distance

val jaro_winkler_distance : encoding:Encoding.t -> ?min_score:float -> ?weight:float ->
  ?threshold:float -> ?n_best_results:int option -> runtime_model -> string -> (string * float) list
let jaro_winkler_distance ~encoding ?min_score:(min_score=(-1.0)) ?weight:(weight=0.1)
  ?threshold:(threshold=0.7) ?n_best_results:(n_best_results=None) runtime_model input

Parameter	Type	Comment
encoding	Encoding.t	One of `{ASCII, UTF8, UTF16, UTF32, CHAR_WIDTH_1, CHAR_WIDTH_2, CHAR_WIDTH_4}`. Describes the encoding of the string passed in the `input` parameter. You can use a custom fixed-width encoding with `CHAR_WIDTH_1`, `CHAR_WIDTH_2` and `CHAR_WIDTH_4`. The encoding must match the one used to build the exportable model.
min_score	float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
weight	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value.
threshold	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value.
n_best_results	int	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.
runtime_model	runtime_model	Value returned by `build_runtime_model`.
input	string	The input to get scores for.

Returns a list of string * float tuples containing the candidates and the matching scores, following this schema: [("中国", 0.0) ; ("hiz", 0.5)].

jaro_distance

val jaro_distance : encoding:Encoding.t -> ?min_score:float -> ?n_best_results:int option ->
  runtime_model -> string -> (string * float) list
let jaro_distance ~encoding ?min_score:(min_score=(-1.0)) ?n_best_results:(n_best_results=None)
  runtime_model input

Parameter	Type	Comment
encoding	Encoding.t	One of `{ASCII, UTF8, UTF16, UTF32, CHAR_WIDTH_1, CHAR_WIDTH_2, CHAR_WIDTH_4}`. Describes the encoding of the string passed in the `input` parameter. You can use a custom fixed-width encoding with `CHAR_WIDTH_1`, `CHAR_WIDTH_2` and `CHAR_WIDTH_4`. The encoding must match the one used to build the exportable model.
min_score	float	If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
n_best_results	int	Makes the function return the `n_best_results` best scoring candidates only. Improves performance.
runtime_model	runtime_model	Value returned by `build_runtime_model`.
input	string	The input to get scores for.

Returns a list of string * float tuples containing the candidates and the matching scores, following this schema: [("中国", 0.0) ; ("hiz", 0.5)].

C

The files you need are in the lib folder.

test.c:

#include "batch_jaro_winkler.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
    char      *candidates[] = { "hello", "hiz" };
    uint32_t  candidates_lengths[] = { 5, 3 };
    uint32_t  exportable_model_size;
    uint32_t  nb_results;

    // char_width = 1 ; nb_candidates = 2 ; nb_runtime_threads = 1
    void *exportable_model = bjw_build_exportable_model(
      candidates, 1, candidates_lengths, 2, NULL, 1, &exportable_model_size
    );
    if (!exportable_model)
        exit(1);
    void *runtime_model = bjw_build_runtime_model(exportable_model);
    if (!runtime_model)
        exit(1);

    // input_length = 5 ; min_score = -1.0 (deactivate)
    // weight = 0.1 (default value for standard jaro winkler)
    // threshold = 0.7 (default value for standard jaro winkler)
    // n_best_results = 0 (deactivate)
    bjw_result *res = bjw_jaro_winkler_distance(runtime_model, "hallo", 5, -1.0, 0.1, 0.7, 0, &nb_results);
    if (!res)
        exit(1);

    uint32_t best_candidate_ind = 0;
    for (uint32_t i_res = 0; i_res < nb_results; i_res++)
    {
        // Warning: candidates are not null terminated, as the meaning of bytes within candidates
        // depends on the encoding, including for 0.
        printf(
            "{ .candidate = \"%.*s\", .score = %f }\n",
            res[i_res].candidate_length, res[i_res].candidate, res[i_res].score
        );
        if (res[i_res].score > res[best_candidate_ind].score)
            best_candidate_ind = i_res;
    }

    // Important: the 'candidate' field in `bjw_result` is a pointer to somewhere within the exportable model.
    // If you want to keep candidates after the exportable model is being freed, you must copy the data.
    // char_width = 1
    char *best_candidate = malloc(res[best_candidate_ind].candidate_length * 1);
    memcpy(best_candidate, res[best_candidate_ind].candidate, res[best_candidate_ind].candidate_length * 1);
    uint32_t best_candidate_length = res[best_candidate_ind].candidate_length;

    free(res);
    bjw_free_runtime_model(runtime_model);
    free(exportable_model);

    printf("best candidate: \"%.*s\"\n", best_candidate_length, best_candidate);
    free(best_candidate);
    return (0);
}

$ ls -l
-rw-r--r--  1 user  wheel  33490 27 avr 13:12 batch_jaro_winkler.c
-rw-r--r--  1 user  wheel   1111 27 avr 13:12 batch_jaro_winkler.h
-rw-r--r--  1 user  wheel   1533 27 avr 13:12 batch_jaro_winkler_internal.h
-rw-r--r--  1 user  wheel  22514 27 avr 13:12 batch_jaro_winkler_runtime.h
-rw-r--r--  1 user  wheel   2190 27 avr 13:15 test.c
-rw-r--r--  1 user  wheel  78701 27 avr 13:12 uthash.h
$ gcc -O3 batch_jaro_winkler.c test.c
$ ./a.out
{ .candidate = "hiz", .score = 0.511111 }
{ .candidate = "hello", .score = 0.880000 }
best candidate: "hello"

bjw_build_exportable_model

void  *bjw_build_exportable_model(
    void **candidates, uint32_t char_width, uint32_t *candidates_lengths, uint32_t nb_candidates,
    float *min_scores, uint32_t nb_runtime_threads, uint32_t *res_model_size
)

Parameter	Type	Comment
candidates	void**	An array of character arrays. Each character must be `char_width` bytes wide. They can be encoded however you like, including in custom encodings, and including with `0` characters in the middle of the encoded strings.
char_width	uint32_t	Must be one of {1, 2, 4}. The width in bytes of a single character in the strings you provide in the `candidates` parameter. For example, if you use `utf-32`, set `char_width` to 4.
candidates_lengths	uint32_t*	Array containing the length of each candidate. If the strings are null-terminated, don't count the last byte when determining the length.
nb_candidates	uint32_t	The number of elements in the `candidates`, `candidates_lengths` and `min_scores` arrays.
min_scores	float*	Can be NULL. If provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a `min_score` at runtime, which takes precedence.
nb_runtime_threads	uint32_t	The number of threads to use at runtime (`bjw_jaro_distance` and `bjw_jaro_winkler_distance`).
res_model_size	uint32_t*	The value is set by the function to the size in bytes of the resulting exportable model.

Returns a buffer. You are responsible for freeing it. Warning: you need to keep it in memory as long as you want to make runtime calculations if you passed it as argument to bjw_build_runtime_model, as the runtime functions (bjw_jaro_distance and bjw_jaro_winkler_distance) return pointers to the candidates' strings stored inside the exportable model.

bjw_build_runtime_model

void  *bjw_build_runtime_model(void *exportable_model)

Parameter	Type	Comment
exportable_model	void*	An exportable model built with `bjw_build_exportable_model`.

Returns a buffer that you can then pass as argument to one of the runtime functions: bjw_jaro_distance and bjw_jaro_winkler_distance. You must call bjw_free_runtime_model when you are done using it.

bjw_free_runtime_model

void  bjw_free_runtime_model(void *runtime_model)

Parameter	Type	Comment
runtime_model	void*	A runtime model built with `bjw_build_runtime_model`.

Frees the runtime model.

bjw_jaro_winkler_distance

bjw_result  *bjw_jaro_winkler_distance(
    void *runtime_model, void *input, uint32_t input_length, float min_score,
    float weight, float threshold, uint32_t n_best_results, uint32_t *nb_results
)

Parameter	Type	Comment
runtime_model	void*	Object returned by `bjw_build_runtime_model`.
input	void*	The input to get scores for. Must be encoded in the same way as candidates.
input_length	uint32_t	Length of the input in characters, not bytes.
min_score	float	If >= 0.0, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
weight	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value.
threshold	float	The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value.
n_best_results	uint32_t	If > 0, the function returns the `n_best_results` best scoring candidates only. Improves performance.
nb_results	uint32_t*	This value is set by the function to the number of results.

Returns an array of *nb_results bjw_result. You are responsible for freeing the resulting array. Here is the definition for bjw_result, candidate_length is the number of characters, not bytes:

typedef struct
{
  void      *candidate;
  float     score;
  uint32_t  candidate_length;
} bjw_result;

bjw_jaro_distance

bjw_result  *bjw_jaro_distance(
    void *runtime_model, void *input, uint32_t input_length,
    float min_score, uint32_t n_best_results, uint32_t *nb_results
)

Parameter	Type	Comment
runtime_model	void*	Object returned by `bjw_build_runtime_model`.
input	void*	The input to get scores for. Must be encoded in the same way as candidates.
input_length	uint32_t	Length of the input in characters, not bytes.
min_score	float	If >= 0.0, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model.
n_best_results	uint32_t	If > 0, the function returns the `n_best_results` best scoring candidates only. Improves performance.
nb_results	uint32_t*	This value is set by the function to the number of results.

Returns an array of *nb_results bjw_result. You are responsible for freeing the resulting array. Here is the definition for bjw_result, candidate_length is the number of characters, not bytes:

typedef struct
{
  void      *candidate;
  float     score;
  uint32_t  candidate_length;
} bjw_result;

Warning regarding Ruby versions

If you use older MRI versions (< 2.5.8 or between 2.6.0 and 2.6.4 included), you may experience memory leaks. It could somehow be related to MRI's string implementation, which was fixed at the end of 2019: https://github.com/ruby/ruby/compare/v2_6_4...v2_6_5#diff-7a2f2c7dfe0bf61d38272aeaf68ac768R2117. Work was done in this library to mitigate the issue, but the absence of leaks is not guaranteed. If you're interested, you can most likely reproduce the leak with this program, change utf-32le to utf-32 to watch to memory leak disappear:

while true do
  1000.times do
    # random 10 characters string
    str = (0...10).map{ (65 + rand(26)).chr }.join
    str.encode('utf-32le')
  end
  GC.start(full_mark: true, immediate_sweep: true)
  GC.start
end

The future

A similar approach could probably also benefit other fuzzy matching algorithms.

Something that would be really neat would be the ability to add candidates or remove candidates from an exportable model on the fly. This would allow for more flexible scenarios, where the set of candidates can change very often, as in a database for instance. It would make the project appropriate as a PostgreSQL extension for example.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
benchmark		benchmark
lib		lib
ocaml		ocaml
python		python
ruby		ruby
support		support
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md

License

dbousque/batch_jaro_winkler

Folders and files

Latest commit

History

Repository files navigation

batch_jaro_winkler

Benchmark

Installation

Examples

Correctness

How to use

The exportable model

The runtime model and runtime calculations

How does it work?

Python

build_exportable_model

build_exportable_model_bytes

build_runtime_model

jaro_winkler_distance

jaro_winkler_distance_bytes

jaro_distance

jaro_distance_bytes

Ruby

build_exportable_model

build_exportable_model_bytes

build_runtime_model

jaro_winkler_distance

jaro_winkler_distance_bytes

jaro_distance

jaro_distance_bytes

OCaml

build_exportable_model

build_runtime_model

jaro_winkler_distance

jaro_distance

C

bjw_build_exportable_model

bjw_build_runtime_model

bjw_free_runtime_model

bjw_jaro_winkler_distance

bjw_jaro_distance

Warning regarding Ruby versions

The future

About

Topics

Resources

License

Stars

Watchers

Forks

Languages