Skip to content

A collection of metrics and phonetic algorithms for fuzzy string matching in Elixir.

License

Notifications You must be signed in to change notification settings

vanessaklee/akin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Akin

Akin is a collection of string comparison algorithms for Elixir. This solution was born of a Record Linking project. It combines and modifies The Fuzz and Fuzzy Compare. Algorithms can be called independently or in total to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries.

New! Notebooks

Disambiguation

Run Disambiguation in Livebook

Name Disambiguation

Run Name Disambiguation in Livebook

Table of Contents
  1. Installation
  2. Algorithms
  3. Metrics
  4. Algorithm Definitions
  5. Resources
  6. In Development

Installation

Add a dependency in your mix.exs:

deps: [{:akin, "~> 0.2.0"}]

Algorithms

To see all of the avialable algorithms. Hamming Distance is excluded as it only compares strings of equal length. Hamming may be called directly. See: Single Algorithms

iex> Akin.Util.list_algorithms()
["bag_distance", "substring_set", "sorensen_dice", "jaccard", "jaro_winkler", 
"levenshtein", "metaphone", "double_metaphone", "substring_double_metaphone", "ngram", 
"overlap", "substring_sort", "tversky"]

Metrics

Compare Strings

Compare two strings using all of the available algorithms. The return value is a map of scores for each algorithm.

iex> Akin.compare("weird", "wierd")
%{
 bag_distance: 1.0,
 sorensen_dice: 0.25,
 double_metaphone: 1.0,
 jaccard: 0.14,
 jaro_winkler: 0.94,
 levenshtein: 0.6,
 metaphone: 1.0,
 ngram: 0.25,
 overlap: 0.25,
 tversky: 0.14
}
iex> Akin.compare("beginning", "begining")
%{
  bag_distance: 0.89,
  sorensen_dice: 0.93,
  double_metaphone: 1.0,
  jaccard: 0.88,
  jaro_winkler: 0.95,
  levenshtein: 0.89,
  metaphone: 1.0,
  ngram: 0.88,
  overlap: 1.0,
  tversky: 0.88
}

Options

Comparison accepts options in a Keyword list.

  1. algorithms: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0.
    1. metric - algorithm metric. Default is both
    • "string": uses string algorithms
    • "phonetic": uses phonetic algorithms
    1. unit - algorithm unit. Default is both.
    • "whole": uses algorithms best suited for whole string comparison (distance)
    • "partial": uses algorithms best suited for partial string comparison (substring)
  2. level - level for double phonetic matching. Default is "normal".
    • "strict": both encodings for each string must match
    • "strong": the primary encoding for each string must match
    • "normal": the primary encoding of one string must match either encoding of other string (default)
    • "weak": either primary or secondary encoding of one string must match one encoding of other string
  3. match_at: an algorith score equal to or above this value is condsidered a match. Default is 0.9
  4. ngram_size: number of contiguous letters to split strings into. Default is 2.
  5. short_length: qualifies as "short" to recieve a shortness boost. Used by Name Metric. Default is 8.
  6. stem: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Default false

Algorithms

Restrict the list of algorithms by name or metric and/or unit.

iex> opts = [algorithms: ["bag_distance", "jaccard", "jaro_winkler"]]
iex> Akin.compare("weird", "wierd", opts) 
%{
bag_distance: 1.0, 
jaccard: 0.14, 
jaro_winkler: 0.94
}
iex> opts = [algorithms: [metric: "phonetic", unit: "whole"]]
iex > Akin.compare("weird", "wierd", opts)
%{
double_metaphone: 1.0, 
metaphone: 1.0
}

n-gram Size

The default ngram size for the algorithms is 2. You can change by setting a value in opts.

iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"]])
%{sorensen_dice: 0.25}
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"], ngram_size: 1])
%{sorensen_dice: 0.8}

Match Level

The default match strictness is "normal" You change it by setting a value in opts. Currently it only affects the outcomes of the substring_set and double_metaphone algorithms

iex> left = "Alice in Wonderland"
iex> right = "Alice's Adventures in Wonderland"
iex> Akin.compare(left, right, [algorithms: ["substring_set"]])
%{substring_set: 0.85}
iex> Akin.compare(left, right, [algorithms: ["substring_set"], level: "weak"])
%{substring_set: 0.85}
iex> left = "which way"
iex> right = "whitch way"
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "weak"])
%{double_metaphone: 1.0}
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "strict"])
%{double_metaphone: 0.0}

Stems

Compare the stemmed version of two strings.

iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"]])
%{bag_distance: 0.57, double_metaphone: 0.0}
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"], stem: true])
%{bag_distance: 1.0, double_metaphone: 1.0}
Additional Examples
iex> Akin.compare("weird", "wierd", algorithms: ["bag_distance", "jaro_winkler", "jaccard"])
%{bag_distance: 1.0, jaccard: 0.14, jaro_winkler: 0.94}
iex> Akin.compare("weird", "wierd", algorithms: [metric: "string", unit: "whole"], ngram_size: 1)
%{
  bag_distance: 1.0,
  jaccard: 0.67,
  jaro_winkler: 0.94,
  levenshtein: 0.6,
  sorensen_dice: 0.8,
  tversky: 1.0
}

Preprocessing

Before being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation & emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms.

"Alice Liddell" becomes

%Akin.Corpus{
  list: ["alice", "liddell"],
  original: "alice liddell",
  set: #MapSet<["alice", "liddell"]>,
  stems: ["alic", "liddel"],
  string: "aliceliddell"
}

Accents

iex> Akin.compare("Hubert Łępicki", "Hubert Lepicki")
%{
  bag_distance: 0.92,
  dice_sorensen: 0.83,
  double_metaphone: 0.0,
  jaccard: 0.71,
  jaro_winkler: 0.97,
  levenshtein: 0.92,
  metaphone: 0.0,
  ngram: 0.83,
  overlap: 0.83,
  tversky: 0.71
}

Phonemes

iex> Akin.phonemes("virginia") 
["frjn", "frkn"]
iex> Akin.phonemes("beginning")
["bjnnk", "pjnnk", "pknnk"]
iex> Akin.phonemes("wonderland")
["wntrlnt", "antrlnt", "fntrlnt"]

Name Disambiguation

UNDER DEVELOPMENT

Identity is the challenge of author name disambiguation (AND). The aim of AND is to match an author's name to that author when the author appears in a list of many authors. Complexity arises from homonymity (many people with the same name) and synonymity (when one person uses different forms/spellings of their name in publications).

Given the name of an author which is divided into the given, middle, and family name parts (i.e. "Virginia", nil, "Woolf") and a list of possible matching author names, find and return the matches for the author in the list. If initials exist in the left name, a separate comparison is performed for the initals and the sets of the right string.

If the comparison metrics produce a score greater than or equal to 0.9, they considered a match and returned in the list.

iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["v woolfe", "v woolf"]
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["virginia woolfe", "v woolf"]

This may not be what you want. There are likely to be unwanted matches.

iex> Akin.match_names("V. Woolf", ["Victor Woolf", "Virginia Woolf", "V White", "V Woolf", "Virginia Woolfe"])
["v woolf", "virginia woolf", "victor woolf"]

Definitions

Bag Distance

The bag distance is a cheap distance measure which always returns a distance smaller or equal to the edit distance. It's meant to be an efficient approximation of the distance between two strings to quickly rule out strings that are largely different.

Double Metaphone

Calculates the Double Metaphone Phonetic Algorithm metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak.

  • "strict": both encodings for each string must match
  • "strong": the primary encoding for each string must match
  • "normal": the primary encoding of one string must match either encoding of other string (default)
  • "weak": either primary or secondary encoding of one string must match one encoding of other string
Hamming Distance

Note: Hamming algorithm is not used in an of the comparison functions because it requires the strings being compared are of the same length. It can be called directly, however, so it is still a part of this library.

The Hamming distance between two strings of equal length is the number of positions at which the corresponding letters are different. Returns the percentage of change needed to the left string of the comparison of one string (left) with another string (right). Returns 0.0 if the strings are not the same length. Returns 1.0 if the string are equal.

Jaccard Similarity

Calculates the similarity of two strings as the size of the intersection divided by the size of the union. Default ngram: 2.

Jaro-Winkler Similarity

Jaro-Winkler calculates the edit distance between two strings. A score of one denotes equality. Unlike the Jaro Similarity, it modifies the prefix scale to gives a more favorable rating to strings that match from the beginning.

Levenshtein Distance

Compare two strings for their Levenshtein score. The score is determined by finding the edit distance: the minimum number of single-character edits needed to change one word into the other. The distance is substracted from 1.0 and then divided by the longest length between the two strings.

Metaphone

Compares two strings by converting each to an approximate phonetic representation in ASCII and then comparing those phoenetic representations. Returns a 1 if the phoentic representations are an exact match.

N-Gram Similarity

Calculates the ngram distance between two strings. Default ngram: 2.

Overlap Metric

Uses the Overlap Similarity metric to compare two strings by tokenizing the strings and measuring their overlap. Default ngram: 1.

Sørensen–Dice

Sørensen–Dice coefficient is calculated using bigrams. The equation is 2nt / nx + ny where nx is the number of bigrams in string x, ny is the number of bigrams in string y, and nt is the number of bigrams in both strings. For example, the bigrams of night and nacht are {ni,ig,gh,ht} and {na,ac,ch,ht}. They each have four and the intersection is ht.

(2 · 1) / (4 + 4) = 0.25

Substring Double Metaphone

Iterate over the cartesian product of the two lists sending each element through the Double Metaphone using all strictness levels until a true value is found in the list of returned booleans from the Double Metaphone algorithm. Return the percentage of true values found. If true is never returned, return 0. Increases
accuracy for search terms containing more than one word.

Substring Set

Splits the strings on spaces, sorts, re-joins, and then determines Jaro-Winkler distance. Best when the strings contain irrelevent substrings.

Substring Sort

Sorts substrings by words, compares the sorted strings in pairs, and returns the maximum ratio. If one strings is signficantly longer than the other, this method will compare matching substrings only.

Tversky

A generalization of Sørensen–Dice and Jaccard.

Resources

In Development