|
4 | 4 |
|
5 | 5 | A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list... |
6 | 6 |
|
7 | | -* [Download](#download) |
8 | | -* [Overview](#overview) |
9 | | -* [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance) |
10 | | -* [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance) |
11 | | -* [Levenshtein](#levenshtein) |
12 | | -* [Normalized Levenshtein](#normalized-levenshtein) |
13 | | -* [Weighted Levenshtein](#weighted-levenshtein) |
14 | | -* [Damerau-Levenshtein](#damerau-levenshtein) |
15 | | -* [Optimal String Alignment](#optimal-string-alignment) |
16 | | -* [Jaro-Winkler](#jaro-winkler) |
17 | | -* [Longest Common Subsequence](#longest-common-subsequence) |
18 | | -* [Metric Longest Common Subsequence](#metric-longest-common-subsequence) |
19 | | -* [N-Gram](#n-gram) |
20 | | -* [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms) |
21 | | - * [Q-Gram](#shingle-n-gram-based-algorithms) |
22 | | - * [Cosine similarity](#shingle-n-gram-based-algorithms) |
23 | | - * [Jaccard index](#shingle-n-gram-based-algorithms) |
24 | | - * [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms) |
25 | | -* [Experimental](#experimental) |
26 | | - * [SIFT4](#sift4) |
27 | | -* [Users](#users) |
| 7 | +- [python-string-similarity](#python-string-similarity) |
| 8 | + - [Download](#download) |
| 9 | + - [Overview](#overview) |
| 10 | + - [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance) |
| 11 | + - [(Normalized) similarity and distance](#normalized-similarity-and-distance) |
| 12 | + - [Metric distances](#metric-distances) |
| 13 | + - [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance) |
| 14 | + - [Levenshtein](#levenshtein) |
| 15 | + - [Normalized Levenshtein](#normalized-levenshtein) |
| 16 | + - [Weighted Levenshtein](#weighted-levenshtein) |
| 17 | + - [Damerau-Levenshtein](#damerau-levenshtein) |
| 18 | + - [Optimal String Alignment](#optimal-string-alignment) |
| 19 | + - [Jaro-Winkler](#jaro-winkler) |
| 20 | + - [Longest Common Subsequence](#longest-common-subsequence) |
| 21 | + - [Metric Longest Common Subsequence](#metric-longest-common-subsequence) |
| 22 | + - [N-Gram](#n-gram) |
| 23 | + - [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms) |
| 24 | + - [Q-Gram](#q-gram) |
| 25 | + - [Cosine similarity](#cosine-similarity) |
| 26 | + - [Jaccard index](#jaccard-index) |
| 27 | + - [Sorensen-Dice coefficient](#sorensen-dice-coefficient) |
| 28 | + - [Overlap coefficient (i.e., Szymkiewicz-Simpson)](#overlap-coefficient-ie-szymkiewicz-simpson) |
| 29 | + - [Experimental](#experimental) |
| 30 | + - [SIFT4](#sift4) |
| 31 | + - [Users](#users) |
28 | 32 |
|
29 | 33 |
|
30 | 34 | ## Download |
@@ -55,6 +59,7 @@ The main characteristics of each implemented algorithm are presented below. The |
55 | 59 | | [Cosine similarity](#cosine-similarity) |similarity<br>distance | Yes | No | Profile | O(m+n) | | |
56 | 60 | | [Jaccard index](#jaccard-index) |similarity<br>distance | Yes | Yes | Set | O(m+n) | | |
57 | 61 | | [Sorensen-Dice coefficient](#sorensen-dice-coefficient) |similarity<br>distance | Yes | No | Set | O(m+n) | | |
| 62 | +| [Overlap coefficient](#overlap-coefficient-ie-szymkiewicz-simpson) |similarity<br>distance | Yes | No | Set | O(m+n) | | |
58 | 63 |
|
59 | 64 | [1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes. |
60 | 65 |
|
@@ -360,6 +365,11 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in |
360 | 365 |
|
361 | 366 | Distance is computed as 1 - similarity. |
362 | 367 |
|
| 368 | +### Overlap coefficient (i.e., Szymkiewicz-Simpson) |
| 369 | +Very similar to Jaccard and Sorensen-Dice measures, but this time the similarity is computed as |V1 inter V2| / Min(|V1|,|V2|). Tends to yield higher similarity scores compared to the other overlapping coefficients. Always returns the highest similarity score (1) if one given string is the subset of the other. |
| 370 | + |
| 371 | +Distance is computed as 1 - similarity. |
| 372 | + |
363 | 373 | ## Experimental |
364 | 374 |
|
365 | 375 | ### SIFT4 |
|
0 commit comments