Skip to content

Commit 3cea9e2

Browse files
author
Gökhan Ercan
committed
'Overlap Coefficient' added which is very similar to Jaccard and Sorensen-Dice measures.
1 parent d56b44f commit 3cea9e2

File tree

3 files changed

+94
-21
lines changed

3 files changed

+94
-21
lines changed

README.md

Lines changed: 31 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,31 @@
44

55
A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...
66

7-
* [Download](#download)
8-
* [Overview](#overview)
9-
* [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
10-
* [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
11-
* [Levenshtein](#levenshtein)
12-
* [Normalized Levenshtein](#normalized-levenshtein)
13-
* [Weighted Levenshtein](#weighted-levenshtein)
14-
* [Damerau-Levenshtein](#damerau-levenshtein)
15-
* [Optimal String Alignment](#optimal-string-alignment)
16-
* [Jaro-Winkler](#jaro-winkler)
17-
* [Longest Common Subsequence](#longest-common-subsequence)
18-
* [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
19-
* [N-Gram](#n-gram)
20-
* [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
21-
* [Q-Gram](#shingle-n-gram-based-algorithms)
22-
* [Cosine similarity](#shingle-n-gram-based-algorithms)
23-
* [Jaccard index](#shingle-n-gram-based-algorithms)
24-
* [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms)
25-
* [Experimental](#experimental)
26-
* [SIFT4](#sift4)
27-
* [Users](#users)
7+
- [python-string-similarity](#python-string-similarity)
8+
- [Download](#download)
9+
- [Overview](#overview)
10+
- [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
11+
- [(Normalized) similarity and distance](#normalized-similarity-and-distance)
12+
- [Metric distances](#metric-distances)
13+
- [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
14+
- [Levenshtein](#levenshtein)
15+
- [Normalized Levenshtein](#normalized-levenshtein)
16+
- [Weighted Levenshtein](#weighted-levenshtein)
17+
- [Damerau-Levenshtein](#damerau-levenshtein)
18+
- [Optimal String Alignment](#optimal-string-alignment)
19+
- [Jaro-Winkler](#jaro-winkler)
20+
- [Longest Common Subsequence](#longest-common-subsequence)
21+
- [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
22+
- [N-Gram](#n-gram)
23+
- [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
24+
- [Q-Gram](#q-gram)
25+
- [Cosine similarity](#cosine-similarity)
26+
- [Jaccard index](#jaccard-index)
27+
- [Sorensen-Dice coefficient](#sorensen-dice-coefficient)
28+
- [Overlap coefficient (i.e., Szymkiewicz-Simpson)](#overlap-coefficient-ie-szymkiewicz-simpson)
29+
- [Experimental](#experimental)
30+
- [SIFT4](#sift4)
31+
- [Users](#users)
2832

2933

3034
## Download
@@ -55,6 +59,7 @@ The main characteristics of each implemented algorithm are presented below. The
5559
| [Cosine similarity](#cosine-similarity) |similarity<br>distance | Yes | No | Profile | O(m+n) | |
5660
| [Jaccard index](#jaccard-index) |similarity<br>distance | Yes | Yes | Set | O(m+n) | |
5761
| [Sorensen-Dice coefficient](#sorensen-dice-coefficient) |similarity<br>distance | Yes | No | Set | O(m+n) | |
62+
| [Overlap coefficient](#overlap-coefficient-ie-szymkiewicz-simpson) |similarity<br>distance | Yes | No | Set | O(m+n) | |
5863

5964
[1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.
6065

@@ -360,6 +365,11 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in
360365

361366
Distance is computed as 1 - similarity.
362367

368+
### Overlap coefficient (i.e., Szymkiewicz-Simpson)
369+
Very similar to Jaccard and Sorensen-Dice measures, but this time the similarity is computed as |V1 inter V2| / Min(|V1|,|V2|). Tends to yield higher similarity scores compared to the other overlapping coefficients. Always returns the highest similarity score (1) if one given string is the subset of the other.
370+
371+
Distance is computed as 1 - similarity.
372+
363373
## Experimental
364374

365375
### SIFT4

strsimpy/overlap_coefficient.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
from .shingle_based import ShingleBased
2+
from .string_distance import NormalizedStringDistance
3+
from .string_similarity import NormalizedStringSimilarity
4+
5+
6+
class OverlapCoefficient(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
7+
8+
def __init__(self, k=3):
9+
super().__init__(k)
10+
11+
def distance(self, s0, s1):
12+
return 1.0 - self.similarity(s0, s1)
13+
14+
def similarity(self, s0, s1):
15+
if s0 is None:
16+
raise TypeError("Argument s0 is NoneType.")
17+
if s1 is None:
18+
raise TypeError("Argument s1 is NoneType.")
19+
if s0 == s1:
20+
return 1.0
21+
union = set()
22+
profile0, profile1 = self.get_profile(s0), self.get_profile(s1)
23+
for k in profile0.keys():
24+
union.add(k)
25+
for k in profile1.keys():
26+
union.add(k)
27+
inter = int(len(profile0.keys()) + len(profile1.keys()) - len(union))
28+
return inter / min(len(profile0),len(profile1))
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import unittest
2+
3+
from strsimpy.overlap_coefficient import OverlapCoefficient
4+
5+
class TestOverlapCoefficient(unittest.TestCase):
6+
7+
def test_overlap_coefficient_onestringissubsetofother_return0(self):
8+
sim = OverlapCoefficient(3)
9+
s1,s2 = "eat","eating"
10+
actual = sim.distance(s1,s2)
11+
print("distance: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
12+
self.assertEqual(0,actual)
13+
14+
def test_overlap_coefficient_onestringissubset_return1(self):
15+
sim = OverlapCoefficient(3)
16+
s1,s2 = "eat","eating"
17+
actual = sim.similarity(s1,s2)
18+
print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
19+
self.assertEqual(1,actual)
20+
21+
def test_overlap_coefficient_onestringissubsetofother_return1(self):
22+
sim = OverlapCoefficient(3)
23+
s1,s2 = "eat","eating"
24+
actual = sim.similarity(s1,s2)
25+
print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
26+
self.assertEqual(1,actual)
27+
28+
def test_overlap_coefficient_halfsimilar_return1(self):
29+
sim = OverlapCoefficient(2)
30+
s1,s2 = "car","bar"
31+
self.assertEqual(1/2,sim.similarity(s1,s2))
32+
self.assertEqual(1/2,sim.distance(s1,s2))
33+
34+
if __name__ == "__main__":
35+
unittest.main()

0 commit comments

Comments
 (0)