Skip to content
This repository was archived by the owner on Aug 5, 2024. It is now read-only.

Commit d0a578f

Browse files
committed
Adjust Python3 code
I'm not sure that I made the right assumptions about Python3's Unicode handling when I made the first patch to it. By constructing the specific `diffs` output I created a sequence of code units that `diff_main` in Python3 would _not_ have made because it's operating on Unicode code points natively when finding the common prefix. Therefore I do not think that the Python3 library experienced this problem as the others did. Nonetheless it _has_ been reporting the diff length differently than in other languages and I have left that change in there. Of note, it doesn't look like we have true harmony between the languages despite the appearance of such. The `lua` wiki page makes this clear, but at least with Python we have the ability to harmonize the meaning of the lengths and I have done that in this change.
1 parent 4fc0073 commit d0a578f

File tree

2 files changed

+1
-23
lines changed

2 files changed

+1
-23
lines changed

python3/diff_match_patch.py

Lines changed: 0 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1134,16 +1134,6 @@ def diff_levenshtein(self, diffs):
11341134
levenshtein += max(insertions, deletions)
11351135
return levenshtein
11361136

1137-
@classmethod
1138-
def is_high_surrogate(cls, utf16be_bytes):
1139-
c = struct.unpack('>H', utf16be_bytes)[0]
1140-
return c >= 0xd800 and c <= 0xdbff
1141-
1142-
@classmethod
1143-
def is_low_surrogate(cls, utf16be_bytes):
1144-
c = struct.unpack('>H', utf16be_bytes)[0]
1145-
return c >= 0xdc00 and c <= 0xdfff
1146-
11471137
def diff_toDelta(self, diffs):
11481138
"""Crush the diff into an encoded string which describes the operations
11491139
required to transform text1 into text2.
@@ -1159,18 +1149,6 @@ def diff_toDelta(self, diffs):
11591149
text = []
11601150
last_end = None
11611151
for (op, data) in diffs:
1162-
encoded = data.encode('utf-16be', 'surrogatepass')
1163-
this_top = encoded[0:2]
1164-
this_end = encoded[-2:]
1165-
1166-
if self.is_high_surrogate(this_end):
1167-
encoded = encoded[0:-2]
1168-
1169-
if last_end and self.is_high_surrogate(last_end) and self.is_low_surrogate(this_top):
1170-
encoded = last_end + encoded
1171-
1172-
data = encoded.decode('utf-16be', 'surrogateescape')
1173-
last_end = this_end
11741152
if op == self.DIFF_INSERT:
11751153
# High ascii will raise UnicodeDecodeError. Use Unicode instead.
11761154
data = data.encode("utf-8")

python3/tests/diff_match_patch_test.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -445,7 +445,7 @@ def testDiffDelta(self):
445445
# Convert delta string into a diff.
446446
self.assertEqual(diffs, self.dmp.diff_fromDelta(text1, delta))
447447

448-
diffs = [(self.dmp.DIFF_EQUAL, "\ud83d\ude4b\ud83d"), (self.dmp.DIFF_INSERT, "\ude4c\ud83d"), (self.dmp.DIFF_EQUAL, "\ude4b")]
448+
diffs = self.dmp.diff_main("\U0001F64B\U0001F64B", "\U0001F64B\U0001F64C\U0001F64B")
449449
delta = self.dmp.diff_toDelta(diffs)
450450
self.assertEqual("=2\t+%F0%9F%99%8C\t=2", delta)
451451

0 commit comments

Comments
 (0)