Skip to content

Commit f44806c

Browse files
committed
Code comments from JS version; minor doc updates
1 parent 4ee6dcd commit f44806c

File tree

4 files changed

+215
-25
lines changed

4 files changed

+215
-25
lines changed

README.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,15 @@ Please choose 'B' or 'A': B
3232
**References**
3333

3434
1. Ford, L. R., & Johnson, S. M. (1959). A Tournament Problem.
35-
The American Mathematical Monthly, 66(5), 387-389. <[https://doi.org/10.1080/00029890.1959.11989306](https://doi.org/10.1080/00029890.1959.11989306)>
35+
The American Mathematical Monthly, 66(5), 387-389. [https://doi.org/10.1080/00029890.1959.11989306](https://doi.org/10.1080/00029890.1959.11989306)
3636
2. Knuth, D. E. (1998). The Art of Computer Programming: Volume 3: Sorting and Searching (2nd ed.).
37-
Addison-Wesley. <[https://cs.stanford.edu/~knuth/taocp.html#vol3](https://cs.stanford.edu/~knuth/taocp.html#vol3)>
38-
3. <[https://en.wikipedia.org/wiki/Merge-insertion_sort](https://en.wikipedia.org/wiki/Merge-insertion_sort)>
37+
Addison-Wesley. [https://cs.stanford.edu/~knuth/taocp.html#vol3](https://cs.stanford.edu/~knuth/taocp.html#vol3)
38+
3. [https://en.wikipedia.org/wiki/Merge-insertion_sort](https://en.wikipedia.org/wiki/Merge-insertion_sort)
39+
40+
## See Also
41+
42+
* JavaScript / TypeScript version: [https://www.npmjs.com/package/merge-insertion](https://www.npmjs.com/package/merge-insertion)
43+
* This algorithm in action: [https://haukex.github.io/pairrank/](https://haukex.github.io/pairrank/) (select “Efficient”)
3944

4045
## API
4146

@@ -53,7 +58,7 @@ alias of TypeVar(‘T’)
5358
### merge_insertion.Comparator
5459

5560
A user-supplied async function to compare two items.
56-
The argument is a tuple of the two items to be compared; they must not be equal.
61+
The single argument is a tuple of the two items to be compared; they must not be equal.
5762
Must return 0 if the first item is ranked higher, or 1 if the second item is ranked higher.
5863

5964
<a id="merge_insertion.merge_insertion_sort"></a>
@@ -63,8 +68,8 @@ Must return 0 if the first item is ranked higher, or 1 if the second item is ran
6368
Merge-Insertion Sort (Ford-Johnson algorithm) with async comparison.
6469

6570
* **Parameters:**
66-
* **array** – Array of to sort. Duplicate items are not allowed.
67-
* **comparator** – Async comparison function.
71+
* **array** – Array to sort. **Duplicate items are not allowed.**
72+
* **comparator** – Async comparison function as described in [`Comparator`](#merge_insertion.Comparator).
6873
* **Returns:**
6974
A shallow copy of the array sorted in ascending order.
7075

merge_insertion/__init__.py

Lines changed: 78 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -30,19 +30,24 @@
3030
**References**
3131
3232
1. Ford, L. R., & Johnson, S. M. (1959). A Tournament Problem.
33-
The American Mathematical Monthly, 66(5), 387-389. <https://doi.org/10.1080/00029890.1959.11989306>
33+
The American Mathematical Monthly, 66(5), 387-389. https://doi.org/10.1080/00029890.1959.11989306
3434
2. Knuth, D. E. (1998). The Art of Computer Programming: Volume 3: Sorting and Searching (2nd ed.).
35-
Addison-Wesley. <https://cs.stanford.edu/~knuth/taocp.html#vol3>
36-
3. <https://en.wikipedia.org/wiki/Merge-insertion_sort>
35+
Addison-Wesley. https://cs.stanford.edu/~knuth/taocp.html#vol3
36+
3. https://en.wikipedia.org/wiki/Merge-insertion_sort
37+
38+
See Also
39+
--------
40+
41+
* JavaScript / TypeScript version: https://www.npmjs.com/package/merge-insertion
42+
43+
* This algorithm in action: https://haukex.github.io/pairrank/ (select "Efficient")
3744
3845
API
3946
---
4047
4148
.. autoclass:: merge_insertion.T
42-
:members:
4349
4450
.. autoclass:: merge_insertion.Comparator
45-
:members:
4651
4752
.. autofunction:: merge_insertion.merge_insertion_sort
4853
@@ -69,15 +74,15 @@
6974
from typing import TypeVar, Literal
7075
from math import floor, ceil, log2
7176

72-
# NOTICE: This file contains very few code comments because it is a port of
73-
# https://github.com/haukex/merge-insertion.js/blob/main/src/merge-insertion.ts
74-
# Please see that file for detailed code comments and explanations.
75-
7677
#: A type of object that can be compared by a :class:`Comparator` and therefore sorted by
7778
#: :func:`merge_insertion_sort`. Must have sensible support for the equality operators.
7879
T = TypeVar('T')
7980

81+
# Helper that generates the group sizes for _make_groups.
8082
def _group_sizes() -> Generator[int, None, None]:
83+
# <https://en.wikipedia.org/wiki/Merge-insertion_sort>:
84+
# "... the sums of sizes of every two adjacent groups form a sequence of powers of two."
85+
# <https://oeis.org/A014113>: a(0) = 0 and if n>=1, a(n) = 2^n - a(n-1).
8186
prev :int = 0
8287
i :int = 1
8388
while True:
@@ -86,6 +91,8 @@ def _group_sizes() -> Generator[int, None, None]:
8691
prev = cur
8792
i += 1
8893

94+
# Helper function to group and reorder items to be inserted via binary search.
95+
# See also the description within the code of merge_insertion_sort.
8996
def _make_groups(array :Sequence[T]) -> Sequence[tuple[int, T]]:
9097
items = list(enumerate(array))
9198
rv :list[tuple[int, T]] = []
@@ -102,10 +109,12 @@ def _make_groups(array :Sequence[T]) -> Sequence[tuple[int, T]]:
102109
return rv
103110

104111
#: A user-supplied async function to compare two items.
105-
#: The argument is a tuple of the two items to be compared; they must not be equal.
112+
#: The single argument is a tuple of the two items to be compared; they must not be equal.
106113
#: Must return 0 if the first item is ranked higher, or 1 if the second item is ranked higher.
107114
Comparator = Callable[[tuple[T, T]], Awaitable[Literal[0, 1]]]
108115

116+
# Helper function to insert an item into a sorted array via binary search.
117+
# Returns the index **before** which to insert the new item, e.g. `array.insert(index, item)`
109118
async def _bin_insert_index(array :Sequence[T], item :T, comp :Comparator) -> int:
110119
if not array:
111120
return 0
@@ -122,6 +131,7 @@ async def _bin_insert_index(array :Sequence[T], item :T, comp :Comparator) -> in
122131
left = mid + 1
123132
return left
124133

134+
# Finds the index of an object in an array by object identity (instead of equality).
125135
def _ident_find(array :Sequence[T], item :T) -> int:
126136
for i,e in enumerate(array):
127137
if e is item:
@@ -135,6 +145,7 @@ async def merge_insertion_sort(array :Sequence[T], comparator :Comparator) -> Se
135145
:param comparator: Async comparison function as described in :class:`Comparator`.
136146
:return: A shallow copy of the array sorted in ascending order.
137147
"""
148+
# Special cases and error checking
138149
if len(array)<1:
139150
return []
140151
if len(array)==1:
@@ -144,30 +155,84 @@ async def merge_insertion_sort(array :Sequence[T], comparator :Comparator) -> Se
144155
if len(array)==2:
145156
return list(array) if await comparator((array[0], array[1])) else [array[1], array[0]]
146157

147-
pairs :dict[T, T] = {}
158+
# Algorithm description adapted and expanded from <https://en.wikipedia.org/wiki/Merge-insertion_sort>:
159+
# 1. Group the items into ⌊n/2⌋ pairs of elements, arbitrarily, leaving one element unpaired if there is an odd number of elements.
160+
# 2. Perform ⌊n/2⌋ comparisons, one per pair, to determine the larger of the two elements in each pair.
161+
pairs :dict[T, T] = {} # keys are the larger items, values the smaller ones
148162
for i in range(0, len(array)-1, 2):
149163
if await comparator((array[i], array[i+1])):
150164
pairs[array[i+1]] = array[i]
151165
else:
152166
pairs[array[i]] = array[i+1]
153167

168+
# 3. Recursively sort the ⌊n/2⌋ larger elements from each pair, creating an initial sorted output sequence
169+
# of ⌊n/2⌋ of the input elements, in ascending order, using the merge-insertion sort.
154170
larger = await merge_insertion_sort(list(pairs), comparator)
155171

172+
# Build the "main chain" data structure we will use to insert items into (explained a bit more below), while also:
173+
# 4. Insert at the start of the sorted sequence the element that was paired with
174+
# the first and smallest element of the sorted sequence.
175+
# Note that we know the main chain has at least one item here due to the special cases at the beginning of this function.
156176
main_chain :list[list[T]] = [ [ pairs[larger[0]] ], [ larger[0] ] ] + [ [ la, pairs[la] ] for la in larger[1:] ]
157177
assert all( len(i)==2 for i in main_chain[2:] )
158178

179+
# 5. Insert the remaining ⌈n/2⌉−1 items that are not yet in the sorted output sequence into that sequence,
180+
# one at a time, with a specially chosen insertion ordering, as follows:
181+
#
182+
# a. Partition the un-inserted elements yᵢ into groups with contiguous indexes.
183+
# There are two elements y₃ and y₄ in the first group¹, and the sums of sizes of every two adjacent
184+
# groups form a sequence of powers of two. Thus, the sizes of groups are: 2, 2, 6, 10, 22, 42, ...
185+
# b. Order the un-inserted elements by their groups (smaller indexes to larger indexes), but within each
186+
# group order them from larger indexes to smaller indexes. Thus, the ordering becomes:
187+
# y₄, y₃, y₆, y₅, y₁₂, y₁₁, y₁₀, y₉, y₈, y₇, y₂₂, y₂₁, ...
188+
# c. Use this ordering to insert the elements yᵢ into the output sequence. For each element yᵢ,
189+
# use a binary search from the start of the output sequence up to but not including xᵢ to determine
190+
# where to insert yᵢ.²
191+
#
192+
# ¹ My explanation: The items already in the sorted output sequence (the larger elements of each pair) are
193+
# labeled xᵢ and the yet unsorted (smaller) elements are labeled yᵢ, with i starting at 1. However, due
194+
# to step 4 above, the item that would have been labeled y₁ has actually already become element x₁, and
195+
# therefore the element that would have been x₁ is now x₂ and no longer has a paired yᵢ element. It
196+
# follows that the first paired elements are x₃ and y₃, and so the first unsorted element to be inserted
197+
# into the output sequence is y₃. Also noteworthy is that if the input had an odd number of elements,
198+
# the leftover unpaired element is treated as the last yᵢ element.
199+
#
200+
# ² In my opinion, this is lacking detail, and this seems to be true for the other two sources (Ford-Johnson
201+
# and Knuth) as well. So here is my attempt at adding more details to the explanation: The "main chain" is
202+
# always kept in sorted order, therefore, for each item of the main chain that has an associated `smaller`
203+
# item, we know that this smaller item must be inserted *before* that main chain item. The problem I see
204+
# with the various descriptions is that they don't explicitly explain that the insertion process shifts all
205+
# the indices of the array, and due to the nonlinear insertion order, this makes it tricky to keep track of
206+
# the correct array indices over which to perform the insertion search. So instead, below, I use a linear
207+
# search to find the main chain item being operated on each time, which is expensive, but much easier. It
208+
# should also be noted that the leftover unpaired element, if there is one, gets inserted across the whole
209+
# main chain as it exists at the time of its insertion - it may not be inserted last. So even though there
210+
# is still some optimization potential, this algorithm is used in cases where the comparisons are much more
211+
# expensive than the rest of the algorithm, so the cost is acceptable for now.
212+
213+
# Iterate over the groups to be inserted, which are built from the main chain as explained above (in the
214+
# current implementation we don't need the original indices returned by _make_groups). Also, if there was
215+
# a leftover item from an odd input length, treat it as the last "smaller" item. We'll use the fact that
216+
# at this point, all main_chain items contain two elements, so we'll mark the leftover item as a special
217+
# case by having it be the only item with one element.
159218
for _,pair in _make_groups( main_chain[2:] + ( [[array[-1]]] if len(array) % 2 else [] ) ):
160-
if len(pair)==1:
219+
# Determine which item to insert and where.
220+
if len(pair)==1: # See explanation of this special case above.
221+
# This is the leftover item, it gets inserted into the current whole main chain.
161222
item = pair[0]
162223
idx = await _bin_insert_index([ i[0] for i in main_chain ], item, comparator)
163224
else:
164225
assert len(pair)==2
226+
# Locate the pair we're about to insert in the main chain, to limit the extent of the binary search (see also explanation above).
165227
pair_idx = _ident_find(main_chain, pair)
166228
item = pair.pop()
229+
# Locate the index in the main chain where the pair's smaller item needs to be inserted.
167230
idx = await _bin_insert_index([ i[0] for i in main_chain[:pair_idx] ], item, comparator)
231+
# Actually do the insertion.
168232
main_chain.insert(idx, [item])
169233
assert all( len(i)==1 for i in main_chain )
170234

235+
# Turn the "main chain" data structure back into an array of values.
171236
return [ i[0] for i in main_chain ]
172237

173238
def merge_insertion_max_comparisons(n :int) -> int:
@@ -178,4 +243,5 @@ def merge_insertion_max_comparisons(n :int) -> int:
178243
"""
179244
if n<0:
180245
raise ValueError("must specify zero or more items")
246+
# Formula from https://en.wikipedia.org/wiki/Merge-insertion_sort (the sum version should work too)
181247
return n*ceil(log2(3*n/4)) - floor((2**floor(log2(6*n)))/3) + floor(log2(6*n)/2) if n else 0

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
55
[project]
66
name = "merge-insertion"
77
description = "The merge-insertion sort (aka the Ford-Johnson algorithm) is optimized for using few comparisons."
8-
version = "1.1.0"
8+
version = "1.1.1"
99
authors = [ { name="Hauke Dämpfling", email="haukex@zero-g.net" } ]
1010
readme = "README.md"
1111
requires-python = ">=3.10"

0 commit comments

Comments
 (0)