Skip to content

Commit da5475e

Browse files
committed
change default method to :hash for contains and related functions
1 parent 107b15b commit da5475e

File tree

2 files changed

+16
-16
lines changed

2 files changed

+16
-16
lines changed

docs/src/man/joins.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The main functions for combining two data sets are `leftjoin`, `innerjoin`, `out
1818

1919
See [the Wikipedia page on SQL joins](https://en.wikipedia.org/wiki/Join_(SQL)) for more information.
2020

21-
By default, to match observations, InMemoryDatasets sorts the right data set and uses a binary search algorithm for finding the matches of each observation in the left data set in the right data set based on the passed key column(s), thus, it has better performance when the left data set is larger than the right data set. However, passing `method = :hash` changes the default. The matching is done based on the formatted values of the key column(s), however, using the `mapformats` keyword argument, one may set it to `false` for one or both data sets.
21+
By default (except for `semijoin` and `antijoin`), to match observations, InMemoryDatasets sorts the right data set and uses a binary search algorithm for finding the matches of each observation in the left data set in the right data set based on the passed key column(s), thus, it has better performance when the left data set is larger than the right data set. However, passing `method = :hash` changes the default. The matching is done based on the formatted values of the key column(s), however, using the `mapformats` keyword argument, one may set it to `false` for one or both data sets.
2222

2323
For `leftjoin` and `innerjoin` the order of observations of the output data set is the same as their order in the left data set. However, the order of observations from the right table depends on the stability of the sort algorithm. User can set the `stable` keyword argument to `true` to guarantee a stable sort. For `outerjoin` the order of observations from the left data set in the output data set is also the same as their order in the original data set, however, for those observations which are from the right table, there is no specific order.
2424

src/join/main.jl

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -458,12 +458,12 @@ function DataAPI.outerjoin(dsl::AbstractDataset, dsr::AbstractDataset; on = noth
458458
end
459459

460460
"""
461-
contains(main, transaction; on, mapformats = true, alg = HeapSort, stable = false, accelerate = false, method = :sort)
461+
contains(main, transaction; on, mapformats = true, alg = HeapSort, stable = false, accelerate = false, method = :hash)
462462
463463
returns a boolean vector where is true when the key for the
464464
corresponding row in the `main` data set is found in the transaction data set.
465465
466-
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:sort`
466+
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:hash`
467467
468468
# Examples
469469
@@ -505,7 +505,7 @@ julia> contains(main, tds, on = :g1 => :group)
505505
1
506506
```
507507
"""
508-
function Base.contains(main::AbstractDataset, transaction::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, method = :sort, threads::Bool = true)
508+
function Base.contains(main::AbstractDataset, transaction::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, method = :hash, threads::Bool = true)
509509
!(method in (:hash, :sort)) && throw(ArgumentError("method must be :hash or :sort"))
510510
on === nothing && throw(ArgumentError("`on` keyword must be specified"))
511511
if !(on isa AbstractVector)
@@ -536,7 +536,7 @@ function Base.contains(main::AbstractDataset, transaction::AbstractDataset; on =
536536
end
537537

538538
"""
539-
antijoin(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, view = false, accelerate = false, method = :sort)
539+
antijoin(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, view = false, accelerate = false, method = :hash)
540540
541541
Opposite to `semijoin`, perform an anti join of two `Datasets`: `dsl` and `dsr`, and return a `Dataset`
542542
containing rows where keys appear in `dsl` but not in `dsr`.
@@ -556,7 +556,7 @@ rows that have key values appear in `dsr` will be removed.
556556
you can use the function `getformat` to see the format;
557557
by setting `mapformats` to a `Bool Vector` of length 2, you can specify whether to use formatted values
558558
for `dsl` and `dsr`, respectively; for example, passing a `[true, false]` means use formatted values for `dsl` and do not use formatted values for `dsr`.
559-
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:sort`
559+
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:hash`
560560
- `alg`: sorting algorithms used, is `HeapSort` (the Heap Sort algorithm) by default;
561561
it can also be `QuickSort` (the Quicksort algorithm).
562562
- `stable`: by default is `false`, means that the sorting results have not to be stable;
@@ -639,7 +639,7 @@ julia> antijoin(dsl, dsr, on = :year, mapformats = true) # Use formats for datas
639639
1 │ 2012 true
640640
```
641641
"""
642-
function DataAPI.antijoin(dsl::AbstractDataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, view = false, method = :sort, threads = true)
642+
function DataAPI.antijoin(dsl::AbstractDataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, view = false, method = :hash, threads = true)
643643
!(method in (:hash, :sort)) && throw(ArgumentError("method must be :hash or :sort"))
644644
if view
645645
Base.view(dsl, .!contains(dsl, dsr, on = on, mapformats = mapformats, stable = stable, alg = alg, accelerate = accelerate, method = method, threads = threads), :)
@@ -648,7 +648,7 @@ function DataAPI.antijoin(dsl::AbstractDataset, dsr::AbstractDataset; on = nothi
648648
end
649649
end
650650
"""
651-
semijoin(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, view = false, accelerate = false, method = :sort)
651+
semijoin(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, view = false, accelerate = false, method = :hash)
652652
653653
Perform a semi join of two `Datasets`: `dsl` and `dsr`, and return a `Dataset`
654654
containing rows where keys appear in `dsl` and `dsr`.
@@ -668,7 +668,7 @@ rows that have values in `dsl` while do not have matching values `on` keys in `d
668668
you can use the function `getformat` to see the format;
669669
by setting `mapformats` to a `Bool Vector` of length 2, you can specify whether to use formatted values
670670
for `dsl` and `dsr`, respectively; for example, passing a `[true, false]` means use formatted values for `dsl` and do not use formatted values for `dsr`.
671-
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:sort`
671+
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:hash`
672672
- `alg`: sorting algorithms used, is `HeapSort` (the Heap Sort algorithm) by default;
673673
it can also be `QuickSort` (the Quicksort algorithm).
674674
- `stable`: by default is `false`, means that the sorting results have not to be stable;
@@ -753,7 +753,7 @@ julia> semijoin(dsl, dsr, on = :year, mapformats = true) # Use formats for datas
753753
3 │ 2020 true
754754
```
755755
"""
756-
function DataAPI.semijoin(dsl::AbstractDataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, view = false, method = :sort, threads = true)
756+
function DataAPI.semijoin(dsl::AbstractDataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, view = false, method = :hash, threads = true)
757757
!(method in (:hash, :sort)) && throw(ArgumentError("method must be :hash or :sort"))
758758
if view
759759
Base.view(dsl, contains(dsl, dsr, on = on, mapformats = mapformats, stable = stable, alg = alg, accelerate = accelerate, method = method, threads = threads), :)
@@ -762,7 +762,7 @@ function DataAPI.semijoin(dsl::AbstractDataset, dsr::AbstractDataset; on = nothi
762762
end
763763
end
764764
"""
765-
antijoin!(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, accelerate = false, method = :sort)
765+
antijoin!(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, accelerate = false, method = :hash)
766766
767767
Opposite to `semijoin`, perform an anti join of two `Datasets`: `dsl` and `dsr`, and change the left table `dsl` into a `Dataset`
768768
containing rows where keys appear in `dsl` but not in `dsr`.
@@ -782,7 +782,7 @@ rows that have key values appear in `dsr` will be removed.
782782
you can use the function `getformat` to see the format;
783783
by setting `mapformats` to a `Bool Vector` of length 2, you can specify whether to use formatted values
784784
for `dsl` and `dsr`, respectively; for example, passing a `[true, false]` means use formatted values for `dsl` and do not use formatted values for `dsr`.
785-
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:sort`
785+
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:hash`
786786
- `alg`: sorting algorithms used, is `HeapSort` (the Heap Sort algorithm) by default;
787787
it can also be `QuickSort` (the Quicksort algorithm).
788788
- `stable`: by default is `false`, means that the sorting results have not to be stable;
@@ -880,12 +880,12 @@ julia> dsl
880880
1 │ 2012 true
881881
```
882882
"""
883-
function antijoin!(dsl::Dataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, method = :sort, threads = true)
883+
function antijoin!(dsl::Dataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, method = :hash, threads = true)
884884
!(method in (:hash, :sort)) && throw(ArgumentError("method must be :hash or :sort"))
885885
deleteat!(dsl, contains(dsl, dsr, on = on, mapformats = mapformats, stable = stable, alg = alg, accelerate = accelerate, method = method, threads = threads))
886886
end
887887
"""
888-
semijoin!(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, accelerate = false, method = :sort)
888+
semijoin!(dsl, dsr; on=nothing, makeunique=false, mapformats=true, alg=HeapSort, stable=false, accelerate = false, method = :hash)
889889
890890
Perform a semi join of two `Datasets`: `dsl` and `dsr`, and change the left table `dsl` into a `Dataset`
891891
containing rows where keys appear in `dsl` and `dsr`.
@@ -905,7 +905,7 @@ rows that have values in `dsl` while do not have matching values `on` keys in `d
905905
you can use the function `getformat` to see the format;
906906
by setting `mapformats` to a `Bool Vector` of length 2, you can specify whether to use formatted values
907907
for `dsl` and `dsr`, respectively; for example, passing a `[true, false]` means use formatted values for `dsl` and do not use formatted values for `dsr`.
908-
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:sort`
908+
- `method` is either `:sort` or `:hash` for specifiying the method of match finding, default is `:hash`
909909
- `alg`: sorting algorithms used, is `HeapSort` (the Heap Sort algorithm) by default;
910910
it can also be `QuickSort` (the Quicksort algorithm).
911911
- `stable`: by default is `false`, means that the sorting results have not to be stable;
@@ -1010,7 +1010,7 @@ julia> dsl
10101010
3 │ 2020 true
10111011
```
10121012
"""
1013-
function semijoin!(dsl::Dataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, method = :sort, threads = true)
1013+
function semijoin!(dsl::Dataset, dsr::AbstractDataset; on = nothing, mapformats::Union{Bool, Vector{Bool}} = true, stable = false, alg = HeapSort, accelerate = false, method = :hash, threads = true)
10141014
deleteat!(dsl, .!contains(dsl, dsr, on = on, mapformats = mapformats, stable = stable, alg = alg, accelerate = accelerate, method = method, threads = threads))
10151015
end
10161016

0 commit comments

Comments
 (0)