Skip to content

Commit 4c6e194

Browse files
committed
add modify.md
1 parent a61ce2d commit 4c6e194

File tree

2 files changed

+184
-6
lines changed

2 files changed

+184
-6
lines changed

docs/make.jl

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,12 @@ makedocs(
2424
"Formats" => "man/formats.md",
2525
"Call functions on each observation" => "man/map.md",
2626
"Row-wise operations" => "man/byrow.md",
27-
"Modifying data sets" => "man/modify.md",
28-
"Filtering observations" => "man/filter.md",
29-
"Sorting" => "man/sorting.md",
30-
"Grouping" => "man/grouping.md",
31-
"Aggregating over groups" => "man/aggregation.md",
32-
"Transposing Data" => "man/transpose.md",
27+
"Transform columns" => "man/modify.md",
28+
"Filter observations" => "man/filter.md",
29+
"Sort" => "man/sorting.md",
30+
"Group observations" => "man/grouping.md",
31+
"Aggregation" => "man/aggregation.md",
32+
"Transpose data" => "man/transpose.md",
3333
"Joins" => "man/joins.md"
3434
]
3535
# "API" => Any[

docs/src/man/modify.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Transforming data sets
2+
3+
# Introduction
4+
5+
The `modify!` function can be used to transform and modify columns of a data set. Note that the function modifies the data set in-place and operates on actual values (rather than the formatted values). To modify a copy of data we should use the `modify` function. These two functions accept one column of data set and apply the provided functions on the fed column as a vector, this should be compared to `map!/map` functions which apply operations on individual observations.
6+
7+
> Note that `modify!/modify` remove the format of columns as soon as their values are updated by a given transformation.
8+
9+
# Specifying the transformation
10+
11+
The first argument of these two functions is the name of the data set which is going to be modified and the next arguments can be the transform specifications, i.e.
12+
13+
> `modify!(ds, args...)` or `modify(ds, args...)`
14+
15+
The simplest form of `args` is `col => fun` which calls `fun` on `col` as a vector and replaces `col` with the output of the call. `col` can be a column index or a column name. Thus, to replace the value of a column which is called `:x` in a data set `ds` with their standardised values, we can use the following expression:
16+
17+
> `modify!(ds, :x1 => stdze)`
18+
19+
where `:x1` is a column in `ds`, and `stdze` is a function which subtracts values by their mean and divide them by their standard deviation. If you don't want to replace the column, but instead you like to create a new column based on calling `fun` on `col`, the `col => fun => :newname` (here `:newname` is a name for the new column) form is handy. Thus, to standardised the values of a column, which is called `:x1`, and store them as a new column in the data set, you can use,
20+
21+
> `modify!(ds, :x1 => stdze => :x1_stdze)`
22+
23+
To modify multiple columns of a data set with the same `fun`, we can use the `cols => fun`, where `cols` is a set of columns, this includes, a vector of columns indices, a vector of column names, a regular expression which selects some of the variables based on their names, or `Between` and `Not` types. When `cols` is referring to multiple columns, `modify!` automatically expands `cols => fun` to `col1 => fun, col2 => fun, ...`, where `col1` is the first column in the selected columns, `col2` is the second one, and so on. Thus to standardised all columns which starts with `x` in a data set, we can use the following expression:
24+
25+
> `modify!(ds, r"^x" => stdze)`
26+
27+
Note that the Julia broadcasting can be also used for specifying `args...`, e.g. something like:
28+
29+
> `[1, 2, 3] .=> [stdze, x -> x .- mean(x), x -> x ./ sum(x)] .=> [:stdze_1, :m_2, :m_3]`
30+
31+
will be translated as:
32+
> `1 => stdze => :stdze_1, 2 => (x -> x .- mean(x)) => :m_2, 3 => (x -> x ./ sum(x)) => :m_3`,
33+
34+
and something like:
35+
36+
> `:x1 .=> [sum, sort] .=> [:x1_sum, :x1_sort]`
37+
38+
will be translated as:
39+
40+
>`:x1 => sum => :x1_sum, :x1 => sort => :x1_sort`.
41+
42+
## Examples
43+
44+
```jldoctest
45+
julia> ds = Dataset(x1 = 1:5, x2 = [-2, -1, missing, 1, 2],
46+
x3 = [0.0, 0.1, 0.2, missing, 0.4])
47+
5×3 Dataset
48+
Row │ x1 x2 x3
49+
│ identity identity identity
50+
│ Int64? Int64? Float64?
51+
─────┼──────────────────────────────
52+
1 │ 1 -2 0.0
53+
2 │ 2 -1 0.1
54+
3 │ 3 missing 0.2
55+
4 │ 4 1 missing
56+
5 │ 5 2 0.4
57+
58+
julia> modify!(ds, 2:3 => sum)
59+
5×3 Dataset
60+
Row │ x1 x2 x3
61+
│ identity identity identity
62+
│ Int64? Int64? Float64?
63+
─────┼──────────────────────────────
64+
1 │ 1 0 0.7
65+
2 │ 2 0 0.7
66+
3 │ 3 0 0.7
67+
4 │ 4 0 0.7
68+
5 │ 5 0 0.7
69+
70+
julia> modify!(ds, :x1 => x -> x .- mean(x))
71+
5×3 Dataset
72+
Row │ x1 x2 x3
73+
│ identity identity identity
74+
│ Float64? Int64? Float64?
75+
─────┼──────────────────────────────
76+
1 │ -2.0 0 0.7
77+
2 │ -1.0 0 0.7
78+
3 │ 0.0 0 0.7
79+
4 │ 1.0 0 0.7
80+
5 │ 2.0 0 0.7
81+
```
82+
83+
# Accessing to modified columns
84+
85+
One of the key features of `modify!/modify` is that these functions have access to all modified/created variable in a single run of the function. It means, every transformation can be done on all columns that have been or updated by `args` arguments or any column which is created by `col => fun => :newname` syntax. In other words, for `args...` from left to right whenever a column is updated or created, the next operation has access to its value (either new or updated values). This will be particularly useful in conjunction with `byrow` which performs row-wise operations.
86+
87+
88+
# Specialised functions
89+
90+
There are two functions in Datasets which are very handy to modify a data set: `byrow`, and `splitter`.
91+
92+
## `byrow`
93+
94+
The `byrow` function is discussed in length in another section as a stand-alone function, however, it can also be used as the `fun` when we want to specify the transformation in `modify!/modify`. The syntax of `byrow` is different from its stand-alone usage in the way that when `byrow` is the `fun` part of `args` in the syntax of `modify!/modify` functions, we don't need to specify `ds` and `cols`, however, every other arguments are the same as the stand-alone usage.
95+
96+
The main feature of `byrow` inside `modify!/modify` is that it can accept multiple columns as the input argument, opposed to the other functions inside `modify!/modify` which only accept single column. This and the fact that every transformation inside `modify!/modify` has access to modified columns, help to provide complex transformations in a single run of `modify!/modify`.
97+
98+
The form of `args` when `byrow` is the function is similar to other functions with the following exceptions:
99+
100+
* When `cols` refers to multiple columns in `cols => byrow(...)`, `modify!/modify` will create a new column with a names based on the arguments passed to it. The user can provide a custom name by using the `cols => byrow(...) => :newname` syntax.
101+
* When `col` refers to a single column in `col => byrow(...)`, `modify!/modify` will apply operation on single values of the column and replace the column with the new values, i.e. it doesn't create a new column.
102+
* To use broadcasting with `byrow`, i.e. applying the same row-wise operation on multiple columns, the form must be `cols .=> byrow` where `cols` is a vector of column names or column indices (regular expression cannot be used for this purpose).
103+
104+
## `splitter`
105+
106+
`splitter` is also a specialised function which has a single job: splitting a single column which is a `Tuple` of values into multiple columns. It only operates on a single columns and the values inside the column which needs to be split must be in the form of `Tuples`. The form of `args` for `splitter` must be similar to:
107+
108+
> `modify!(ds, col => splitter => [:new_col_1, :new_col_2])`
109+
110+
which means we like to split `col` into two new columns; `:new_col_1` and `:new_col_2`. Here `col` can be a column index or a column name.
111+
112+
> Note, `splitter` produces as many columns as the length of the given new names, i.e. if the user provides fewer names than needed, the output columns will only contain partial components of the input `Tuple`.
113+
114+
## Examples
115+
116+
```jldoctest
117+
julia> body = Dataset(weight = [78.5, 59, 80], height = [160, 171, 183])
118+
3×2 Dataset
119+
Row │ weight height
120+
│ identity identity
121+
│ Float64? Int64?
122+
─────┼────────────────────
123+
1 │ 78.5 160
124+
2 │ 59.0 171
125+
3 │ 80.0 183
126+
127+
julia> modify!(body, :height => byrow(x -> (x/100)^2) => :BMI, [1, 3] => byrow(/) => :BMI)
128+
3×3 Dataset
129+
Row │ weight height BMI
130+
│ identity identity identity
131+
│ Float64? Int64? Float64?
132+
─────┼──────────────────────────────
133+
1 │ 78.5 160 30.6641
134+
2 │ 59.0 171 20.1771
135+
3 │ 80.0 183 23.8884
136+
137+
julia> sale = Dataset(customer = ["Bob Smith", "John Max", "Froon Moore"],
138+
item1_q1 = [23, 43, 50], item2_q1 = [44, 32, 55],
139+
item3_q1 = [45, 45, 54])
140+
3×4 Dataset
141+
Row │ customer item1_q1 item2_q1 item3_q1
142+
│ identity identity identity identity
143+
│ String? Int64? Int64? Int64?
144+
─────┼───────────────────────────────────────────
145+
1 │ Bob Smith 23 44 45
146+
2 │ John Max 43 32 45
147+
3 │ Froon Moore 50 55 54
148+
149+
julia> modify!(sale, 2:4 => byrow(sum) => :total)
150+
3×5 Dataset
151+
Row │ customer item1_q1 item2_q1 item3_q1 total
152+
│ identity identity identity identity identity
153+
│ String? Int64? Int64? Int64? Int64?
154+
─────┼─────────────────────────────────────────────────────
155+
1 │ Bob Smith 23 44 45 112
156+
2 │ John Max 43 32 45 120
157+
3 │ Froon Moore 50 55 54 159
158+
159+
julia> julia> function name_split(x)
160+
spl = split(x, " ")
161+
(string(spl[1]), string(spl[2]))
162+
end
163+
name_split (generic function with 1 method)
164+
165+
julia> modify!(sale, :customer => byrow(name_split),
166+
:customer => splitter => [:first_name, :last_name])
167+
3×7 Dataset
168+
Row │ customer item1_q1 item2_q1 item3_q1 total first_name last_name
169+
│ identity identity identity identity identity identity identity
170+
│ Tuple…? Int64? Int64? Int64? Int64? String? String?
171+
─────┼───────────────────────────────────────────────────────────────────────────────────
172+
1 │ ("Bob", "Smith") 23 44 45 112 Bob Smith
173+
2 │ ("John", "Max") 43 32 45 120 John Max
174+
3 │ ("Froon", "Moore") 50 55 54 159 Froon Moore
175+
176+
```
177+
178+
In the last example, we use `byrow` to apply `name_split` on each row of `:customer`, and since there is only one column as the input of `byrow`, `modify!` replaces the column with the new values. Also, note that the `modify!` function has access to these new values and we can use `splitter` to split the column into two new columns.

0 commit comments

Comments
 (0)