|
| 1 | +# Transforming data sets |
| 2 | + |
| 3 | +# Introduction |
| 4 | + |
| 5 | +The `modify!` function can be used to transform and modify columns of a data set. Note that the function modifies the data set in-place and operates on actual values (rather than the formatted values). To modify a copy of data we should use the `modify` function. These two functions accept one column of data set and apply the provided functions on the fed column as a vector, this should be compared to `map!/map` functions which apply operations on individual observations. |
| 6 | + |
| 7 | +> Note that `modify!/modify` remove the format of columns as soon as their values are updated by a given transformation. |
| 8 | +
|
| 9 | +# Specifying the transformation |
| 10 | + |
| 11 | +The first argument of these two functions is the name of the data set which is going to be modified and the next arguments can be the transform specifications, i.e. |
| 12 | + |
| 13 | +> `modify!(ds, args...)` or `modify(ds, args...)` |
| 14 | +
|
| 15 | +The simplest form of `args` is `col => fun` which calls `fun` on `col` as a vector and replaces `col` with the output of the call. `col` can be a column index or a column name. Thus, to replace the value of a column which is called `:x` in a data set `ds` with their standardised values, we can use the following expression: |
| 16 | + |
| 17 | +> `modify!(ds, :x1 => stdze)` |
| 18 | +
|
| 19 | +where `:x1` is a column in `ds`, and `stdze` is a function which subtracts values by their mean and divide them by their standard deviation. If you don't want to replace the column, but instead you like to create a new column based on calling `fun` on `col`, the `col => fun => :newname` (here `:newname` is a name for the new column) form is handy. Thus, to standardised the values of a column, which is called `:x1`, and store them as a new column in the data set, you can use, |
| 20 | + |
| 21 | +> `modify!(ds, :x1 => stdze => :x1_stdze)` |
| 22 | +
|
| 23 | +To modify multiple columns of a data set with the same `fun`, we can use the `cols => fun`, where `cols` is a set of columns, this includes, a vector of columns indices, a vector of column names, a regular expression which selects some of the variables based on their names, or `Between` and `Not` types. When `cols` is referring to multiple columns, `modify!` automatically expands `cols => fun` to `col1 => fun, col2 => fun, ...`, where `col1` is the first column in the selected columns, `col2` is the second one, and so on. Thus to standardised all columns which starts with `x` in a data set, we can use the following expression: |
| 24 | + |
| 25 | +> `modify!(ds, r"^x" => stdze)` |
| 26 | +
|
| 27 | +Note that the Julia broadcasting can be also used for specifying `args...`, e.g. something like: |
| 28 | + |
| 29 | +> `[1, 2, 3] .=> [stdze, x -> x .- mean(x), x -> x ./ sum(x)] .=> [:stdze_1, :m_2, :m_3]` |
| 30 | +
|
| 31 | +will be translated as: |
| 32 | +> `1 => stdze => :stdze_1, 2 => (x -> x .- mean(x)) => :m_2, 3 => (x -> x ./ sum(x)) => :m_3`, |
| 33 | +
|
| 34 | +and something like: |
| 35 | + |
| 36 | +> `:x1 .=> [sum, sort] .=> [:x1_sum, :x1_sort]` |
| 37 | +
|
| 38 | +will be translated as: |
| 39 | + |
| 40 | +>`:x1 => sum => :x1_sum, :x1 => sort => :x1_sort`. |
| 41 | +
|
| 42 | +## Examples |
| 43 | + |
| 44 | +```jldoctest |
| 45 | +julia> ds = Dataset(x1 = 1:5, x2 = [-2, -1, missing, 1, 2], |
| 46 | + x3 = [0.0, 0.1, 0.2, missing, 0.4]) |
| 47 | +5×3 Dataset |
| 48 | + Row │ x1 x2 x3 |
| 49 | + │ identity identity identity |
| 50 | + │ Int64? Int64? Float64? |
| 51 | +─────┼────────────────────────────── |
| 52 | + 1 │ 1 -2 0.0 |
| 53 | + 2 │ 2 -1 0.1 |
| 54 | + 3 │ 3 missing 0.2 |
| 55 | + 4 │ 4 1 missing |
| 56 | + 5 │ 5 2 0.4 |
| 57 | +
|
| 58 | +julia> modify!(ds, 2:3 => sum) |
| 59 | +5×3 Dataset |
| 60 | + Row │ x1 x2 x3 |
| 61 | + │ identity identity identity |
| 62 | + │ Int64? Int64? Float64? |
| 63 | +─────┼────────────────────────────── |
| 64 | + 1 │ 1 0 0.7 |
| 65 | + 2 │ 2 0 0.7 |
| 66 | + 3 │ 3 0 0.7 |
| 67 | + 4 │ 4 0 0.7 |
| 68 | + 5 │ 5 0 0.7 |
| 69 | +
|
| 70 | +julia> modify!(ds, :x1 => x -> x .- mean(x)) |
| 71 | +5×3 Dataset |
| 72 | + Row │ x1 x2 x3 |
| 73 | + │ identity identity identity |
| 74 | + │ Float64? Int64? Float64? |
| 75 | +─────┼────────────────────────────── |
| 76 | + 1 │ -2.0 0 0.7 |
| 77 | + 2 │ -1.0 0 0.7 |
| 78 | + 3 │ 0.0 0 0.7 |
| 79 | + 4 │ 1.0 0 0.7 |
| 80 | + 5 │ 2.0 0 0.7 |
| 81 | +``` |
| 82 | + |
| 83 | +# Accessing to modified columns |
| 84 | + |
| 85 | +One of the key features of `modify!/modify` is that these functions have access to all modified/created variable in a single run of the function. It means, every transformation can be done on all columns that have been or updated by `args` arguments or any column which is created by `col => fun => :newname` syntax. In other words, for `args...` from left to right whenever a column is updated or created, the next operation has access to its value (either new or updated values). This will be particularly useful in conjunction with `byrow` which performs row-wise operations. |
| 86 | + |
| 87 | + |
| 88 | +# Specialised functions |
| 89 | + |
| 90 | +There are two functions in Datasets which are very handy to modify a data set: `byrow`, and `splitter`. |
| 91 | + |
| 92 | +## `byrow` |
| 93 | + |
| 94 | +The `byrow` function is discussed in length in another section as a stand-alone function, however, it can also be used as the `fun` when we want to specify the transformation in `modify!/modify`. The syntax of `byrow` is different from its stand-alone usage in the way that when `byrow` is the `fun` part of `args` in the syntax of `modify!/modify` functions, we don't need to specify `ds` and `cols`, however, every other arguments are the same as the stand-alone usage. |
| 95 | + |
| 96 | +The main feature of `byrow` inside `modify!/modify` is that it can accept multiple columns as the input argument, opposed to the other functions inside `modify!/modify` which only accept single column. This and the fact that every transformation inside `modify!/modify` has access to modified columns, help to provide complex transformations in a single run of `modify!/modify`. |
| 97 | + |
| 98 | +The form of `args` when `byrow` is the function is similar to other functions with the following exceptions: |
| 99 | + |
| 100 | +* When `cols` refers to multiple columns in `cols => byrow(...)`, `modify!/modify` will create a new column with a names based on the arguments passed to it. The user can provide a custom name by using the `cols => byrow(...) => :newname` syntax. |
| 101 | +* When `col` refers to a single column in `col => byrow(...)`, `modify!/modify` will apply operation on single values of the column and replace the column with the new values, i.e. it doesn't create a new column. |
| 102 | +* To use broadcasting with `byrow`, i.e. applying the same row-wise operation on multiple columns, the form must be `cols .=> byrow` where `cols` is a vector of column names or column indices (regular expression cannot be used for this purpose). |
| 103 | + |
| 104 | +## `splitter` |
| 105 | + |
| 106 | +`splitter` is also a specialised function which has a single job: splitting a single column which is a `Tuple` of values into multiple columns. It only operates on a single columns and the values inside the column which needs to be split must be in the form of `Tuples`. The form of `args` for `splitter` must be similar to: |
| 107 | + |
| 108 | +> `modify!(ds, col => splitter => [:new_col_1, :new_col_2])` |
| 109 | +
|
| 110 | +which means we like to split `col` into two new columns; `:new_col_1` and `:new_col_2`. Here `col` can be a column index or a column name. |
| 111 | + |
| 112 | +> Note, `splitter` produces as many columns as the length of the given new names, i.e. if the user provides fewer names than needed, the output columns will only contain partial components of the input `Tuple`. |
| 113 | +
|
| 114 | +## Examples |
| 115 | + |
| 116 | +```jldoctest |
| 117 | +julia> body = Dataset(weight = [78.5, 59, 80], height = [160, 171, 183]) |
| 118 | +3×2 Dataset |
| 119 | + Row │ weight height |
| 120 | + │ identity identity |
| 121 | + │ Float64? Int64? |
| 122 | +─────┼──────────────────── |
| 123 | + 1 │ 78.5 160 |
| 124 | + 2 │ 59.0 171 |
| 125 | + 3 │ 80.0 183 |
| 126 | +
|
| 127 | +julia> modify!(body, :height => byrow(x -> (x/100)^2) => :BMI, [1, 3] => byrow(/) => :BMI) |
| 128 | +3×3 Dataset |
| 129 | + Row │ weight height BMI |
| 130 | + │ identity identity identity |
| 131 | + │ Float64? Int64? Float64? |
| 132 | +─────┼────────────────────────────── |
| 133 | + 1 │ 78.5 160 30.6641 |
| 134 | + 2 │ 59.0 171 20.1771 |
| 135 | + 3 │ 80.0 183 23.8884 |
| 136 | +
|
| 137 | +julia> sale = Dataset(customer = ["Bob Smith", "John Max", "Froon Moore"], |
| 138 | + item1_q1 = [23, 43, 50], item2_q1 = [44, 32, 55], |
| 139 | + item3_q1 = [45, 45, 54]) |
| 140 | +3×4 Dataset |
| 141 | + Row │ customer item1_q1 item2_q1 item3_q1 |
| 142 | + │ identity identity identity identity |
| 143 | + │ String? Int64? Int64? Int64? |
| 144 | +─────┼─────────────────────────────────────────── |
| 145 | + 1 │ Bob Smith 23 44 45 |
| 146 | + 2 │ John Max 43 32 45 |
| 147 | + 3 │ Froon Moore 50 55 54 |
| 148 | +
|
| 149 | +julia> modify!(sale, 2:4 => byrow(sum) => :total) |
| 150 | +3×5 Dataset |
| 151 | + Row │ customer item1_q1 item2_q1 item3_q1 total |
| 152 | + │ identity identity identity identity identity |
| 153 | + │ String? Int64? Int64? Int64? Int64? |
| 154 | +─────┼───────────────────────────────────────────────────── |
| 155 | + 1 │ Bob Smith 23 44 45 112 |
| 156 | + 2 │ John Max 43 32 45 120 |
| 157 | + 3 │ Froon Moore 50 55 54 159 |
| 158 | +
|
| 159 | +julia> julia> function name_split(x) |
| 160 | + spl = split(x, " ") |
| 161 | + (string(spl[1]), string(spl[2])) |
| 162 | + end |
| 163 | +name_split (generic function with 1 method) |
| 164 | +
|
| 165 | +julia> modify!(sale, :customer => byrow(name_split), |
| 166 | + :customer => splitter => [:first_name, :last_name]) |
| 167 | +3×7 Dataset |
| 168 | + Row │ customer item1_q1 item2_q1 item3_q1 total first_name last_name |
| 169 | + │ identity identity identity identity identity identity identity |
| 170 | + │ Tuple…? Int64? Int64? Int64? Int64? String? String? |
| 171 | +─────┼─────────────────────────────────────────────────────────────────────────────────── |
| 172 | + 1 │ ("Bob", "Smith") 23 44 45 112 Bob Smith |
| 173 | + 2 │ ("John", "Max") 43 32 45 120 John Max |
| 174 | + 3 │ ("Froon", "Moore") 50 55 54 159 Froon Moore |
| 175 | +
|
| 176 | +``` |
| 177 | + |
| 178 | +In the last example, we use `byrow` to apply `name_split` on each row of `:customer`, and since there is only one column as the input of `byrow`, `modify!` replaces the column with the new values. Also, note that the `modify!` function has access to these new values and we can use `splitter` to split the column into two new columns. |
0 commit comments