Skip to content

Commit 6ca16e6

Browse files
committed
[df] Add more docs to the Snapshot with variations section
1 parent fcf1cdc commit 6ca16e6

File tree

1 file changed

+25
-4
lines changed

1 file changed

+25
-4
lines changed

tree/dataframe/src/RDataFrame.cxx

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1265,10 +1265,28 @@ In that case, RDataFrame will snapshot the filtered columns in a memory-efficien
12651265
default-constructed object in case of classes. If none of the filters pass like in row 6, the entire event is omitted from the snapshot.
12661266
12671267
To tell apart a genuine `0` (like `x` in row 0) from a variation that didn't pass the selection, RDataFrame writes a bitmask for each event, indicating which variations
1268-
are valid (see last column). A mapping of column names to this bitmask is placed in the same file as the output dataset, and automatically loaded when
1269-
RDataFrame opens a file that was snapshot with variations.
1270-
Attempting to read such missing values with RDataFrame will produce an error, but RDataFrame can either skip these values or fill in defaults as
1271-
described in the \ref missing-values "section on dealing with missing values".
1268+
are valid (see last column). The bitmask is implemented as a 64-bit `std::bitset` in memory, written to the output
1269+
dataset as a `std::uin64_t`. For every 64 columns, a new bitmask column is added to the output dataset.
1270+
1271+
Each column that might contain invalid values is connected to exactly one bit in one bitmask. A mapping of column names
1272+
to the corresponding bitmask is placed in the same file as the output dataset, with a name that follows the pattern
1273+
`"R_rdf_branchToBitmaskMapping_<NAME_OF_THE_DATASET>"`. It is of type
1274+
`std::unordered_map<std::string, std::pair<std::string, unsigned int>>`, and maps a column name to the name of the
1275+
bitmask column and the index of the relevant bit. For example, in the same file as the dataset "Events" there would be
1276+
an object named `R_rdf_branchToBitmaskMapping_Events`. This object for example would describe a connection such as:
1277+
1278+
~~~
1279+
muon_pt --> (R_rdf_mask_Events_0, 42)
1280+
~~~
1281+
1282+
which means that the validity of the entries in `muon_pt` is established by the bit `42` in the bitmask found in the
1283+
column `R_rdf_mask_Events_0`.
1284+
1285+
When RDataFrame opens a file, it checks for the existence of this mapping between columns and bitmasks, and loads it automatically if found. As such,
1286+
RDataFrame makes the treatment of the various bitmap maskings completely transparent to the user.
1287+
1288+
In case certain values are labeled invalid by the corresponding bit, this will result in reading a missing value. The semantics of such a scenario follow the
1289+
rules described in the \ref missing-values "section on dealing with missing values" and can be dealt with accordingly.
12721290
12731291
\note Snapshot with variations is currently restricted to single-threaded TTree snapshots.
12741292
@@ -1780,6 +1798,9 @@ more of its entries. For example:
17801798
- When joining different datasets horizontally according to some index value
17811799
(e.g. the event number), if the index does not find a match in one or more
17821800
other datasets for a certain entry.
1801+
- If, for a certain event, a column is invalid because it results from a Snapshot
1802+
with systematic variations, and that variation didn't pass its filters. For
1803+
more details, see \ref snapshot-with-variations.
17831804
17841805
For example, suppose that column "y" does not have a value for entry 42:
17851806

0 commit comments

Comments
 (0)