22
33A ` Dataset[T] ` is a mapping that allows pipelining of functions in a readable syntax returning an example of type ` T ` .
44
5+ <!-- pytest-codeblocks:importorskip(datastream)-->
6+
57``` python
68from datastream import Dataset
79
@@ -25,15 +27,49 @@ assert dataset[2] == ('banana', 28)
2527
2628## Class Methods
2729
28- ### from_subscriptable
30+ ### ` from_subscriptable `
31+
32+ ``` python
33+ from_subscriptable(data: Subscriptable[T]) -> Dataset[T]
34+ ```
2935
3036Create ` Dataset ` based on subscriptable i.e. implements ` __getitem__ ` and ` __len__ ` .
3137
38+ #### Parameters
39+
40+ - ` data ` : Any object that implements ` __getitem__ ` and ` __len__ `
41+
42+ #### Returns
43+
44+ - A new Dataset instance
45+
46+ #### Notes
47+
3248Should only be used for simple examples as a ` Dataset ` created with this method does not support methods that require a source dataframe like ` Dataset.split ` and ` Dataset.subset ` .
3349
34- ### from_dataframe
50+ ### ` from_dataframe `
51+
52+ ``` python
53+ from_dataframe(df: pd.DataFrame) -> Dataset[pd.Series]
54+ ```
55+
56+ Create ` Dataset ` based on ` pandas.DataFrame ` .
57+
58+ #### Parameters
3559
36- Create ` Dataset ` based on ` pandas.DataFrame ` . ` Dataset.__getitem__ ` will return a row from the dataframe and ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
60+ - ` df ` : Source pandas DataFrame
61+
62+ #### Returns
63+
64+ - A new Dataset instance where ` __getitem__ ` returns a row from the dataframe
65+
66+ #### Notes
67+
68+ ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
69+
70+ #### Examples
71+
72+ <!-- pytest-codeblocks:importorskip(datastream)-->
3773
3874``` python
3975import pandas as pd
@@ -49,10 +85,30 @@ dataset = (
4985assert dataset[- 1 ] == 4
5086```
5187
52- ### from_paths
88+ ### ` from_paths `
89+
90+ ``` python
91+ from_paths(paths: List[str ], pattern: str ) -> Dataset[pd.Series]
92+ ```
5393
5494Create ` Dataset ` from paths using regex pattern that extracts information from the path itself.
55- ` Dataset.__getitem__ ` will return a row from the dataframe and ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
95+
96+ #### Parameters
97+
98+ - ` paths ` : List of file paths
99+ - ` pattern ` : Regex pattern with named groups to extract information from paths
100+
101+ #### Returns
102+
103+ - A new Dataset instance where ` __getitem__ ` returns a row from the generated dataframe
104+
105+ #### Notes
106+
107+ ` Dataset.map ` should be given a function that takes a row from the dataframe as input.
108+
109+ #### Examples
110+
111+ <!-- pytest-codeblocks:importorskip(datastream)-->
56112
57113``` python
58114from datastream import Dataset
@@ -68,10 +124,26 @@ assert dataset[-1] == 'damage'
68124
69125## Instance Methods
70126
71- ### map
127+ ### ` map `
128+
129+ ``` python
130+ map (self , function: Callable[[T], U]) -> Dataset[U]
131+ ```
72132
73133Creates a new dataset with the function added to the dataset pipeline.
74134
135+ #### Parameters
136+
137+ - ` function ` : Function to apply to each example
138+
139+ #### Returns
140+
141+ - A new Dataset with the mapping function added to the pipeline
142+
143+ #### Examples
144+
145+ <!-- pytest-codeblocks:importorskip(datastream)-->
146+
75147``` python
76148from datastream import Dataset
77149
@@ -83,11 +155,30 @@ dataset = (
83155assert dataset[- 1 ] == 4
84156```
85157
86- ### starmap
158+ ### ` starmap `
159+
160+ ``` python
161+ starmap(self , function: Callable[... , U]) -> Dataset[U]
162+ ```
87163
88164Creates a new dataset with the function added to the dataset pipeline.
165+
166+ #### Parameters
167+
168+ - ` function ` : Function that accepts multiple arguments unpacked from the pipeline output
169+
170+ #### Returns
171+
172+ - A new Dataset with the mapping function added to the pipeline
173+
174+ #### Notes
175+
89176The dataset's pipeline should return an iterable that will be expanded as arguments to the mapped function.
90177
178+ #### Examples
179+
180+ <!-- pytest-codeblocks:importorskip(datastream)-->
181+
91182``` python
92183from datastream import Dataset
93184
@@ -100,11 +191,29 @@ dataset = (
100191assert dataset[- 1 ] == 7
101192```
102193
103- ### subset
194+ ### ` subset `
195+
196+ ``` python
197+ subset(self , function: Callable[[pd.DataFrame], pd.Series]) -> Dataset[T]
198+ ```
199+
200+ Select a subset of the dataset using a function that receives the source dataframe as input.
104201
105- Select a subset of the dataset using a function that receives the source dataframe as input and is expected to return a boolean mask.
202+ #### Parameters
106203
107- Note that this function can still be called after multiple operations such as mapping functions as it uses the source dataframe.
204+ - ` function ` : Function that takes a DataFrame and returns a boolean mask
205+
206+ #### Returns
207+
208+ - A new Dataset containing only the selected examples
209+
210+ #### Notes
211+
212+ This function can still be called after multiple operations such as mapping functions as it uses the source dataframe.
213+
214+ #### Examples
215+
216+ <!-- pytest-codeblocks:importorskip(datastream)-->
108217
109218``` python
110219import pandas as pd
@@ -121,9 +230,36 @@ dataset = (
121230assert dataset[- 1 ] == 2
122231```
123232
124- ### split
233+ ### ` split `
234+
235+ ``` python
236+ split(
237+ self ,
238+ key_column: str ,
239+ proportions: Dict[str , float ],
240+ stratify_column: Optional[str ] = None ,
241+ filepath: Optional[str ] = None ,
242+ seed: Optional[int ] = None ,
243+ ) -> Dict[str , Dataset[T]]
244+ ```
245+
246+ Split dataset into multiple parts.
247+
248+ #### Parameters
249+
250+ - ` key_column ` : Column to use as unique identifier for examples
251+ - ` proportions ` : Dictionary mapping split names to proportions
252+ - ` stratify_column ` : Optional column to use for stratification
253+ - ` filepath ` : Optional path to save/load split configuration
254+ - ` seed ` : Optional random seed for reproducibility
255+
256+ #### Returns
257+
258+ - Dictionary mapping split names to Dataset instances
125259
126- Split dataset into multiple parts. Optionally you can stratify on a column in the source dataframe or save the split to a json file.
260+ #### Notes
261+
262+ Optionally you can stratify on a column in the source dataframe or save the split to a json file.
127263If you are sure that the split strategy will not change then you can safely use a seed instead of a filepath.
128264
129265Saved splits can continue from the old split and handle:
@@ -133,6 +269,10 @@ Saved splits can continue from the old split and handle:
133269- Adapt after removing examples from dataset
134270- Adapt to new stratification
135271
272+ #### Examples
273+
274+ <!-- pytest-codeblocks:importorskip(datastream)-->
275+
136276``` python
137277import numpy as np
138278import pandas as pd
@@ -154,9 +294,21 @@ assert len(split_datasets['train']) == 80
154294assert split_datasets[' test' ][0 ] == 3
155295```
156296
157- ### zip_index
297+ ### ` zip_index `
298+
299+ ``` python
300+ zip_index(self ) -> Dataset[Tuple[T, int ]]
301+ ```
302+
303+ Zip the output with its underlying Dataset index.
158304
159- Zip the output with its underlying Dataset index. The output of the pipeline will be a tuple ` (output, index) ` .
305+ #### Returns
306+
307+ - A new Dataset where each example is a tuple of ` (output, index) `
308+
309+ #### Examples
310+
311+ <!-- pytest-codeblocks:importorskip(datastream)-->
160312
161313``` python
162314from datastream import Dataset
@@ -165,10 +317,26 @@ dataset = Dataset.from_subscriptable([4, 5, 6]).zip_index()
165317assert dataset[0 ] == (4 , 0 )
166318```
167319
168- ### cache
320+ ### ` cache `
321+
322+ ``` python
323+ cache(self , key_column: str ) -> Dataset[T]
324+ ```
169325
170326Cache intermediate step in-memory based on key column.
171327
328+ #### Parameters
329+
330+ - ` key_column ` : Column to use as cache key
331+
332+ #### Returns
333+
334+ - A new Dataset with caching enabled
335+
336+ #### Examples
337+
338+ <!-- pytest-codeblocks:importorskip(datastream)-->
339+
172340``` python
173341import pandas as pd
174342from datastream import Dataset
@@ -178,12 +346,30 @@ dataset = Dataset.from_dataframe(df).cache('key')
178346assert dataset[0 ][' value' ] == 1
179347```
180348
181- ### concat
349+ ### ` concat `
350+
351+ ``` python
352+ concat(datasets: List[Dataset[T]]) -> Dataset[T]
353+ ```
354+
355+ Concatenate multiple datasets together.
356+
357+ #### Parameters
358+
359+ - ` datasets ` : List of datasets to concatenate
360+
361+ #### Returns
362+
363+ - A new Dataset combining all input datasets
182364
183- Concatenate multiple datasets together so that they behave like a single dataset.
365+ #### Notes
184366
185367Consider using ` Datastream.merge ` if you have multiple data sources instead as it allows you to control the number of samples from each source in the training batches.
186368
369+ #### Examples
370+
371+ <!-- pytest-codeblocks:importorskip(datastream)-->
372+
187373``` python
188374from datastream import Dataset
189375
@@ -194,9 +380,29 @@ assert len(combined) == 4
194380assert combined[2 ] == 3
195381```
196382
197- ### combine
383+ ### ` combine `
384+
385+ ``` python
386+ combine(datasets: List[Dataset]) -> Dataset[Tuple]
387+ ```
388+
389+ Zip multiple datasets together so that all combinations of examples are possible.
390+
391+ #### Parameters
392+
393+ - ` datasets ` : List of datasets to combine
394+
395+ #### Returns
396+
397+ - A new Dataset yielding tuples of all possible combinations
398+
399+ #### Notes
400+
401+ Creates tuples like ` (example1, example2, ...) ` for all possible combinations (i.e. the cartesian product).
402+
403+ #### Examples
198404
199- Zip multiple datasets together so that all combinations of examples are possible (i.e. the product) creating tuples like ` (example1, example2, ...) ` .
405+ <!-- pytest-codeblocks:importorskip(datastream) -->
200406
201407``` python
202408from datastream import Dataset
0 commit comments