Skip to content

Commit b0fe747

Browse files
Feature additional docs (#222)
* wip * wip * doc updates * function doc changes only * function doc changes only
1 parent 168b6f1 commit b0fe747

File tree

5 files changed

+62
-26
lines changed

5 files changed

+62
-26
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ For example:
152152

153153
To use an older DB runtime version in your notebook, you can use the following code in your notebook:
154154

155-
```commandline
155+
```shell
156156
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
157157
```
158158

dbldatagen/data_generator.py

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -901,20 +901,24 @@ def withStructColumn(self, colName, fields=None, asJson=False, **kwargs):
901901
a struct of the specified fields.
902902
903903
:param colName: name of column
904-
:param fields: list of fields to compose as a struct valued column
904+
:param fields: list of elements to compose as a struct valued column (each being a string or tuple), or a dict
905+
outlining the structure of the struct column
905906
:param asJson: If False, generate a struct valued column. If True, generate a JSON string column
907+
:param kwargs: keyword arguments to pass to the underlying column generators as per `withColumn`
906908
:return: A modified in-place instance of data generator allowing for chaining of calls
907909
following the Builder pattern
908910
909911
.. note::
910912
Additional options for the field specification may be specified as keyword arguments.
911913
912-
The field specification may be :
913-
- a list of field references (strings) which will be used as both the field name and the SQL expression
914-
- a list of tuples of the form (field_name, field_expression) where field_name is the name of the field
915-
- a Python dict outlining the structure of the struct column. The keys of the dict are the field names
914+
The fields specification specified by the `fields` argument may be :
916915
917-
When using the ``struct`` form of the field specifications, a field whose value is a list will be treated
916+
- A list of field references (`strings`) which will be used as both the field name and the SQL expression
917+
- A list of tuples of the form **(field_name, field_expression)** where `field_name` is the name of the
918+
field. In that case, the `field_expression` string should be a SQL expression to generate the field value
919+
- A Python dict outlining the structure of the struct column. The keys of the dict are the field names
920+
921+
When using the `dict` form of the field specifications, a field whose value is a list will be treated
918922
as creating a SQL array literal.
919923
920924
"""

dbldatagen/utils.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,18 @@
99
"""
1010

1111
import functools
12-
import warnings
13-
from datetime import timedelta
14-
import re
1512
import json
13+
import re
1614
import time
15+
import warnings
16+
from datetime import timedelta
17+
1718
import jmespath
1819

1920

2021
def deprecated(message=""):
21-
""" Define a deprecated decorator without dependencies on 3rd party libraries
22+
"""
23+
Define a deprecated decorator without dependencies on 3rd party libraries
2224
2325
Note there is a 3rd party library called `deprecated` that provides this feature but goal is to only have
2426
dependencies on packages already used in the Databricks runtime
@@ -275,7 +277,8 @@ def strip_margins(s, marginChar):
275277

276278

277279
def split_list_matching_condition(lst, cond):
278-
""" Split a list on elements that match a condition
280+
"""
281+
Split a list on elements that match a condition
279282
280283
This will find all matches of a specific condition in the list and split the list into sub lists around the
281284
element that matches this condition.
@@ -288,9 +291,9 @@ def split_list_matching_condition(lst, cond):
288291
splitListOnCondition(x, lambda el: el == 'id')
289292
290293
291-
result:
294+
Result:
292295
`[['id'], ['city_name'], ['id'], ['city_id', 'city_pop'],
293-
['id'], ['city_id', 'city_pop', 'city_id', 'city_pop'], ['id']]`
296+
['id'], ['city_id', 'city_pop', 'city_id', 'city_pop'], ['id']]`
294297
295298
:arg lst: list of items to perform condition matches against
296299
:arg cond: lambda function or function taking single argument and returning True or False

docs/source/generating_column_data.rst

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ This includes:
1313
- Whether the generated data set will be a streaming or batch data set
1414
- How the column data should be generated and what the dependencies for each column are
1515
- How random and psuedo-random data is generated
16+
- The structure for structured columns and JSON valued columns
1617

1718
.. seealso::
1819
See the following links for more details:
@@ -22,6 +23,7 @@ This includes:
2223
* Controlling how existing columns are generated - :data:`~dbldatagen.data_generator.DataGenerator.withColumnSpec`
2324
* Adding column generation specs in bulk - :data:`~dbldatagen.data_generator.DataGenerator.withColumnSpecs`
2425
* Options for column generation - :doc:`options_and_features`
26+
* Generating JSON and complex data - :doc:`generating_json_data`
2527

2628
Column data is generated for all columns whether imported from a schema or explicitly added
2729
to a data specification. However, column data can be omitted from the final output, allowing columns to be used
@@ -35,15 +37,16 @@ These control the data generation process.
3537

3638
The data generation process itself is deferred until the data generation instance ``build`` method is executed.
3739

38-
So until the ``build`` method is invoked, the data generation specification is in initialization mode.
40+
So until the :data:`~dbldatagen.data_generator.DataGenerator.build` method is invoked, the data generation
41+
specification is in initialization mode.
3942

4043
Once ``build`` has been invoked, the data generation instance holds state about the data set generated.
4144

4245
While ``build`` can be invoked a subsequent time, making further modifications to the definition post build before
4346
calling ``build`` again is not recommended. We recommend the use of the ``clone`` method to make a new data generation
4447
specification similar to an existing one if further modifications are needed.
4548

46-
See :data:`~dbldatagen.data_generator.DataGenerator.clone` for further information.
49+
See the method :data:`~dbldatagen.data_generator.DataGenerator.clone` for further information.
4750

4851
Adding columns to a data generation spec
4952
----------------------------------------
@@ -55,18 +58,21 @@ specification.
5558
When building the data generation spec, the ``withSchema`` method may be used to add columns from an existing schema.
5659
This does _not_ prevent the use of ``withColumn`` to add new columns.
5760

58-
Use ``withColumn`` to define a new column. This method takes a parameter to specify the data type.
59-
See :data:`~dbldatagen.data_generator.DataGenerator.withColumn`.
61+
| Use ``withColumn`` to define a new column. This method takes a parameter to specify the data type.
62+
| See the method :data:`~dbldatagen.data_generator.DataGenerator.withColumn` for further details.
6063
6164
Use ``withColumnSpec`` to define how a column previously defined in a schema should be generated. This method does not
6265
take a data type property, but uses the data type information defined in the schema.
63-
See :data:`~dbldatagen.data_generator.DataGenerator.withColumnSpec`.
66+
See the method :data:`~dbldatagen.data_generator.DataGenerator.withColumnSpec` for further details.
6467

65-
Use ``withColumnSpecs`` to define how multiple columns imported from a schema should be generated.
66-
As the pattern matching may inadvertently match an unintended column, it is permitted to override the specification
67-
added through this method by a subsequent call to ``withColumnSpec`` to change the definition of how a specific column
68-
should be generated
69-
See :data:`~dbldatagen.data_generator.DataGenerator.withColumnSpecs`.
68+
| Use ``withColumnSpecs`` to define how multiple columns imported from a schema should be generated.
69+
As the pattern matching may inadvertently match an unintended column, it is permitted to override the specification
70+
added through this method by a subsequent call to ``withColumnSpec`` to change the definition of how a specific column
71+
should be generated.
72+
| See the method :data:`~dbldatagen.data_generator.DataGenerator.withColumnSpecs` for further details.
73+
74+
Use the method :data:`~dbldatagen.data_generator.DataGenerator.withStructColumn` for simpler creation of struct and
75+
JSON valued columns.
7076

7177
By default all columns are marked as being dependent on an internal ``id`` seed column.
7278
Use the ``baseColumn`` attribute to mark a column as being dependent on another column or set of columns.
@@ -85,6 +91,7 @@ Use of the base column attribute has several effects:
8591

8692
If you need to generate a field with the same name as the seed column (by default `id`), you may override
8793
the default seed column name in the constructor of the data generation spec through the use of the
94+
``seedColumnName`` parameter.
8895

8996

9097
Note that Spark SQL is case insensitive with respect to column names.
@@ -127,6 +134,12 @@ For example, the following code will generate rows with varying numbers of synth
127134
128135
df = ds.build()
129136
137+
| The helper method ``withStructColumn`` of the ``DataGenerator`` class enables simpler definition of structured
138+
and JSON valued columns.
139+
| See the documentation for the method :data:`~dbldatagen.data_generator.DataGenerator.withStructColumn` for
140+
further details.
141+
142+
130143
The mechanics of column data generation
131144
---------------------------------------
132145
The data set is generated when the ``build`` method is invoked on the data generation instance.
@@ -168,3 +181,4 @@ This has several implications:
168181
However it does not reorder the building sequence if there is a reference to a column that will be built later in the
169182
SQL expression.
170183
To enforce the dependency, you must use the `baseColumn` attribute to indicate the dependency.
184+

docs/source/generating_json_data.rst

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Generating JSON data
1212
There are several methods for generating JSON data:
1313

1414
- Generate a dataframe and save it as JSON will generate full data set as JSON
15-
- Generate JSON valued fields using SQL functions such as `named_struct` and `to_json`
15+
- Generate JSON valued fields using SQL functions such as `named_struct` and `to_json`.
1616

1717
Writing dataframe as JSON data
1818
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -178,7 +178,8 @@ written as:
178178
expr="named_struct('event_type', event_type, 'event_ts', event_ts)",
179179
baseColumn=['event_type', 'event_ts'])
180180
181-
To simplify the specification of struct valued columns, the defined value of `INFER_DATATYPE` can be used in place of
181+
182+
To simplify the specification of struct valued columns, the defined value of `INFER_DATATYPE` can be used in place of
182183
the datatype when the `expr` attribute is specified. This will cause the datatype to be inferred from the expression.
183184

184185
In this case, the previous code would be written as follows:
@@ -191,6 +192,20 @@ In this case, the previous code would be written as follows:
191192
192193
The helper method ``withStructColumn`` can also be used to simplify the specification of struct valued columns.
193194

195+
Using this method, the previous code can be written as one of the following options:
196+
197+
.. code-block:: python
198+
199+
# Use either form to create the struct valued field
200+
.withStructColumn("event_info1", fields=['event_type', 'event_ts'])
201+
.withStructColumn("event_info2", fields={'event_type': 'event_type',
202+
'event_ts': 'event_ts'})
203+
204+
In the case of the second variant, the expression following the struct field name can be any arbitrary SQL string. It
205+
can also generate JSON for the same definition.
206+
207+
See the following documentation for more details: :data:`~dbldatagen.data_generator.DataGenerator.withStructColumn`
208+
194209
Generating JSON valued fields
195210
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
196211

0 commit comments

Comments
 (0)