Skip to content

Commit b28602d

Browse files
Misc doc changes (#268)
* wip * wip * wip * wip * wip
1 parent 02d529e commit b28602d

File tree

3 files changed

+132
-25
lines changed

3 files changed

+132
-25
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,11 @@
33
## Change History
44
All notable changes to the Databricks Labs Data Generator will be documented in this file.
55

6+
### Unreleased
7+
8+
#### Changed
9+
* Updated documentation for generating text data.
10+
611

712
### Version 0.3.6 Post 1
813

@@ -25,6 +30,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
2530
* Ths version marks the changing minimum version of Databricks runtime to 10.4 LTS and later releases.
2631
* While there are no known incompatibilities with Databricks 9.1 LTS, we will not test against this release
2732

33+
2834
### Version 0.3.5
2935

3036
#### Changed

dbldatagen/column_spec_options.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,9 @@ class ColumnSpecOptions(object):
4747
:param baseColumn: Either the string name of the base column, or a list of columns to use to
4848
control data generation. The option ``baseColumns`` is an alias for ``baseColumn``.
4949
50+
:param baseColumnType: Determines how the value is derived from the base column. Possible values are 'auto',
51+
'hash', 'raw_values', 'values'
52+
5053
:param values: List of discrete values for the colummn. Discrete values for the column can be strings, numbers
5154
or constants conforming to type of column
5255
@@ -105,6 +108,29 @@ class ColumnSpecOptions(object):
105108
106109
:param escapeSpecialChars: if True, require escape for all special chars in template
107110
111+
When a column's value is derived from the value of another column, the `baseColumn` and `baseColumnType` options
112+
can be used to control how the value is derived. The `baseColumn` option can be used to specify the name of the
113+
base column, and the `baseColumnType` option can be used to specify how the value is derived from the base column.
114+
115+
The following values are permitted for the `baseColumnType` option:
116+
117+
- 'auto': Automatically determine the base column type based on the column type of the base column.
118+
- 'hash': Use a hash of the base column(s) value to derive the value of the new column.
119+
- 'raw_values': Use the raw values of the base column to derive the value of the new column.
120+
- 'values': Use the values of the base column to derive the value of the new column.
121+
122+
The `baseColumn` option can be used to specify the name of the base column. If the `baseColumn` option is not
123+
specified, the value of the new column will be derived from the seed or `id` column.
124+
125+
The `baseColumnType` option is optional. If it is not specified, the value of the new column will be derived
126+
based on the column type of the base column.
127+
128+
The derivation from `raw_values` differs from `values` in that the `raw_values` option will use the raw values
129+
of the base column to derive the value of the new column, while the `values` option will use the values of the
130+
base column to derive the value of the new column after scaling to the range or implied range of the new column.
131+
132+
For example a column with four categorical values , 'A', 'B', 'C', 'D' has an implied range of 0 .. 3.
133+
108134
.. note::
109135
If the `dataRange` parameter is specified as well as the `minValue`, `maxValue` or `step`,
110136
the results are undetermined.

docs/source/textdata.rst

Lines changed: 100 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ The following example illustrates generating data for specific ranges of values:
2828
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
2929
partitions=4, randomSeedMethod="hash_fieldname")
3030
.withIdOutput()
31-
.withColumn("code3", StringType(), values=['online', 'offline', 'unknown'])
32-
.withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True, percentNulls=0.05)
33-
.withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
31+
.withColumn("code3", "string", values=['online', 'offline', 'unknown'])
32+
.withColumn("code4", "string", values=['a', 'b', 'c'], random=True, percentNulls=0.05)
33+
.withColumn("code5", "string", values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
3434
)
3535
3636
Generating text from existing values
@@ -84,7 +84,7 @@ The following example illustrates its use:
8484
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
8585
partitions=4, randomSeedMethod="hash_fieldname")
8686
.withIdOutput()
87-
.withColumnSpec("sample_text", text=dg.ILText(paragraphs=(1, 4),
87+
.withColumn("sample_text", "string", text=dg.ILText(paragraphs=(1, 4),
8888
sentences=(2, 6)))
8989
)
9090
@@ -96,7 +96,12 @@ Using the general purpose text generator
9696

9797
The ``template`` attribute allows specification of templated text generation.
9898

99-
Here are some examples of its use to generate dummy email addresses, ip addressed and phone numbers
99+
.. note ::
100+
The ``template`` option is shorthand for ``text=dg.TemplateGenerator(template=...)``
101+
102+
This can be specified with different options covering how escapes are handled and customizing the word list
103+
- see the `TemplateGenerator` documentation for more details.
104+
100105
101106
.. code-block:: python
102107
@@ -105,27 +110,25 @@ Here are some examples of its use to generate dummy email addresses, ip addresse
105110
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
106111
partitions=4, randomSeedMethod="hash_fieldname")
107112
.withIdOutput()
108-
.withColumnSpec("email",
113+
.withColumn("email", "string",
109114
template=r'\w.\w@\w.com|\w@\w.co.u\k')
110-
.withColumnSpec("ip_addr",
115+
.withColumn("ip_addr", "string",
111116
template=r'\n.\n.\n.\n')
112-
.withColumnSpec("phone",
117+
.withColumn("phone", "string",
113118
template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd')
119+
120+
# the following implements the same pattern as for `phone` but using the `TemplateGenerator` class
121+
.withColumn("phone2", "string",
122+
text=dg.TemplateGenerator(r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd'))
114123
)
115124
116125
df = df_spec.build()
117126
num_rows=df.count()
118127
119128
The implementation of the template expansion uses the underlying `TemplateGenerator` class.
120129

121-
.. note ::
122-
The ``template`` option is shorthand for ``text=dg.TemplateGenerator(template=...)``
123-
124-
This can be specified in multiple modes - see the `TemplateGenerator` documentation for more details.
125-
126-
127130
TemplateGenerator options
128-
---------------------------------------------
131+
-------------------------
129132

130133
The template generator generates text from a template to allow for generation of synthetic credit card numbers,
131134
VINs, IBANs and many other structured codes.
@@ -154,9 +157,27 @@ It uses the following special chars:
154157
W Insert a random uppercase word from the ipsum lorem word set. Always escaped
155158
======== ======================================
156159

160+
In all other cases, the char itself is used.
161+
162+
The setting of the ``escapeSpecialChars`` determines how the template generate interprets the special chars.
163+
164+
If set to False, which defaults to `False`, then the special char does not need to be escaped to have its special
165+
meaning. But the special char must be escaped to be treated as a literal char.
166+
167+
So the template ``r"\dr_\v"`` will generate the values ``"dr_0"`` ... ``"dr_999"`` when used via the template option
168+
and applied to the values zero to 999.
169+
Here the the character `d` is escaped to avoid interpretation as a special character.
170+
171+
If set to True, then the special char only has its special meaning when preceded by an escape.
172+
173+
So the option `text=dg.TemplateGenerator(r'dr_\v', escapeSpecialChars=True)` will generate the values
174+
``"dr_0"`` ... ``"dr_999"`` when applied to the values zero to 999.
175+
176+
This conforms to earlier implementations for backwards compatibility.
177+
157178
.. note::
158-
If escape is used and ``escapeSpecialChars`` is False, then the following
159-
char is assumed to have no special meaning.
179+
Setting the argument `escapeSpecialChars=False` means that the special char does not need to be escaped to
180+
be treated as a special char. But it must be escaped to be treated as a literal char.
160181

161182
If the ``escapeSpecialChars`` option is set to True, then the following char only has its special
162183
meaning when preceded by an escape.
@@ -165,20 +186,74 @@ It uses the following special chars:
165186

166187
A special case exists for ``\\v`` - if immediately followed by a digit 0 - 9, the underlying base value
167188
is interpreted as an array of values and the nth element is retrieved where `n` is the digit specified.
168-
189+
169190
The ``escapeSpecialChars`` is set to False by default for backwards compatibility.
170191

171192
To use the ``escapeSpecialChars`` option, use the variant
172-
``text=dg.TemplateGenerator(template=...), escapeSpecialChars=True``
193+
``text=dg.TemplateGenerator(template=..., escapeSpecialChars=True)``
173194

174-
In all other cases, the char itself is used.
175195

176-
The setting of the ``escapeSpecialChars`` determines how templates generate data.
196+
Using a custom word list
197+
^^^^^^^^^^^^^^^^^^^^^^^^
198+
199+
The template generator allows specification of a custom word list also. This is a list of words that can be
200+
used in the template generation. The default word list is the `ipsum lorem` word list.
201+
202+
While the `values` option allows for the specification of a list of categorical values, this is transmitted as part of
203+
the generated SQL. The use of the `TemplateGenerator` object with a custom word list allows for specification of much
204+
larger lists of possible values without the need to transmit them as part of the generated SQL.
205+
206+
For example the following code snippet illustrates the use of a custom word list:
207+
208+
.. code-block:: python
209+
210+
import dbldatagen as dg
211+
212+
names = ['alpha', 'beta', 'gamma', 'lambda', 'theta']
213+
214+
df_spec = (
215+
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
216+
partitions=4, randomSeedMethod="hash_fieldname")
217+
.withIdOutput()
218+
.withColumn("email", "string",
219+
template=r'\w.\w@\w.com|\w@\w.co.u\k')
220+
.withColumn("ip_addr", "string",
221+
template=r'\n.\n.\n.\n')
222+
.withColumn("phone", "string",
223+
template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd')
224+
225+
# implements the same pattern as for `phone` but using the `TemplateGenerator` class
226+
.withColumn("phone2", "string",
227+
text=dg.TemplateGenerator(r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd'))
228+
229+
# uses a custom word list
230+
.withColumn("name", "string",
231+
text=dg.TemplateGenerator(r'\w \w|\w \w \w|\w \a. \w',
232+
escapeSpecialChars=True,
233+
extendedWordList=names))
234+
)
235+
236+
df = df_spec.build()
237+
display(df)
238+
239+
Here the `names` variable is a list of names that can be used in the template generation.
240+
241+
While this is short list in this case, it could be a much larger list of names either
242+
specified as a literal, or read from another dataframe, file, table or produced from another source.
243+
244+
As this is not transmitted as part of the generated SQL, it allows for much larger lists of possible values.
245+
246+
Other forms of text value lookup
247+
--------------------------------
248+
249+
The use of the `values` option and the `template` option with a `TemplateGenerator` instance allow for generation of
250+
data when the range of possible values is known.
177251

178-
If set to False, then the template ``r"\\dr_\\v"`` will generate the values ``"dr_0"`` ... ``"dr_999"`` when applied
179-
to the values zero to 999. This conforms to earlier implementations for backwards compatibility.
252+
But what about scenarios when the list of data is read from a different table or some other form of lookup?
180253

181-
If set to True, then the template ``r"dr_\\v"`` will generate the values ``"dr_0"`` ... ``"dr_999"``
182-
when applied to the values zero to 999. This conforms to the preferred style going forward
254+
As the output of the data generation `build()` method is a regular PySpark DataFrame, it is possible to join the
255+
generated data with other data sources to generate the required data.
183256

257+
In these cases, the generator can be specified to produce lookup keys that can be used to join with the
258+
other data sources.
184259

0 commit comments

Comments
 (0)