@@ -28,9 +28,9 @@ The following example illustrates generating data for specific ranges of values:
2828 dg.DataGenerator(sparkSession = spark, name = " test_data_set1" , rows = 100000 ,
2929 partitions = 4 , randomSeedMethod = " hash_fieldname" )
3030 .withIdOutput()
31- .withColumn(" code3" , StringType() , values = [' online' , ' offline' , ' unknown' ])
32- .withColumn(" code4" , StringType() , values = [' a' , ' b' , ' c' ], random = True , percentNulls = 0.05 )
33- .withColumn(" code5" , StringType() , values = [' a' , ' b' , ' c' ], random = True , weights = [9 , 1 , 1 ])
31+ .withColumn(" code3" , " string " , values = [' online' , ' offline' , ' unknown' ])
32+ .withColumn(" code4" , " string " , values = [' a' , ' b' , ' c' ], random = True , percentNulls = 0.05 )
33+ .withColumn(" code5" , " string " , values = [' a' , ' b' , ' c' ], random = True , weights = [9 , 1 , 1 ])
3434 )
3535
3636 Generating text from existing values
@@ -84,7 +84,7 @@ The following example illustrates its use:
8484 dg.DataGenerator(sparkSession = spark, name = " test_data_set1" , rows = 100000 ,
8585 partitions = 4 , randomSeedMethod = " hash_fieldname" )
8686 .withIdOutput()
87- .withColumnSpec (" sample_text" , text = dg.ILText(paragraphs = (1 , 4 ),
87+ .withColumn (" sample_text" , " string " , text = dg.ILText(paragraphs = (1 , 4 ),
8888 sentences = (2 , 6 )))
8989 )
9090
@@ -96,7 +96,12 @@ Using the general purpose text generator
9696
9797The ``template `` attribute allows specification of templated text generation.
9898
99- Here are some examples of its use to generate dummy email addresses, ip addressed and phone numbers
99+ .. note ::
100+ The ``template`` option is shorthand for ``text=dg.TemplateGenerator(template=...)``
101+
102+ This can be specified with different options covering how escapes are handled and customizing the word list
103+ - see the `TemplateGenerator` documentation for more details.
104+
100105
101106 .. code-block :: python
102107
@@ -105,27 +110,25 @@ Here are some examples of its use to generate dummy email addresses, ip addresse
105110 dg.DataGenerator(sparkSession = spark, name = " test_data_set1" , rows = 100000 ,
106111 partitions = 4 , randomSeedMethod = " hash_fieldname" )
107112 .withIdOutput()
108- .withColumnSpec (" email" ,
113+ .withColumn (" email" , " string " ,
109114 template = r ' \w . \w @\w . com| \w @\w . co. u\k ' )
110- .withColumnSpec (" ip_addr" ,
115+ .withColumn (" ip_addr" , " string " ,
111116 template = r ' \n . \n . \n . \n ' )
112- .withColumnSpec (" phone" ,
117+ .withColumn (" phone" , " string " ,
113118 template = r ' ( ddd) -ddd-dddd| 1( ddd) ddd-dddd| ddd ddddddd' )
119+
120+ # the following implements the same pattern as for `phone` but using the `TemplateGenerator` class
121+ .withColumn(" phone2" , " string" ,
122+ text = dg.TemplateGenerator(r ' ( ddd) -ddd-dddd| 1( ddd) ddd-dddd| ddd ddddddd' ))
114123 )
115124
116125 df = df_spec.build()
117126 num_rows= df.count()
118127
119128 The implementation of the template expansion uses the underlying `TemplateGenerator ` class.
120129
121- .. note ::
122- The ``template`` option is shorthand for ``text=dg.TemplateGenerator(template=...)``
123-
124- This can be specified in multiple modes - see the `TemplateGenerator` documentation for more details.
125-
126-
127130TemplateGenerator options
128- ---------------------------------------------
131+ -------------------------
129132
130133The template generator generates text from a template to allow for generation of synthetic credit card numbers,
131134VINs, IBANs and many other structured codes.
@@ -154,9 +157,27 @@ It uses the following special chars:
154157 W Insert a random uppercase word from the ipsum lorem word set. Always escaped
155158 ======== ======================================
156159
160+ In all other cases, the char itself is used.
161+
162+ The setting of the ``escapeSpecialChars `` determines how the template generate interprets the special chars.
163+
164+ If set to False, which defaults to `False `, then the special char does not need to be escaped to have its special
165+ meaning. But the special char must be escaped to be treated as a literal char.
166+
167+ So the template ``r"\dr_\v" `` will generate the values ``"dr_0" `` ... ``"dr_999" `` when used via the template option
168+ and applied to the values zero to 999.
169+ Here the the character `d ` is escaped to avoid interpretation as a special character.
170+
171+ If set to True, then the special char only has its special meaning when preceded by an escape.
172+
173+ So the option `text=dg.TemplateGenerator(r'dr_\v', escapeSpecialChars=True) ` will generate the values
174+ ``"dr_0" `` ... ``"dr_999" `` when applied to the values zero to 999.
175+
176+ This conforms to earlier implementations for backwards compatibility.
177+
157178.. note ::
158- If escape is used and `` escapeSpecialChars `` is False, then the following
159- char is assumed to have no special meaning .
179+ Setting the argument ` escapeSpecialChars=False ` means that the special char does not need to be escaped to
180+ be treated as a special char. But it must be escaped to be treated as a literal char .
160181
161182 If the ``escapeSpecialChars `` option is set to True, then the following char only has its special
162183 meaning when preceded by an escape.
@@ -165,20 +186,74 @@ It uses the following special chars:
165186
166187 A special case exists for ``\\v `` - if immediately followed by a digit 0 - 9, the underlying base value
167188 is interpreted as an array of values and the nth element is retrieved where `n ` is the digit specified.
168-
189+
169190 The ``escapeSpecialChars `` is set to False by default for backwards compatibility.
170191
171192 To use the ``escapeSpecialChars `` option, use the variant
172- ``text=dg.TemplateGenerator(template=...) , escapeSpecialChars=True ``
193+ ``text=dg.TemplateGenerator(template=..., escapeSpecialChars=True) ``
173194
174- In all other cases, the char itself is used.
175195
176- The setting of the ``escapeSpecialChars `` determines how templates generate data.
196+ Using a custom word list
197+ ^^^^^^^^^^^^^^^^^^^^^^^^
198+
199+ The template generator allows specification of a custom word list also. This is a list of words that can be
200+ used in the template generation. The default word list is the `ipsum lorem ` word list.
201+
202+ While the `values ` option allows for the specification of a list of categorical values, this is transmitted as part of
203+ the generated SQL. The use of the `TemplateGenerator ` object with a custom word list allows for specification of much
204+ larger lists of possible values without the need to transmit them as part of the generated SQL.
205+
206+ For example the following code snippet illustrates the use of a custom word list:
207+
208+ .. code-block :: python
209+
210+ import dbldatagen as dg
211+
212+ names = [' alpha' , ' beta' , ' gamma' , ' lambda' , ' theta' ]
213+
214+ df_spec = (
215+ dg.DataGenerator(sparkSession = spark, name = " test_data_set1" , rows = 100000 ,
216+ partitions = 4 , randomSeedMethod = " hash_fieldname" )
217+ .withIdOutput()
218+ .withColumn(" email" , " string" ,
219+ template = r ' \w . \w @\w . com| \w @\w . co. u\k ' )
220+ .withColumn(" ip_addr" , " string" ,
221+ template = r ' \n . \n . \n . \n ' )
222+ .withColumn(" phone" , " string" ,
223+ template = r ' ( ddd) -ddd-dddd| 1( ddd) ddd-dddd| ddd ddddddd' )
224+
225+ # implements the same pattern as for `phone` but using the `TemplateGenerator` class
226+ .withColumn(" phone2" , " string" ,
227+ text = dg.TemplateGenerator(r ' ( ddd) -ddd-dddd| 1( ddd) ddd-dddd| ddd ddddddd' ))
228+
229+ # uses a custom word list
230+ .withColumn(" name" , " string" ,
231+ text = dg.TemplateGenerator(r ' \w \w | \w \w \w | \w \a . \w ' ,
232+ escapeSpecialChars = True ,
233+ extendedWordList = names))
234+ )
235+
236+ df = df_spec.build()
237+ display(df)
238+
239+ Here the `names ` variable is a list of names that can be used in the template generation.
240+
241+ While this is short list in this case, it could be a much larger list of names either
242+ specified as a literal, or read from another dataframe, file, table or produced from another source.
243+
244+ As this is not transmitted as part of the generated SQL, it allows for much larger lists of possible values.
245+
246+ Other forms of text value lookup
247+ --------------------------------
248+
249+ The use of the `values ` option and the `template ` option with a `TemplateGenerator ` instance allow for generation of
250+ data when the range of possible values is known.
177251
178- If set to False, then the template ``r"\\dr_\\v" `` will generate the values ``"dr_0" `` ... ``"dr_999" `` when applied
179- to the values zero to 999. This conforms to earlier implementations for backwards compatibility.
252+ But what about scenarios when the list of data is read from a different table or some other form of lookup?
180253
181- If set to True, then the template `` r"dr_\\v" `` will generate the values `` "dr_0" `` ... `` "dr_999" ``
182- when applied to the values zero to 999. This conforms to the preferred style going forward
254+ As the output of the data generation ` build() ` method is a regular PySpark DataFrame, it is possible to join the
255+ generated data with other data sources to generate the required data.
183256
257+ In these cases, the generator can be specified to produce lookup keys that can be used to join with the
258+ other data sources.
184259
0 commit comments