Skip to content

Commit 41fc07f

Browse files
modified files to build for Databricks runtime 11.3 LTS compliant versions
Modified build and dependency files to build for Databricks runtime 11.3 LTS compliant versions as earlier versions will not be supported beyond March 2025. This allows for use of Python 3.9 / Apache Spark 3.3.0 as minimum versions and brings in important updates to streaming (that will avoid version specific unit tests for other commits)
1 parent 087bb02 commit 41fc07f

19 files changed

+94
-74
lines changed

.github/workflows/push.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@ jobs:
3131
sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
3232
java -version
3333
34-
- name: Set up Python 3.8
34+
- name: Set up Python 3.9.21
3535
uses: actions/setup-python@v5
3636
with:
37-
python-version: '3.8.12'
37+
python-version: '3.9.21'
3838
cache: 'pipenv'
3939

4040
- name: Check Python version

.github/workflows/release.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@ jobs:
2424
sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
2525
java -version
2626
27-
- name: Set up Python 3.8
27+
- name: Set up Python 3.9.21
2828
uses: actions/setup-python@v5
2929
with:
30-
python-version: '3.8.12'
30+
python-version: '3.9.21'
3131
cache: 'pipenv'
3232

3333
- name: Check Python version

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,13 @@ All notable changes to the Databricks Labs Data Generator will be documented in
88
#### Fixed
99
* Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime
1010

11+
#### Changed
12+
* Changed base Databricks runtime version to DBR 11.3 LTS (based on Apache Spark 3.3.0)
13+
14+
#### Added
15+
* Added support for serialization to/from JSON format
16+
17+
1118
### Version 0.4.0 Hotfix 2
1219

1320
#### Fixed

CONTRIBUTING.md

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,7 @@ Dependent packages are not installed automatically by the `dbldatagen` package.
1919

2020
## Python compatibility
2121

22-
The code has been tested with Python 3.8.12 and later.
23-
24-
Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks
25-
runtime 9.1 LTS or later.
22+
The code has been tested with Python 3.9.21 and later.
2623

2724
## Checking your code for common issues
2825

@@ -46,7 +43,7 @@ Our recommended mechanism for building the code is to use a `conda` or `pipenv`
4643
But it can be built with any Python virtualization environment.
4744

4845
### Spark dependencies
49-
The builds have been tested against Spark 3.2.1. This requires the OpenJDK 1.8.56 or later version of Java 8.
46+
The builds have been tested against Spark 3.3.0. This requires the OpenJDK 1.8.56 or later version of Java 8.
5047
The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing.
5148
These are not installed automatically by the build process, so you will need to install them separately.
5249

@@ -75,7 +72,7 @@ To build with `pipenv`, perform the following commands:
7572
- Run `make dist` from the main project directory
7673
- The resulting wheel file will be placed in the `dist` subdirectory
7774

78-
The resulting build has been tested against Spark 3.2.1
75+
The resulting build has been tested against Spark 3.3.0
7976

8077
## Creating the HTML documentation
8178

@@ -161,19 +158,19 @@ See https://legacy.python.org/dev/peps/pep-0008/
161158

162159
# Github expectations
163160
When running the unit tests on Github, the environment should use the same environment as the latest Databricks
164-
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 10.4 onwards,
161+
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 11.3 onwards,
165162
unit tests will be run on the environment corresponding to the latest LTS release.
166163

167-
Libraries will use the same versions as the earliest supported LTS release - currently 10.4 LTS
164+
Libraries will use the same versions as the earliest supported LTS release - currently 11.3 LTS
168165

169166
This means for the current build:
170167

171168
- Use of Ubuntu 22.04 for the test runner
172169
- Use of Java 8
173-
- Use of Python 3.11
170+
- Use of Python 3.9.21 when testing / building the image
174171

175172
See the following resources for more information
176173
= https://docs.databricks.com/en/release-notes/runtime/15.4lts.html
177-
- https://docs.databricks.com/en/release-notes/runtime/10.4lts.html
174+
- https://docs.databricks.com/en/release-notes/runtime/11.3lts.html
178175
- https://github.com/actions/runner-images/issues/10636
179176

Pipfile

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ sphinx = ">=2.0.0,<3.1.0"
1010
nbsphinx = "*"
1111
numpydoc = "==0.8"
1212
pypandoc = "*"
13-
ipython = "==7.31.1"
13+
ipython = "==7.32.0"
1414
pydata-sphinx-theme = "*"
1515
recommonmark = "*"
1616
sphinx-markdown-builder = "*"
@@ -19,13 +19,13 @@ prospector = "*"
1919

2020
[packages]
2121
numpy = "==1.22.0"
22-
pyspark = "==3.1.3"
23-
pyarrow = "==4.0.1"
24-
wheel = "==0.38.4"
25-
pandas = "==1.2.4"
26-
setuptools = "==65.6.3"
27-
pyparsing = "==2.4.7"
22+
pyspark = "==3.3.0"
23+
pyarrow = "==7.0.0"
24+
wheel = "==0.37.0"
25+
pandas = "==1.3.4"
26+
setuptools = "==58.0.4"
27+
pyparsing = "==3.0.4"
2828
jmespath = "==0.10.0"
2929

3030
[requires]
31-
python_version = "3.8.12"
31+
python_version = "3.9.21"

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,8 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag
8383
contains details of installation using alternative mechanisms.
8484

8585
## Compatibility
86-
The Databricks Labs Data Generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
87-
compatible with the Databricks runtime 10.4 LTS and later releases. For full Unity Catalog support,
86+
The Databricks Labs Data Generator framework can be used with Pyspark 3.3.0 and Python 3.9.21 or later. These are
87+
compatible with the Databricks runtime 11.3 LTS and later releases. For full Unity Catalog support,
8888
we recommend using Databricks runtime 13.2 or later (Databricks 13.3 LTS or above preferred)
8989

9090
For full library compatibility for a specific Databricks Spark release, see the Databricks
@@ -155,7 +155,7 @@ The GitHub repository also contains further examples in the examples directory.
155155

156156
## Spark and Databricks Runtime Compatibility
157157
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
158-
older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
158+
older LTS versions at least from 11.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
159159
including `current` and `preview`.
160160

161161
While we don't specifically drop support for older runtimes, changes in Pyspark APIs or

dbldatagen/column_generation_spec.py

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ class ColumnGenerationSpec(SerializableToDict):
9595
# restrict spurious messages from java gateway
9696
logging.getLogger("py4j").setLevel(logging.WARNING)
9797

98-
def __init__(self, name, colType=None, minValue=0, maxValue=None, step=1, prefix='', random=False,
98+
def __init__(self, name, colType=None, *, minValue=0, maxValue=None, step=1, prefix='', random=False,
9999
distribution=None, baseColumn=None, randomSeed=None, randomSeedMethod=None,
100100
implicit=False, omit=False, nullable=True, debug=False, verbose=False,
101101
seedColumnName=DEFAULT_SEED_COLUMN,
@@ -529,18 +529,22 @@ def _setup_logger(self):
529529
else:
530530
self.logger.setLevel(logging.WARNING)
531531

532-
def _computeAdjustedRangeForColumn(self, colType, c_min, c_max, c_step, c_begin, c_end, c_interval, c_range,
532+
def _computeAdjustedRangeForColumn(self, colType, c_min, c_max, c_step, *, c_begin, c_end, c_interval, c_range,
533533
c_unique):
534534
"""Determine adjusted range for data column
535535
"""
536536
assert colType is not None, "`colType` must be non-None instance"
537537

538538
if type(colType) is DateType or type(colType) is TimestampType:
539-
return self._computeAdjustedDateTimeRangeForColumn(colType, c_begin, c_end, c_interval, c_range, c_unique)
539+
return self._computeAdjustedDateTimeRangeForColumn(colType, c_begin, c_end, c_interval,
540+
c_range=c_range,
541+
c_unique=c_unique)
540542
else:
541-
return self._computeAdjustedNumericRangeForColumn(colType, c_min, c_max, c_step, c_range, c_unique)
543+
return self._computeAdjustedNumericRangeForColumn(colType, c_min, c_max, c_step,
544+
c_range=c_range,
545+
c_unique=c_unique)
542546

543-
def _computeAdjustedNumericRangeForColumn(self, colType, c_min, c_max, c_step, c_range, c_unique):
547+
def _computeAdjustedNumericRangeForColumn(self, colType, c_min, c_max, c_step, *, c_range, c_unique):
544548
"""Determine adjusted range for data column
545549
546550
Rules:
@@ -589,7 +593,7 @@ def _computeAdjustedNumericRangeForColumn(self, colType, c_min, c_max, c_step, c
589593

590594
return result
591595

592-
def _computeAdjustedDateTimeRangeForColumn(self, colType, c_begin, c_end, c_interval, c_range, c_unique):
596+
def _computeAdjustedDateTimeRangeForColumn(self, colType, c_begin, c_end, c_interval, *, c_range, c_unique):
593597
"""Determine adjusted range for Date or Timestamp data column
594598
"""
595599
effective_begin, effective_end, effective_interval = None, None, None
@@ -656,7 +660,7 @@ def _getUniformRandomSQLExpression(self, col_name):
656660
else:
657661
return "rand()"
658662

659-
def _getScaledIntSQLExpression(self, col_name, scale, base_columns, base_datatypes=None, compute_method=None,
663+
def _getScaledIntSQLExpression(self, col_name, scale, base_columns, *, base_datatypes=None, compute_method=None,
660664
normalize=False):
661665
""" Get scaled numeric expression
662666

dbldatagen/data_analyzer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ def _displayRow(self, row):
9292

9393
return ", ".join(results)
9494

95-
def _addMeasureToSummary(self, measureName, summaryExpr="''", fieldExprs=None, dfData=None, rowLimit=1,
95+
def _addMeasureToSummary(self, measureName, *, summaryExpr="''", fieldExprs=None, dfData=None, rowLimit=1,
9696
dfSummary=None):
9797
""" Add a measure to the summary dataframe
9898
@@ -340,7 +340,7 @@ def _generatorDefaultAttributesFromType(cls, sqlType, colName=None, dataSummary=
340340
return result
341341

342342
@classmethod
343-
def _scriptDataGeneratorCode(cls, schema, dataSummary=None, sourceDf=None, suppressOutput=False, name=None):
343+
def _scriptDataGeneratorCode(cls, schema, *, dataSummary=None, sourceDf=None, suppressOutput=False, name=None):
344344
"""
345345
Generate outline data generator code from an existing dataframe
346346

dbldatagen/data_generator.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ class DataGenerator(SerializableToDict):
7676

7777
# logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.NOTSET)
7878

79-
def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
79+
def __init__(self, sparkSession=None, name=None, *, randomSeedMethod=None,
8080
rows=1000000, startingId=0, randomSeed=None, partitions=None, verbose=False,
8181
batchSize=None, debug=False, seedColumnName=DEFAULT_SEED_COLUMN,
8282
random=False,
@@ -782,7 +782,7 @@ def _checkColumnOrColumnList(self, columns, allowId=False):
782782
f" column `{columns}` must refer to defined column")
783783
return True
784784

785-
def withColumnSpec(self, colName, minValue=None, maxValue=None, step=1, prefix=None,
785+
def withColumnSpec(self, colName, *, minValue=None, maxValue=None, step=1, prefix=None,
786786
random=None, distribution=None,
787787
implicit=False, dataRange=None, omit=False, baseColumn=None, **kwargs):
788788
""" add a column specification for an existing column
@@ -842,7 +842,7 @@ def hasColumnSpec(self, colName):
842842
"""
843843
return colName in self._columnSpecsByName
844844

845-
def withColumn(self, colName, colType=StringType(), minValue=None, maxValue=None, step=1,
845+
def withColumn(self, colName, colType=StringType(), *, minValue=None, maxValue=None, step=1,
846846
dataRange=None, prefix=None, random=None, distribution=None,
847847
baseColumn=None, nullable=True,
848848
omit=False, implicit=False, noWarn=False,
@@ -1058,7 +1058,7 @@ def withStructColumn(self, colName, fields=None, asJson=False, **kwargs):
10581058

10591059
return newDf
10601060

1061-
def _generateColumnDefinition(self, colName, colType=None, baseColumn=None,
1061+
def _generateColumnDefinition(self, colName, colType=None, baseColumn=None, *,
10621062
implicit=False, omit=False, nullable=True, **kwargs):
10631063
""" generate field definition and column spec
10641064
@@ -1591,7 +1591,7 @@ def scriptTable(self, name=None, location=None, tableFormat="delta", asHtml=Fals
15911591

15921592
return results
15931593

1594-
def scriptMerge(self, tgtName=None, srcName=None, updateExpr=None, delExpr=None, joinExpr=None, timeExpr=None,
1594+
def scriptMerge(self, tgtName=None, srcName=None, *, updateExpr=None, delExpr=None, joinExpr=None, timeExpr=None,
15951595
insertExpr=None,
15961596
useExplicitNames=True,
15971597
updateColumns=None, updateColumnExprs=None,

dbldatagen/text_generator_plugins.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ class _FnCallContext:
6969
def __init__(self, txtGen):
7070
self.textGenerator = txtGen
7171

72-
def __init__(self, fn, init=None, initPerBatch=False, name=None, rootProperty=None):
72+
def __init__(self, fn, *, init=None, initPerBatch=False, name=None, rootProperty=None):
7373
super().__init__()
7474
assert fn is not None or callable(fn), "Function must be provided wiith signature fn(context, oldValue)"
7575
assert init is None or callable(init), "Init function must be a callable function or lambda if passed"
@@ -284,7 +284,7 @@ class FakerTextFactory(PyfuncTextFactory):
284284

285285
_defaultFakerTextFactory = None
286286

287-
def __init__(self, locale=None, providers=None, name="FakerText", lib=None,
287+
def __init__(self, *, locale=None, providers=None, name="FakerText", lib=None,
288288
rootClass=None):
289289

290290
super().__init__(name)

0 commit comments

Comments
 (0)