Skip to content

Commit dad4372

Browse files
authored
Update developer tooling (#320)
* Initial commit * Add project file * Fix streaming test * Fix streaming test * Fix actions * Fix actions * Fix project file * Fix streaming test * Cover spark_singleton.py * Update project file, makefile, and documentation * Updated dependencies, makefile, and developer docs * Update dependencies, tools, and tests * Update actions * Update actions * Remove explicit py4j dependency
1 parent 2beb541 commit dad4372

23 files changed

+692
-2110
lines changed

.github/dependabot.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
version: 2
2+
updates:
3+
- package-ecosystem: "pip"
4+
directory: "/"
5+
schedule:
6+
interval: "daily"
7+
- package-ecosystem: "github-actions"
8+
directory: "/"
9+
schedule:
10+
interval: "daily"

.github/workflows/push.yml

Lines changed: 34 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,24 @@ on:
77
branches: [master]
88

99
jobs:
10+
fmt:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- name: Checkout
14+
uses: actions/checkout@v4.2.2
15+
16+
- name: Format files
17+
run: make dev fmt
18+
19+
- name: Fail on differences
20+
run: git diff --exit-code
21+
1022
tests:
1123
# Ubuntu latest no longer installs Python 3.9 by default so install it
12-
runs-on: ubuntu-22.04
24+
runs-on: ubuntu-latest
1325
steps:
1426
- name: Checkout
15-
uses: actions/checkout@v4
27+
uses: actions/checkout@v4.2.2
1628
with:
1729
fetch-depth: 0
1830

@@ -26,35 +38,33 @@ jobs:
2638
# key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
2739
# restore-keys: |
2840
# ${{ runner.os }}-go-
29-
- name: Set Java 8
30-
run: |
31-
sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
32-
java -version
3341

34-
- name: Set up Python 3.10.12
35-
uses: actions/setup-python@v5
42+
- name: Set up JDK 17
43+
uses: actions/setup-java@v4
3644
with:
37-
python-version: '3.10.12'
38-
cache: 'pipenv'
39-
40-
- name: Check Python version
41-
run: python --version
45+
distribution: 'temurin' # Can also use 'zulu', 'adopt', etc.
46+
java-version: '17'
4247

43-
- name: Install pip
44-
run: python -m pip install --upgrade pip
48+
- name: Get Java version
49+
run: java -version
4550

46-
- name: Install
47-
run: pip install pipenv
51+
#- name: Set Java 8
52+
# run: |
53+
# sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
54+
# java -version
4855

49-
- name: Install dependencies
50-
run: pipenv install --dev
56+
- name: Install Python
57+
uses: actions/setup-python@v5
58+
with:
59+
cache: 'pip'
60+
cache-dependency-path: '**/pyproject.toml'
61+
python-version: '3.10'
5162

52-
- name: Lint
53-
run: |
54-
pipenv run prospector --profile prospector.yaml
63+
- name: Install Hatch
64+
run: pip install hatch
5565

56-
- name: Run tests
57-
run: make test
66+
- name: Run unit tests
67+
run: make dev test
5868

5969
- name: Publish test coverage to coverage site
6070
uses: codecov/codecov-action@v4

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,15 @@ docs/source/reference/api/*.rst
3636
.coverage
3737
htmlcov/
3838
.coverage.xml
39+
40+
# IDE-specific folders — prevent local/editor config files from polluting source control.
41+
# PyCharm
42+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
43+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
44+
# and can be added to the global gitignore or merged into this file. For a more nuclear
45+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
46+
.idea/
47+
# Cursor IDE
48+
# Cursor is an AI-powered code editor. The .cursor/ directory contains IDE-specific
49+
# settings and configurations similar to other IDEs.
50+
.cursor/

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,22 @@ All notable changes to the Databricks Labs Data Generator will be documented in
77

88
#### Fixed
99
* Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime
10+
* Refactored `DataAnalyzer` and `BasicStockTickerProvider` to comply with ANSI SQL standards
11+
* Removed internal modification of `SparkSession`
1012

1113
#### Changed
1214
* Changed base Databricks runtime version to DBR 13.3 LTS (based on Apache Spark 3.4.1) - minimum supported version
1315
of Python is now 3.10.12
16+
* Updated build tooling to use [hatch](https://hatch.pypa.io/latest/)
17+
* Moved dependencies and tool configuration to [pyproject.toml](pyproject.toml)
18+
* Removed dependencies provided by the Databricks Runtime
19+
* Updated Git actions
20+
* Updated [makefile](makefile)
21+
* Updated [CONTRIBUTING.md](CONTRIBUTING.md)
1422

1523
#### Added
1624
* Added support for serialization to/from JSON format
25+
* Added Ruff and mypy tooling
1726

1827

1928
### Version 0.4.0 Hotfix 2

CONTRIBUTING.md

Lines changed: 58 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -11,68 +11,45 @@ state this explicitly, by submitting any copyrighted material via pull request,
1111
other means you agree to license the material under the project's Databricks license and
1212
warrant that you have the legal authority to do so.
1313

14-
# Building the code
14+
# Development Setup
1515

16-
## Package Dependencies
17-
See the contents of the file `python/require.txt` to see the Python package dependencies.
18-
Dependent packages are not installed automatically by the `dbldatagen` package.
16+
## Python Compatibility
1917

20-
## Python compatibility
18+
The code supports Python 3.10+ and has been tested with Python 3.10 and later.
2119

22-
The code has been tested with Python 3.9.21 and later.
20+
## Quick Start
2321

24-
## Checking your code for common issues
22+
```bash
23+
# Install development dependencies
24+
make dev
2525

26-
Run `make dev-lint` from the project root directory to run various code style checks.
27-
These are based on the use of `prospector`, `pylint` and related tools.
26+
# Format and lint code
27+
make fmt # Format with ruff and fix issues
28+
make lint # Check code quality
2829

29-
## Setting up your build environment
30-
Run `make buildenv` from the root of the project directory to setup a `pipenv` based build environment.
30+
# Run tests
31+
make test # Run tests
3132

32-
Run `make create-dev-env` from the root of the project directory to
33-
set up a conda based virtualized Python build environment in the project directory.
34-
35-
You can use alternative build virtualization environments or simply install the requirements
36-
directly in your environment.
37-
38-
39-
## Build steps
33+
# Build package
34+
make build # Build with modern build system
35+
```
4036

41-
Our recommended mechanism for building the code is to use a `conda` or `pipenv` based development process.
37+
## Development Tools
4238

43-
But it can be built with any Python virtualization environment.
39+
All development tools are configured in `pyproject.toml`.
4440

45-
### Spark dependencies
46-
The builds have been tested against Apache Spark 3.4.1.
47-
The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing.
48-
These are not installed automatically by the build process, so you will need to install them separately.
41+
## Dependencies
4942

50-
### Building with Conda
51-
To build with `conda`, perform the following commands:
52-
- `make create-dev-env` from the main project directory to create your conda environment, if using
53-
- activate the conda environment - e.g `conda activate dbl_testdatagenerator`
54-
- install the necessary dependencies in your conda environment via `make install-dev-dependencies`
55-
56-
- use the following to build and run the tests with a coverage report
57-
- Run `make dev-test-with-html-report` from the main project directory.
43+
All dependencies are defined in `pyproject.toml`:
5844

59-
- Use the following command to make the distributable:
60-
- Run `make dev-dist` from the main project directory
61-
- The resulting wheel file will be placed in the `dist` subdirectory
62-
63-
### Building with Pipenv
64-
To build with `pipenv`, perform the following commands:
65-
- `make buildenv` from the main project directory to create your conda environment, if using
66-
- install the necessary dependencies in your conda environment via `make install-dev-dependencies`
67-
68-
- use the following to build and run the tests with a coverage report
69-
- Run `make test-with-html-report` from the main project directory.
45+
- `[project.dependencies]` lists dependencies necessary to run the `dbldatagen` library
46+
- `[tool.hatch.envs.default]` lists the default environment necessary to develop, test, and build the `dbldatagen` library
7047

71-
- Use the following command to make the distributable:
72-
- Run `make dist` from the main project directory
73-
- The resulting wheel file will be placed in the `dist` subdirectory
48+
## Spark Dependencies
7449

75-
The resulting build has been tested against Spark 3.4.1
50+
The builds have been tested against Spark 3.4.1+. This requires OpenJDK 1.8.56 or later version of Java 8.
51+
The Databricks runtimes use the Azul Zulu version of OpenJDK 8.
52+
These are not installed automatically by the build process.
7653

7754
## Creating the HTML documentation
7855

@@ -82,7 +59,10 @@ The main html document will be in the file (relative to the root of the build di
8259
`./docs/docs/build/html/index.html`
8360

8461
## Building the Python wheel
85-
Run `make clean dist` from the main project directory.
62+
63+
```bash
64+
make build # Clean and build the package
65+
```
8666

8767
# Testing
8868

@@ -102,22 +82,15 @@ spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
10282

10383
The name used to flag the spark instance should be the test module or test class name.
10484

105-
## Running unit / integration tests
106-
107-
If using an environment with multiple Python versions, make sure to use virtual env or
108-
similar to pick up correct python versions. The make target `create`
109-
110-
If necessary, set `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to correct versions of Python.
111-
112-
To run the tests using a `conda` environment:
113-
- Run `make dev-test` from the main project directory to run the unit tests.
85+
## Running Tests
11486

115-
- Run `make dev-test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`
87+
```bash
88+
# Run all tests
89+
make test
11690

117-
To run the tests using a `pipenv` environment:
118-
- Run `make test` from the main project directory to run the unit tests.
91+
If using an environment with multiple Python versions, make sure to use virtual env or similar to pick up correct python versions.
11992

120-
- Run `make test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`
93+
If necessary, set `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to correct versions of Python.
12194

12295
# Using the Databricks Labs data generator
12396
The recommended method for installation is to install from the PyPi package
@@ -147,30 +120,44 @@ For example, the following code downloads the release V0.2.1
147120

148121
> '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl'
149122

150-
# Coding Style
123+
# Code Quality and Style
124+
125+
## Automated Formatting
126+
127+
Code can be automatically formatted and linted with the following commands:
128+
129+
```bash
130+
# Format code and fix issues automatically
131+
make fmt
132+
133+
# Check code quality without making changes
134+
make lint
135+
```
151136

152-
The code follows the Pyspark coding conventions.
137+
## Coding Conventions
153138

154-
Basically it follows the Python PEP8 coding conventions - but method and argument names used mixed case starting
155-
with a lower case letter rather than underscores following Pyspark coding conventions.
139+
The code follows PySpark coding conventions:
140+
- Python PEP8 standards with some PySpark-specific adaptations
141+
- Method and argument names use mixed case starting with lowercase (following PySpark conventions)
142+
- Line length limit of 120 characters
156143

157-
See https://legacy.python.org/dev/peps/pep-0008/
144+
See the [Python PEP8 Guide](https://peps.python.org/pep-0008/) for general Python style guidelines.
158145

159146
# Github expectations
160-
When running the unit tests on Github, the environment should use the same environment as the latest Databricks
161-
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 13.3 LTS onwards,
147+
When running the unit tests on GitHub, the environment should use the same environment as the latest Databricks
148+
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 13.3 onwards,
162149
unit tests will be run on the environment corresponding to the latest LTS release.
163150

164151
Libraries will use the same versions as the earliest supported LTS release - currently 13.3 LTS
165152

166153
This means for the current build:
167154

168-
- Use of Ubuntu 22.04.2 LTS for the test runner
155+
- Use of Ubuntu 22.04 for the test runner
169156
- Use of Java 8
170157
- Use of Python 3.10.12 when testing / building the image
171158

172159
See the following resources for more information
173160
= https://docs.databricks.com/en/release-notes/runtime/15.4lts.html
174-
- https://docs.databricks.com/aws/en/release-notes/runtime/13.3lts
161+
- https://docs.databricks.com/en/release-notes/runtime/11.3lts.html
175162
- https://github.com/actions/runner-images/issues/10636
176163

Pipfile

Lines changed: 0 additions & 31 deletions
This file was deleted.

0 commit comments

Comments
 (0)