databrickslabs
diff --git a/‎.github/dependabot.yml‎
Lines changed: 10 additions & 0 deletions b/‎.github/dependabot.yml‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎.github/workflows/push.yml‎
Lines changed: 34 additions & 24 deletions b/‎.github/workflows/push.yml‎
Lines changed: 34 additions & 24 deletions
diff --git a/‎.gitignore‎
Lines changed: 12 additions & 0 deletions b/‎.gitignore‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 58 additions & 71 deletions b/‎CONTRIBUTING.md‎
Lines changed: 58 additions & 71 deletions
diff --git a/‎Pipfile‎
Lines changed: 0 additions & 31 deletions b/‎Pipfile‎
Lines changed: 0 additions & 31 deletions
@@ -0,0 +1,10 @@
+version: 2
+updates:
+  - package-ecosystem: "pip"
+    directory: "/"
+    schedule:
+      interval: "daily"
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "daily"
@@ -7,12 +7,24 @@ on:
     branches: [master]
 
 jobs:
+  fmt:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4.2.2
+
+      - name: Format files
+        run: make dev fmt
+
+      - name: Fail on differences
+        run: git diff --exit-code
+
   tests:
     # Ubuntu latest no longer installs Python 3.9 by default so install it
-    runs-on: ubuntu-22.04
+    runs-on: ubuntu-latest
     steps:
       - name: Checkout
-        uses: actions/checkout@v4
+        uses: actions/checkout@v4.2.2
         with:
           fetch-depth: 0
 
@@ -26,35 +38,33 @@ jobs:
       #     key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
       #     restore-keys: |
       #       ${{ runner.os }}-go-
-      - name: Set Java 8
-        run: |
-          sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
-          java -version
 
-      - name: Set up Python 3.10.12
-        uses: actions/setup-python@v5
+      - name: Set up JDK 17
+        uses: actions/setup-java@v4
         with:
-          python-version: '3.10.12'
-          cache: 'pipenv'
-
-      - name: Check Python version
-        run: python --version
+          distribution: 'temurin'   # Can also use 'zulu', 'adopt', etc.
+          java-version: '17'
 
-      - name: Install pip
-        run: python -m pip install --upgrade pip
+      - name: Get Java version
+        run: java -version
 
-      - name: Install 
-        run: pip install pipenv
+      #- name: Set Java 8
+      #  run: |
+      #    sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
+      #    java -version
 
-      - name: Install dependencies
-        run: pipenv install --dev
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          cache: 'pip'
+          cache-dependency-path: '**/pyproject.toml'
+          python-version: '3.10'
 
-      - name: Lint
-        run: |
-          pipenv run prospector --profile prospector.yaml
+      - name: Install Hatch
+        run: pip install hatch
 
-      - name: Run tests
-        run: make test
+      - name: Run unit tests
+        run: make dev test
 
       - name: Publish test coverage to coverage site
         uses: codecov/codecov-action@v4
 
@@ -36,3 +36,15 @@ docs/source/reference/api/*.rst
 .coverage
 htmlcov/
 .coverage.xml
+
+# IDE-specific folders — prevent local/editor config files from polluting source control.
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+.idea/
+# Cursor IDE
+# Cursor is an AI-powered code editor. The .cursor/ directory contains IDE-specific
+# settings and configurations similar to other IDEs.
+.cursor/
@@ -7,13 +7,22 @@ All notable changes to the Databricks Labs Data Generator will be documented in
 
 #### Fixed 
 * Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime
+* Refactored `DataAnalyzer` and `BasicStockTickerProvider` to comply with ANSI SQL standards
+* Removed internal modification of `SparkSession`
 
 #### Changed
 * Changed base Databricks runtime version to DBR 13.3 LTS (based on Apache Spark 3.4.1) - minimum supported version
   of Python is now 3.10.12
+* Updated build tooling to use [hatch](https://hatch.pypa.io/latest/)
+* Moved dependencies and tool configuration to [pyproject.toml](pyproject.toml)
+* Removed dependencies provided by the Databricks Runtime
+* Updated Git actions
+* Updated [makefile](makefile)
+* Updated [CONTRIBUTING.md](CONTRIBUTING.md)
 
 #### Added
 * Added support for serialization to/from JSON format
+* Added Ruff and mypy tooling
 
 
 ### Version 0.4.0 Hotfix 2
 
@@ -11,68 +11,45 @@ state this explicitly, by submitting any copyrighted material via pull request,
 other means you agree to license the material under the project's Databricks license and 
 warrant that you have the legal authority to do so.
 
-# Building the code
+# Development Setup
 
-## Package Dependencies
-See the contents of the file `python/require.txt` to see the Python package dependencies. 
-Dependent packages are not installed automatically by the `dbldatagen` package.
+## Python Compatibility
 
-## Python compatibility
+The code supports Python 3.10+ and has been tested with Python 3.10 and later.
 
-The code has been tested with Python 3.9.21 and later.
+## Quick Start
 
-## Checking your code for common issues
+```bash
+# Install development dependencies
+make dev
 
-Run `make dev-lint` from the project root directory to run various code style checks. 
-These are based on the use of `prospector`, `pylint` and related tools.
+# Format and lint code
+make fmt                 # Format with ruff and fix issues
+make lint                # Check code quality
 
-## Setting up your build environment
-Run `make buildenv` from the root of the project directory to setup a `pipenv` based build environment.
+# Run tests
+make test                # Run tests
 
-Run `make create-dev-env` from the root of the project directory to 
-set up a conda based virtualized Python build environment in the project directory.
-
-You can use alternative build virtualization environments or simply install the requirements
-directly in your environment.
-
-
-## Build steps
+# Build package
+make build               # Build with modern build system
+```
 
-Our recommended mechanism for building the code is to use a `conda` or `pipenv` based development process. 
+## Development Tools
 
-But it can be built with any Python virtualization environment.
+All development tools are configured in `pyproject.toml`.
 
-### Spark dependencies
-The builds have been tested against Apache Spark 3.4.1. 
-The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing.
-These are not installed automatically by the build process, so you will need to install them separately.
+## Dependencies
 
-### Building with Conda
-To build with `conda`, perform the following commands:
-  - `make create-dev-env` from the main project directory to create your conda environment, if using
-  - activate the conda environment - e.g `conda activate dbl_testdatagenerator`
-  - install the necessary dependencies in your conda environment via `make install-dev-dependencies`
-  
-  - use the following to build and run the tests with a coverage report
-    - Run  `make dev-test-with-html-report` from the main project directory.
+All dependencies are defined in `pyproject.toml`:
 
-  - Use the following command to make the distributable:
-    - Run `make dev-dist` from the main project directory
-  - The resulting wheel file will be placed in the `dist` subdirectory
-  
-### Building with Pipenv
-To build with `pipenv`, perform the following commands:
-  - `make buildenv` from the main project directory to create your conda environment, if using
-  - install the necessary dependencies in your conda environment via `make install-dev-dependencies`
-  
-  - use the following to build and run the tests with a coverage report
-    - Run  `make test-with-html-report` from the main project directory.
+- `[project.dependencies]` lists dependencies necessary to run the `dbldatagen` library
+- `[tool.hatch.envs.default]` lists the default environment necessary to develop, test, and build the `dbldatagen` library
 
-  - Use the following command to make the distributable:
-    - Run `make dist` from the main project directory
-  - The resulting wheel file will be placed in the `dist` subdirectory
+## Spark Dependencies
 
-The resulting build has been tested against Spark 3.4.1
+The builds have been tested against Spark 3.4.1+. This requires OpenJDK 1.8.56 or later version of Java 8.
+The Databricks runtimes use the Azul Zulu version of OpenJDK 8.
+These are not installed automatically by the build process.
 
 ## Creating the HTML documentation
 
@@ -82,7 +59,10 @@ The main html document will be in the file (relative to the root of the build di
  `./docs/docs/build/html/index.html`
 
 ## Building the Python wheel
-Run  `make clean dist` from the main project directory.
+
+```bash
+make build               # Clean and build the package
+```
 
 # Testing 
 
@@ -102,22 +82,15 @@ spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
 
 The name used to flag the spark instance should be the test module or test class name. 
 
-## Running unit / integration tests
-
-If using an environment with multiple Python versions, make sure to use virtual env or 
-similar to pick up correct python versions. The make target `create`
-
-If necessary, set `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to correct versions of Python.
-
-To run the tests using a `conda` environment:
-  - Run `make dev-test` from the main project directory to run the unit tests.
+## Running Tests
 
-  - Run `make dev-test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`
+```bash
+# Run all tests
+make test
 
-To run the tests using a `pipenv` environment:
-  - Run `make test` from the main project directory to run the unit tests.
+If using an environment with multiple Python versions, make sure to use virtual env or similar to pick up correct python versions.
 
-  - Run `make test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`
+If necessary, set `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to correct versions of Python.
 
 # Using the Databricks Labs data generator
 The recommended method for installation is to install from the PyPi package
@@ -147,30 +120,44 @@ For example, the following code downloads the release V0.2.1
 
 > '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl'
 
-# Coding Style 
+# Code Quality and Style
+
+## Automated Formatting
+
+Code can be automatically formatted and linted with the following commands:
+
+```bash
+# Format code and fix issues automatically
+make fmt
+
+# Check code quality without making changes
+make lint
+```
 
-The code follows the Pyspark coding conventions. 
+## Coding Conventions
 
-Basically it follows the Python PEP8 coding conventions - but method and argument names used mixed case starting 
-with a lower case letter rather than underscores following Pyspark coding conventions.
+The code follows PySpark coding conventions:
+- Python PEP8 standards with some PySpark-specific adaptations
+- Method and argument names use mixed case starting with lowercase (following PySpark conventions)
+- Line length limit of 120 characters
 
-See https://legacy.python.org/dev/peps/pep-0008/
+See the [Python PEP8 Guide](https://peps.python.org/pep-0008/) for general Python style guidelines.
 
 # Github expectations
-When running the unit tests on Github, the environment should use the same environment as the latest Databricks
-runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 13.3 LTS onwards, 
+When running the unit tests on GitHub, the environment should use the same environment as the latest Databricks
+runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 13.3 onwards, 
 unit tests will be run on the environment corresponding to the latest LTS release. 
 
 Libraries will use the same versions as the earliest supported LTS release - currently 13.3 LTS
 
 This means for the current build:
 
-- Use of Ubuntu 22.04.2 LTS for the test runner
+- Use of Ubuntu 22.04 for the test runner
 - Use of Java 8
 - Use of Python 3.10.12 when testing / building the image
 
 See the following resources for more information
 = https://docs.databricks.com/en/release-notes/runtime/15.4lts.html
-- https://docs.databricks.com/aws/en/release-notes/runtime/13.3lts
+- https://docs.databricks.com/en/release-notes/runtime/11.3lts.html
 - https://github.com/actions/runner-images/issues/10636