Skip to content

Commit d836665

Browse files
authored
Merge pull request #3 from geekwhocodes/feature/sub-packages
Feature/sub packages
2 parents 83916b3 + 8f65ad3 commit d836665

34 files changed

+1565
-549
lines changed

.github/workflows/test.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: Publish to Test PyPI
2+
3+
on:
4+
push:
5+
branches:
6+
- 'feature*'
7+
8+
jobs:
9+
test-and-publish:
10+
runs-on: ubuntu-latest
11+
12+
steps:
13+
- name: Checkout code
14+
uses: actions/checkout@v4
15+
16+
- name: Set up Python
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: '3.12'
20+
21+
- name: Install Poetry
22+
run: |
23+
curl -sSL https://install.python-poetry.org | python3 -
24+
echo "$HOME/.local/bin" >> $GITHUB_PATH
25+
26+
- name: Install dependencies
27+
run: poetry install
28+
29+
- name: Run tests
30+
run: poetry run pytest
31+
32+
- name: Build the package
33+
run: poetry build
34+
35+
- name: Publish to Test PyPI
36+
env:
37+
POETRY_PYPI_TOKEN_TESTPYPI: ${{ secrets.TEST_PYPI_TOKEN }}
38+
run: |
39+
poetry config repositories.testpypi https://test.pypi.org/legacy/
40+
poetry publish -r testpypi --build

README.md

Lines changed: 89 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -1,163 +1,137 @@
1-
# Apache PySpark Custom Data Source Template
21

3-
This repository provides a template for creating a custom data source for Apache PySpark. It is designed to help developers extend PySpark’s data source API to support custom data ingestion and storage mechanisms.
2+
# pyspark-msgraph-source
43

4+
A **PySpark DataSource** to seamlessly integrate and read data from **Microsoft Graph API**, enabling easy access to resources like **SharePoint List Items**, and more.
55

6-
## Motivation
7-
8-
When developing custom PySpark data sources, I encountered several challenges that made the development process frustrating:
9-
10-
1. **Environment Setup Complexity**: Setting up a development environment for PySpark data source development was unnecessarily complex, with multiple dependencies and version conflicts.
11-
12-
2. **Test Data Management**: Managing test data and maintaining consistent test environments across different machines was challenging.
13-
14-
3. **Debugging Issues**: The default setup made it difficult to debug custom data source code effectively, especially when dealing with Spark's distributed nature.
15-
16-
4. **Documentation Gaps**: Existing documentation for custom data source development was scattered and often incomplete.
17-
18-
This template repository aims to solve these pain points and provide a streamlined development experience.
19-
6+
---
207

218
## Features
9+
- Entra ID Authentication
10+
Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.
2211

23-
- Pre-configured development environment
24-
- Ready-to-use test infrastructure
25-
- Example implementation
26-
- Automated tests setup
27-
- Debug-friendly configuration
12+
- Automatic Pagination Handling
13+
Fetches all paginated data from Microsoft Graph without manual intervention.
2814

29-
## Getting Started
15+
- Dynamic Schema Inference
16+
Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.
3017

31-
Follow these steps to set up and use this repository:
18+
- Simple Configuration with .option()
19+
Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.
3220

33-
### Prerequisites
21+
- Zero External Ingestion Services
22+
No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.
3423

35-
- Docker
36-
- Visual Studio Code
37-
- Python 3.11
24+
- Extensible Resource Providers
25+
Add custom resource providers to support more Microsoft Graph endpoints as needed.
3826

39-
### Creating a Repository from This Template
27+
- Pluggable Architecture
28+
Dynamically load resource providers without modifying core logic.
4029

41-
To create a new repository based on this template:
30+
- Optimized for PySpark
31+
Designed to work natively with Spark's DataFrame API for big data processing.
4232

43-
1. Go to the [GitHub repository](https://github.com/geekwhocodes/pyspark-custom-datasource-template).
44-
2. Click the **Use this template** button.
45-
3. Select **Create a new repository**.
46-
4. Choose a repository name, visibility (public or private), and click **Create repository from template**.
47-
5. Clone your new repository:
33+
- Secure by Design
34+
Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.
4835

49-
```sh
50-
git clone https://github.com/your-username/your-new-repository.git
51-
cd your-new-repository
52-
```
36+
---
5337

54-
### Setup
38+
## Installation
5539

56-
1. **Open the repository in Visual Studio Code:**
57-
58-
```sh
59-
code .
60-
```
61-
62-
2. **Build and start the development container:**
63-
64-
Open the command palette (Ctrl+Shift+P) and select `Remote-Containers: Reopen in Container`.
40+
```bash
41+
pip install pyspark-msgraph-source
42+
```
6543

66-
3. **Initialize the environment:**
44+
---
6745

68-
The environment will be initialized automatically by running the `init-env.sh` script defined in the `devcontainer.json` file.
46+
## ⚡ Quickstart
6947

70-
### Project Structure
48+
### 1. Authentication
7149

72-
The project follows this structure:
50+
This package uses [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).
51+
Ensure you're authenticated:
7352

74-
```
75-
.
76-
├── src/
77-
│ ├── fake_source/ # Default fake data source implementation
78-
│ │ ├── __init__.py
79-
│ │ ├── source.py # Implementation of the fake data source
80-
│ │ ├── schema.py # Schema definitions (if applicable)
81-
│ │ └── utils.py # Helper functions (if needed)
82-
│ ├── tests/ # Unit tests for the custom data source
83-
│ │ ├── __init__.py
84-
│ │ ├── test_source.py # Tests for the data source
85-
│ │ └── conftest.py # Test configuration and fixtures
86-
├── .devcontainer/ # Development container setup files
87-
│ ├── Dockerfile
88-
│ ├── devcontainer.json
89-
├── |── scripts
90-
├── | ├── init-env.sh # Initialization script for setting up the environment
91-
├── pyproject.toml # Project dependencies and build system configuration
92-
├── README.md # Project documentation
93-
├── LICENSE # License file
53+
```bash
54+
az login
9455
```
9556

96-
### Usage
57+
Or set environment variables:
58+
```bash
59+
export AZURE_CLIENT_ID=<your-client-id>
60+
export AZURE_TENANT_ID=<your-tenant-id>
61+
export AZURE_CLIENT_SECRET=<your-client-secret>
62+
```
9763

98-
By default, this template includes a **fake data source** that generates mock data. You can use it as-is or replace it with your own implementation.
64+
### 2. Example Usage
9965

100-
1. **Register the custom data source:**
66+
```python
67+
from pyspark.sql import SparkSession
10168

102-
```python
103-
from pyspark.sql import SparkSession
104-
from fake_source.source import FakeDataSource
69+
spark = SparkSession.builder \
70+
.appName("MSGraphExample") \
71+
.getOrCreate()
10572

106-
spark = SparkSession.builder.getOrCreate()
107-
spark.dataSource.register(FakeDataSource)
108-
```
73+
from pyspark_msgraph_source.core.source import MSGraphDataSource
74+
spark.dataSource.register(MSGraphDataSource)
10975

110-
2. **Read data using the custom data source:**
76+
df = spark.read.format("msgraph") \
77+
.option("resource", "list_items") \
78+
.option("site-id", "<YOUR_SITE_ID>") \
79+
.option("list-id", "<YOUR_LIST_ID>") \
80+
.option("top", 100) \
81+
.option("expand", "fields") \
82+
.load()
11183

112-
```python
113-
df = spark.read.format("fake").load()
114-
df.show()
115-
```
84+
df.show()
11685

117-
3. **Run tests:**
86+
# with schema
11887

119-
```sh
120-
pytest
121-
```
88+
df = spark.read.format("msgraph") \
89+
.option("resource", "list_items") \
90+
.option("site-id", "<YOUR_SITE_ID>") \
91+
.option("list-id", "<YOUR_LIST_ID>") \
92+
.option("top", 100) \
93+
.option("expand", "fields") \
94+
.schema("id string, Title string")
95+
.load()
12296

123-
### Customization
97+
df.show()
12498

125-
To replace the fake data source with your own:
99+
```
126100

127-
1. **Rename the package folder:**
101+
---
128102

129-
```sh
130-
mv src/fake_source src/your_datasource_name
131-
```
103+
## Supported Resources
132104

133-
2. **Update imports in `source.py` and other files:**
105+
| Resource | Description |
106+
|--------------|-----------------------------|
107+
| `list_items`| SharePoint List Items |
108+
| *(more coming soon...)* | |
134109

135-
```python
136-
from your_datasource_name.source import CustomDataSource
137-
```
110+
---
138111

139-
3. **Update `pyproject.toml` to reflect the new package name.**
112+
## Development
140113

141-
4. **Modify the schema and options in `source.py` to fit your use case.**
114+
Coming soon...
142115

143-
### References
144-
1. [Microsoft Learn - PySpark custom data sources](https://learn.microsoft.com/en-us/azure/databricks/pyspark/datasources)
116+
---
145117

146-
### License
118+
## Troubleshooting
147119

148-
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
120+
| Issue | Solution |
121+
|---------------------------------|----------------------------------------------|
122+
| `ValueError: resource missing` | Add `.option("resource", "list_items")` |
123+
| Empty dataframe | Verify IDs, permissions, and access |
124+
| Authentication failures | Check Azure credentials and login status |
149125

150-
### Contact
126+
---
151127

152-
For issues and questions, please use the GitHub Issues section.
128+
## 📄 License
153129

130+
[MIT License](LICENSE)
154131

155-
### Need Help Setting Up a Data Intelligence Platform with Databricks?
156-
If you need expert guidance on setting up a modern data intelligence platform using Databricks, we can help. Our consultancy specializes in:
132+
---
157133

158-
- Custom data source development for Databricks and Apache Spark
159-
- Optimizing ETL pipelines for performance and scalability
160-
- Data governance and security using Unity Catalog
161-
- Building ML & AI solutions on Databricks
134+
## 📚 Resources
162135

163-
🚀 [Contact us](https://www.linkedin.com/in/geekwhocodes/) for a consultation and take your data platform to the next level.
136+
- [Microsoft Graph API](https://learn.microsoft.com/en-us/graph/overview)
137+
- [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential)

docs/api/core/async-iterator.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Async To Sync Iterator
2+
3+
::: pyspark_msgraph_source.core.async_iterator

docs/api/core/client.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Base Client
2+
3+
::: pyspark_msgraph_source.core.base_client

docs/api/core/models.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Core Models
2+
3+
::: pyspark_msgraph_source.core.models

docs/api/core/resource-provider.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Resouorce Provider
2+
3+
::: pyspark_msgraph_source.core.resource_provider

docs/api/core/source.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Source
2+
3+
::: pyspark_msgraph_source.core.source

docs/api/core/utils.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Utils
2+
3+
::: pyspark_msgraph_source.core.utils

docs/api/index.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# API Reference
2+
3+
Welcome to the API Reference of `your_package`.
4+
5+
Below are the available modules and submodules:
6+
7+
## Core
8+
- [Core Overview](core.md)
9+
10+
## Utils
11+
- [Utils Helpers](utils.md)
12+
13+
## API Client
14+
- [API Client](api_client.md)

docs/api/resources/index.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
2+
# Available Resources
3+
4+
This page lists the Microsoft Graph resources currently supported by the `pyspark-msgraph-source` connector.
5+
6+
---
7+
8+
## Supported Resources
9+
10+
| Resource Name | Description | Read more |
11+
|---------------|-------------|------------------|
12+
| `list_items` | Retrieves items from a SharePoint List | [Configuration](list-items.md) |
13+
14+
---
15+
16+
## Adding New Resources
17+
18+
Want to add support for more resources?
19+
Check out the [Contributing Guide](contributing.md) to learn how to extend the connector!
20+
21+
---
22+
23+
## Notes
24+
- Resources may require specific Microsoft Graph API permissions.
25+
- Pagination, authentication, and schema inference are handled automatically.
26+
27+
---
28+
29+
## Request New Resources
30+
31+
Is your desired resource not listed here?
32+
Open an [issue](https://github.com/geekwhocodes/pyspark-msgraph-source/issues) to request it!
33+

0 commit comments

Comments
 (0)