Skip to content

Commit ce161bc

Browse files
committed
add documentation
1 parent 446257a commit ce161bc

File tree

14 files changed

+342
-143
lines changed

14 files changed

+342
-143
lines changed

README.md

Lines changed: 79 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,163 +1,123 @@
1-
# Apache PySpark Custom Data Source Template
21

3-
This repository provides a template for creating a custom data source for Apache PySpark. It is designed to help developers extend PySpark’s data source API to support custom data ingestion and storage mechanisms.
2+
# pyspark-msgraph-source
43

4+
A **PySpark DataSource** to seamlessly integrate and read data from **Microsoft Graph API**, enabling easy access to resources like **SharePoint List Items**, and more.
55

6-
## Motivation
7-
8-
When developing custom PySpark data sources, I encountered several challenges that made the development process frustrating:
9-
10-
1. **Environment Setup Complexity**: Setting up a development environment for PySpark data source development was unnecessarily complex, with multiple dependencies and version conflicts.
11-
12-
2. **Test Data Management**: Managing test data and maintaining consistent test environments across different machines was challenging.
13-
14-
3. **Debugging Issues**: The default setup made it difficult to debug custom data source code effectively, especially when dealing with Spark's distributed nature.
15-
16-
4. **Documentation Gaps**: Existing documentation for custom data source development was scattered and often incomplete.
17-
18-
This template repository aims to solve these pain points and provide a streamlined development experience.
19-
6+
---
207

218
## Features
9+
- Entra ID Authentication
10+
Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.
2211

23-
- Pre-configured development environment
24-
- Ready-to-use test infrastructure
25-
- Example implementation
26-
- Automated tests setup
27-
- Debug-friendly configuration
28-
29-
## Getting Started
30-
31-
Follow these steps to set up and use this repository:
12+
- Automatic Pagination Handling
13+
Fetches all paginated data from Microsoft Graph without manual intervention.
3214

33-
### Prerequisites
15+
- Dynamic Schema Inference
16+
Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.
3417

35-
- Docker
36-
- Visual Studio Code
37-
- Python 3.11
18+
- Simple Configuration with .option()
19+
Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.
3820

39-
### Creating a Repository from This Template
21+
- Zero External Ingestion Services
22+
No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.
4023

41-
To create a new repository based on this template:
24+
- Extensible Resource Providers
25+
Add custom resource providers to support more Microsoft Graph endpoints as needed.
4226

43-
1. Go to the [GitHub repository](https://github.com/geekwhocodes/pyspark-custom-datasource-template).
44-
2. Click the **Use this template** button.
45-
3. Select **Create a new repository**.
46-
4. Choose a repository name, visibility (public or private), and click **Create repository from template**.
47-
5. Clone your new repository:
27+
- Pluggable Architecture
28+
Dynamically load resource providers without modifying core logic.
4829

49-
```sh
50-
git clone https://github.com/your-username/your-new-repository.git
51-
cd your-new-repository
52-
```
30+
- Optimized for PySpark
31+
Designed to work natively with Spark's DataFrame API for big data processing.
5332

54-
### Setup
33+
- Secure by Design
34+
Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.
5535

56-
1. **Open the repository in Visual Studio Code:**
36+
---
5737

58-
```sh
59-
code .
60-
```
38+
## Installation
6139

62-
2. **Build and start the development container:**
63-
64-
Open the command palette (Ctrl+Shift+P) and select `Remote-Containers: Reopen in Container`.
40+
```bash
41+
pip install pyspark-msgraph-source
42+
```
6543

66-
3. **Initialize the environment:**
44+
---
6745

68-
The environment will be initialized automatically by running the `init-env.sh` script defined in the `devcontainer.json` file.
46+
## ⚡ Quickstart
6947

70-
### Project Structure
48+
### 1. Authentication
7149

72-
The project follows this structure:
50+
This package uses [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).
51+
Ensure you're authenticated:
7352

53+
```bash
54+
az login
7455
```
75-
.
76-
├── src/
77-
│ ├── fake_source/ # Default fake data source implementation
78-
│ │ ├── __init__.py
79-
│ │ ├── source.py # Implementation of the fake data source
80-
│ │ ├── schema.py # Schema definitions (if applicable)
81-
│ │ └── utils.py # Helper functions (if needed)
82-
│ ├── tests/ # Unit tests for the custom data source
83-
│ │ ├── __init__.py
84-
│ │ ├── test_source.py # Tests for the data source
85-
│ │ └── conftest.py # Test configuration and fixtures
86-
├── .devcontainer/ # Development container setup files
87-
│ ├── Dockerfile
88-
│ ├── devcontainer.json
89-
├── |── scripts
90-
├── | ├── init-env.sh # Initialization script for setting up the environment
91-
├── pyproject.toml # Project dependencies and build system configuration
92-
├── README.md # Project documentation
93-
├── LICENSE # License file
94-
```
95-
96-
### Usage
97-
98-
By default, this template includes a **fake data source** that generates mock data. You can use it as-is or replace it with your own implementation.
9956

100-
1. **Register the custom data source:**
101-
102-
```python
103-
from pyspark.sql import SparkSession
104-
from fake_source.source import FakeDataSource
105-
106-
spark = SparkSession.builder.getOrCreate()
107-
spark.dataSource.register(FakeDataSource)
108-
```
57+
Or set environment variables:
58+
```bash
59+
export AZURE_CLIENT_ID=<your-client-id>
60+
export AZURE_TENANT_ID=<your-tenant-id>
61+
export AZURE_CLIENT_SECRET=<your-client-secret>
62+
```
10963

110-
2. **Read data using the custom data source:**
64+
### 2. Example Usage
11165

112-
```python
113-
df = spark.read.format("fake").load()
114-
df.show()
115-
```
66+
```python
67+
from pyspark.sql import SparkSession
11668

117-
3. **Run tests:**
69+
spark = SparkSession.builder \
70+
.appName("MSGraphExample") \
71+
.getOrCreate()
11872

119-
```sh
120-
pytest
121-
```
73+
from pyspark_msgraph_source.core.source import MSGraphDataSource
74+
spark.dataSource.register(MSGraphDataSource)
12275

123-
### Customization
76+
df = spark.read.format("msgraph") \
77+
.option("resource", "list_items") \
78+
.option("site-id", "<YOUR_SITE_ID>") \
79+
.option("list-id", "<YOUR_LIST_ID>") \
80+
.option("top", 100) \
81+
.option("expand", "fields") \
82+
.load()
12483

125-
To replace the fake data source with your own:
84+
df.show()
85+
```
12686

127-
1. **Rename the package folder:**
87+
---
12888

129-
```sh
130-
mv src/fake_source src/your_datasource_name
131-
```
89+
## Supported Resources
13290

133-
2. **Update imports in `source.py` and other files:**
91+
| Resource | Description |
92+
|--------------|-----------------------------|
93+
| `list_items`| SharePoint List Items |
94+
| *(more coming soon...)* | |
13495

135-
```python
136-
from your_datasource_name.source import CustomDataSource
137-
```
96+
---
13897

139-
3. **Update `pyproject.toml` to reflect the new package name.**
98+
## Development
14099

141-
4. **Modify the schema and options in `source.py` to fit your use case.**
100+
Coming soon...
142101

143-
### References
144-
1. [Microsoft Learn - PySpark custom data sources](https://learn.microsoft.com/en-us/azure/databricks/pyspark/datasources)
102+
---
145103

146-
### License
104+
## Troubleshooting
147105

148-
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
106+
| Issue | Solution |
107+
|---------------------------------|----------------------------------------------|
108+
| `ValueError: resource missing` | Add `.option("resource", "list_items")` |
109+
| Empty dataframe | Verify IDs, permissions, and access |
110+
| Authentication failures | Check Azure credentials and login status |
149111

150-
### Contact
112+
---
151113

152-
For issues and questions, please use the GitHub Issues section.
114+
## 📄 License
153115

116+
[MIT License](LICENSE)
154117

155-
### Need Help Setting Up a Data Intelligence Platform with Databricks?
156-
If you need expert guidance on setting up a modern data intelligence platform using Databricks, we can help. Our consultancy specializes in:
118+
---
157119

158-
- Custom data source development for Databricks and Apache Spark
159-
- Optimizing ETL pipelines for performance and scalability
160-
- Data governance and security using Unity Catalog
161-
- Building ML & AI solutions on Databricks
120+
## 📚 Resources
162121

163-
🚀 [Contact us](https://www.linkedin.com/in/geekwhocodes/) for a consultation and take your data platform to the next level.
122+
- [Microsoft Graph API](https://learn.microsoft.com/en-us/graph/overview)
123+
- [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential)

docs/api/core/async-iterator.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
# Core Engine
1+
# Async To Sync Iterator
22

33
::: pyspark_msgraph_source.core.async_iterator

docs/api/core/client.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
# Core Engine
1+
# Base Client
22

33
::: pyspark_msgraph_source.core.base_client

docs/api/core/models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
# Core Engine
1+
# Core Models
22

33
::: pyspark_msgraph_source.core.models

docs/api/core/resource-provider.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
# Core Engine
1+
# Resouorce Provider
22

33
::: pyspark_msgraph_source.core.resource_provider

docs/api/core/source.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Source
2+
3+
::: pyspark_msgraph_source.core.source

docs/api/core/utils.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
# Core Engine
1+
# Utils
22

33
::: pyspark_msgraph_source.core.utils

docs/api/resources/index.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
2+
# Available Resources
3+
4+
This page lists the Microsoft Graph resources currently supported by the `pyspark-msgraph-source` connector.
5+
6+
---
7+
8+
## Supported Resources
9+
10+
| Resource Name | Description | Read more |
11+
|---------------|-------------|------------------|
12+
| `list_items` | Retrieves items from a SharePoint List | [Configuration](list-items.md) |
13+
14+
---
15+
16+
## Adding New Resources
17+
18+
Want to add support for more resources?
19+
Check out the [Contributing Guide](contributing.md) to learn how to extend the connector!
20+
21+
---
22+
23+
## Notes
24+
- Resources may require specific Microsoft Graph API permissions.
25+
- Pagination, authentication, and schema inference are handled automatically.
26+
27+
---
28+
29+
## Request New Resources
30+
31+
Is your desired resource not listed here?
32+
Open an [issue](https://github.com/geekwhocodes/pyspark-msgraph-source/issues) to request it!
33+

docs/api/resources/list-items.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Resource - List Items
2+
3+
4+
::: pyspark_msgraph_source.resources.list_items

docs/getting-started.md

Whitespace-only changes.

0 commit comments

Comments
 (0)