|
1 | | -# Apache PySpark Custom Data Source Template |
2 | 1 |
|
3 | | -This repository provides a template for creating a custom data source for Apache PySpark. It is designed to help developers extend PySpark’s data source API to support custom data ingestion and storage mechanisms. |
| 2 | +# pyspark-msgraph-source |
4 | 3 |
|
| 4 | +A **PySpark DataSource** to seamlessly integrate and read data from **Microsoft Graph API**, enabling easy access to resources like **SharePoint List Items**, and more. |
5 | 5 |
|
6 | | -## Motivation |
7 | | - |
8 | | -When developing custom PySpark data sources, I encountered several challenges that made the development process frustrating: |
9 | | - |
10 | | -1. **Environment Setup Complexity**: Setting up a development environment for PySpark data source development was unnecessarily complex, with multiple dependencies and version conflicts. |
11 | | - |
12 | | -2. **Test Data Management**: Managing test data and maintaining consistent test environments across different machines was challenging. |
13 | | - |
14 | | -3. **Debugging Issues**: The default setup made it difficult to debug custom data source code effectively, especially when dealing with Spark's distributed nature. |
15 | | - |
16 | | -4. **Documentation Gaps**: Existing documentation for custom data source development was scattered and often incomplete. |
17 | | - |
18 | | -This template repository aims to solve these pain points and provide a streamlined development experience. |
19 | | - |
| 6 | +--- |
20 | 7 |
|
21 | 8 | ## Features |
| 9 | +- Entra ID Authentication |
| 10 | +Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly. |
22 | 11 |
|
23 | | -- Pre-configured development environment |
24 | | -- Ready-to-use test infrastructure |
25 | | -- Example implementation |
26 | | -- Automated tests setup |
27 | | -- Debug-friendly configuration |
| 12 | +- Automatic Pagination Handling |
| 13 | +Fetches all paginated data from Microsoft Graph without manual intervention. |
28 | 14 |
|
29 | | -## Getting Started |
| 15 | +- Dynamic Schema Inference |
| 16 | +Automatically detects the schema of the resource by sampling data, so you don't need to define it manually. |
30 | 17 |
|
31 | | -Follow these steps to set up and use this repository: |
| 18 | +- Simple Configuration with .option() |
| 19 | +Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive. |
32 | 20 |
|
33 | | -### Prerequisites |
| 21 | +- Zero External Ingestion Services |
| 22 | +No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph. |
34 | 23 |
|
35 | | -- Docker |
36 | | -- Visual Studio Code |
37 | | -- Python 3.11 |
| 24 | +- Extensible Resource Providers |
| 25 | +Add custom resource providers to support more Microsoft Graph endpoints as needed. |
38 | 26 |
|
39 | | -### Creating a Repository from This Template |
| 27 | +- Pluggable Architecture |
| 28 | +Dynamically load resource providers without modifying core logic. |
40 | 29 |
|
41 | | -To create a new repository based on this template: |
| 30 | +- Optimized for PySpark |
| 31 | +Designed to work natively with Spark's DataFrame API for big data processing. |
42 | 32 |
|
43 | | -1. Go to the [GitHub repository](https://github.com/geekwhocodes/pyspark-custom-datasource-template). |
44 | | -2. Click the **Use this template** button. |
45 | | -3. Select **Create a new repository**. |
46 | | -4. Choose a repository name, visibility (public or private), and click **Create repository from template**. |
47 | | -5. Clone your new repository: |
| 33 | +- Secure by Design |
| 34 | +Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data. |
48 | 35 |
|
49 | | - ```sh |
50 | | - git clone https://github.com/your-username/your-new-repository.git |
51 | | - cd your-new-repository |
52 | | - ``` |
| 36 | +--- |
53 | 37 |
|
54 | | -### Setup |
| 38 | +## Installation |
55 | 39 |
|
56 | | -1. **Open the repository in Visual Studio Code:** |
57 | | - |
58 | | - ```sh |
59 | | - code . |
60 | | - ``` |
61 | | - |
62 | | -2. **Build and start the development container:** |
63 | | - |
64 | | - Open the command palette (Ctrl+Shift+P) and select `Remote-Containers: Reopen in Container`. |
| 40 | +```bash |
| 41 | +pip install pyspark-msgraph-source |
| 42 | +``` |
65 | 43 |
|
66 | | -3. **Initialize the environment:** |
| 44 | +--- |
67 | 45 |
|
68 | | - The environment will be initialized automatically by running the `init-env.sh` script defined in the `devcontainer.json` file. |
| 46 | +## ⚡ Quickstart |
69 | 47 |
|
70 | | -### Project Structure |
| 48 | +### 1. Authentication |
71 | 49 |
|
72 | | -The project follows this structure: |
| 50 | +This package uses [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential). |
| 51 | +Ensure you're authenticated: |
73 | 52 |
|
74 | | -``` |
75 | | -. |
76 | | -├── src/ |
77 | | -│ ├── fake_source/ # Default fake data source implementation |
78 | | -│ │ ├── __init__.py |
79 | | -│ │ ├── source.py # Implementation of the fake data source |
80 | | -│ │ ├── schema.py # Schema definitions (if applicable) |
81 | | -│ │ └── utils.py # Helper functions (if needed) |
82 | | -│ ├── tests/ # Unit tests for the custom data source |
83 | | -│ │ ├── __init__.py |
84 | | -│ │ ├── test_source.py # Tests for the data source |
85 | | -│ │ └── conftest.py # Test configuration and fixtures |
86 | | -├── .devcontainer/ # Development container setup files |
87 | | -│ ├── Dockerfile |
88 | | -│ ├── devcontainer.json |
89 | | -├── |── scripts |
90 | | -├── | ├── init-env.sh # Initialization script for setting up the environment |
91 | | -├── pyproject.toml # Project dependencies and build system configuration |
92 | | -├── README.md # Project documentation |
93 | | -├── LICENSE # License file |
| 53 | +```bash |
| 54 | +az login |
94 | 55 | ``` |
95 | 56 |
|
96 | | -### Usage |
| 57 | +Or set environment variables: |
| 58 | +```bash |
| 59 | +export AZURE_CLIENT_ID=<your-client-id> |
| 60 | +export AZURE_TENANT_ID=<your-tenant-id> |
| 61 | +export AZURE_CLIENT_SECRET=<your-client-secret> |
| 62 | +``` |
97 | 63 |
|
98 | | -By default, this template includes a **fake data source** that generates mock data. You can use it as-is or replace it with your own implementation. |
| 64 | +### 2. Example Usage |
99 | 65 |
|
100 | | -1. **Register the custom data source:** |
| 66 | +```python |
| 67 | +from pyspark.sql import SparkSession |
101 | 68 |
|
102 | | - ```python |
103 | | - from pyspark.sql import SparkSession |
104 | | - from fake_source.source import FakeDataSource |
| 69 | +spark = SparkSession.builder \ |
| 70 | +.appName("MSGraphExample") \ |
| 71 | +.getOrCreate() |
105 | 72 |
|
106 | | - spark = SparkSession.builder.getOrCreate() |
107 | | - spark.dataSource.register(FakeDataSource) |
108 | | - ``` |
| 73 | +from pyspark_msgraph_source.core.source import MSGraphDataSource |
| 74 | +spark.dataSource.register(MSGraphDataSource) |
109 | 75 |
|
110 | | -2. **Read data using the custom data source:** |
| 76 | +df = spark.read.format("msgraph") \ |
| 77 | +.option("resource", "list_items") \ |
| 78 | +.option("site-id", "<YOUR_SITE_ID>") \ |
| 79 | +.option("list-id", "<YOUR_LIST_ID>") \ |
| 80 | +.option("top", 100) \ |
| 81 | +.option("expand", "fields") \ |
| 82 | +.load() |
111 | 83 |
|
112 | | - ```python |
113 | | - df = spark.read.format("fake").load() |
114 | | - df.show() |
115 | | - ``` |
| 84 | +df.show() |
116 | 85 |
|
117 | | -3. **Run tests:** |
| 86 | +# with schema |
118 | 87 |
|
119 | | - ```sh |
120 | | - pytest |
121 | | - ``` |
| 88 | +df = spark.read.format("msgraph") \ |
| 89 | +.option("resource", "list_items") \ |
| 90 | +.option("site-id", "<YOUR_SITE_ID>") \ |
| 91 | +.option("list-id", "<YOUR_LIST_ID>") \ |
| 92 | +.option("top", 100) \ |
| 93 | +.option("expand", "fields") \ |
| 94 | +.schema("id string, Title string") |
| 95 | +.load() |
122 | 96 |
|
123 | | -### Customization |
| 97 | +df.show() |
124 | 98 |
|
125 | | -To replace the fake data source with your own: |
| 99 | +``` |
126 | 100 |
|
127 | | -1. **Rename the package folder:** |
| 101 | +--- |
128 | 102 |
|
129 | | - ```sh |
130 | | - mv src/fake_source src/your_datasource_name |
131 | | - ``` |
| 103 | +## Supported Resources |
132 | 104 |
|
133 | | -2. **Update imports in `source.py` and other files:** |
| 105 | +| Resource | Description | |
| 106 | +|--------------|-----------------------------| |
| 107 | +| `list_items`| SharePoint List Items | |
| 108 | +| *(more coming soon...)* | | |
134 | 109 |
|
135 | | - ```python |
136 | | - from your_datasource_name.source import CustomDataSource |
137 | | - ``` |
| 110 | +--- |
138 | 111 |
|
139 | | -3. **Update `pyproject.toml` to reflect the new package name.** |
| 112 | +## Development |
140 | 113 |
|
141 | | -4. **Modify the schema and options in `source.py` to fit your use case.** |
| 114 | +Coming soon... |
142 | 115 |
|
143 | | -### References |
144 | | -1. [Microsoft Learn - PySpark custom data sources](https://learn.microsoft.com/en-us/azure/databricks/pyspark/datasources) |
| 116 | +--- |
145 | 117 |
|
146 | | -### License |
| 118 | +## Troubleshooting |
147 | 119 |
|
148 | | -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
| 120 | +| Issue | Solution | |
| 121 | +|---------------------------------|----------------------------------------------| |
| 122 | +| `ValueError: resource missing` | Add `.option("resource", "list_items")` | |
| 123 | +| Empty dataframe | Verify IDs, permissions, and access | |
| 124 | +| Authentication failures | Check Azure credentials and login status | |
149 | 125 |
|
150 | | -### Contact |
| 126 | +--- |
151 | 127 |
|
152 | | -For issues and questions, please use the GitHub Issues section. |
| 128 | +## 📄 License |
153 | 129 |
|
| 130 | +[MIT License](LICENSE) |
154 | 131 |
|
155 | | -### Need Help Setting Up a Data Intelligence Platform with Databricks? |
156 | | -If you need expert guidance on setting up a modern data intelligence platform using Databricks, we can help. Our consultancy specializes in: |
| 132 | +--- |
157 | 133 |
|
158 | | -- Custom data source development for Databricks and Apache Spark |
159 | | -- Optimizing ETL pipelines for performance and scalability |
160 | | -- Data governance and security using Unity Catalog |
161 | | -- Building ML & AI solutions on Databricks |
| 134 | +## 📚 Resources |
162 | 135 |
|
163 | | -🚀 [Contact us](https://www.linkedin.com/in/geekwhocodes/) for a consultation and take your data platform to the next level. |
| 136 | +- [Microsoft Graph API](https://learn.microsoft.com/en-us/graph/overview) |
| 137 | +- [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential) |
0 commit comments