Skip to content

Commit 18810dc

Browse files
committed
Fix readme with new dataflint version
1 parent 2be29d3 commit 18810dc

File tree

1 file changed

+47
-12
lines changed

1 file changed

+47
-12
lines changed

README.md

Lines changed: 47 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -60,10 +60,17 @@ See [Our Features](https://dataflint.gitbook.io/dataflint-for-spark/overview/our
6060
### Scala
6161

6262
Install DataFlint OSS via sbt:
63+
For Spark 3.X:
6364
```sbt
64-
libraryDependencies += "io.dataflint" %% "spark" % "0.2.3"
65+
libraryDependencies += "io.dataflint" %% "spark" % "0.6.1"
6566
```
6667

68+
For Spark 4.X:
69+
```sbt
70+
libraryDependencies += "io.dataflint" %% "dataflint-spark4" % "0.6.1"
71+
```
72+
73+
6774
Then instruct spark to load the DataFlint OSS plugin:
6875
```scala
6976
val spark = SparkSession
@@ -76,10 +83,20 @@ val spark = SparkSession
7683
### PySpark
7784
Add these 2 configs to your pyspark session builder:
7885

86+
For Spark 3.X:
87+
```python
88+
builder = pyspark.sql.SparkSession.builder
89+
...
90+
.config("spark.jars.packages", "io.dataflint:spark_2.12:0.6.1") \
91+
.config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
92+
...
93+
```
94+
95+
For Spark 4.X:
7996
```python
8097
builder = pyspark.sql.SparkSession.builder
8198
...
82-
.config("spark.jars.packages", "io.dataflint:spark_2.12:0.2.3") \
99+
.config("spark.jars.packages", "io.dataflint:dataflint-spark4_2.13:0.6.1") \
83100
.config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
84101
...
85102
```
@@ -90,14 +107,22 @@ Alternatively, install DataFlint OSS with **no code change** as a spark ivy pack
90107

91108
```bash
92109
spark-submit
93-
--packages io.dataflint:spark_2.12:0.2.3 \
110+
--packages io.dataflint:spark_2.12:0.6.1 \
111+
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
112+
...
113+
```
114+
115+
For Spark 4.X:
116+
```bash
117+
spark-submit
118+
--packages io.dataflint:dataflint-spark4_2.13:0.6.1 \
94119
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
95120
...
96121
```
97122

98123
### Additional installation options
99124

100-
* There is also support for scala 2.13, if your spark cluster is using scala 2.13 change package name to io.dataflint:spark_**2.13**:0.2.3
125+
* There is also support for scala 2.13, if your spark cluster is using scala 2.13 change package name to io.dataflint:spark_**2.13**:0.6.1
101126
* For more installation options, including for **python** and **k8s spark-operator**, see [Install on Spark docs](https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark)
102127
* For installing DataFlint OSS in **spark history server** for observability on completed runs see [install on spark history server docs](https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark-history-server)
103128
* For installing DataFlint OSS on **DataBricks** see [install on databricks docs](https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-databricks)
@@ -112,17 +137,27 @@ The plugin exposes an additional HTTP resoures for additional metrics not availa
112137

113138
For more information, see [how it works docs](https://dataflint.gitbook.io/dataflint-for-spark/overview/how-it-works)
114139

115-
## Medium Articles
140+
## Articles
141+
142+
* [AWS engineering blog post featuring DataFlint - Centralize Apache Spark observability on Amazon EMR on EKS with external Spark History Server](https://aws.amazon.com/blogs/big-data/centralize-apache-spark-observability-on-amazon-emr-on-eks-with-external-spark-history-server/)
143+
144+
* [Wix engineering blog post featuring DataFlint - How Wix Built the Ultimate Spark-as-a-Service Platform](https://www.wix.engineering/post/how-wix-built-the-ultimate-spark-as-a-service-platform-part1)
145+
146+
* [Cloudera Community - How to integrated DataFlint in CDP](https://community.cloudera.com/t5/Community-Articles/How-to-integrated-DataFlint-in-CDP/ta-p/383681)
147+
148+
* [Dataminded engineering blog post featuring DataFlint - Running thousands of Spark applications without losing your cool](https://medium.com/datamindedbe/running-thousands-of-spark-applications-without-losing-your-cool-969208a2d655)
149+
150+
* [Data Engineering Weekly #156 - Featuring DataFlint](https://www.dataengineeringweekly.com/p/data-engineering-weekly-156)
116151

117-
* [Fixing small files performance issues in Apache Spark using DataFlint OSS](https://medium.com/@menishmueli/fixing-small-files-performance-issues-in-apache-spark-using-dataflint-49ffe3eb755f)
152+
* [Medium Blog Post - Fixing small files performance issues in Apache Spark using DataFlint](https://medium.com/@menishmueli/fixing-small-files-performance-issues-in-apache-spark-using-dataflint-49ffe3eb755f)
118153

119-
* [Are Long Filter Conditions in Apache Spark Leading to Performance Issues?](https://medium.com/@menishmueli/are-long-filter-conditions-in-apache-spark-leading-to-performance-issues-0b5bc6c0f94a)
154+
* [Medium Blog Post - Are Long Filter Conditions in Apache Spark Leading to Performance Issues?](https://medium.com/@menishmueli/are-long-filter-conditions-in-apache-spark-leading-to-performance-issues-0b5bc6c0f94a)
120155

121-
* [Optimizing update operations to Apache Iceberg tables using DataFlint OSS](https://medium.com/dev-genius/optimizing-update-operations-to-apache-iceberg-tables-using-dataflint-e4e372e75b8a)
156+
* [Medium Blog Post - Optimizing update operations to Apache Iceberg tables using DataFlint](https://medium.com/dev-genius/optimizing-update-operations-to-apache-iceberg-tables-using-dataflint-e4e372e75b8a)
122157

123-
* [Did you know that your Apache Spark logs might be leaking PIIs?](https://medium.com/system-weakness/did-you-know-that-your-apache-spark-logs-might-be-leaking-piis-06f2a0e8a82c)
158+
* [Medium Blog Post - Did you know that your Apache Spark logs might be leaking PIIs?](https://medium.com/system-weakness/did-you-know-that-your-apache-spark-logs-might-be-leaking-piis-06f2a0e8a82c)
124159

125-
* [Cost vs Speed: measuring Apache Spark performance with DataFlint OSS](https://medium.com/@menishmueli/cost-vs-speed-measuring-apache-spark-performance-with-dataflint-c5f909ebe229)
160+
* [Medium Blog Post - Cost vs Speed: measuring Apache Spark performance with DataFlint](https://medium.com/@menishmueli/cost-vs-speed-measuring-apache-spark-performance-with-dataflint-c5f909ebe229)
126161

127162

128163
## Compatibility Matrix
@@ -136,8 +171,8 @@ DataFlint OSS require spark version 3.2 and up, and supports both scala versions
136171
| Standalone |||
137172
| Kubernetes Spark Operator |||
138173
| EMR |||
139-
| Dataproc || |
140-
| HDInsights || |
174+
| Dataproc || |
175+
| HDInsights || |
141176
| Databricks |||
142177

143178
For more information, see [supported versions docs](https://dataflint.gitbook.io/dataflint-for-spark/overview/supported-versions)

0 commit comments

Comments
 (0)