[SPARK-54410][SQL] Fix read support for the variant logical type annotation #53120

harshmotw-db · 2025-11-18T20:18:13Z

What changes were proposed in this pull request?

This PR introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type.

The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in.

Why are the changes needed?

The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such.

Does this PR introduce any user-facing change?

Yes, it allows users to read parquet files with the variant logical type annotation.

How was this patch tested?

Existing test from this PR where we wrote data of the variant logical type and tested read using an ad-hoc solution.

Was this patch authored or co-authored using generative AI tooling?

No

cashmand

Thanks for making this PR! Just a few small questions and suggestions.

cashmand · 2025-11-18T20:25:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val PARQUET_IGNORE_VARIANT_ANNOTATION =
+    buildConf("spark.sql.parquet.ignoreVariantAnnotation")
+      .doc("When true, ignore the variant logical type annotation and treat the Parquet " +


Should we mark this conf as .internal()? I think the main use case is to simplify debugging issues with the raw variant bytes, but let me know if there's a reason for this conf that I'm missing. Assuming my understanding is right, maybe we can also mention the intended use case in the doc comment.

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala

cloud-fan · 2025-11-19T03:27:45Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

        throw QueryCompilationErrors.invalidVariantWrongNumFieldsError()
      }
-      val valueAndMetadata = Seq("value", "metadata").map { colName =>
+      val Seq(v, m) = Seq("value", "metadata").map { colName =>


Suggested change

val Seq(v, m) = Seq("value", "metadata").map { colName =>

val Seq(value, metadata) = Seq("value", "metadata").map { colName =>

dongjoon-hyun

This should target Apache Spark 4.2.0, @harshmotw-db .

Please fix the config version.

      .version("4.1.0")

cloud-fan · 2025-11-20T03:09:06Z

Hi @dongjoon-hyun , this is the last piece of variant type support in 4.1. We have been using Parquet variant logical type when writing Spark variant to Parquet (already in branch-4.1), and we should also support reading it back.

dongjoon-hyun · 2025-11-20T05:27:37Z

Hi @dongjoon-hyun , this is the last piece of variant type support in 4.1. We have been using Parquet variant logical type when writing Spark variant to Parquet (already in branch-4.1), and we should also support reading it back.

To @cloud-fan , I'm not sure this is the last piece. However, if you want, I'd like to ask you to revise the PR title as a kind of bug fix. Since Variant was added at Apache Spark 4.0.0 via SPARK-54410, if this is a bug fix, we should land at both branch-4.1 and branch-4.0, shouldn't it?

dongjoon-hyun · 2025-11-20T05:29:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.parquet.ignoreVariantAnnotation")
+      .doc("When true, ignore the variant logical type annotation and treat the Parquet " +
+        "column in the same way as the underlying struct type")
+      .version("4.1.0")


If this is a bug fix, this should be 4.0.2, @harshmotw-db and @cloud-fan .

Not sure if the parquet version we use in Spark 4.0 has the variant logical type. I'll leave it to @harshmotw-db

Not sure if the parquet version we use in Spark 4.0 has the variant logical type. I'll leave it to @harshmotw-db

Thanks. We can continue our discussion if we are not sure. AFAIK, it means there is no regression at Apache Spark 4.1.0 from Apache Spark 4.0.0.

For the record, for the improvement, this should be 4.2.0 according to the Apache Spark community policy, @harshmotw-db and @cloud-fan .

Given Spark 4.1 has upgraded the parquet version which has logical variant type, I think 4.1 should support reading parquet files with native variant type fields?

IIUC, we can say that it's still simply unsupported feature like we did in Apache Spark 4.0.0 variant. It's too late if this is an improvement, @cloud-fan .

This PR practically is a fix already. This PR added a temporary workaround for reading variant data mainly for testing purposes (see this line). Essentially, the existing code behaves as if ignoreVariantAnnotation = false. This PR just implements this code more formally so we actually do make sure that the target type matches the actual parquet type

Why don't you revise the PR title more properly which looks like a fix literally, @harshmotw-db ?

Also, the ParquetRowConverter fix is essential since currently, when VARIANT_ALLOW_READING_SHREDDED = false, the reader is broken when the parquet schema is struct<metadata, value> instead of struct<value, metadata>

Sure, in practice it is a fix. I need to head out for an hour and I will change the PR title after that

I removed my review request from this PR.

dongjoon-hyun · 2025-11-20T16:54:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "column in the same way as the underlying struct type")
+      .version("4.1.0")
+      .booleanConf
+      .createWithDefault(false)


When this should be true?

It's mainly for debugging purposes if we need to extract the raw variant bytes by specifying the schema as say struct<value: Binary, metadata: Binary>

Well, for that purpose, let's remove this configuration. You can use logDebug instead.

Correct me if I'm wrong but I don't think logDebug would be helpful here if we want to extract variant columns into a custom schema in a Spark DataFrame. This config is a good tool to debug issues in a Parquet file

May I ask why you think it that way? You told me that It's mainly for debugging purposes, right?

Correct me if I'm wrong but I don't think logDebug would be helpful here if we want to extract variant columns into a custom schema in a Spark DataFrame. This config is a good tool to debug issues in a Parquet file

I have added a new test variant logical type annotation - ignore variant annotation to demonstrate this point.

So, if the ignoreVariantAnnotation config is enabled, you can read a parquet file with an underlying variant column into a struct of binaries schema. So for a variant column v, you could run:
spark.read.format("parquer").schema("v struct<value: BINARY, metadata: BINARY>").load(...) and it would load the value and metadata columns into these fields even though the data is logically not a struct of two binaries but is instead a variant. People could use this to debug the physical variant values.

If the config is disabled, which is the default, this read would give an error and you would need to read variant columns into a variant schema.

dongjoon-hyun · 2025-11-21T00:11:40Z

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

-        convertInternal(groupColumn, None)
+      case v: VariantLogicalTypeAnnotation if v.getSpecVersion == 1 =>
+        if (ignoreVariantAnnotation) {
+          convertInternal(groupColumn)


I don't understand the reason why we need to maintain this logic for pure debugging purpose.

#53120 (comment)

cloud-fan · 2025-11-21T05:58:47Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala


-  /** Parquet converter for unshredded Variant */
+  // Parquet converter for unshredded Variant.
+  @deprecated("We use this converter when the `spark.sql.variant.allowReadingShredded` config " +


This is not a public API and we don't need to use the deprecated annotation. We can just put it as normal code comment.

dongjoon-hyun · 2025-11-21T16:51:31Z

Thank you, @harshmotw-db . Could you resolve the remaining @cloud-fan 's comment, #53120 (comment) and #53120 (comment) ?

harshmotw-db · 2025-11-21T18:54:25Z

@dongjoon-hyun @cloud-fan Thanks for reviewing! I have addressed your comments

dongjoon-hyun

+1, LGTM. Thank you, @harshmotw-db and @cloud-fan .
Merged to master/4.1 for Apache Spark 4.1.0.

…tation ### What changes were proposed in this pull request? [This PR](#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit da7389b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tation ### What changes were proposed in this pull request? [This PR](apache#53005) introduced a fix where the Spark parquet writer would annotate variant columns with the parquet variant logical type. The PR had an ad-hoc fix on the reader side for validation. This PR formally allows Spark to read parquet files with the Variant logical type. The PR also introduces an unrelated fix in ParquetRowConverter to allow Spark to read variant columns regardless of which order the value and metadata fields are stored in. ### Why are the changes needed? The variant logical type annotation has formally been adopted as part of the parquet spec in is part of the parquet-java 1.16.0 library. Therefore, Spark should be able to read files containing data annotated as such. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to read parquet files with the variant logical type annotation. ### How was this patch tested? Existing test from [this PR](apache#53005) where we wrote data of the variant logical type and tested read using an ad-hoc solution. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53120 from harshmotw-db/harshmotw-db/variant_annotation_write. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

harshmotw-db added 6 commits November 11, 2025 02:26

basic write support implemented

c6fab62

read tests

88a7439

minor change

1f3e3e8

fix config

e308d20

Add read support for variant logical type

4eadb8c

fix merge conflicts with master

124340a

github-actions bot added the SQL label Nov 18, 2025

harshmotw-db changed the title ~~[SPARK-54410] Support Variant Logical Type Annotation on reads~~ [SPARK-54410] Support Variant Logical Type Annotation on parquet reads Nov 18, 2025

cashmand reviewed Nov 18, 2025

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-54410] Support Variant Logical Type Annotation on parquet reads~~ [SPARK-54410][SQL] Support Variant Logical Type Annotation on parquet reads Nov 18, 2025

cloud-fan reviewed Nov 19, 2025

View reviewed changes

dongjoon-hyun previously requested changes Nov 20, 2025

View reviewed changes

dongjoon-hyun reviewed Nov 20, 2025

View reviewed changes

harshmotw-db added 2 commits November 20, 2025 19:11

addressed comments

7bccd52

remove unnecessary comment

653023f

harshmotw-db requested review from cashmand, cloud-fan and dongjoon-hyun November 20, 2025 20:11

dongjoon-hyun reviewed Nov 21, 2025

View reviewed changes

harshmotw-db changed the title ~~[SPARK-54410][SQL] Support Variant Logical Type Annotation on parquet reads~~ [SPARK-54410][SQL] Fix read support for the variant logical type annotation Nov 21, 2025

add test

2d0f545

cloud-fan reviewed Nov 21, 2025

View reviewed changes

addressed Wenchen's comments

05a88ad

dongjoon-hyun approved these changes Nov 21, 2025

View reviewed changes

dongjoon-hyun closed this in da7389b Nov 21, 2025

	val Seq(v, m) = Seq("value", "metadata").map { colName =>
	val Seq(value, metadata) = Seq("value", "metadata").map { colName =>

[SPARK-54410][SQL] Fix read support for the variant logical type annotation #53120

[SPARK-54410][SQL] Fix read support for the variant logical type annotation #53120

Uh oh!

Conversation

harshmotw-db commented Nov 18, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cashmand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 20, 2025

Uh oh!

dongjoon-hyun commented Nov 20, 2025

Uh oh!

dongjoon-hyun Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harshmotw-db Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dongjoon-hyun Nov 20, 2025 •

edited

Loading

cloud-fan Nov 20, 2025 •

edited

Loading

dongjoon-hyun Nov 20, 2025 •

edited

Loading

dongjoon-hyun Nov 20, 2025 •

edited

Loading

harshmotw-db Nov 20, 2025 •

edited

Loading

harshmotw-db Nov 20, 2025 •

edited

Loading

harshmotw-db Nov 21, 2025 •

edited

Loading

harshmotw-db Nov 21, 2025 •

edited

Loading