[SPARK-54388][SS] Introduce StatePartitionReader that scan raw bytes for Single ColFamily #53104

zifeif2 · 2025-11-18T01:04:57Z

What changes were proposed in this pull request?

Introducing a new StatePartitionReader - StatePartitionReaderAllColumnFamilies to support offline repartition.
StatePartitionReaderAllColumnFamilies is invoked when user specify option readAllColumnFamilies to true.

We have the StateDataSource Reader, which allows customers to read the rows in an operator state store using the DataFrame API, just like they read a normal table. But it currently only supports reading one column family in the state store at a time.

We would introduce a change to allow reading all the state rows in all the column families, so that we can repartition them at once. This would allow us to read the entire state store, repartition the rows, and then save the new repartition state rows to the cloud. This also has a perf impact, since we don’t have to read each column family separately. We would read the state based on the last committed batch version.

Since each column family can have a different schema, the DataFrame we will return will treat the key and value row as bytes -

partition_key (string)
key_bytes (binary)
value_bytes (binary)
column_family_name (string)

Why are the changes needed?

See above

Does this PR introduce any user-facing change?

No

How was this patch tested?

See unit test. It not only verify the schema, but also validate the data are serialized to bytes correctly by comparing them against the normal queried data frame

Was this patch authored or co-authored using generative AI tooling?

Yes. haiku, sonnet.

micheal-o

Did a quick pass. It is in the right direction. Just needs some changes.

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/utils/SchemaUtil.scala

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreConf.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/utils/SchemaUtil.scala

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...he/spark/sql/execution/datasources/v2/state/StatePartitionReaderAllColumnFamiliesSuite.scala

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala

common/utils/src/main/resources/error/error-conditions.json

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionErrors.scala

micheal-o · 2025-11-25T06:43:17Z

Also CI is failing due to linter error for your changes. PTAL

micheal-o

Did another round of review. It is almost there. Thanks

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionErrors.scala

...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala

micheal-o · 2025-11-27T05:06:23Z

@zifeif2 please also fix the CI failure.

micheal-o

Stamped with some minor comments. Mostly looks good now. Thanks

micheal-o · 2025-12-02T01:31:25Z

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

+        Seq(INTERNAL_ONLY_READ_ALL_COLUMN_FAMILIES, STATE_VAR_NAME))
+    }
+
+    if (internalOnlyReadAllColumnFamilies && joinSide != JoinSideValues.none) {


You forgot to address this comment: https://github.com/apache/spark/pull/53104/files#r2567180585

micheal-o · 2025-12-02T01:31:39Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionErrors.scala

+
+  def unsupportedStateStoreProviderError(
+      checkpointLocation: String,
+      providerClass: String): StateRepartitionUnsupportedProviderError = {


nit: return type StateRepartitionInvalidCheckpointError

micheal-o · 2025-12-02T01:31:49Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionErrors.scala

    subClass = "UNSUPPORTED_OFFSET_SEQ_VERSION",
    messageParameters = Map("version" -> version.toString))
+
+class StateRepartitionUnsupportedProviderError(


still haven't fixed this indentation

github-actions bot added SQL STRUCTURED STREAMING labels Nov 18, 2025

zifeif2 changed the title ~~[WIP] [SPARK-54388][SS] Introduce StatePartitionReader that scan raw bytes for Single ColFamily~~ [SPARK-54388][SS] Introduce StatePartitionReader that scan raw bytes for Single ColFamily Nov 18, 2025

micheal-o reviewed Nov 18, 2025

View reviewed changes

zifeif2 force-pushed the repartition-reader-single-cf branch from 42540b1 to 0a878e9 Compare November 19, 2025 02:54

github-actions bot removed the STRUCTURED STREAMING label Nov 19, 2025

zifeif2 commented Nov 19, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/utils/SchemaUtil.scala Outdated Show resolved Hide resolved

micheal-o reviewed Nov 21, 2025

View reviewed changes

zifeif2 force-pushed the repartition-reader-single-cf branch from 932e054 to 279ddf5 Compare November 22, 2025 00:47

github-actions bot added the STRUCTURED STREAMING label Nov 22, 2025

zifeif2 commented Nov 24, 2025

View reviewed changes

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala Outdated Show resolved Hide resolved

zifeif2 commented Nov 24, 2025

View reviewed changes

...he/spark/sql/execution/datasources/v2/state/StatePartitionAllColumnFamiliesReaderSuite.scala Outdated Show resolved Hide resolved

zifeif2 force-pushed the repartition-reader-single-cf branch from c10eed0 to 99e2412 Compare November 24, 2025 20:08

micheal-o reviewed Nov 25, 2025

View reviewed changes

zifeif2 force-pushed the repartition-reader-single-cf branch from 194364d to b46e8d1 Compare November 26, 2025 08:18

micheal-o reviewed Nov 27, 2025

View reviewed changes

zifeif2 force-pushed the repartition-reader-single-cf branch from 0f8f7d3 to 392d498 Compare December 1, 2025 17:07

Ubuntu and others added 11 commits December 1, 2025 22:33

scan simple operator state

ac4bd31

add test and support for HDFS

f14e024

remove unused code

0cd1330

address comment

f6e15ed

refactor test

158c846

add more test

2129dcb

address comment

251f306

small changes

fa776e1

fix small issue

63e0753

address commenet

aee5732

get keySchema from stateStoreColFamilySchemaOpt

48521c3

zifeif2 force-pushed the repartition-reader-single-cf branch from 392d498 to 48521c3 Compare December 1, 2025 22:33

micheal-o approved these changes Dec 2, 2025

View reviewed changes

[SPARK-54388][SS] Introduce StatePartitionReader that scan raw bytes for Single ColFamily #53104

Are you sure you want to change the base?

[SPARK-54388][SS] Introduce StatePartitionReader that scan raw bytes for Single ColFamily #53104

Conversation

zifeif2 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

micheal-o left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

micheal-o commented Nov 25, 2025

Uh oh!

micheal-o left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

micheal-o commented Nov 27, 2025

Uh oh!

micheal-o left a comment

Choose a reason for hiding this comment

Uh oh!

micheal-o Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

micheal-o Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

micheal-o Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zifeif2 commented Nov 18, 2025 •

edited

Loading