[SPARK-54397][PYTHON] Make `UserDefinedType` hashable #53113

zhengruifeng · 2025-11-18T07:50:36Z

What changes were proposed in this pull request?

Fix the hashability of UserDefinedType

Why are the changes needed?

UDT is not hashable, e.g.

In [11]: from pyspark.testing.objects import ExamplePointUDT

In [12]: e = ExamplePointUDT()

In [13]: {e: 0}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 {e: 0}

TypeError: unhashable type: 'ExamplePointUDT'

In [14]: from pyspark.ml.linalg import VectorUDT

In [15]: v = VectorUDT()

In [16]: {v: 1}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 {v: 1}

TypeError: unhashable type: 'VectorUDT'

see https://docs.python.org/3/reference/datamodel.html#object.__hash__

If a class does not define an eq() method it should not define a hash() operation either; if it defines eq() but not hash(), its instances will not be usable as items in hashable collections.

A class that overrides eq() and does not define hash() will have its hash() implicitly set to None. When the hash() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.abc.Hashable).

Does this PR introduce any user-facing change?

yes, hash(udt) will work after this fix

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2025-11-18T08:41:54Z

also cc @gaogaotiantian

gaogaotiantian · 2025-11-18T18:43:32Z

This is a very bad practice and I highly suggest that we don't do it. I don't believe UDT should be hashable. If I understand correctly, UDT can be mutable right? Mutable object should not be hashable. Also if two objects are equal, they must have the same hash - the current implementation does not guarantee that. Why is the __eq__ implementation so strange? Why do we need this to be hashable? No every object should be hashable.

Apologize about the comment above. I misunderstood the code. It looks a bit shady but if the user use it as we expected it should work properly. I was thinking about the actual objects. When the UDT is only created by SomeClassInheritsUDT() (so all instances are equal) then this should work.

However, I still have some concerns (the code itself still has a smell). Is it possible that the user creates different UDT instances with the same type?

For example

class VectorType(UserDefinedType):
    def __init__(self, dimension):
        self.dimension = dimension
    def __repr__(self):
        return f"Vector({self.dimension})"

type1 = VectorType(2)  # a 2-d vector
type2 = VectorType(3)  # a 3-d vector

assert type1 == type2  # True
assert hash(type1) == hash(type2)  # False

Is this valid code that users can write?

Also is it possible for the user to create their own container type like StructType? Do we assume the type instances are immutable?

ueshin · 2025-11-18T19:53:11Z

@gaogaotiantian That's a good point.

assert type1 == type2  # True

I expect this case to be False.

E.g., the builtin type DecimalType returns False if the argument is different:

>>> from pyspark.sql.types import DecimalType
>>> DecimalType(10,0) == DecimalType(20,0)
False

We may need to revisit __eq__ as well?

gaogaotiantian · 2025-11-18T19:59:48Z

I think the general direction should be making __eq__ strict. It's not too bad to fallback to the simple is comparison. If the user wants it to be smarter, they can implement their own. Without knowing the structure of an arbitrary user defined type, it's too dangerous to guess whether two instances are the same.

zhengruifeng · 2025-11-19T01:43:29Z

I was trying to cache the converters {datetype: converter} and then found that UDT cannot be used as the key.

I think the general direction should be making eq strict.

@gaogaotiantian sounds good, another solution is to just remove the eq method in UDT, then it should use DataType.eq

spark/python/pyspark/sql/types.py

Lines 115 to 125 in 9380e91

    
           class DataType: 
        
               """Base class for data types.""" 
        
               def __repr__(self) -> str: 
        
                   return self.__class__.__name__ + "()" 
        
               def __hash__(self) -> int: 
        
                   return hash(str(self)) 
        
               def __eq__(self, other: Any) -> bool: 
        
                   return isinstance(other, self.__class__) and self.__dict__ == other.__dict__

python/pyspark/sql/tests/test_types.py

gaogaotiantian · 2025-11-19T04:26:34Z

@zhengruifeng using __eq__ from DataType is much better, but I still have concerns. __dict__ does not contain all the info - even at Python level.

For example:

class Vector(UserDefinedType):
    __slots__ = ["dimension"]

    def __init__(self, dimension):
        super(Vector, self).__init__()
        self.dimension = dimension

type1 = Vector(3)
type2 = Vector(2)


assert type1.__dict__ == type2.__dict__  # This will be falsly True

This is a valid (even encouraged) usage of Python classes, I know it might not be common, but our code will have issues.

I think the safest way is to force users to write their own __eq__ and __hash__ methods so they can't blame us for failed code.

zhengruifeng · 2025-11-19T06:03:47Z

I think the safest way is to force users to write their own eq and hash methods

@gaogaotiantian how does this affect existing UDTs written by users? type1 == type2 will fail?

gaogaotiantian · 2025-11-19T07:53:50Z

So, the only case it will work is the two objects are the same object

type1 = Vector(3)
type2 = type1

assert type2 == type1

We can even enforce this by raising an NotImplementedError in __eq__ so it will explicitly ask users to implement their own __eq__ method - because there's no way we can guarantee the equal comparison.

However, it's possible that these proposals are a bit too strict and we want to provide some heuristics to the users. Or we just have to keep some backward compatibility for UDT. Falling back to DataType is not too bad - but technically it's not 100% correct. We probably need to think about tradeoff here.

zhengruifeng · 2025-11-19T08:46:28Z

I agree that we don't have a strict __eq__ may cause some correctness issue, I think it is another issue and existing workflow might be affected if we raise NotImplementedError. We can revisit it in separate thread.

gaogaotiantian · 2025-11-19T18:25:27Z

NotImplementedError is a bit too strict I agree - if we designed it from scratch that might be the way to go. How about return self is other? Do we have a lot of use case where we need to compare

type1 = Vector(3)
type2 = Vector(3)
assert type1 == type2

holdenk · 2025-11-21T00:17:25Z

I also like a default impl of is, that supports the use case of caching.

zhengruifeng · 2025-11-21T00:18:43Z

NotImplementedError is a bit too strict I agree - if we designed it from scratch that might be the way to go. How about return self is other? Do we have a lot of use case where we need to compare
type1 = Vector(3)
type2 = Vector(3)
assert type1 == type2

@gaogaotiantian We can modify the builtin UDTs (VectorUDT/MatrixUDT/etc) but I am not sure whether __eq__ is used a lot in user code. I think we can revisit it in a separate thread/PR.

gaogaotiantian · 2025-11-21T01:22:11Z

Yeah for now falling back to the default implementation is not too bad. We can change it when users have issues. It should cover most of the use cases and the users can always implement their own if they are not happy with it.

zhengruifeng · 2025-11-24T09:17:11Z

cc @HyukjinKwon and @ueshin would you mind taking another look? thanks

dongjoon-hyun

+1, LGTM. Merged to master for Apache Spark 4.2.0.

Thank you, @zhengruifeng and all.

### What changes were proposed in this pull request? Fix the hashability of `UserDefinedType` ### Why are the changes needed? UDT is not hashable, e.g. ``` In [11]: from pyspark.testing.objects import ExamplePointUDT In [12]: e = ExamplePointUDT() In [13]: {e: 0} --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[13], line 1 ----> 1 {e: 0} TypeError: unhashable type: 'ExamplePointUDT' In [14]: from pyspark.ml.linalg import VectorUDT In [15]: v = VectorUDT() In [16]: {v: 1} --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[16], line 1 ----> 1 {v: 1} TypeError: unhashable type: 'VectorUDT' ``` see https://docs.python.org/3/reference/datamodel.html#object.__hash__ > If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections. > A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None. When the __hash__() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.abc.Hashable). ### Does this PR introduce _any_ user-facing change? yes, `hash(udt)` will work after this fix ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#53113 from zhengruifeng/type_hashable. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

fix

9380e91

github-actions bot added SQL ML PYTHON labels Nov 18, 2025

zhengruifeng changed the title ~~Fix the hashable of UserDefinedType~~ [PYTHON] Fix the hashable of UserDefinedType Nov 18, 2025

zhengruifeng changed the title ~~[PYTHON] Fix the hashable of UserDefinedType~~ [SPARK-54397][PYTHON] Fix the hashable of UserDefinedType Nov 18, 2025

zhengruifeng changed the title ~~[SPARK-54397][PYTHON] Fix the hashable of UserDefinedType~~ [SPARK-54397][PYTHON] Make UserDefinedType hashable Nov 18, 2025

zhengruifeng requested review from HyukjinKwon and ueshin November 18, 2025 08:37

del __eq__

96edab4

amaliujia reviewed Nov 19, 2025

View reviewed changes

python/pyspark/sql/tests/test_types.py Show resolved Hide resolved

fix

4277671

gaogaotiantian approved these changes Nov 21, 2025

View reviewed changes

ueshin approved these changes Nov 24, 2025

View reviewed changes

HyukjinKwon approved these changes Nov 25, 2025

View reviewed changes

dongjoon-hyun approved these changes Nov 25, 2025

View reviewed changes

dongjoon-hyun closed this in 7917e2d Nov 25, 2025

zhengruifeng deleted the type_hashable branch November 25, 2025 01:11

[SPARK-54397][PYTHON] Make UserDefinedType hashable #53113

[SPARK-54397][PYTHON] Make UserDefinedType hashable #53113

Uh oh!

Conversation

zhengruifeng commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Nov 18, 2025

Uh oh!

gaogaotiantian commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaogaotiantian commented Nov 18, 2025

Uh oh!

zhengruifeng commented Nov 19, 2025

Uh oh!

Uh oh!

gaogaotiantian commented Nov 19, 2025

Uh oh!

zhengruifeng commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaogaotiantian commented Nov 19, 2025

Uh oh!

zhengruifeng commented Nov 19, 2025

Uh oh!

gaogaotiantian commented Nov 19, 2025

Uh oh!

holdenk commented Nov 21, 2025

Uh oh!

zhengruifeng commented Nov 21, 2025

Uh oh!

gaogaotiantian commented Nov 21, 2025

Uh oh!

zhengruifeng commented Nov 24, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[SPARK-54397][PYTHON] Make `UserDefinedType` hashable #53113

[SPARK-54397][PYTHON] Make `UserDefinedType` hashable #53113

zhengruifeng commented Nov 18, 2025 •

edited

Loading

gaogaotiantian commented Nov 18, 2025 •

edited

Loading

ueshin commented Nov 18, 2025 •

edited

Loading

zhengruifeng commented Nov 19, 2025 •

edited

Loading