-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54397][PYTHON] Make UserDefinedType hashable
#53113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
UserDefinedTypeUserDefinedType
UserDefinedTypeUserDefinedType
UserDefinedTypeUserDefinedType hashable
|
also cc @gaogaotiantian |
|
Apologize about the comment above. I misunderstood the code. It looks a bit shady but if the user use it as we expected it should work properly. I was thinking about the actual objects. When the UDT is only created by However, I still have some concerns (the code itself still has a smell). Is it possible that the user creates different UDT instances with the same type? For example class VectorType(UserDefinedType):
def __init__(self, dimension):
self.dimension = dimension
def __repr__(self):
return f"Vector({self.dimension})"
type1 = VectorType(2) # a 2-d vector
type2 = VectorType(3) # a 3-d vector
assert type1 == type2 # True
assert hash(type1) == hash(type2) # FalseIs this valid code that users can write? Also is it possible for the user to create their own container type like |
|
@gaogaotiantian That's a good point. assert type1 == type2 # TrueI expect this case to be E.g., the builtin type >>> from pyspark.sql.types import DecimalType
>>> DecimalType(10,0) == DecimalType(20,0)
FalseWe may need to revisit |
|
I think the general direction should be making |
|
I was trying to cache the converters
@gaogaotiantian sounds good, another solution is to just remove the eq method in UDT, then it should use DataType.eq spark/python/pyspark/sql/types.py Lines 115 to 125 in 9380e91
|
|
@zhengruifeng using For example: class Vector(UserDefinedType):
__slots__ = ["dimension"]
def __init__(self, dimension):
super(Vector, self).__init__()
self.dimension = dimension
type1 = Vector(3)
type2 = Vector(2)
assert type1.__dict__ == type2.__dict__ # This will be falsly TrueThis is a valid (even encouraged) usage of Python classes, I know it might not be common, but our code will have issues. I think the safest way is to force users to write their own |
@gaogaotiantian how does this affect existing UDTs written by users? |
|
So, the only case it will work is the two objects are the same object type1 = Vector(3)
type2 = type1
assert type2 == type1We can even enforce this by raising an However, it's possible that these proposals are a bit too strict and we want to provide some heuristics to the users. Or we just have to keep some backward compatibility for UDT. Falling back to |
|
I agree that we don't have a strict |
|
type1 = Vector(3)
type2 = Vector(3)
assert type1 == type2 |
|
I also like a default impl of is, that supports the use case of caching. |
@gaogaotiantian We can modify the builtin UDTs (VectorUDT/MatrixUDT/etc) but I am not sure whether |
|
Yeah for now falling back to the default implementation is not too bad. We can change it when users have issues. It should cover most of the use cases and the users can always implement their own if they are not happy with it. |
|
cc @HyukjinKwon and @ueshin would you mind taking another look? thanks |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master for Apache Spark 4.2.0.
Thank you, @zhengruifeng and all.
### What changes were proposed in this pull request?
Fix the hashability of `UserDefinedType`
### Why are the changes needed?
UDT is not hashable, e.g.
```
In [11]: from pyspark.testing.objects import ExamplePointUDT
In [12]: e = ExamplePointUDT()
In [13]: {e: 0}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[13], line 1
----> 1 {e: 0}
TypeError: unhashable type: 'ExamplePointUDT'
In [14]: from pyspark.ml.linalg import VectorUDT
In [15]: v = VectorUDT()
In [16]: {v: 1}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[16], line 1
----> 1 {v: 1}
TypeError: unhashable type: 'VectorUDT'
```
see https://docs.python.org/3/reference/datamodel.html#object.__hash__
> If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections.
> A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None. When the __hash__() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.abc.Hashable).
### Does this PR introduce _any_ user-facing change?
yes, `hash(udt)` will work after this fix
### How was this patch tested?
added tests
### Was this patch authored or co-authored using generative AI tooling?
no
Closes apache#53113 from zhengruifeng/type_hashable.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Fix the hashability of
UserDefinedTypeWhy are the changes needed?
UDT is not hashable, e.g.
see https://docs.python.org/3/reference/datamodel.html#object.__hash__
Does this PR introduce any user-facing change?
yes,
hash(udt)will work after this fixHow was this patch tested?
added tests
Was this patch authored or co-authored using generative AI tooling?
no