-
Notifications
You must be signed in to change notification settings - Fork 73
[GEN-2381] Pandas handling of nullable cells #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…ts and that the data is stored and retrieved accurately.\
| column_type = entity.columns[column].column_type | ||
| cell_value = matching_row[column].values[0] | ||
| if not hasattr(row, column) or cell_value != getattr(row, column): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why this part is needed even after using .convert_dtypes()? and Dan mentioning that the Synapse Tables uploaded results were OK after adding convert_dtypes() in her investigation. Is this due to itertuples coercing into native pandas dtypes and so we need the explicit handling of NAs below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert_dtypes doesn't handle for arrays of values, and it was running into issues when checking for differences in arrays when there were null values.
| values = DataFrame(values).convert_dtypes() | ||
| elif isinstance(values, str): | ||
| values = csv_to_pandas_df(filepath=values, **kwargs) | ||
| values = csv_to_pandas_df(filepath=values, **kwargs).convert_dtypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to also add this everywhere, so in the query function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be added there I think @danlu1 if you wanted to take the changes from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a closer look at the code. I think adding .convert_dtypes here or right after read_csv in csv_to_pandas_df would both work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this scope of this for the upsert rows part just limited to when you want to use a dataframe to directly upsert rows to the Synapse Table. Upserting from a csv would be unaffected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When trying to upsert from a CSV we need to read it into a Dataframe in order for us to do the comparison to find the cells of data which changed.
| except (TypeError, ValueError): | ||
| # If comparison fails, assume they differ | ||
| values_differ = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update last screenshot that I took from ChatGpt based on my testing on pandas '2.3.1' and numpy '2.3.1'/numpy '2.1.3'

There is no TypeError and outputs <NA> if compare pd.NA with pd.NA, np.nan.
Also, convert_dtypes convert empty cell to <NA>(which indicates pd.NA).
If pd.NA is in a np.array, comparing errors out with TypeError: boolean value of NA is ambiguous. Example:
cell_value = np.array([1,2,pd.NA])
row_value = np.array([1,2,pd.NA])
cell_value != row_value
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/_libs/missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
| values = DataFrame(values).convert_dtypes() | ||
| elif isinstance(values, str): | ||
| values = csv_to_pandas_df(filepath=values, **kwargs) | ||
| values = csv_to_pandas_df(filepath=values, **kwargs).convert_dtypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a closer look at the code. I think adding .convert_dtypes here or right after read_csv in csv_to_pandas_df would both work.
| # Convert items to int | ||
| df[col] = df[col].apply( | ||
| lambda x: ( | ||
| [int(item) for item in x] if isinstance(x, list) else x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not working if the list has NAs. I run into ValueError: cannot convert float NaN to integer when the NA is np.nan and TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType' when the NA is pd.NA.
| # Convert items to bool | ||
| df[col] = df[col].apply( | ||
| lambda x: ( | ||
| [bool(item) for item in x] if isinstance(x, list) else x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool(np.np) outputted True and bool(pd.NA) error out TypeError: boolean value of NA is ambiguous

Problem:
Solution:
convert_dtypesand thedtypeargument when reading in a CSV to pandas DFTesting: