Skip to content

Conversation

@BryanFauble
Copy link
Member

  • Verified that the system correctly handles null values during upserts and that the data is stored and retrieved accurately.

Problem:

Solution:

  • Use the suggestion from @danlu1 to use both the convert_dtypes and the dtype argument when reading in a CSV to pandas DF

Testing:

  • More testing within Genie is needed

…ts and that the data is stored and retrieved accurately.\
@BryanFauble BryanFauble requested review from danlu1 and rxu17 November 6, 2025 18:44
column_type = entity.columns[column].column_type
cell_value = matching_row[column].values[0]
if not hasattr(row, column) or cell_value != getattr(row, column):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this part is needed even after using .convert_dtypes()? and Dan mentioning that the Synapse Tables uploaded results were OK after adding convert_dtypes() in her investigation. Is this due to itertuples coercing into native pandas dtypes and so we need the explicit handling of NAs below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_dtypes doesn't handle for arrays of values, and it was running into issues when checking for differences in arrays when there were null values.

values = DataFrame(values).convert_dtypes()
elif isinstance(values, str):
values = csv_to_pandas_df(filepath=values, **kwargs)
values = csv_to_pandas_df(filepath=values, **kwargs).convert_dtypes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to also add this everywhere, so in the query function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be added there I think @danlu1 if you wanted to take the changes from here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a closer look at the code. I think adding .convert_dtypes here or right after read_csv in csv_to_pandas_df would both work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this scope of this for the upsert rows part just limited to when you want to use a dataframe to directly upsert rows to the Synapse Table. Upserting from a csv would be unaffected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trying to upsert from a CSV we need to read it into a Dataframe in order for us to do the comparison to find the cells of data which changed.

except (TypeError, ValueError):
# If comparison fails, assume they differ
values_differ = True

Copy link
Contributor

@danlu1 danlu1 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new edits are awesome. This fixes the issue that comparing two NAs in Python depends heavily on datatypes: None, pd.NA, np.nan. Found the below screenshot helpful.
Screenshot 2025-11-06 at 7 09 31 PM

Copy link
Contributor

@danlu1 danlu1 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update last screenshot that I took from ChatGpt based on my testing on pandas '2.3.1' and numpy '2.3.1'/numpy '2.1.3'
Screenshot 2025-11-10 at 3 15 17 PM

There is no TypeError and outputs <NA> if compare pd.NA with pd.NA, np.nan.
Also, convert_dtypes convert empty cell to <NA>(which indicates pd.NA).
If pd.NA is in a np.array, comparing errors out with TypeError: boolean value of NA is ambiguous. Example:

cell_value = np.array([1,2,pd.NA])
row_value = np.array([1,2,pd.NA])
cell_value != row_value
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/_libs/missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

values = DataFrame(values).convert_dtypes()
elif isinstance(values, str):
values = csv_to_pandas_df(filepath=values, **kwargs)
values = csv_to_pandas_df(filepath=values, **kwargs).convert_dtypes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a closer look at the code. I think adding .convert_dtypes here or right after read_csv in csv_to_pandas_df would both work.

# Convert items to int
df[col] = df[col].apply(
lambda x: (
[int(item) for item in x] if isinstance(x, list) else x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not working if the list has NAs. I run into ValueError: cannot convert float NaN to integer when the NA is np.nan and TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType' when the NA is pd.NA.

# Convert items to bool
df[col] = df[col].apply(
lambda x: (
[bool(item) for item in x] if isinstance(x, list) else x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool(np.np) outputted True and bool(pd.NA) error out TypeError: boolean value of NA is ambiguous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants