[GEN-2381] Pandas handling of nullable cells #1272

BryanFauble · 2025-11-06T18:44:20Z

Verified that the system correctly handles null values during upserts and that the data is stored and retrieved accurately.

Problem:

When querying for, inserting, or updating cells of data that can be nullable things start to break as described in https://sagebionetworks.jira.com/browse/GEN-2381

Solution:

Use the suggestion from @danlu1 to use both the convert_dtypes and the dtype argument when reading in a CSV to pandas DF

Testing:

More testing within Genie is needed

…ts and that the data is stored and retrieved accurately.\

rxu17 · 2025-11-06T19:30:27Z

synapseclient/models/mixins/table_components.py

            column_type = entity.columns[column].column_type
            cell_value = matching_row[column].values[0]
-            if not hasattr(row, column) or cell_value != getattr(row, column):
+


Do you know why this part is needed even after using .convert_dtypes()? and Dan mentioning that the Synapse Tables uploaded results were OK after adding convert_dtypes() in her investigation. Is this due to itertuples coercing into native pandas dtypes and so we need the explicit handling of NAs below

convert_dtypes doesn't handle for arrays of values, and it was running into issues when checking for differences in arrays when there were null values.

rxu17 · 2025-11-06T19:31:09Z

synapseclient/models/mixins/table_components.py

+        values = DataFrame(values).convert_dtypes()
    elif isinstance(values, str):
-        values = csv_to_pandas_df(filepath=values, **kwargs)
+        values = csv_to_pandas_df(filepath=values, **kwargs).convert_dtypes()


Is the plan to also add this everywhere, so in the query function

Yes it should be added there I think @danlu1 if you wanted to take the changes from here.

I can take a closer look at the code. I think adding .convert_dtypes here or right after read_csv in csv_to_pandas_df would both work.

rxu17 · 2025-11-06T20:50:35Z

synapseclient/models/mixins/table_components.py

Is this scope of this for the upsert rows part just limited to when you want to use a dataframe to directly upsert rows to the Synapse Table. Upserting from a csv would be unaffected?

When trying to upsert from a CSV we need to read it into a Dataframe in order for us to do the comparison to find the cells of data which changed.

danlu1 · 2025-11-07T03:14:17Z

synapseclient/models/mixins/table_components.py

+                    except (TypeError, ValueError):
+                        # If comparison fails, assume they differ
+                        values_differ = True
+


This new edits are awesome. This fixes the issue that comparing two NAs in Python depends heavily on datatypes: None, pd.NA, np.nan. Found the below screenshot helpful.

Update last screenshot that I took from ChatGpt based on my testing on pandas '2.3.1' and numpy '2.3.1'/numpy '2.1.3'

There is no TypeError and outputs <NA> if compare pd.NA with pd.NA, np.nan.
Also, convert_dtypes convert empty cell to <NA>(which indicates pd.NA).
If pd.NA is in a np.array, comparing errors out with TypeError: boolean value of NA is ambiguous. Example:

cell_value = np.array([1,2,pd.NA]) row_value = np.array([1,2,pd.NA]) cell_value != row_value Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pandas/_libs/missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__ TypeError: boolean value of NA is ambiguous

danlu1 · 2025-11-07T03:16:45Z

synapseclient/models/mixins/table_components.py

+        values = DataFrame(values).convert_dtypes()
    elif isinstance(values, str):
-        values = csv_to_pandas_df(filepath=values, **kwargs)
+        values = csv_to_pandas_df(filepath=values, **kwargs).convert_dtypes()


I can take a closer look at the code. I think adding .convert_dtypes here or right after read_csv in csv_to_pandas_df would both work.

danlu1 · 2025-11-11T22:32:09Z

synapseclient/models/mixins/table_components.py

+                    # Convert items to int
+                    df[col] = df[col].apply(
+                        lambda x: (
+                            [int(item) for item in x] if isinstance(x, list) else x


This might not working if the list has NAs. I run into ValueError: cannot convert float NaN to integer when the NA is np.nan and TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType' when the NA is pd.NA.

danlu1 · 2025-11-11T22:54:15Z

synapseclient/models/mixins/table_components.py

+                    # Convert items to bool
+                    df[col] = df[col].apply(
+                        lambda x: (
+                            [bool(item) for item in x] if isinstance(x, list) else x


bool(np.np) outputted True and bool(pd.NA) error out TypeError: boolean value of NA is ambiguous

- Verified that the system correctly handles null values during upser…

2efe8e3

…ts and that the data is stored and retrieved accurately.\

BryanFauble requested review from danlu1 and rxu17 November 6, 2025 18:44

rxu17 reviewed Nov 6, 2025

View reviewed changes

Patch unit test

bb9a5fd

danlu1 reviewed Nov 7, 2025

View reviewed changes

danlu1 added 2 commits November 11, 2025 19:25

resolve ambiguity error for np array and convert np.bool to regular bool

3abf87e

add unit test cases for construct_partial_rows

e3de13a

danlu1 reviewed Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GEN-2381] Pandas handling of nullable cells #1272

[GEN-2381] Pandas handling of nullable cells #1272

BryanFauble commented Nov 6, 2025

Uh oh!

rxu17 Nov 6, 2025

Uh oh!

BryanFauble Nov 6, 2025

Uh oh!

rxu17 Nov 6, 2025

Uh oh!

BryanFauble Nov 6, 2025

Uh oh!

danlu1 Nov 7, 2025

Uh oh!

rxu17 Nov 6, 2025

Uh oh!

BryanFauble Nov 6, 2025

Uh oh!

danlu1 Nov 7, 2025 •

edited

Loading

Uh oh!

danlu1 Nov 10, 2025 •

edited

Loading

Uh oh!

danlu1 Nov 7, 2025

Uh oh!

danlu1 Nov 11, 2025

Uh oh!

danlu1 Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[GEN-2381] Pandas handling of nullable cells #1272

Are you sure you want to change the base?

[GEN-2381] Pandas handling of nullable cells #1272

Conversation

BryanFauble commented Nov 6, 2025

Problem:

Solution:

Testing:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danlu1 Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danlu1 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danlu1 Nov 7, 2025 •

edited

Loading

danlu1 Nov 10, 2025 •

edited

Loading