-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# /// script
# dependencies = [
# "openpyxl",
# "pandas",
# ]
# ///
import pandas as pd
# Assuming "some_file.xlsx" exists and contains a sheet "Sheet1" with
# some data, for example cell A1="A Header" and cell A2=1234
with pd.ExcelFile("some_file.xlsx") as excel_file:
ws = excel_file.book["Sheet1"]
dim_before = (ws.max_row, ws.max_column)
print(f"Dimensions of Sheet1: {dim_before=}")
df = excel_file.parse(sheet_name=ws.title)
dim_after = (ws.max_row, ws.max_column)
print(f"Dimensions of Sheet1: {dim_after=}")
assert dim_before == dim_after # assert fails!
# Expect same values
# But we get dim_before=(2, 1) or whatever is in the file
# and dim_after=(None, None) - unexpectedly dimensions are now NoneIssue Description
When using Pandas to read tables from Excel worksheets, the parse function messes with the internal dimension variables - it sets max_row and max_column to None.
This is caused by pandas/io/excel/_openpyxl.py lines 610 + 611 (the last 2 in the following snippet):
def get_sheet_data(
self, sheet, file_rows_needed: int | None = None
) -> list[list[Scalar]]:
if self.book.read_only:
sheet.reset_dimensions()
The reset_dimensions call resets the values.
Why is this call done, is it necessary? On read-only sheets only?! Can those two lines be removed? Unfortunately there is no comment as to why they are needed. Removing them fixes the issue for me - but would that break otherwise? In any case, I do not expect a "read" operation to change max_row and max_column.
The problem is that I need to read multiple tables from different sheets. So I usually first try to find the position of the tables in the file using openpyxl code on the excel_file.book, then I read the table using excel_file.parse with corresponding parameters (in particular skiprows and skipfooter) And afterwards I still read more stuff from the same sheet. [Excel files from customers can be a big PITA, customers like to put multiple tables one one sheet or add hidden sheets, hidden rows&cols - that would almost be a feature request: read_excel(..., read_hidden_cols=False)]
It came as a surprise that the dimensions are lost after reading.
This is an issue for openpyxl / xlsx files only: when reading an old-style xls file with xlrd instead, the sheet dimension can be found via book.sheet_by_name("Sheet1").nrows and ncols - and they do not change after reading.
Expected Behavior
Dimensions should be left alone.
Installed Versions
Limited to relevant parts