Skip to content

BUG: ExcelFile.parse sets sheet.max_row and sheet.max_column to None #63010

@behrenhoff

Description

@behrenhoff

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# /// script
# dependencies = [
#     "openpyxl",
#     "pandas",
# ]
# ///
import pandas as pd

# Assuming "some_file.xlsx" exists and contains a sheet "Sheet1" with
# some data, for example cell A1="A Header" and cell A2=1234

with pd.ExcelFile("some_file.xlsx") as excel_file:
    ws = excel_file.book["Sheet1"]

    dim_before = (ws.max_row, ws.max_column)
    print(f"Dimensions of Sheet1: {dim_before=}")

    df = excel_file.parse(sheet_name=ws.title)

    dim_after = (ws.max_row, ws.max_column)

    print(f"Dimensions of Sheet1: {dim_after=}")

assert dim_before == dim_after  # assert fails!

# Expect same values
# But we get dim_before=(2, 1) or whatever is in the file
# and dim_after=(None, None) - unexpectedly dimensions are now None

Issue Description

When using Pandas to read tables from Excel worksheets, the parse function messes with the internal dimension variables - it sets max_row and max_column to None.

This is caused by pandas/io/excel/_openpyxl.py lines 610 + 611 (the last 2 in the following snippet):

    def get_sheet_data(
        self, sheet, file_rows_needed: int | None = None
    ) -> list[list[Scalar]]:
        if self.book.read_only:
            sheet.reset_dimensions()

The reset_dimensions call resets the values.

Why is this call done, is it necessary? On read-only sheets only?! Can those two lines be removed? Unfortunately there is no comment as to why they are needed. Removing them fixes the issue for me - but would that break otherwise? In any case, I do not expect a "read" operation to change max_row and max_column.

The problem is that I need to read multiple tables from different sheets. So I usually first try to find the position of the tables in the file using openpyxl code on the excel_file.book, then I read the table using excel_file.parse with corresponding parameters (in particular skiprows and skipfooter) And afterwards I still read more stuff from the same sheet. [Excel files from customers can be a big PITA, customers like to put multiple tables one one sheet or add hidden sheets, hidden rows&cols - that would almost be a feature request: read_excel(..., read_hidden_cols=False)]

It came as a surprise that the dimensions are lost after reading.

This is an issue for openpyxl / xlsx files only: when reading an old-style xls file with xlrd instead, the sheet dimension can be found via book.sheet_by_name("Sheet1").nrows and ncols - and they do not change after reading.

Expected Behavior

Dimensions should be left alone.

Installed Versions

Limited to relevant parts

INSTALLED VERSIONS ------------------ python : 3.12.3 pandas : 2.3.3 openpyxl : 3.1.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions