Merge branch 'pandas-dev:main' into sort-api-ref-in-alpha-order-2

DoNguyenHung · web-flow · commit 40bbdd34dca0 · 2025-08-19T16:47:56.000-04:00
diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml
@@ -189,7 +189,7 @@ jobs:
         # installing wheel here because micromamba step was skipped
         if: matrix.buildplat[1] == 'win_arm64'
         shell: bash -el {0}
-        run: python -m pip install wheel
+        run: python -m pip install wheel anaconda-client
 
       - name: Validate wheel RECORD
         shell: bash -el {0}
diff --git a/README.md b/README.md
@@ -19,9 +19,9 @@
 **pandas** is a Python package that provides fast, flexible, and expressive data
 structures designed to make working with "relational" or "labeled" data both
 easy and intuitive. It aims to be the fundamental high-level building block for
-doing practical, **real world** data analysis in Python. Additionally, it has
-the broader goal of becoming **the most powerful and flexible open source data
-analysis / manipulation tool available in any language**. It is already well on
+doing practical, **real-world** data analysis in Python. Additionally, it has
+the broader goal of becoming **the most powerful and flexible open-source data
+analysis/manipulation tool available in any language**. It is already well on
 its way towards this goal.
 
 ## Table of Contents
@@ -64,7 +64,7 @@ Here are just a few of the things that pandas does well:
     data sets
   - [**Hierarchical**][mi] labeling of axes (possible to have multiple
     labels per tick)
-  - Robust IO tools for loading data from [**flat files**][flat-files]
+  - Robust I/O tools for loading data from [**flat files**][flat-files]
     (CSV and delimited), [**Excel files**][excel], [**databases**][db],
     and saving/loading data from the ultrafast [**HDF5 format**][hdfstore]
   - [**Time series**][timeseries]-specific functionality: date range
@@ -138,7 +138,7 @@ or for installing in [development mode](https://pip.pypa.io/en/latest/cli/pip_in
 
 
 ```sh
-python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true
+python -m pip install -ve . --no-build-isolation --config-settings editable-verbose=true
 ```
 
 See the full instructions for [installing from source](https://pandas.pydata.org/docs/dev/development/contributing_environment.html).
@@ -155,7 +155,7 @@ has been under active development since then.
 
 ## Getting Help
 
-For usage questions, the best place to go to is [StackOverflow](https://stackoverflow.com/questions/tagged/pandas).
+For usage questions, the best place to go to is [Stack Overflow](https://stackoverflow.com/questions/tagged/pandas).
 Further, general questions and discussions can also take place on the [pydata mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata).
 
 ## Discussion and Development
diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst
@@ -14,10 +14,108 @@ including other versions of pandas.
 Enhancements
 ~~~~~~~~~~~~
 
-.. _whatsnew_300.enhancements.enhancement1:
+.. _whatsnew_300.enhancements.string_dtype:
 
-Enhancement1
-^^^^^^^^^^^^
+Dedicated string data type by default
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Historically, pandas represented string columns with NumPy ``object`` data type.
+This representation has numerous problems: it is not specific to strings (any
+Python object can be stored in an ``object``-dtype array, not just strings) and
+it is often not very efficient (both performance wise and for memory usage).
+
+Starting with pandas 3.0, a dedicated string data type is enabled by default
+(backed by PyArrow under the hood, if installed, otherwise falling back to being
+backed by NumPy ``object``-dtype). This means that pandas will start inferring
+columns containing string data as the new ``str`` data type when creating pandas
+objects, such as in constructors or IO functions.
+
+Old behavior:
+
+.. code-block:: python
+
+    >>> ser = pd.Series(["a", "b"])
+    0    a
+    1    b
+    dtype: object
+
+New behavior:
+
+.. code-block:: python
+
+    >>> ser = pd.Series(["a", "b"])
+    0    a
+    1    b
+    dtype: str
+
+The string data type that is used in these scenarios will mostly behave as NumPy
+object would, including missing value semantics and general operations on these
+columns.
+
+The main characteristic of the new string data type:
+
+- Inferred by default for string data (instead of object dtype)
+- The ``str`` dtype can only hold strings (or missing values), in contrast to
+  ``object`` dtype. (setitem with non string fails)
+- The missing value sentinel is always ``NaN`` (``np.nan``) and follows the same
+  missing value semantics as the other default dtypes.
+
+Those intentional changes can have breaking consequences, for example when checking
+for the ``.dtype`` being object dtype or checking the exact missing value sentinel.
+See the :ref:`string_migration_guide` for more details on the behaviour changes
+and how to adapt your code to the new default.
+
+.. seealso::
+
+    `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
+
+
+.. _whatsnew_300.enhancements.copy_on_write:
+
+Copy-on-Write
+^^^^^^^^^^^^^
+
+The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
+how pandas operates with respect to copies and views. A summary of the changes:
+
+1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way,
+   i.e. including accessing a DataFrame column as a Series) or any method returning a
+   new DataFrame or Series, always *behaves as if* it were a copy in terms of user
+   API.
+2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
+   to do this is to directly modify that object itself.
+
+The main goal of this change is to make the user API more consistent and
+predictable. There is now a clear rule: *any* subset or returned
+series/dataframe **always** behaves as a copy of the original, and thus never
+modifies the original (before pandas 3.0, whether a derived object would be a
+copy or a view depended on the exact operation performed, which was often
+confusing).
+
+Because every single indexing step now behaves as a copy, this also means that
+"chained assignment" (updating a DataFrame with multiple setitem steps) will
+stop working. Because this now consistently never works, the
+``SettingWithCopyWarning`` is removed.
+
+The new behavioral semantics are explained in more detail in the
+:ref:`user guide about Copy-on-Write <copy_on_write>`.
+
+A secondary goal is to improve performance by avoiding unnecessary copies. As
+mentioned above, every new DataFrame or Series returned from an indexing
+operation or method *behaves* as a copy, but under the hood pandas will use
+views as much as possible, and only copy when needed to guarantee the "behaves
+as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
+implementation detail).
+
+Some of the behaviour changes described above are breaking changes in pandas
+3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
+2.3 to get deprecation warnings for a subset of those changes. The
+:ref:`migration guide <copy_on_write.migration_guide>` explains the upgrade
+process in more detail.
+
+.. seealso::
+
+    `PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html>`__
 
 .. _whatsnew_300.enhancements.enhancement2:
 
diff --git a/pandas/_config/config.py b/pandas/_config/config.py
@@ -693,8 +693,8 @@ def _get_registered_option(key: str):
 
 def _translate_key(key: str) -> str:
     """
-    if key id deprecated and a replacement key defined, will return the
-    replacement key, otherwise returns `key` as - is
+    if `key` is deprecated and a replacement key defined, will return the
+    replacement key, otherwise returns `key` as-is
     """
     d = _get_deprecated_option(key)
     if d:
diff --git a/pandas/_version.py b/pandas/_version.py
@@ -581,7 +581,7 @@ def render_git_describe(pieces):
 def render_git_describe_long(pieces):
     """TAG-DISTANCE-gHEX[-dirty].
 
-    Like 'git describe --tags --dirty --always -long'.
+    Like 'git describe --tags --dirty --always --long'.
     The distance/hash is unconditional.
 
     Exceptions:
diff --git a/pandas/core/accessor.py b/pandas/core/accessor.py
@@ -88,7 +88,7 @@ def _add_delegate_accessors(
         cls
             Class to add the methods/properties to.
         delegate
-            Class to get methods/properties and doc-strings.
+            Class to get methods/properties and docstrings.
         accessors : list of str
             List of accessors to add.
         typ : {'property', 'method'}
@@ -159,7 +159,7 @@ def delegate_names(
     Parameters
     ----------
     delegate : object
-        The class to get methods/properties & doc-strings.
+        The class to get methods/properties & docstrings.
     accessors : Sequence[str]
         List of accessor to add.
     typ : {'property', 'method'}
diff --git a/pandas/core/arrays/boolean.py b/pandas/core/arrays/boolean.py
@@ -378,7 +378,7 @@ def _logical_method(self, other, op):  # type: ignore[override]
         elif is_list_like(other):
             other = np.asarray(other, dtype="bool")
             if other.ndim > 1:
-                raise NotImplementedError("can only perform ops with 1-d structures")
+                return NotImplemented
             other, mask = coerce_to_array(other, copy=False)
         elif isinstance(other, np.bool_):
             other = other.item()
diff --git a/pandas/core/base.py b/pandas/core/base.py
@@ -90,7 +90,7 @@
 
 class PandasObject(DirNamesMixin):
     """
-    Baseclass for various pandas objects.
+    Base class for various pandas objects.
     """
 
     # results from calls to methods decorated with cache_readonly get added to _cache
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
@@ -10216,6 +10216,7 @@ def shift(
         suffix : str, optional
             If str and periods is an iterable, this is added after the column
             name and before the shift value for each shifted column name.
+            For `Series` this parameter is unused and defaults to `None`.
 
         Returns
         -------
diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py
@@ -1926,7 +1926,7 @@ def _setitem_with_indexer(self, indexer, value, name: str = "iloc") -> None:
                     labels = index.insert(len(index), key)
 
                     # We are expanding the Series/DataFrame values to match
-                    #  the length of thenew index `labels`.  GH#40096 ensure
+                    #  the length of the new index `labels`.  GH#40096 ensure
                     #  this is valid even if the index has duplicates.
                     taker = np.arange(len(index) + 1, dtype=np.intp)
                     taker[-1] = -1
diff --git a/pandas/io/api.py b/pandas/io/api.py
@@ -1,5 +1,5 @@
 """
-Data IO api
+Data I/O API
 """
 
 from pandas.io.clipboards import read_clipboard
diff --git a/pandas/io/common.py b/pandas/io/common.py
@@ -1,4 +1,4 @@
-"""Common IO api utilities"""
+"""Common I/O API utilities"""
 
 from __future__ import annotations
 
diff --git a/pandas/io/formats/style_render.py b/pandas/io/formats/style_render.py
@@ -6,6 +6,7 @@
     Sequence,
 )
 from functools import partial
+import pathlib
 import re
 from typing import (
     TYPE_CHECKING,
@@ -70,7 +71,9 @@ class StylerRenderer:
     Base class to process rendering a Styler with a specified jinja2 template.
     """
 
-    loader = jinja2.PackageLoader("pandas", "io/formats/templates")
+    this_dir = pathlib.Path(__file__).parent.resolve()
+    template_dir = this_dir / "templates"
+    loader = jinja2.FileSystemLoader(template_dir)
     env = jinja2.Environment(loader=loader, trim_blocks=True)
     template_html = env.get_template("html.tpl")
     template_html_table = env.get_template("html_table.tpl")
diff --git a/pandas/io/parquet.py b/pandas/io/parquet.py
@@ -464,8 +464,12 @@ def to_parquet(
 
         .. versionadded:: 2.1.0
 
-    kwargs
-        Additional keyword arguments passed to the engine.
+    **kwargs
+        Additional keyword arguments passed to the engine:
+
+        * For ``engine="pyarrow"``: passed to :func:`pyarrow.parquet.write_table`
+          or :func:`pyarrow.parquet.write_to_dataset` (when using partition_cols)
+        * For ``engine="fastparquet"``: passed to :func:`fastparquet.write`
 
     Returns
     -------
@@ -585,7 +589,11 @@ def read_parquet(
         .. versionadded:: 3.0.0
 
     **kwargs
-        Any additional kwargs are passed to the engine.
+        Additional keyword arguments passed to the engine:
+
+        * For ``engine="pyarrow"``: passed to :func:`pyarrow.parquet.read_table`
+        * For ``engine="fastparquet"``: passed to
+          :meth:`fastparquet.ParquetFile.to_pandas`
 
     Returns
     -------
diff --git a/pandas/tests/arithmetic/test_numeric.py b/pandas/tests/arithmetic/test_numeric.py
@@ -862,6 +862,19 @@ def test_modulo_zero_int(self):
             expected = Series([np.nan, 0.0])
             tm.assert_series_equal(result, expected)
 
+    def test_non_1d_ea_raises_notimplementederror(self):
+        # GH#61866
+        ea_array = array([1, 2, 3, 4, 5], dtype="Int64").reshape(5, 1)
+        np_array = np.array([1, 2, 3, 4, 5], dtype=np.int64).reshape(5, 1)
+
+        msg = "can only perform ops with 1-d structures"
+
+        with pytest.raises(NotImplementedError, match=msg):
+            ea_array * np_array
+
+        with pytest.raises(NotImplementedError, match=msg):
+            np_array * ea_array
+
 
 class TestAdditionSubtraction:
     # __add__, __sub__, __radd__, __rsub__, __iadd__, __isub__
diff --git a/pandas/tests/io/formats/style/test_html.py b/pandas/tests/io/formats/style/test_html.py
@@ -1,3 +1,4 @@
+import pathlib
 from textwrap import (
     dedent,
     indent,
@@ -18,7 +19,9 @@
 
 @pytest.fixture
 def env():
-    loader = jinja2.PackageLoader("pandas", "io/formats/templates")
+    project_dir = pathlib.Path(__file__).parent.parent.parent.parent.parent.resolve()
+    template_dir = project_dir / "io" / "formats" / "templates"
+    loader = jinja2.FileSystemLoader(template_dir)
     env = jinja2.Environment(loader=loader, trim_blocks=True)
     return env
 
diff --git a/pandas/tests/tseries/holiday/test_holiday.py b/pandas/tests/tseries/holiday/test_holiday.py
@@ -340,7 +340,7 @@ class TestHolidayCalendar(AbstractHolidayCalendar):
     tm.assert_index_equal(date_interval_high, expected_results)
 
 
-def test_holidays_with_timezone_specified_but_no_occurences():
+def test_holidays_with_timezone_specified_but_no_occurrences():
     # GH 54580
     # _apply_rule() in holiday.py was silently dropping timezones if you passed it
     # an empty list of holiday dates that had timezone information
diff --git a/pandas/tests/tslibs/test_parsing.py b/pandas/tests/tslibs/test_parsing.py
@@ -402,7 +402,8 @@ def test_hypothesis_delimited_date(
         request.applymarker(
             pytest.mark.xfail(
                 reason="parse_datetime_string cannot reliably tell whether "
-                "e.g. %m.%Y is a float or a date"
+                "e.g. %m.%Y is a float or a date",
+                strict=False,
             )
         )
     date_string = test_datetime.strftime(date_format.replace(" ", delimiter))
diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md
@@ -2,7 +2,7 @@
 
 - Created: 23 December 2022
 - Status: Implemented
-- Discussion: [#39584](https://github.com/pandas-dev/pandas/pull/50402)
+- Discussion: [#50424](https://github.com/pandas-dev/pandas/pull/50424)
 - Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche))
 - Revision: 1