@@ -14,10 +14,108 @@ including other versions of pandas.
1414Enhancements
1515~~~~~~~~~~~~
1616
17- .. _whatsnew_300.enhancements.enhancement1 :
17+ .. _whatsnew_300.enhancements.string_dtype :
1818
19- Enhancement1
20- ^^^^^^^^^^^^
19+ Dedicated string data type by default
20+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+ Historically, pandas represented string columns with NumPy ``object `` data type.
23+ This representation has numerous problems: it is not specific to strings (any
24+ Python object can be stored in an ``object ``-dtype array, not just strings) and
25+ it is often not very efficient (both performance wise and for memory usage).
26+
27+ Starting with pandas 3.0, a dedicated string data type is enabled by default
28+ (backed by PyArrow under the hood, if installed, otherwise falling back to being
29+ backed by NumPy ``object ``-dtype). This means that pandas will start inferring
30+ columns containing string data as the new ``str `` data type when creating pandas
31+ objects, such as in constructors or IO functions.
32+
33+ Old behavior:
34+
35+ .. code-block :: python
36+
37+ >> > ser = pd.Series([" a" , " b" ])
38+ 0 a
39+ 1 b
40+ dtype: object
41+
42+ New behavior:
43+
44+ .. code-block :: python
45+
46+ >> > ser = pd.Series([" a" , " b" ])
47+ 0 a
48+ 1 b
49+ dtype: str
50+
51+ The string data type that is used in these scenarios will mostly behave as NumPy
52+ object would, including missing value semantics and general operations on these
53+ columns.
54+
55+ The main characteristic of the new string data type:
56+
57+ - Inferred by default for string data (instead of object dtype)
58+ - The ``str `` dtype can only hold strings (or missing values), in contrast to
59+ ``object `` dtype. (setitem with non string fails)
60+ - The missing value sentinel is always ``NaN `` (``np.nan ``) and follows the same
61+ missing value semantics as the other default dtypes.
62+
63+ Those intentional changes can have breaking consequences, for example when checking
64+ for the ``.dtype `` being object dtype or checking the exact missing value sentinel.
65+ See the :ref: `string_migration_guide ` for more details on the behaviour changes
66+ and how to adapt your code to the new default.
67+
68+ .. seealso ::
69+
70+ `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html >`__
71+
72+
73+ .. _whatsnew_300.enhancements.copy_on_write :
74+
75+ Copy-on-Write
76+ ^^^^^^^^^^^^^
77+
78+ The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in
79+ how pandas operates with respect to copies and views. A summary of the changes:
80+
81+ 1. The result of *any * indexing operation (subsetting a DataFrame or Series in any way,
82+ i.e. including accessing a DataFrame column as a Series) or any method returning a
83+ new DataFrame or Series, always *behaves as if * it were a copy in terms of user
84+ API.
85+ 2. As a consequence, if you want to modify an object (DataFrame or Series), the only way
86+ to do this is to directly modify that object itself.
87+
88+ The main goal of this change is to make the user API more consistent and
89+ predictable. There is now a clear rule: *any * subset or returned
90+ series/dataframe **always ** behaves as a copy of the original, and thus never
91+ modifies the original (before pandas 3.0, whether a derived object would be a
92+ copy or a view depended on the exact operation performed, which was often
93+ confusing).
94+
95+ Because every single indexing step now behaves as a copy, this also means that
96+ "chained assignment" (updating a DataFrame with multiple setitem steps) will
97+ stop working. Because this now consistently never works, the
98+ ``SettingWithCopyWarning `` is removed.
99+
100+ The new behavioral semantics are explained in more detail in the
101+ :ref: `user guide about Copy-on-Write <copy_on_write >`.
102+
103+ A secondary goal is to improve performance by avoiding unnecessary copies. As
104+ mentioned above, every new DataFrame or Series returned from an indexing
105+ operation or method *behaves * as a copy, but under the hood pandas will use
106+ views as much as possible, and only copy when needed to guarantee the "behaves
107+ as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an
108+ implementation detail).
109+
110+ Some of the behaviour changes described above are breaking changes in pandas
111+ 3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas
112+ 2.3 to get deprecation warnings for a subset of those changes. The
113+ :ref: `migration guide <copy_on_write.migration_guide >` explains the upgrade
114+ process in more detail.
115+
116+ .. seealso ::
117+
118+ `PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write <https://pandas.pydata.org/pdeps/0007-copy-on-write.html >`__
21119
22120.. _whatsnew_300.enhancements.enhancement2 :
23121
0 commit comments