@@ -51,53 +51,117 @@ def get_dummies(
5151 """
5252 Convert categorical variable into dummy/indicator variables.
5353
54- Each variable is converted in as many 0/1 variables as there are different
55- values. Columns in the output are each named after a value; if the input is
56- a DataFrame, the name of the original variable is prepended to the value.
54+ This function converts categorical data into binary (0/1) data, also known as
55+ one-hot encoding or dummy variables. It's commonly used in statistical modeling
56+ and machine learning. For more details, see the :ref:`reshaping` section in
57+ the user guide.
5758
5859 Parameters
5960 ----------
6061 data : array-like, Series, or DataFrame
61- Data of which to get dummy indicators .
62+ Data to encode. If a DataFrame, can work on multiple columns .
6263 prefix : str, list of str, or dict of str, default None
63- A string to be prepended to DataFrame column names.
64- Pass a list with length equal to the number of columns
65- when calling get_dummies on a DataFrame. Alternatively, `prefix`
66- can be a dictionary mapping column names to prefixes.
64+ String to prepend to column names.
65+ * If a string, the same prefix is used for all columns
66+ * If a list, it should have the same length as the number of columns
67+ * If a dict, maps column names to prefixes
68+ * If None, no prefix is used
6769 prefix_sep : str, list of str, or dict of str, default '_'
68- Should you choose to prepend DataFrame column names with a prefix, this
69- is the separator/delimiter to use between the two. Alternatively,
70- `prefix_sep` can be a list with length equal to the number of columns,
71- or a dictionary mapping column names to separators.
70+ Separator between prefix and dummy column names.
71+ * If a string, the same separator is used for all columns
72+ * If a list, should have same length as number of columns
73+ * If a dict, maps column names to separators
7274 dummy_na : bool, default False
73- If True, a NaN indicator column will be added even if no NaN values are present.
74- If False, NA values are encoded as all zero.
75+ Add a column to indicate NaN values:
76+ * If True, creates NA column even if no NaN values present
77+ * If False, NA values are encoded as all zeros
7578 columns : list-like, default None
76- Column names in the DataFrame to be encoded.
77- If `columns` is None then all the columns with
78- `object`, `string`, or `category` dtype will be converted.
79+ Which columns to encode:
80+ * If None, encodes all object, string, and category columns
81+ * If list-like, encodes only specified columns
7982 sparse : bool, default False
80- Whether the dummy-encoded columns should be backed by
81- a :class:`SparseArray` (True) or a regular NumPy array (False).
83+ If True, return SparseArray (save memory for many zeros)
84+ If False, return regular NumPy array
8285 drop_first : bool, default False
83- Whether to get k-1 dummies out of k categorical levels by removing the
84- first level.
86+ Whether to drop the first category level:
87+ * If True, drops first level (avoid collinearity in models)
88+ * If False, keeps all levels
8589 dtype : dtype, default bool
86- Data type for new columns. Only a single dtype is allowed .
90+ Data type for dummy columns. Must be a single dtype.
8791
8892 Returns
8993 -------
9094 DataFrame
91- Dummy-coded data. If `data` contains other columns than the
92- dummy-coded one(s), these will be prepended, unaltered, to the result.
95+ Dummy-coded data:
96+ * Each categorical level becomes a new column of 1s and 0s
97+ * Original non-encoded columns are included unchanged
98+ * Each new column name combines the original column name,
99+ prefix_sep, and the encoded level
93100
94101 See Also
95102 --------
96- Series.str.get_dummies : Convert Series of strings to dummy codes.
97- :func:`~pandas.from_dummies` : Convert dummy codes to categorical ``DataFrame``.
103+ Series.str.get_dummies : Convert string Series to dummy codes.
104+ from_dummies : Convert dummy codes back to categorical DataFrame.
105+ DataFrame.astype : Convert dtypes of DataFrame columns.
106+ pandas.Categorical : Represent categorical data.
98107
99108 Notes
100109 -----
110+ * Dummy variables are commonly used in statistical models and machine
111+ learning to convert categorical data into a format suitable for
112+ numerical processing.
113+ * The `drop_first` option is useful for avoiding the "dummy variable trap"
114+ where perfectly correlated dummy variables can cause problems in
115+ regression models.
116+ * For sparse matrices, using `sparse=True` can significantly reduce
117+ memory usage when data has many zeros.
118+
119+ Examples
120+ --------
121+ Basic usage for a single column:
122+
123+ >>> s = pd.Series(list('abca'))
124+ >>> pd.get_dummies(s)
125+ a b c
126+ 0 1 0 0
127+ 1 0 1 0
128+ 2 0 0 1
129+ 3 1 0 0
130+
131+ With NaN values:
132+
133+ >>> s = pd.Series(list('abcaa'))
134+ >>> s.loc[3] = np.nan
135+ >>> pd.get_dummies(s, dummy_na=True)
136+ a b c NaN
137+ 0 1 0 0 0
138+ 1 0 1 0 0
139+ 2 0 0 1 0
140+ 3 0 0 0 1
141+ 4 1 0 0 0
142+
143+ With `drop_first=True`:
144+
145+ >>> pd.get_dummies(s, drop_first=True)
146+ b c
147+ 0 0 0
148+ 1 1 0
149+ 2 0 1
150+ 3 0 0
151+ 4 0 0
152+
153+ With DataFrame input and custom prefixes:
154+
155+ >>> df = pd.DataFrame({
156+ ... 'A': ['a', 'b', 'a'],
157+ ... 'B': ['b', 'a', 'c'],
158+ ... 'C': [1, 2, 3]
159+ ... })
160+ >>> pd.get_dummies(df, prefix=['col1', 'col2'])
161+ C col1_a col1_b col2_a col2_b col2_c
162+ 0 1 1 0 0 1 0
163+ 1 2 0 1 1 0 0
164+ 2 3 1 0 0 0 1
101165 Reference :ref:`the user guide <reshaping.dummies>` for more examples.
102166
103167 Examples
@@ -372,48 +436,108 @@ def from_dummies(
372436 """
373437 Create a categorical ``DataFrame`` from a ``DataFrame`` of dummy variables.
374438
375- Inverts the operation performed by :func:`~pandas.get_dummies`.
439+ This function converts dummy/indicator variables (typically 0s and 1s) back into
440+ categorical variables, essentially inverting :func:`~pandas.get_dummies`. For
441+ more details, see the :ref:`reshaping` section in the user guide.
376442
377443 .. versionadded:: 1.5.0
378444
379445 Parameters
380446 ----------
381447 data : DataFrame
382- Data which contains dummy-coded variables in form of integer columns of
383- 1's and 0's .
448+ DataFrame containing dummy-coded variables (columns of 0s and 1s).
449+ Each group of dummy columns represents one original categorical variable .
384450 sep : str, default None
385- Separator used in the column names of the dummy categories they are
386- character indicating the separation of the categorical names from the prefixes.
387- For example, if your column names are 'prefix_A' and 'prefix_B',
388- you can strip the underscore by specifying sep='_' .
451+ Separator used in the dummy column names between the prefix and category.
452+ For example, if columns are 'color_red', 'color_blue', use sep='_' to
453+ identify 'color' as the original variable name and ['red', 'blue'] as
454+ the categories .
389455 default_category : None, Hashable or dict of Hashables, default None
390- The default category is the implied category when a value has none of the
391- listed categories specified with a one, i.e. if all dummies in a row are
392- zero. Can be a single value for all variables or a dict directly mapping
393- the default categories to a prefix of a variable. The default category
394- will be coerced to the dtype of ``data.columns`` if such coercion is
395- lossless, and will raise otherwise.
456+ Category to use for rows where all dummy values are 0:
457+ * If None, raises error when a row has all zeros
458+ * If Hashable, uses this value for all variables
459+ * If dict, maps each variable prefix to its default category
460+ Value will be coerced to match column dtype if possible.
396461
397462 Returns
398463 -------
399464 DataFrame
400- Categorical data decoded from the dummy input-data.
465+ A DataFrame with categorical columns decoded from dummy variables.
466+ Each group of dummy columns is converted back to a single
467+ categorical column.
468+
469+ See Also
470+ --------
471+ get_dummies : Convert categorical variable(s) to dummy variables.
472+ Categorical : Pandas Categorical type for categorical data.
473+ DataFrame.astype : Convert DataFrame columns to different types.
474+
475+ Notes
476+ -----
477+ * The function assumes each group of dummy columns represents one
478+ original categorical variable
479+ * Column names must follow the pattern: prefix + sep + category
480+ * Each row should have at most one 1 in each group of dummies
481+ (unless using default_category)
482+ * NA values are not allowed in the dummy columns
483+
484+ Examples
485+ --------
486+ Convert dummy columns back to a single categorical column:
487+
488+ >>> df = pd.DataFrame({
489+ ... 'color_red': [1, 0, 0],
490+ ... 'color_blue': [0, 1, 0],
491+ ... 'color_green': [0, 0, 1],
492+ ... 'size': [1, 2, 3]
493+ ... })
494+ >>> pd.from_dummies(df, sep='_')
495+ size color
496+ 0 1 red
497+ 1 2 blue
498+ 2 3 green
499+
500+ With a default category for rows of all zeros:
501+
502+ >>> df = pd.DataFrame({
503+ ... 'color_red': [1, 0, 0, 0],
504+ ... 'color_blue': [0, 1, 0, 0],
505+ ... 'color_green': [0, 0, 1, 0]
506+ ... })
507+ >>> pd.from_dummies(df, sep='_', default_category='unknown')
508+ color
509+ 0 red
510+ 1 blue
511+ 2 green
512+ 3 unknown
513+
514+ With different defaults for different variables:
515+
516+ >>> df = pd.DataFrame({
517+ ... 'color_red': [1, 0, 0],
518+ ... 'color_blue': [0, 0, 0],
519+ ... 'size_S': [0, 1, 0],
520+ ... 'size_M': [0, 0, 0]
521+ ... })
522+ >>> defaults = {'color': 'unknown', 'size': 'L'}
523+ >>> pd.from_dummies(df, sep='_', default_category=defaults)
524+ color size
525+ 0 red L
526+ 1 unknown S
527+ 2 unknown L
401528
402529 Raises
403530 ------
404531 ValueError
405- * When the input ``DataFrame`` ``data`` contains NA values.
406- * When the input ``DataFrame`` ``data`` contains column names with separators
407- that do not match the separator specified with ``sep``.
408- * When a ``dict`` passed to ``default_category`` does not include an implied
409- category for each prefix.
410- * When a value in ``data`` has more than one category assigned to it.
411- * When ``default_category=None`` and a value in ``data`` has no category
412- assigned to it.
532+ * When input contains NA values
533+ * When column names don't match the sep pattern
534+ * When default_category dict is missing categories
535+ * When a row has multiple 1s in one dummy group
536+ * When a row has all 0s and no default_category
413537 TypeError
414- * When the input ``data`` is not of type `` DataFrame``.
415- * When the input ``DataFrame`` ``data`` contains non- dummy data.
416- * When the passed `` sep`` is of a wrong data type.
538+ * When input is not a DataFrame
539+ * When columns don't contain dummy data
540+ * When sep is not a string
417541 * When the passed ``default_category`` is of a wrong data type.
418542
419543 See Also
0 commit comments