Skip to content
30 changes: 30 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1671,6 +1671,36 @@ function takes a number of arguments. Only the first is required.
* ``chunksize``: Number of rows to write at a time
* ``date_format``: Format string for datetime objects

Floating Point Precision on Writing and Reading to CSV Files
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Floating Point Precision inaccuracies when writing and reading to CSV files happen due to how the numeric data is represented and parsed in pandas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Floating Point Precision inaccuracies when writing and reading to CSV files happen due to how the numeric data is represented and parsed in pandas.
Floating point precision inaccuracies when writing and reading to CSV files happen due to how the numeric data is represented and parsed in pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the content in its current state talks about implementation details and is a bit harsh on pandas, to the extent that I think its missing the larger point that floating point values are by nature not exact.

Taking a step back - what is the overall goal that this documentation is trying to achieve?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Will, thank you for the feedback. The overall goal is to explain that, by default, due to computer arithmetic outside of our control, floating point numbers are not always stored or returned with exact accuracy.

My intent with the doc addition I added is to show that floating point numbers cannot always be stored precisely, and differences can arise when values are converted and later read back. However, to help with this, pandas provides options such as the float_format parameter (for writing) and the float_precision="round_trip" parameter (for reading) that help improve precision when writing and reading to csv.​ So that they are preserved just as the were and precision loss doesn't happen.

During the write process, pandas converts all the numeric values into text that is stored as bytes in the CSV file. However, when we read the CSV back, pandas parses those
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During the write process, pandas converts all the numeric values into text that is stored as bytes in the CSV file. However, when we read the CSV back, pandas parses those
During the write process, pandas converts all the numeric values into text that is stored as bytes in the CSV file. However, when the CSV is read back, pandas parses those

text values and converts them back into different types (floats, integers, strings) which is when the loss of float point precision happens.
The conversion process is not always guaranteed to be accurate because small differences in data representation between original and reloaded data frame can occur leading to precision loss.

* ``float_format``: Format string for floating point numbers

``df.to_csv('file.csv', float_format='%.17g')`` allows for floating point precision to be specified when writing to the CSV file. In this example, this ensures that the floating point is written in this exact format of 17 significant digits (64-bit float).

``df = pd.read_csv('file.csv', float_precision='round_trip')`` allows for floating point precision to be specified when reading from the CSV file. This is guaranteed to round-trip values after writing to a file and Pandas will read the numbers without losing or changing decimal places.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``df = pd.read_csv('file.csv', float_precision='round_trip')`` allows for floating point precision to be specified when reading from the CSV file. This is guaranteed to round-trip values after writing to a file and Pandas will read the numbers without losing or changing decimal places.
``df = pd.read_csv('file.csv', float_precision='round_trip')`` allows for floating point precision to be specified when reading from the CSV file. This is guaranteed to round-trip values after writing to a file and pandas will read the numbers without losing or changing decimal places.


.. ipython:: python

from io import StringIO

x0 = 18292498239.824
df1 = pd.DataFrame({'One': [x0]}, index=["bignum"])

csv_string = df1.to_csv(float_format='%.17g')
df2 = pd.read_csv(StringIO(csv_string), index_col=0, float_precision='round_trip')

x1 = df1.iloc[0, 0]
x2 = df2.iloc[0, 0]

print(f"x0 = {x0}; x1 = {x1}; Are they equal? {x0 == x1}")
print(f"x0 = {x0}; x2 = {x2}; Are they equal? {x0 == x2}")

Writing a formatted string
++++++++++++++++++++++++++

Expand Down