Skip to content

Commit 98b671c

Browse files
committed
Create Pandas documentation
1 parent 5dd96a6 commit 98b671c

File tree

2 files changed

+147
-0
lines changed

2 files changed

+147
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ In `docs/`:
2323

2424
- [`Numerical Issues.md`](/docs/Numerical%20Issues.md): Information about detecting and resolving numerical issues.
2525

26+
- [`Pandas.md`](/docs/Pandas.md): Crash course on the Pandas data manipulation library.
27+
2628
Finally, you can generate documentation for the SWITCH modules by running `pydoc -w switch_model` after having installed
2729
SWITCH. This will build HTML documentation files from python doc strings which
2830
will include descriptions of each module, their intentions, model

docs/Pandas.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Using Pandas
2+
3+
[Pandas](https://pandas.pydata.org/) is a Python library that is used for data analysis and manipulation.
4+
5+
In SWITCH, Pandas is mainly used to create graphs and also output files after solving.
6+
7+
This document gives a brief overview of key concepts and commands
8+
to get one started with Pandas. There are a lot better resources available
9+
online teaching that Pandas including entire online courses.
10+
11+
Finally, the Pandas [documentation](https://pandas.pydata.org/docs/)
12+
and [API reference](https://pandas.pydata.org/docs/reference/index.html#api) should be your go-to
13+
when trying to learn something new about Pandas.
14+
15+
## Key Concepts
16+
17+
### DataFrame
18+
19+
Dataframes are Pandas way of storing tabular data.
20+
They have rows, columns and labelled axes (e.g. row or column names).
21+
Dataframes are the primary pandas data structure. When manipulating data,
22+
the common practice is to store your main dataframe in a variable called `df`.
23+
24+
### Series
25+
26+
A series can be thought of as a single column in a dataframe.
27+
It's a 1-dimensional array of values.
28+
29+
### Indexes
30+
31+
Pandas has two ways of working with dataframes: with or without custom indexes.
32+
Custom indexes are essentially labels for each row. For example, the following
33+
dataframe has 4 columns (A, B, C, D) and a custom index (the date).
34+
35+
```
36+
A B C D
37+
2000-01-01 0.815944 -2.093889 0.677462 -0.982934
38+
2000-01-02 -1.688796 -0.771125 -0.119608 -0.308316
39+
2000-01-03 -0.527520 0.314343 0.852414 -1.348821
40+
2000-01-04 0.133422 3.016478 -0.443788 -1.514029
41+
2000-01-05 -1.451578 0.455796 0.559009 -0.247087
42+
```
43+
44+
The same dataframe can be expressed without the custom index as follows.
45+
Here the date is a column just like the others and the index is the default
46+
numbered index.
47+
48+
```
49+
date A B C D
50+
0 2000-01-01 0.815944 -2.093889 0.677462 -0.982934
51+
1 2000-01-02 -1.688796 -0.771125 -0.119608 -0.308316
52+
2 2000-01-03 -0.527520 0.314343 0.852414 -1.348821
53+
3 2000-01-04 0.133422 3.016478 -0.443788 -1.514029
54+
4 2000-01-05 -1.451578 0.455796 0.559009 -0.247087
55+
```
56+
57+
Using custom indexes is quite powerful but more advanced. When starting
58+
out it's best to avoid custom indexes.
59+
60+
### Chaining
61+
62+
Every command you apply on a dataframe *returns* a new dataframe.
63+
That is commands *do not* modify the dataframe they're called on.
64+
65+
For example, the following has no effect.
66+
67+
`df.groupby("country")`
68+
69+
Instead, you should always update your variable with the returned result.
70+
For example,
71+
72+
`df = df.groupby("country")`
73+
74+
This allows you to "chain" multiple operations together. E.g.
75+
76+
`df = df.groupby("country").rename(...).some_other_command(...)`
77+
78+
## Useful commands
79+
80+
- `df = pandas.read_csv(filepath, index_col=False)`. This command
81+
reads a csv file from filepath and returns a dataframe that gets stored
82+
in `df`. `index_col=False` ensures that no custom index is automatically
83+
created.
84+
85+
- `df.to_csv(filepath, index=False)`.
86+
This command will write a dataframe to `filepath`. `index=False` makes
87+
sure you don't write the index to the file. This should
88+
be used if your index is just `0, 1, 2, ...` in which case you probably
89+
don't want to write your index to the file.
90+
91+
- `df["column_name"]`: Returns a *Series* containing the values for that column.
92+
93+
- `df[["column_1", "column_2"]]`: Returns a *DataFrame* containing only the specified columns.
94+
95+
- `df[df["column_name"] == "some_value"]`: Returns a dataframe with only the rows
96+
where the condition in the square brackets is met. In this case we filter out
97+
all the rows where the value under `column_name` is not `"some_value"`.
98+
99+
- `df.merge(other_df, on=["key_1", "key_2"])`: Merges `df` with `other_df`
100+
where the columns over which we are merging are columns `key_1` and `key_2`.
101+
102+
- `df.info()`: Prints the columns in the dataframe and some info about each column.
103+
104+
- `df.head()`: Prints the start of the dataframe.
105+
106+
- `df.drop_duplicates()`: Drops duplicates from the dataframe
107+
108+
- `Series.unique()`: Returns a series where duplicates are dropped.
109+
110+
## Example
111+
112+
This example shows how we can use Pandas to generate a more useful view
113+
of our generation plants from the SWITCH input files.
114+
115+
```python
116+
import pandas as pd
117+
118+
# READ
119+
120+
gen_projects = pd.read_csv("generation_projects_info.csv", index_col=False)
121+
costs = pd.read_csv("gen_build_costs.csv", index_col=False)
122+
predetermined = pd.read_csv("gen_build_predetermined.csv", index_col=False)
123+
124+
# JOIN TABLES
125+
gen_projects = gen_projects.merge(
126+
costs,
127+
on="GENERATION_PROJECT",
128+
)
129+
130+
gen_projects = gen_projects.merge(
131+
predetermined,
132+
on=["GENERATION_PROJECT", "build_year"],
133+
how="left" # Makes a left join
134+
)
135+
136+
# FILTER
137+
# When uncommented will filter out all the projects that aren't wind.
138+
# gen_projects = gen_projects[gen_projects["gen_energy_source"] == "Wind"]
139+
140+
# WRITE
141+
gen_projects.to_csv("projects.csv", index=False)
142+
```
143+
144+
If you run the following code snippet it will create a `projects.csv` file
145+
containing the project data, cost data and prebuild data all in one file.

0 commit comments

Comments
 (0)