chris1610
diff --git a/‎notebooks/case_study_weather/1-dwd_konverter_download.ipynb‎
Lines changed: 109 additions & 0 deletions b/‎notebooks/case_study_weather/1-dwd_konverter_download.ipynb‎
Lines changed: 109 additions & 0 deletions
diff --git a/‎notebooks/case_study_weather/2-dwd_konverter_extract.ipynb‎
Lines changed: 98 additions & 0 deletions b/‎notebooks/case_study_weather/2-dwd_konverter_extract.ipynb‎
Lines changed: 98 additions & 0 deletions
@@ -0,0 +1,109 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Import temperature data from the DWD and process it\n",
+    "\n",
+    "This notebook pulls historical temperature data from the DWD server and formats it for future use in other projects. The data is delivered in a hourly frequencs in a .zip file for each of the available weather stations. To use the data, we need everythin in a single .csv-file, all stations side-by-side. Also, we need the daily average.\n",
+    "\n",
+    "To reduce computing time, we also crop all data earlier than 2007. \n",
+    "\n",
+    "Files should be executed in the following pipeline:\n",
+    "* 1-dwd_konverter_download\n",
+    "* 2-dwd_konverter_extract\n",
+    "* 3-dwd_konverter_build_df\n",
+    "* 4-dwd_konverter_final_processing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1.) Download files from the DWD-API\n",
+    "Here we download all relevant files from the DWS Server. The DWD Server is http-based, so we scrape the download page for all links that match 'stundenwerte_TU_.\\*_hist.zip' and download them to the folder 'download'. \n",
+    "\n",
+    "Link to the relevant DWD-page: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Done\n"
+     ]
+    }
+   ],
+   "source": [
+    "import requests\n",
+    "import re\n",
+    "from bs4 import BeautifulSoup\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Set base values\n",
+    "download_folder = Path.cwd() / 'download'\n",
+    "base_url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'\n",
+    "\n",
+    "\n",
+    "# Initiate Session and get the Index-Page\n",
+    "with requests.Session() as s:\n",
+    "    resp = s.get(base_url)\n",
+    "\n",
+    "# Parse the Index-Page for all relevant <a href> \n",
+    "soup = BeautifulSoup(resp.content)\n",
+    "links = soup.findAll(\"a\", href=re.compile(\"stundenwerte_TU_.*_hist.zip\"))\n",
+    "\n",
+    "# For testing, only download 10 files\n",
+    "file_max = 10\n",
+    "dl_count = 0\n",
+    "\n",
+    "#Download the .zip files to the download_folder\n",
+    "for link in links:\n",
+    "    zip_response = requests.get(base_url + link['href'], stream=True)\n",
+    "    # Limit the downloads while testing\n",
+    "    dl_count += 1\n",
+    "    if dl_count > file_max:\n",
+    "        break\n",
+    "    with open(Path(download_folder) / link['href'], 'wb') as file:\n",
+    "        for chunk in zip_response.iter_content(chunk_size=128):\n",
+    "            file.write(chunk)  \n",
+    "    \n",
+    "print('Done')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -0,0 +1,98 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Import temperature data from the DWD and process it\n",
+    "\n",
+    "This notebook pulls historical temperature data from the DWD server and formats it for future use in other projects. The data is delivered in a hourly frequencs in a .zip file for each of the available weather stations. To use the data, we need everythin in a single .csv-file, all stations side-by-side. Also, we need the daily average.\n",
+    "\n",
+    "To reduce computing time, we also crop all data earlier than 2007. \n",
+    "\n",
+    "Files should be executed in the following pipeline:\n",
+    "* 1-dwd_konverter_download\n",
+    "* 2-dwd_konverter_extract\n",
+    "* 3-dwd_konverter_build_df\n",
+    "* 4-dwd_konverter_final_processing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2.) Extract all .zip-archives\n",
+    "In this next step, we extract a single file from all the downloaded .zip files and save them to the 'import' folder. Beware, there is going to be a lot of data (~6 GB of .csv files)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Done'"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from pathlib import Path\n",
+    "import glob\n",
+    "import re\n",
+    "from zipfile import ZipFile\n",
+    "\n",
+    "# Folder definitions\n",
+    "download_folder = Path.cwd() / 'download'\n",
+    "import_folder = Path.cwd() / 'import'\n",
+    "\n",
+    "# Find all .zip files and generate a list\n",
+    "unzip_files = glob.glob('download/stundenwerte_TU_*_hist.zip')\n",
+    "\n",
+    "# Set the name pattern of the file we need\n",
+    "regex_name = re.compile('produkt.*')\n",
+    "\n",
+    "# Open all files, look for files that match ne regex pattern, extract to 'import'\n",
+    "for file in unzip_files:\n",
+    "    with ZipFile(file, 'r') as zipObj:\n",
+    "        list_of_filenames = zipObj.namelist()\n",
+    "        extract_filename = list(filter(regex_name.match, list_of_filenames))[0]\n",
+    "        zipObj.extract(extract_filename, import_folder)\n",
+    "\n",
+    "display('Done')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}