CodingChallenge の提出 #37

naoking158 · 2024-07-16T04:58:06Z

お忙しい中恐縮ですが、ご確認のほどよろしくお願いいたします。

実行コード ```sh $ ruff check --fix; ruff format ```

このグラフには日ごとの - 消費量の総量 - 中央値と 10-90%-ile がプロットされる。

このグラフにはエリア別に日ごとの - 消費量の総量 - 中央値と 10-90%-ile がプロットされる。

このグラフには特定ユーザーの日ごとの消費量の総量と、そのユーザーが属するエリアの日ごとの消費量の中央値がプロットされる。

kmamoru

チャレンジの提出ありがとうございます！
全体的にコードも読みやすく、関数も大きすぎずまとまっていたと思います
いくつかコメントや質問を残していますが、お手隙の時にご確認ください

kmamoru · 2024-07-16T06:13:12Z

dashboard/consumption/models.py

+    # bulk_update では Model.save() が呼ばれないため、updated_at が更新されない
+    # この対処として、objects を差し替える
+    # ref: https://scrapbox.io/shimizukawa/django_bulk_update_%E6%99%82%E3%81%ABupdated_at%E3%82%92%E6%9B%B4%E6%96%B0%E3%81%99%E3%82%8B
+    objects = models.manager.BaseManager.from_queryset(QuerySet)()
+


bulk_updateのupdated_at更新の工夫、なるほどと思いました 👍

kmamoru · 2024-07-16T06:14:47Z

dashboard/consumption/management/commands/import.py


    def handle(self, *args, **options):
-        print("Implement me!")
+        data_dir = Path(settings.BASE_DIR).parent / 'data'


pathlib使ってるの素晴らしいと思います！

kmamoru · 2024-07-16T06:19:22Z

dashboard/consumption/management/commands/import.py

+    return users_to_create, users_to_update
+
+
+def import_user_data(csv_file_path, batch_size=10000):


【質問】
データを取り込んだ際にcsvの内容が正しくない場合エラーを吐くことがあると思います
その場合の調査を簡単にする、またはエラー時の対応を楽にする方法で何か工夫できる点はあるでしょうか？

想定され得るエラーに対して try-except を用いてエラーハンドリングすべきだと思います。

私の知る限り想定されるエラーは以下2種類です。

文字コード由来のエンコードエラー UnicodeDecodeError

ファイル暗号化由来のエンコードエラー UnicodeDecodeError

上記エラーへの対処と、それ以外のエラーをキャッチしてログに残す関数を下記に示します。
pd.read_csv をこの関数で置き換えることで調査・対応が楽になると思います。

def load_csv(path: Path) -> pd.DataFrame: if not path.exists(): raise FileNotFoundError(f'csv not found: {path}') encodings = ["utf-8-sig", "cp932", "shift_jis", "euc-jp", "iso-2022-jp"] for encoding in encodings: try: df = pd.read_csv(path, encoding=encoding, engine="python") return df except UnicodeDecodeError: # 現在の文字コード `encoding` で開けなかった場合、次の encoding を試す continue except Exception as e: # Exception の場合に付加情報が不要であれば、このブロックは不要 raise Exception(e) # ファイルが暗号化されている場合も `UnicodeDecodeError` が発生するが、前述の処理では素通りしてしまう。 # encodings で列挙した文字コードで開けなかった場合はファイル暗号化の可能性が残るためここで raise する。 raise UnicodeDecodeError(f'cs may be encypted: {path}')

補足: utf-8 ではなく utf-8-sig を使用する理由

utf-8-sig は BOM 付き UTF-8 に対応した文字コード。

BOM 付き UTF-8 のファイルを文字コード utf-8 で読み込むこともできる。
しかし、テキスト先頭に BOM が残ってしまうため、例えば CSV の列名を指定して処理する際に文字列が一致しなくてバグが発生する可能性がある。

一方、BOM の無い UTF-8 のファイルを文字コード utf-8-sig で読み込んだ場合、
先頭に BOM が無いためテキストはそのまま読み込まれる。
(参考: encodings.utf_8_sig — UTF-8 codec with BOM signature)

したがって、UTF-8 を想定した文字コードには utf-8-sig を指定しておけば安全だと言える。

早速の回答ありがとうございます！
ログを出したり、ファイルそのものの処理を飛ばすなどの工夫は有用だと思います！

kmamoru · 2024-07-16T06:20:20Z

dashboard/consumption/management/commands/import.py

+from pathlib import Path
+from typing import Any, Iterable
+
+import pandas as pd


csv操作でpandsを使っているの素晴らしいと思いました！

kmamoru · 2024-07-16T08:14:22Z

dashboard/consumption/management/commands/import.py

+    )
+
+    with transaction.atomic():
+        for i in range(0, len(consumption_data_to_create), batch_size):


range(0, -9940, 10000)

python manage.py import実行時にユーザーが取り込まれませんでした...
ユーザー数60件のcsvだとこのfor文内は実行されなそうです

【質問】
どのように修正できそうでしょうか？

すみません、バッチ処理をしている箇所は全て正しくインデックスを扱えていませんでした。

下記コミットで修正対応しました。
307fad6

以下、一部抜粋ですが、len(users_to_create) - batch_size とするのではなく、
len(users_to_create) // batch_size とすることで、データ数が batch_size より小さい場合にも対応できるようにしました。

- with transaction.atomic(): - for i in range(0, len(users_to_create) - batch_size, batch_size): - if len(users_to_create) - i >= batch_size: - User.objects.bulk_create(users_to_create[i : i + batch_size]) - else: - User.objects.bulk_create(users_to_create[i:]) - for i in range(0, len(users_to_create) - batch_size, batch_size): - if len(users_to_update) - i >= batch_size: - User.objects.bulk_update(users_to_update[i : i + batch_size], ['area', 'tariff']) - else: - User.objects.bulk_update(users_to_update[i:], ['area', 'tariff']) + with transaction.atomic(): + for i in range(len(users_to_create) // batch_size + 1): + # IndexError が発生しないように処理をスキップ + if i * batch_size == len(users_to_create): + continue + User.objects.bulk_create(users_to_create[i * batch_size : (i + 1) * batch_size]) + + for i in range(len(users_to_update) // batch_size + 1): + # IndexError が発生しないように処理をスキップ + if i * batch_size == len(users_to_update): + continue + User.objects.bulk_update( + users_to_update[i * batch_size : (i + 1) * batch_size], ['area', 'tariff']

修正ありがとうございます！
rangeの扱いって癖ありますよね...
修正いただいた対応でも問題ないと思いましたが同様の処理は以下でも実装できるかと思います！
(Djangoのbulk_create, bulk_updateのソースコード参照です！)

User.objects.bulk_create(users_to_create, batch_size=batch_size) User.objects.bulk_update(users_to_update, fields=['area', 'tariff'], batch_size=batch_size)

API reference では第二引数に記載されているのになぜか見落としていました、お恥ずかしい限りです…

kmamoru · 2024-07-16T08:16:56Z

dashboard/consumption/management/commands/import.py

+    consumption_data_to_create, consumption_data_to_update = (
+        make_consumption_data_list_to_create_and_update(
+            combined_df=combined_df,
+            existing_consumptions=existing_consumptions,
+            existing_users=existing_users,
+        )
+    )


新規作成と更新を分けているところがGoodだと思います！

kmamoru · 2024-07-16T08:45:46Z

dashboard/consumption/templates/consumption/summary.html

サマリーページ

kmamoru · 2024-07-16T08:47:17Z

dashboard/consumption/templates/consumption/detail.html

shira-182

Challengeご提出ありがとうございます！
限られた時間にも関わらず、要件を満たしたコードをしっかりと書いていただけてると思いました！

shira-182 · 2024-07-16T10:06:25Z

REPORT.md

+理由は処理速度を優先するためである。
+
+本アプリケーションでは数値の厳密さは重要ではないため、
+処理速度を落としてまで `DecimalField` を選択する必要は無いと考えられる。


型の比較ありがとうございます！
同意です👍

shira-182 · 2024-07-16T10:20:38Z

dashboard/consumption/chart/generate.py

+)
+
+
+def plot_total_consumption(df: pd.DataFrame, percentiles: pd.DataFrame) -> Figure:


型ヒントしっかり書かれててgoodです👍

shira-182 · 2024-07-16T10:22:56Z

dashboard/consumption/models.py

+class User(BaseModel):
+    id = models.IntegerField(primary_key=True, help_text='ユーザID')
+    area = models.CharField(max_length=3, help_text='エリア')
+    tariff = models.CharField(max_length=3, help_text='関税')


[FYI]命名が分かりにくいのですが、tariffは電気料金プランのことを指してます。

shira-182 · 2024-07-16T10:24:52Z

dashboard/consumption/tests.py

+from consumption.models import Consumption, User
+
+
+class StatisticsTests(TestCase):


カスタムコマンドのテストもあるとベストです！

仰る通りです。
インデックスや条件分岐が現れる箇所では特にテストが重要だと思います。
今回はそのテストが無く、バッチ処理がバグっていてテストの重要性を痛感しております…

shira-182 · 2024-07-16T10:29:17Z

requirements.txt

 psycopg2==2.9.3
+pandas==2.2.2
+matplotlib==3.9.1
+ruff==0.5.0


ruffの選択いいですね👏

t-miyao

ご対応ありがとうございました！
いくつかコメントしましたが基本IMOのため修正は不要です。

t-miyao · 2024-07-17T01:15:48Z

dashboard/consumption/management/commands/import.py

+            users_to_update.append(user)
+        else:
+            users_to_create.append(User(id=row['id'], area=row['area'], tariff=row['tariff']))
+


これくらいの処理であればfor文使わずともpandasのAPI操作だけでできるかなと思いました。

df_user_has_id = df[df.id.isin(existing_users)].to_dict(orient="records") df_user_has_no_id = df[~df.id.isin(existing_users)].to_dict(orient="records") users_to_create = [User(**user) for user in df_user_has_id] users_to_update = [User(**user) for user in df_user_has_no_id]

ありがとうございます、勉強になります 🙇

t-miyao · 2024-07-17T01:17:30Z

dashboard/consumption/management/commands/import.py

+
+    # 列名のチェック
+    if not all(column in df.columns for column in ['id', 'area', 'tariff']):
+        raise ValueError("CSV file must contain 'id', 'area', and 'tariff' columns")


CSVのカラム形式が不正だったときのチェック良いと思います！
中身の型チェックまでできたら尚良しですね。

t-miyao · 2024-07-17T01:21:27Z

dashboard/consumption/management/commands/import.py

+                continue
+            User.objects.bulk_update(
+                users_to_update[i * batch_size : (i + 1) * batch_size], ['area', 'tariff']
+            )


batch_sizeを可変にした理由はなんでしょうか？
ここは固定でも良い気がします。

通信帯域やメモリなど、環境によって適切な batch_size は変わり得ると考え、可変にしました。
また、マジックナンバーは極力記載すべきではないと考えているので、固定値で十分な場合でも変数で持たせるようにしています。

t-miyao · 2024-07-17T01:23:25Z

dashboard/consumption/management/commands/import.py

+        df['user_id'] = int(user_id)
+        all_dfs.append(df)
+
+    combined_df = pd.concat(all_dfs, ignore_index=True)


1ファイルずつ処理ではなく全てまとめて処理は良いと思います！
ただ大きいファイルや大量ファイルが来た場合にメモリが枯渇する可能性があるのでそこは注意ですね。

メモリへの配慮が漏れていました。
その点も考慮して分割処理すべきでした。

t-miyao · 2024-07-17T01:25:09Z

dashboard/consumption/management/commands/import.py

+                    datetime=row['datetime'],
+                    consumption=row['consumption'],
+                )
+            )


こちらもfor文ではなくpandasのAPI操作だけでできるかと思います。

t-miyao · 2024-07-17T01:28:13Z

dashboard/consumption/management/commands/import.py

+    # 既存の消費データを一括取得
+    existing_data = Consumption.objects.filter(
+        user_id__in=user_ids, datetime__in=combined_df['datetime'].tolist()
+    )


combined_dfは同じdatetimeが存在している可能性があるため、combined_df_df['datetime'].tolist().unique()にした方がin句が短くなるかなと思いました。

t-miyao · 2024-07-17T01:31:02Z

dashboard/consumption/models.py

+    updated_at = models.DateTimeField(auto_now=True, blank=True, null=True)
+
+    # bulk_update では Model.save() が呼ばれないため、updated_at が更新されない
+    # この対処として、objects を差し替える


単純にbulk_updateの際にupdate_atをカラムとして持たせてあげれば良いのかなと思いました。

UPDATE_FIELDS = ["id", "area", "tariff" "updated_at"] MODEL.objects.bulk_update(updates, fields=UPDATE_FIELDS)

たしかに今回のケースではその方がシンプルで良いかもしれません。

ただ、将来的に別の場所で bulk_update を実行する場合や
Model.save() が呼ばれない QuerySet.update を実行するときに、
updated_at に関する処理を忘れるリスクを考え Hook として実装しておくのがベターだと考えました。

t-miyao · 2024-07-17T01:31:44Z

dashboard/consumption/models.py

+        return f'User {self.id} - Area: {self.area} - Tariff: {self.tariff}'
+
+
+class Consumption(models.Model):


ConsumptionもBaseModel継承で良いのかなと思いました。

ご指摘の通りです。
こちら継承ミスですね…

MasaruFukazawa · 2024-07-17T01:20:40Z

dashboard/consumption/models.py

+
+    class Meta:
+        constraints = [
+            models.UniqueConstraint(fields=['user', 'datetime'], name='unique_user_datetime')


user と datetime で複合ユニークとしているの素晴らしいです。

MasaruFukazawa · 2024-07-17T01:23:55Z

dashboard/consumption/chart/generate.py

@@ -0,0 +1,163 @@
+import base64


import の並び順が、pep8 準拠で良いです。

標準ライブラリ

サードパーティに関連するもの

ローカルなアプリケーション/ライブラリに特有のもの

MasaruFukazawa · 2024-07-17T01:39:02Z

dashboard/consumption/management/commands/import.py

+            )


 class Command(BaseCommand):


実装不要でお考えだけでお聞かせいただければ。
バッチを手動でなく定期バッチで処理される場合、実行ログを残したい。
この時、コード上のどの位置にどのようなログを仕込むのがよいか。

バッチ処理全体の開始前と開始後にその旨が分かるようなログと、
進捗が分かるように bulk_update などの直前にログを仕込むのが良いと思います。

MasaruFukazawa · 2024-07-17T02:56:17Z

dashboard/consumption/management/commands/import.py

+    existing_users = User.objects.in_bulk(df['id'].tolist())
+    users_to_create, users_to_update = make_user_list_to_create_and_update(df, existing_users)
+
+    with transaction.atomic():


修正不要でお考えだけでお聞かせいただければ。
トランザクションを意識されていて良いです。
ですが、現状ですとユーザの登録変更が成功した後で使用量データの登録変更の途中でエラーが発生すると使用量データの登録変更のみロールバックされてしまう実装になっているように見えます。

ユーザの登録変更、使用量データの登録変更を1連の処理と考えた時、transaction.atomic() をどこに記入すべきでしょうか？

ユーザの登録変更、使用量データの登録変更を1連の処理と考えた時、transaction.atomic() をどこに記入すべきでしょうか？

現状の構成のままであれば、関数 handle の中で import_user_data と import_all_consumption_data を囲むように transaction.atomic() を記入すれば達成できると思います。
ただ、データベースとは関係のない処理も含んでしまうため、先に登録・変更データをまとめておき、
最後にユーザの登録変更、使用量データの登録変更をまとめて行えるように構成を変えるべきだと思います。

naoking158 added 17 commits July 16, 2024 11:06

feat: 依存ライブラリを追加

cc0aaf6

feat: ruff の設定ファイルを追加

c3d4b26

style: ruff のリントとフォーマットを適用

8b4db38

実行コード ```sh $ ruff check --fix; ruff format ```

feat: モデル定義

82382ad

chore: モデルを admin ページに登録

fa6f17b

feat: ユーザーインポート機能の実装

26de8f6

feat: 消費量インポート機能の実装

3ac2dd5

feat: チャート関連のファイルを格納するディレクトリ作成

7c3b3c4

feat: 日ごとの消費量の総量のグラフ生成機能を実装

8cce9e6

このグラフには日ごとの - 消費量の総量 - 中央値と 10-90%-ile がプロットされる。

feat: エリア別の日ごとの消費量の総量のグラフ生成機能を実装

ebed81b

このグラフにはエリア別に日ごとの - 消費量の総量 - 中央値と 10-90%-ile がプロットされる。

feat: 特定ユーザーの日ごとの消費量の総量のグラフ生成機能を実装

17cf970

このグラフには特定ユーザーの日ごとの消費量の総量と、そのユーザーが属するエリアの日ごとの消費量の中央値がプロットされる。

tests: statistics.py のためのテストを作成

1a94626

feat: layout.html を更新

facb59d

feat: detail ページの URL を追加

49cd823

feat: summary ページを更新

f984b36

feat: details ページを更新

ba3998f

docs: add REPORT.md

5b690ba

kmamoru reviewed Jul 16, 2024

View reviewed changes

shira-182 reviewed Jul 16, 2024

View reviewed changes

fix: インポート機能のバッチ処理がインデックスを正しく扱えていなかったので修正

307fad6

t-miyao reviewed Jul 17, 2024

View reviewed changes

MasaruFukazawa reviewed Jul 17, 2024

View reviewed changes

		return users_to_create, users_to_update


		def import_user_data(csv_file_path, batch_size=10000):

		)


		def plot_total_consumption(df: pd.DataFrame, percentiles: pd.DataFrame) -> Figure:

		from consumption.models import Consumption, User


		class StatisticsTests(TestCase):

		return f'User {self.id} - Area: {self.area} - Tariff: {self.tariff}'


		class Consumption(models.Model):

CodingChallenge の提出 #37

Are you sure you want to change the base?

CodingChallenge の提出 #37

Uh oh!

Conversation

naoking158 commented Jul 16, 2024

Uh oh!

kmamoru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

補足: utf-8 ではなく utf-8-sig を使用する理由

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoking158 Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shira-182 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-miyao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

naoking158 Jul 16, 2024 •

edited

Loading