-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Right now, we store all the events that ever happened into one big .json file on the disk where the wasp-bot is running.
Each time wasp-bot runs, it fetches new events from Posthog, adds those into the .json file, and then does the calculations while having all the data from the .json file loaded in the memory (and often crunching all of it).
This works well for a long time, but now we have accumulated quite a bit of data + we get more and more data/events each day, so this is becoming unmanageable: file is 450MB in size, and calculations are taking too long.
Solutions:
- Cache everything we can, load as little as we need.
Right now that is indeed all the events, because we calculate stuff like for each user when they first used Wasp and similar, so we need all the raw event data since the very start till today, but we could avoid that by caching not just raw event data, but also intermediate results. E.g. we could calculate once for each user when was their starting date and cache that and only update it for new users. Then we don't need the old raw event data any more (at least not for that reason).
Yeah, so a lot of calculations likely never need to be repeated, because they happened in the past, unless we change logic how they are done. So we should break down our calculations to use intermediate (or even final) results, and then cache these as much as we can and reuse them, and we will get into situation where we need much less data loaded into memory and can crunch it much faster. Would it even make sense to have some kind of small "framework" for defining such "intermediate computations" and then they get cached and used and so on automatically? Hm. - Consider how we store the data, so we can load only what we need.
Instead of one big file, maybe we should rather use multiple files (e.g. per quarter), so we can load only portion of it easily.
Or maybe we should instead store data in SQLite, or something more serious like BigQuery or RedShift or similar, we should investigate.
Metadata
Metadata
Assignees
Labels
No labels