Skip to content

Optimize our events cache storing and usage #63

@Martinsos

Description

@Martinsos

Right now, we store all the events that ever happened into one big .json file on the disk where the wasp-bot is running.
Each time wasp-bot runs, it fetches new events from Posthog, adds those into the .json file, and then does the calculations while having all the data from the .json file loaded in the memory (and often crunching all of it).

This works well for a long time, but now we have accumulated quite a bit of data + we get more and more data/events each day, so this is becoming unmanageable: file is 450MB in size, and calculations are taking too long.

Solutions:

  1. Cache everything we can, load as little as we need.
    Right now that is indeed all the events, because we calculate stuff like for each user when they first used Wasp and similar, so we need all the raw event data since the very start till today, but we could avoid that by caching not just raw event data, but also intermediate results. E.g. we could calculate once for each user when was their starting date and cache that and only update it for new users. Then we don't need the old raw event data any more (at least not for that reason).
    Yeah, so a lot of calculations likely never need to be repeated, because they happened in the past, unless we change logic how they are done. So we should break down our calculations to use intermediate (or even final) results, and then cache these as much as we can and reuse them, and we will get into situation where we need much less data loaded into memory and can crunch it much faster. Would it even make sense to have some kind of small "framework" for defining such "intermediate computations" and then they get cached and used and so on automatically? Hm.
  2. Consider how we store the data, so we can load only what we need.
    Instead of one big file, maybe we should rather use multiple files (e.g. per quarter), so we can load only portion of it easily.
    Or maybe we should instead store data in SQLite, or something more serious like BigQuery or RedShift or similar, we should investigate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions