Skip to content

Conversation

@ziadsawalha
Copy link

Giving this a stab. Feedback welcome!


if use_batches
io = file_or_content_io
separator = detect_separator(io.read(1024))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extract separator to a method and memoize it as it's called twice.

batch_size: batch_size,
use_batches: use_batches
)
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overwriting initialize, what do you think about extracting csv_reader to a method and memoizing it?

end
end

def sanitize_content_stream(io)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great method but it's not used. Can you make the Enumerator use it?

@pcreux
Copy link
Owner

pcreux commented Sep 3, 2024

This is great @ziadsawalha ! I added a couple of comments.

Could creating a new class simplify the codebase? e.g. Runner could toggle between CSVBulkReader and CSVBatchReader depending on use_batches.

Regarding the public API, use_batches doesn't sound great IMO. What about batch_load?

We could release this as the next minor release and I would like to release a major version where we remove the virtus dependency. Thoughts?

@ziadsawalha
Copy link
Author

Thanks for the good feedback. I'll submit updates when I get to it... hopefully tonight (Europe).

@ziadsawalha
Copy link
Author

I've updated the PR with the following changes:

  • Extracted bulk/streaming code to a CSVBatchReader which inherits from CSVReader (I opted not to rename CSVReader ... unless you insist!)
  • Extended Config and Runner to support enabling batch loading
  • Renamed flag to batch_load
  • Added some tests (I think more are needed)
  • Minor edits to files I touched

I struggled with trying to support streams that don't have rewind (which means we need to detect the separator and headers on load and then not try to restart the stream). That complicated things a bit, IMO.

More feedback welcome while I try to add more tests and validate this in real life use cases.

@pcreux
Copy link
Owner

pcreux commented Sep 16, 2024

Great, thank you for making these changes @ziadsawalha !

How does this work against real-life use cases?

What kind of streams don't have rewind? S3?

Could we close and re-open (or duplicate?!) the stream if it doesn't support rewinding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants