Skip to content

Conversation

@kshedden
Copy link

Not sure what you have in mind in terms of data format, meta data, etc. Let me know and I will revise the PR as needed.

@josef-pkt
Copy link
Member

@vincentarelbundock
I think when we build this out we will use something like
https://github.com/vincentarelbundock/Rdatasets
which would put nhanes one level lower into a csv folder

in general:
There were discussion on the nipy mailing list about making installable python dataset packages, which makes sense if users will want to use most of the data available or they don't get too large, but not so much if we want to use just a few datasets as in rdatasets.
I didn't pay a lot of attention to the details of dataset packages and meta information. For now the rdataset pattern plus our datasets inside statsmodels seems to be enough.
It's possible to rethink this in future if someone is interested. I saw that there are also related datset packages for Julia (one of them a translation of Vincent's rdatasets) which will have similar installation and license/copyright questions as we do.

@josef-pkt
Copy link
Member

On specific question:
Is the Hanes .gz file an archive with a single csv file or does it have a collection of csv files?
What's the advantage of using an archive instead of a plain csv file?

I'm fine either way, but AFAIK, we would have to write the py2/py3 compatible helper functions to get the data from an archive file. (The statespace notebooks are doing that, and it was what triggered me into looking at creating smdatasets)

@kshedden
Copy link
Author

There's just a single file in there, in csv format. It's only compressed to save space/bandwidth.

I don't feel strongly about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants