Wanted: open data suitable for a data science project!
-
Wanted: open data suitable for a data science project!
Every year, we ask our students 2nd bach to do a project. We give them a (big) dataset (preferably in JSON or a bunch of files combined), ask some research question, and ask them to perform data cleaning and exploration related to that question using #Rstats.
I've used most obvious choices, so I'm turning to Fedi to find new, interesting datasets. If you have no idea, sharing helps too!
@JorisMeys have you heard of Jeffrey Epstein?
-
Wanted: open data suitable for a data science project!
Every year, we ask our students 2nd bach to do a project. We give them a (big) dataset (preferably in JSON or a bunch of files combined), ask some research question, and ask them to perform data cleaning and exploration related to that question using #Rstats.
I've used most obvious choices, so I'm turning to Fedi to find new, interesting datasets. If you have no idea, sharing helps too!
@JorisMeys the VAST challenge would be perfect for this! https://github.com/vast-challenge
Background: https://datastori.es/data-stories-24-vast-challenge/
-
Wanted: open data suitable for a data science project!
Every year, we ask our students 2nd bach to do a project. We give them a (big) dataset (preferably in JSON or a bunch of files combined), ask some research question, and ask them to perform data cleaning and exploration related to that question using #Rstats.
I've used most obvious choices, so I'm turning to Fedi to find new, interesting datasets. If you have no idea, sharing helps too!
@JorisMeys It might be an obvious choice, but the UK Data Service hosts many datasets. A minority of the datasets are open, and they tend to be pretty clean already, but the massive scale of some of them makes them interesting.
-
@JorisMeys It might be an obvious choice, but the UK Data Service hosts many datasets. A minority of the datasets are open, and they tend to be pretty clean already, but the massive scale of some of them makes them interesting.
@JorisMeys For example, the 2019 World Risk Poll is a global survey of fears and attitudes towards risk. It has lots of demographic detail too so you can simulate the effect of different sampling strategies.
It's a 70 MB plaintext spreadsheet with 150,000 rows. It's large enough that in my last visit to the Apple Store, I could compare devices by manually timing how long it took to open the file. (It took 7 seconds on anything with an M4 chip; 5 seconds with an M5.)
https://datacatalogue.ukdataservice.ac.uk/studies/study/8739#details
-
@adenoz Thanks, I didn't know that one. Very interesting, and indeed the kind of data structure I am looking for. It's a bit far from their major (it's students Bio-engineering), so I might opt for another dataset closer to that if I find one. But this one is definitely flagged and stored for future use.
@JorisMeys Students in bio-engineering might enjoy finding out facts about macromolecular structures from the PDB.
Both the RCSB PDB and PDBe portals offer APIs to query the meta-data about structures deposited in the PDB:
https://www.rcsb.org/docs/programmatic-access/web-apis-overview
https://www.ebi.ac.uk/pdbe/pdbe-rest-api
The very basics let you replicate the online dashboards on these portals, showing number of entries deposited or released per year. But the APIs give access to every piece of meta-data in there, so you can really ask sophisticated questions. I played a bit with it a while ago, see some examples here: https://guillawme.github.io/insights-from-the-pdb/
-
Wanted: open data suitable for a data science project!
Every year, we ask our students 2nd bach to do a project. We give them a (big) dataset (preferably in JSON or a bunch of files combined), ask some research question, and ask them to perform data cleaning and exploration related to that question using #Rstats.
I've used most obvious choices, so I'm turning to Fedi to find new, interesting datasets. If you have no idea, sharing helps too!
@JorisMeys @genenetwork How big is big? For biology/genetics the GeneNetwork repository (https://genenetwork.org) has a lot of whole-transcriptome datasets. Each is not huge but some interesting analyses could also be done by combining them in creative ways. Iāve used this resource a lot for bioinformatics training.
Most of the datasets can be downloaded as flatfiles and the API returns JSON. -
Wanted: open data suitable for a data science project!
Every year, we ask our students 2nd bach to do a project. We give them a (big) dataset (preferably in JSON or a bunch of files combined), ask some research question, and ask them to perform data cleaning and exploration related to that question using #Rstats.
I've used most obvious choices, so I'm turning to Fedi to find new, interesting datasets. If you have no idea, sharing helps too!
@JorisMeys as other folks chiming in with their national data portals are doing, here are 3ā100 OGD datasets from Switzerland with a JSON format: https://opendata.swiss/en/dataset?q=&res_format=JSON&sort=max%28res_latest_issued%2C+res_latest_modified%29+desc
-
Wanted: open data suitable for a data science project!
Every year, we ask our students 2nd bach to do a project. We give them a (big) dataset (preferably in JSON or a bunch of files combined), ask some research question, and ask them to perform data cleaning and exploration related to that question using #Rstats.
I've used most obvious choices, so I'm turning to Fedi to find new, interesting datasets. If you have no idea, sharing helps too!
@JorisMeys You could use Crossref metadata. Point them to api.crossref.org and explore millions of records about different scholarly content
-
undefined oblomov@sociale.network shared this topic