class: center, middle # Tidy your data ### unleash the power of data science tools https://github.com/stijnvanhoey/workshop_ywp20 YWP workshop, 13 February 2020 Stijn Van Hoey
@SVanHoey
stijnvanhoey
![:scale 20%](./static/img/logo_fluves.jpg) --- class: center, middle, section_background # Who --- class: left, middle ## Previously [@Biomath](http://www.biomath.ugent.be/), Ghent University .center[ ![:scale 80%](./static/img/intro_biomath.gif)] --- class: left, middle ## Currently - Research Software Engineer [@Fluves](https://www.fluves.com/) .center[ ![:scale 80%](./static/img/intro_fluves.png)] --- class: center, middle ## Currently - Research Software Engineer [@Fluves](https://www.fluves.com/) .center[ ![:scale 80%](./static/img/leak_fluves.png)] Leak & intrusion detection in pipelines
using fiber optic technology --- class: left, middle ## Freelance developer and teacher .center[ ![:scale 90%](./static/img/intro_teaching.png)] --- class: center, middle, section_background # Motivation --- class: middle, center .center[![:scale 100%](./static/img/good_enough_practices_computational_science.png)] .footnote[Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., Teal, T., Wilson, G. (2016).
Good Enough Practices for Scientific Computing, 1–30.] --- class: middle, center .center[![:scale 100%](./static/img/good_enough_practices_computational_science_sec4.png)] .footnote[Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., Teal, T., Wilson, G. (2016).
Good Enough Practices for Scientific Computing, 1–30.] --- class: center, middle .emphasize.right[
Create the data you wish
to see in the world ] -- count: false .emphasize.left[
Data that is easy to model,
visualise and aggregate ] --- class: center, middle, section_background # Tidy your data --- class: center, middle background-image: url(./static/img/tidy_data_paper.png) .footnote[Wickham, H. (2014)
Tidy Data, Vol. 59, Issue 10,
Journal of Statistical Software. doi:10.18637/jss.v059.i10] --- class: center, middle | WWTP | Treatment A | Treatment B | |:------|-------------|-------------| | Destelbergen | 8. | 6.3 | | Landegem | 7.5 | 5.2 | | Dendermonde | 8.3 | 6.2 | | Eeklo | 6.5 | 7.2 | --- class: center, middle | WWTP | Treatment | pH | |:------|:-------------:|:-------------:| | Destelbergen | A | 8. | | Landegem | A | 7.5 | | Dendermonde | A | 8.3 | | Eeklo | A | 6.5 | | Destelbergen | B | 6.3 | | Landegem | B | 5.2 | | Dendermonde | B | 6.2 | | Eeklo | B | 7.2 | --- class: center, middle .center[![:scale 100%](./static/img/tidy_data_scheme.png)] --- class: center, middle, subsection_background # Action! --- class: center, middle Download the [untidy data](https://github.com/stijnvanhoey/workshop_ywp20/raw/master/data/data_messy.xls) .center[![:scale 80%](./static/img/spreadsheet_messy_data.png)] --- class: left Data from [research paper](https://www.hindawi.com/journals/jchem/2019/5405016/): > Seasonal and Spatial Variation of Dissolved Oxygen and Nutrients in Padaviya Reservoir, Sri Lanka Water quality (vertical distribution) data in Padaviya Reservoir, an ancient man-made irrigation reservoir. The original data is available on [figshare](https://figshare.com/articles/Padaviya_Reservoir_Water_Quality_Data_-_2016/8971775), but was _mistreated_ for the sake of this workshop... .center[![:scale 60%](./static/img/waterquality_data_paper.png)] .footnote[Siriwardana, C., Cooray, A. T., Liyanage, S. S. and Koliyabandara, S. M. P. A. (2019),
Seasonal and Spatial Variation of Dissolved Oxygen and Nutrients in Padaviya Reservoir, Sri Lanka.
Journal of Chemistry, Vol. 2019, doi.org/10.1155/2019/5405016.] --- class: left, middle ## Exercise Starting from the [`data_messy.xls`](https://github.com/stijnvanhoey/workshop_ywp20/raw/master/data/data_messy.xls): 1. Download the [untidy data](https://github.com/stijnvanhoey/workshop_ywp20/raw/master/data/data_messy.xls) 2. Open up the data in a spreadsheet program 3. You can see two tabs, from measurements in January and in May 4. Work together with the person next to you. Identify what is wrong with this spreadsheet (take note) and transform the data to a single tidy data set. __Tip:__ - Never modify your original (raw) data! - Keep track of the steps you took in your clean up --- class: center, middle ... --- class: left, middle ### Remember: * __Never modify your raw data__. Always make a copy before making any changes. * __Keep track__ of all of the steps you take to clean your data in a plain text file. * Save your data as an [__open and static format__](http://www.datacarpentry.org/spreadsheet-ecology-lesson/05-exporting-data/) such as csv. * Organize your data according to __tidy data__ principles. __Tip:__ You can also [add data validation](https://datacarpentry.org/spreadsheet-ecology-lesson/04-quality-control/index.html) in spreadsheets to prevent accidentally entering invalid data! .footnote[__Note:__ This lesson is a derivative of the [data carpentry spreadsheet workshop](https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data/index.html). All credits to the data carpentry community! ] --- class: center, middle, section_background # Tidy data visualisation --- class: center, middle, subsection_background # Action! --- class: center, middle Click on the launch button: [![:scale 50%](https://mybinder.org/badge_logo.svg)](http://mybinder.org/v2/gh/stijnvanhoey/workshop_ywp20/master?urlpath=lab) Wait (_fingers crossed_) until you see: .center[![:scale 80%](./static/img/jupyter_intro.png)] --- class: center, middle In the left menu: first, double click `src`; next, double click `visualisations.ipynb`. You should see: ![:scale 80%](./static/img/jupyter_notebook.png) --- class: left ## Before we start... ![:scale 100%](./static/img/sticky_concept.png) --- class: center, middle ... --- class: center, middle, section_background # To conclude --- class: center, middle ## Tidy and GoG all the way? > When only 1 categorical variable or timeseries
(e.g. continuous logging), the added value is LOW. | ID | variable 1 | variable 2 | |----|-------------| ---- | | 1 | 0.2 | 0.8 | | 2 | 0.3 | 0.1 | | ... | ... | ... | | datetime | station 1 | station 2 | |------------|-------------| ---- | | 2017-12-20 17:50 | 0.2 | 0.8 | | 2017-12-20 17:51 | 0.3 | 0.1 | | ... | ... | ... | > When working with different experiments, different conditions, (factorial) experimental designs
the added value is HIGH. --- class: center, middle .center[![:scale 100%](./static/img/data_organization_spreadsheet.png)] --- class: center, middle .center[![:scale 100%](./static/img/good_enough_practices_computational_science.png)] --- class: left, middle ## More? * [Data carpentry](http://www.datacarpentry.org/lessons/) and [Software carpentry](https://software-carpentry.org/lessons/) courses * Doctoral schools Ghent University [data manipulation course](https://github.com/jorisvandenbossche/DS-python-data-analysis) * ROpensci [reproducibility guide](http://ropensci.github.io/reproducibility-guide/) --- class: center, middle Stijn Van Hoey
@SVanHoey
stijnvanhoey
![:scale 20%](./static/img/logo_fluves.jpg)