Data fundamentals

Materials from class on Tuesday, August 9, 2022

Introduction
Task 1: Describe data
Task 2: Diagnose data
- UK General Election Results 2019
- UK General Election Results 2017 and 2019
Task 3: Fix data
Task 4: Compute from data
References

By the end of this homework session you should be able to:

Describe data according to its structure and contents.
Calculate descriptive summaries over datasets.
Apply high-level functions in dplyr and tidyr for working with data.

Introduction

This homework requires you to apply the concepts and skills developed in the class session on data fundamentals. Do complete each of the tasks and be sure to save your work.

Task 1: Describe data

Complete the data description table below identifying the measurement level of each variable in the (fictional) New York bikeshare stations dataset below.

Variable name	Variable value	Measurement level
`name`	“Central Park”
`capacity`	80
`rank_capacity`	45
`date_opened`	“2014-05-23”
`longitude`	-74.00149746
`latitude`	40.74177603

Task 2: Diagnose data

Below are two different tables with results from UK General Elections. We will be working with these data in the next session. Identify whether or not each is in tidy format (Wickham 2014). If they are not, provide a layout for a tidy version. No need to use code here, just edit the markdown table. If you’re struggling to work out how to organise markdown tables, you may wish to use this tables generator.

UK General Election Results 2019

party	percent_vote	num_mps
Conservative	43.6	365
Labour	32.2	202
Scottish National Party	3.9	48
Liberal Democrats	11.6	11
Democratic Union Party	0.8	9

UK General Election Results 2017 and 2019

party	percent_vote_2017	num_mps_2017	percent_vote_2019	num_mps_2019
Conservative	42.4	317	43.6	365
Labour	40.0	262	32.2	202
Scottish National Party	3.0	35	3.9	48
Liberal Democrats	7.4	12	11.6	11
Democratic Union Party	0.9	10	0.8	8

Task 3: Fix data

In the 02-template.Rmd file for this session I have provided links to two derived tables (ny_spread_columns and ny_spread_rows) from the New York bikeshare trip data that are not in a tidy format.

Using functions from dplyr and tidyr reorganise these data so that they conform to the rules of tidy data (Wickham 2014).

A candidate tidy organisation of the data is below. Each row is an origin-destination pair for a weekday or weekend, and each variable describes:

o_station: station id of the origin station
d_station: station id of the destination station
wkday: identifies whether the OD pair describes weekday or weekend ny_trips
count: count of trips recorded for that observation (OD pair and weekday/weekend)
dist: total distance (cumulative) in kms of all trips recorded for that observation (OD pair and weekday/weekend)
duration: total duration in minutes (cumulative) of all trips recorded for that observation (OD pair and weekday/weekend)

You may wish to start with reorganising ny_spread_rows as I deliberately made ny_spread_columns quite challenging. The intention here was to replicate the sorts of arduous data formatting operations that need to be performed when working with real datasets. As always there are different approaches to this, but it can be achieved with use of pivot_longer, pivot_wider, plus a call to separate(). This may be one to post to the course Slack.

## # A tibble: 386,762 x 6
##    o_station d_station wkday   count  dist duration
##        <int>     <int> <chr>   <int> <dbl>    <dbl>
##  1        72       116 weekend     1  1.15     18.2
##  2        72       127 weekend    10 18.0     339.
##  3        72       128 weekend     2  3.18     69.6
##  4        72       146 weekend    12 27.6     402.
##  5        72       151 weekend     2  2.87     54.9
##  6        72       161 weekend     2  2.52     64.8
##  7        72       164 weekend     5 13.3      73.3
##  8        72       167 weekend     1  2.07     17.2
##  9        72       168 weekend     2  1.70     42.7
## 10        72       173 weekend     9  9.59    194.
## # … with 386,752 more rows

Task 4: Compute from data

Using dplyr functions, calculate the average distance, duration and speed of trips occurring for each observation. Print out to the Console the top 10 most heavily cycled OD pairs (and their associated summary statistics) separately for weekdays and weekends. You may wish to join on your ny_stations table in order to fetch the station names corresponding to the origin and destination stations.

References

Wickham, H. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.

Thu, 21 Apr 2022