Data fundamentals
Contents
By the end of this homework session you should be able to:
- Describe data according to its structure and contents.
- Calculate descriptive summaries over datasets.
- Apply high-level functions in
dplyr
andtidyr
for working with data.
Introduction
This homework requires you to apply the concepts and skills developed in the class session on data fundamentals. Do complete each of the tasks and be sure to save your work.
Task 1: Describe data
Complete the data description table below identifying the measurement level of each variable in the (fictional) New York bikeshare stations dataset below.
Variable name | Variable value | Measurement level |
---|---|---|
name |
“Central Park” | |
capacity |
80 | |
rank_capacity |
45 | |
date_opened |
“2014-05-23” | |
longitude |
-74.00149746 | |
latitude |
40.74177603 |
Task 2: Diagnose data
Below are two different tables with results from UK General Elections. We will be working with these data in the next session. Identify whether or not each is in tidy format (Wickham 2014). If they are not, provide a layout for a tidy version. No need to use code here, just edit the markdown table. If you’re struggling to work out how to organise markdown tables, you may wish to use this tables generator.
UK General Election Results 2019
party | percent_vote | num_mps |
---|---|---|
Conservative | 43.6 | 365 |
Labour | 32.2 | 202 |
Scottish National Party | 3.9 | 48 |
Liberal Democrats | 11.6 | 11 |
Democratic Union Party | 0.8 | 9 |
UK General Election Results 2017 and 2019
party | percent_vote_2017 | num_mps_2017 | percent_vote_2019 | num_mps_2019 |
---|---|---|---|---|
Conservative | 42.4 | 317 | 43.6 | 365 |
Labour | 40.0 | 262 | 32.2 | 202 |
Scottish National Party | 3.0 | 35 | 3.9 | 48 |
Liberal Democrats | 7.4 | 12 | 11.6 | 11 |
Democratic Union Party | 0.9 | 10 | 0.8 | 8 |
Task 3: Fix data
In the 02-template.Rmd file for this session I have provided links to two derived tables (ny_spread_columns
and ny_spread_rows
) from the New York bikeshare trip data that are not in a tidy format.
Using functions from dplyr
and tidyr
reorganise these data so that they conform to the rules of tidy data (Wickham 2014).
A candidate tidy organisation of the data is below. Each row is an origin-destination pair for a weekday or weekend, and each variable describes:
o_station
: station id of the origin stationd_station
: station id of the destination stationwkday
: identifies whether the OD pair describes weekday or weekend ny_tripscount
: count of trips recorded for that observation (OD pair and weekday/weekend)dist
: total distance (cumulative) in kms of all trips recorded for that observation (OD pair and weekday/weekend)duration
: total duration in minutes (cumulative) of all trips recorded for that observation (OD pair and weekday/weekend)
ny_spread_rows
as I deliberately made ny_spread_columns
quite challenging. The intention here was to replicate the sorts of arduous data formatting operations that need to be performed when working with real datasets. As always there are different approaches to this, but it can be achieved with use of pivot_longer
, pivot_wider
, plus a call to separate()
. This may be one to post to the course Slack.
## # A tibble: 386,762 x 6
## o_station d_station wkday count dist duration
## <int> <int> <chr> <int> <dbl> <dbl>
## 1 72 116 weekend 1 1.15 18.2
## 2 72 127 weekend 10 18.0 339.
## 3 72 128 weekend 2 3.18 69.6
## 4 72 146 weekend 12 27.6 402.
## 5 72 151 weekend 2 2.87 54.9
## 6 72 161 weekend 2 2.52 64.8
## 7 72 164 weekend 5 13.3 73.3
## 8 72 167 weekend 1 2.07 17.2
## 9 72 168 weekend 2 1.70 42.7
## 10 72 173 weekend 9 9.59 194.
## # … with 386,752 more rows
Task 4: Compute from data
Using dplyr
functions, calculate the average distance, duration and speed of trips occurring for each observation. Print out to the Console the top 10 most heavily cycled OD pairs (and their associated summary statistics) separately for weekdays and weekends. You may wish to join on your ny_stations
table in order to fetch the station names corresponding to the origin and destination stations.