Data fundamentals

Materials from class on Tuesday, August 9, 2022

Contents

By the end of this homework session you should be able to:

  • Describe data according to its structure and contents.
  • Calculate descriptive summaries over datasets.
  • Apply high-level functions in dplyr and tidyr for working with data.

Introduction

This homework requires you to apply the concepts and skills developed in the class session on data fundamentals. Do complete each of the tasks and be sure to save your work.

Task 1: Describe data

Complete the data description table below identifying the measurement level of each variable in the (fictional) New York bikeshare stations dataset below.

Variable name Variable value Measurement level
name “Central Park”
capacity 80
rank_capacity 45
date_opened “2014-05-23”
longitude -74.00149746
latitude 40.74177603

Task 2: Diagnose data

Below are two different tables with results from UK General Elections. We will be working with these data in the next session. Identify whether or not each is in tidy format (Wickham 2014). If they are not, provide a layout for a tidy version. No need to use code here, just edit the markdown table. If you’re struggling to work out how to organise markdown tables, you may wish to use this tables generator.

UK General Election Results 2019

party percent_vote num_mps
Conservative 43.6 365
Labour 32.2 202
Scottish National Party 3.9 48
Liberal Democrats 11.6 11
Democratic Union Party 0.8 9

UK General Election Results 2017 and 2019

party percent_vote_2017 num_mps_2017 percent_vote_2019 num_mps_2019
Conservative 42.4 317 43.6 365
Labour 40.0 262 32.2 202
Scottish National Party 3.0 35 3.9 48
Liberal Democrats 7.4 12 11.6 11
Democratic Union Party 0.9 10 0.8 8

Task 3: Fix data

In the 02-template.Rmd file for this session I have provided links to two derived tables (ny_spread_columns and ny_spread_rows) from the New York bikeshare trip data that are not in a tidy format.

Using functions from dplyr and tidyr reorganise these data so that they conform to the rules of tidy data (Wickham 2014).

A candidate tidy organisation of the data is below. Each row is an origin-destination pair for a weekday or weekend, and each variable describes:

  • o_station: station id of the origin station
  • d_station: station id of the destination station
  • wkday: identifies whether the OD pair describes weekday or weekend ny_trips
  • count: count of trips recorded for that observation (OD pair and weekday/weekend)
  • dist: total distance (cumulative) in kms of all trips recorded for that observation (OD pair and weekday/weekend)
  • duration: total duration in minutes (cumulative) of all trips recorded for that observation (OD pair and weekday/weekend)

You may wish to start with reorganising ny_spread_rows as I deliberately made ny_spread_columns quite challenging. The intention here was to replicate the sorts of arduous data formatting operations that need to be performed when working with real datasets. As always there are different approaches to this, but it can be achieved with use of pivot_longer, pivot_wider, plus a call to separate(). This may be one to post to the course Slack.

## # A tibble: 386,762 x 6
##    o_station d_station wkday   count  dist duration
##        <int>     <int> <chr>   <int> <dbl>    <dbl>
##  1        72       116 weekend     1  1.15     18.2
##  2        72       127 weekend    10 18.0     339.
##  3        72       128 weekend     2  3.18     69.6
##  4        72       146 weekend    12 27.6     402.
##  5        72       151 weekend     2  2.87     54.9
##  6        72       161 weekend     2  2.52     64.8
##  7        72       164 weekend     5 13.3      73.3
##  8        72       167 weekend     1  2.07     17.2
##  9        72       168 weekend     2  1.70     42.7
## 10        72       173 weekend     9  9.59    194.
## # … with 386,752 more rows

Task 4: Compute from data

Using dplyr functions, calculate the average distance, duration and speed of trips occurring for each observation. Print out to the Console the top 10 most heavily cycled OD pairs (and their associated summary statistics) separately for weekdays and weekends. You may wish to join on your ny_stations table in order to fetch the station names corresponding to the origin and destination stations.

References

Wickham, H. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.