Data fundamentals: Describe, wrangle, tidy

Materials from class on Tuesday, August 9, 2022

Contents

Session outcomes

By the end of this session you should gain the following knowledge:

  • Learn the vocabulary and concepts used to describe data.
  • Appreciate the characteristics and importance of tidy data (Wickham 2014) for data processing and analysis.

By the end of this session you should gain the following practical skills:

  • Load flat file datasets in RStudio via querying an API.
  • Calculate descriptive summaries over datasets.
  • Apply high-level functions in dplyr and tidyr for working with data.
  • Create statistical graphics that expose high-level structure in data for cleaning purposes.

Introduction

This session covers some of the basics around how to describe and organise data. Whilst this might sound prosaic, there are several reasons why being able to consistently describe a dataset is important. First: it is the initial step in any analysis and helps delimit the research themes and technical procedures that can be deployed. This is especially relevant to modern Data Science-type workflows (like those supported by tidyverse), where it is common to apply the same analysis templates for working over data. Describing your dataset with a consistent vocabulary enables you to identify which analysis templates to reuse. Second relates to the point in session 1 that Social Data Science projects usually involve repurposing datasets for social science research for the first time. It is often not obvious whether the data contain sufficient detail and structure to characterise the target behaviours to be researched and the target populations they are assumed to represent. This leads to additional levels of uncertainty and places greater importance on the initial step of data processing, description and exploration.

Through the session we will learn both language and concepts for describing and thinking about data, but also how to deploy some of the most important data processing and organisation techniques in R to wrangle over a real dataset. We will be working throughout with data from New York’s Citibike scheme, accessed through the bikedata package, an API to Citibike’s publicly available origin-destination trip data.

The idea of applying a consistent vocabulary to describing your data applies especially to working with modern visualization toolkits (ggplot2, Tableau, vega-lite), and will be covered in some detail during the next session as we introduce Visualization Fundamentals and the Grammar of Graphics.

Concepts

Data structure

In the course we will work with data frames in R. These are spreadsheet like representations where rows are observations (case/record) and columns are variables. Each variable (column) in a data frame is a vector that must be of equal length. Where observations have missing values for certain variables – that is, where they may violate this equal-length requirement – the missing values must be substituted with something, usually with NA or similar. This constraint occasionally causes difficulties, for example when working with variables that contain values of different length for an observation. In these cases we create a special class of column, a list-column, to which we’ll return later in the course.

Organising data according to this simple structure – rows as observations, columns as variables – makes working with data more straightforward. A specific set of tools, made available via the tidyverse, can be deployed for doing most data tidying tasks (Wickham 2014).

Types of variable

Table 1: A breakdown of Stevens (1946) variable types, operators and measures of central tendency that can be applied to each.
Measurement Description Example Operators Midpoint Dispersion
Categories
Nominal Non-orderable categories Political parties; street names = ≠ mode entropy
Ordinal Orderable categories Terrorism threat levels … | <> … | median … | percentile
Measures
Interval Numeric measurements Temperatures; years … | + - … | mean … | variance
Ratio … | Counts Distances; prices … | × ÷ … | mean … | variance

A classification you may have encountered for describing variables is that developed by Stevens (1946), which considers the level of measurement of a variable. Stevens (1946) classed variables into two groups: variables that describe categories of things and variables that describe measurements of things. Categories include attributes like gender, title, Subscribers or Casual users of a bikeshare scheme and ranked orders (1st, 2nd, 3rd largest etc.). Measurements include quantities like distance, age, travel time, number of journeys made via a bikeshare scheme.

Categories can be further subdivided into those that are unordered (nominal) from those that are ordered (ordinal). Measurements can also be subdivided. Interval measurements are quantities that can be ordered and where the difference between two values is meaningful. Ratio measurements have both these properties, but also have a meaningful 0 – where 0 means the absence of something – and where the ratio of two values can be computed. The most common cited example of an interval measurement is temperature (in degrees C). Temperatures can be ordered and compared additively, but 0 degrees C does not mean the absence of temperature and 20 degrees C is not twice as “hot” as 10 degrees C.

Why is this important? The measurement level of a variable determines the types of data analysis procedures that can be performed and therefore allows us to efficiently make decisions when working with a dataset for the first time (Table 1).

Types of observation

Observations either together form an entire population or a subset, or sample that we expect represents a target population.

You no doubt will be familiar with these concepts, but we have to think a little more about this in Social Data Science applications as we may often be working with datasets that are so-called population-level. The Citibike dataset is a complete, population-level dataset in that every journey made through the scheme is recorded. Whether or not this is truly a population-level dataset, however, depends on the analysis purpose. When analysing the bikeshare dataset are we interested only in describing use within the Citibike scheme? Or are we taking the patterns observed through our analysis to make claims and inferences about cycling more generally?

If the latter, then there are problems as the level of detail we have on our sample is pretty trivial compared to traditional datasets, where we deliberately design data collection activities with a specified target population in mind. It may therefore be difficult to gauge how representative Citibike users and Citibike cycling is of New York’s general cycling population. The flipside is that passively collected data do not suffer from the same problems such as non-response bias and social-desirability bias as traditionally collected datasets.

Tidy data

I mentioned that we would be working with data frames organised such that columns always and only refer to variables and rows always and only refer to observations. This arrangement, called tidy (Wickham 2014), has two key advantages. First, if data are arranged in a consistent way, then it is easier to apply and reuse tools for wrangling them due to data having the same underlying structure. Second, placing variables into columns, with each column containing a vector of values, means that we can take advantage of R’s vectorised functions for transforming data – we will demonstrate this in the technical element of this session.

The three rules for tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Drug treatment dataset

To elaborate further, we can use the example given in Wickham (2014), a drug treatment dataset in which two different treatments were administered to participants.

The data could be represented as:

Table 2: Table 1 of Wickham (2014).
person treatment_a treatment_b
John Smith 2
Jane Doe 16 11
Mary Johnson 3 1

An alternative organisation could be:

Table 3: Alternative organisation of Table 1 of Wickham (2014).
treatment John Smith Jane Doe Mary Johnson
treatment_a 16 3
treatment_b 2 11 1

Both present the same information unambiguously – Table 3 is simply Table 2 transposed. However, neither is tidy as the observations are spread across both the rows and columns. This means that we need to apply different procedures to extract, perform computations on, and visually represent, these data.

Much better would be to organise the table into a tidy form. To do this we need to identify the variables:

  1. person: a categorical nominal variable which takes three values: John Smith, Jane Doe, Mary Johnson.
  2. treatment: a categorical nominal variable which takes values: a and b.
  3. result: a measurement ratio (I think) variable with six recorded values (including the missing value): -, 16, 3, 2, 11,

Each observation is then a test result returned for each combination of person and treatment.

So, a tidy organisation for this dataset would be:

Table 4: Tidy version of Table 1 of Wickham (2014).
person treatment result
John Smith a
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 3
Mary Johnson b 1

Gapminder population dataset

In chapter 12 of Wickham and Grolemund (2017), the benefits of this layout, particularly for working with R, are demonstrated using the gapminder dataset. I recommend reading this short chapter in full. We will be applying similar approaches in the technique part of this class (which follows shortly) and also the homework. To consolidate our conceptual understanding of tidy data let’s quickly look at the gapminder data, as it is a dataset we’re probably more likely to encounter.

First, a tidy version of the data:

Table 5: Tidy excerpt from gapminder dataset.
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

So the variables:

  1. country: a categorical nominal variable.
  2. year: a date (cyclic ratio) variable.
  3. cases: a ratio (count) variable.
  4. population: a ratio (count) variable.

Each observation is therefore a recorded count of cases and population for a country in a year.

An alternative organisation of this dataset that appears in Wickham and Grolemund (2017) is below. This is untidy as the observations are spread across two rows. This makes operations that we might want to perform on the cases and population variables – for example computing exposure rates – somewhat tedious.

Table 6: Untidy excerpt of gapminder dataset: observations spread across rows
country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 174504898

Imagine that the gapminder dataset instead reported values of cases separately by gender. A type of representation I’ve often seen in social science domains, probably as it is helpful for data entry, is where observations are spread across the columns. This too creates problems for performing aggregate functions, but also for specifying visualization designs (in ggplot2) as we will discover in the next session.

Table 7: Untidy possible excerpt of gapminder dataset: observations spread across columns
country year f_cases m_cases f_population m_population
Afghanistan 1999 447 298 9993400 9993671
Afghanistan 2000 1599 1067 10296280 10299080
Brazil 1999 16982 20755 86001181 86005181
Brazil 2000 39440 41048 87251329 87253569
China 1999 104007 108252 636451250 636464022
China 2000 104746 109759 640212600 640215983

Techniques

The technical element to this session involves importing, describing, transforming and tidying data from a large bikeshare scheme – New York’s Citibike scheme.

  • Download the 02-template.Rmd file for this session and save it to the reports folder of your comp-sds project that you created in session 1.
  • Open your comp-sds project in RStudio and load the template file by clicking File > Open File ... > reports/02-template.Rmd.

Import

In the template file there is a discussion of how to setup your R session with key packages – tidyverse , fst, lubridate, sf – and also the bikedata package for accessing bikeshare data.

Available via the bikedata package are trip and occupancy data for a number of bikeshare schemes (as below). We will work with data from New York’s Citibike scheme for June 2020. A list of all cities covered by the bikedata package is below:

bike_cities()
##    city     city_name      bike_system
## 1    bo        Boston           Hubway
## 2    ch       Chicago            Divvy
## 3    dc Washington DC CapitalBikeShare
## 4    gu   Guadalajara           mibici
## 5    la   Los Angeles            Metro
## 6    lo        London        Santander
## 7    mo      Montreal             Bixi
## 8    mn   Minneapolis         NiceRide
## 9    ny      New York         Citibike
## 10   ph  Philadelphia           Indego
## 11   sf      Bay Area       FordGoBike

In the template there are code chunks demonstrating how to download and process these data using bikedata’s API. This is mainly for illustrative purposes and the code chunks take some time to execute. We ultimately use the fst package for serializing and reading in the these data. So I suggest you ignore the import code and calls to the bikedata API and instead follow the instructions for downloading and reading in the .fst file with the trips data and also the .csv file containing stations data, with:

# Create subdirectory in data folder for storing bike data.
if(!dir.exists(here("data", "bikedata"))) dir.create(here("data", "bikedata"))

# Read in .csv file of stations data from url.
tmp_file <- tempfile()
url <- "https://www.roger-beecham.com/datasets/ny_stations.csv"
curl::curl_download(url, tmp_file, mode="wb")
ny_stations <- read_csv(tmp_file)

# Read in .fst file of trips data from url.
tmp_file <- tempfile()
cs_url <- "https://www.roger-beecham.com/datasets/ny_trips.fst"
curl::curl_download(url, tmp_file, mode="wb")
ny_trips <- read_fst(tmp_file)

# Write out to subdirectory for future use.
write_fst(trips, here("data", "ny_trips.fst"))
write_csv(stations, here("data", "ny_stations.csv"))

# Clean workspace.
rm(url, tmp_file)

fst implements in the background various operations such as multi-threading to reduce load on disk space. It therefore makes it possible to work with large datasets in-memory in R rather than connecting to a database and serving up summaries/subsets to be loaded into R. We will be working with just 2 million records, but with fst it is possible to work in-memory with much larger datasets – in Lovelace et al. (2020) we ended up working with 80 million + trip records.

If you completed the reading and research from the Session 1 Homework, some of the above should be familiar to you. The key arguments to look at are read_csv() and read_fst(), into which we pass the path to the file. In this case we created a tmpfile() within the R session. We then write these data out and save locally to the project’s data folder. This is useful as we only want to download the data once. In the write_*<> functions we reference this location using the here package’s here() function. here is really useful for reliably creating paths relative to your project’s root. To read in these data for future sessions:

# Read in these local copies of the trips and stations data.
ny_trips <- read_fst(here("data", "ny_trips.fst"))
ny_stations <- read_csv(here("data", "ny_stations.csv"))

Notice that we use assignment here (<-) so that these data are loaded as objects and appear in the Environment pane of your RStudio window. An efficient description of data import with read_csv() is also in Chapter 11 of Wickham and Grolemund (2017).

ny_stations and ny_trips are data frames, spreadsheet type representations containing observations in rows and variables in columns. Inspecting the layout of the stations data with View(ny_stations) you will notice that the top line is the header and contains column (variable) names.
`ny_trips` and `ny_stations` as they appear when calling `View()`.

Figure 1: ny_trips and ny_stations as they appear when calling View().

Describe

There are several functions for generating a quick overview of a data frame’s contents. glimpse<dataset-name> is particularly useful. It provides a summary of the data frame dimensions – we have c. 1.9 million trip observations in ny_trips and 11 variables1. The function also prints out the object type for each of these variables, with the variables either of type int or chr in this case.

glimpse(ny_trips)
## Rows: 1,882,273
## Columns: 11
## $ id               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21…
## $ city             <chr> "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "…
## $ trip_duration    <dbl> 1062, 3810, 1017, 226, 1437, 355, 99, 1810, 87, 2714, 2096, 1611, 529, 69…
## $ start_time       <chr> "2020-06-01 00:00:03", "2020-06-01 00:00:03", "2020-06-01 00:00:09", "202…
## $ stop_time        <chr> "2020-06-01 00:17:46", "2020-06-01 01:03:33", "2020-06-01 00:17:06", "202…
## $ start_station_id <chr> "ny3419", "ny366", "ny389", "ny3255", "ny367", "ny248", "ny3232", "ny3263…
## $ end_station_id   <chr> "ny3419", "ny336", "ny3562", "ny505", "ny497", "ny247", "ny390", "ny496",…
## $ bike_id          <chr> "39852", "37558", "37512", "39674", "21093", "39594", "43315", "16571", "…
## $ user_type        <chr> "Customer", "Subscriber", "Customer", "Customer", "Customer", "Subscriber…
## $ birth_year       <chr> "1997", "1969", "1988", "1969", "1997", "1990", "1938", "1995", "1971", "…
## $ gender           <dbl> 2, 0, 2, 0, 2, 1, 2, 2, 2, 1, 1, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
glimpse(ny_stations)
## Rows: 1,010
## Columns: 6
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
## $ city      <chr> "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "ny", "n…
## $ stn_id    <chr> "ny116", "ny119", "ny120", "ny127", "ny128", "ny143", "ny144", "ny146", "ny150",…
## $ name      <chr> "W 17 St & 8 Ave", "Park Ave & St Edwards St", "Lexington Ave & Classon Ave", "B…
## $ longitude <chr> "-74.00149746", "-73.97803415", "-73.95928168", "-74.00674436", "-74.00297088", …
## $ latitude  <chr> "40.74177603", "40.69608941", "40.68676793", "40.73172428", "40.72710258", "40.6…
Table 8: A breakdown of data types in R.
Type Description
lgl Logical – vectors that can contain only TRUE or FALSE values
int Integers – whole numbers
dbl Double – real numbers with decimals
chr Character – text strings
dttm Date-times – a date + a time
fctr Factors – represent categorical variables of fixed and potentially orderable values

The object type of a variable in a data frame relates to that variable’s measurement level. It is often useful to convert to types with greater specificity. For example, we may wish to convert the start_time and stop_time variables to a date-time format so that various time-related functions could be used. For efficient storage, we may wish to convert the station identifier variables as int types by removing the redundant “ny” text which prefaces end_station_id, end_station_id, stn_id. The geographic coordinates are currently stored as type chr. These could be regarded as quantitative variables, floating points with decimals. So converting to type dbl or as a POINT geometry type (more on this later in the course) may be sensible.

In the 02-template.Rmd file there are code chunks for doing these conversions. There are some slightly more involved data transform procedures in this code. Don’t fixate too much on these, but the upshot can be seen when running glimpse() on the converted data frames:

glimpse(ny_trips)
## Rows: 1,882,273
## Columns: 10
## $ id               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
## $ trip_duration    <dbl> 1062, 3810, 1017, 226, 1437, 355, 99, 1810, 87, 2714, 2096, 1611, 529, 695, 206,…
## $ start_time       <dttm> 2020-06-01 00:00:03, 2020-06-01 00:00:03, 2020-06-01 00:00:09, 2020-06-01 00:00…
## $ stop_time        <dttm> 2020-06-01 00:00:03, 2020-06-01 00:00:03, 2020-06-01 00:00:09, 2020-06-01 00:00…
## $ start_station_id <int> 3419, 366, 389, 3255, 367, 248, 3232, 3263, 390, 319, 237, 3630, 3610, 3708, 465…
## $ end_station_id   <int> 3419, 336, 3562, 505, 497, 247, 390, 496, 3232, 455, 3263, 3630, 3523, 3740, 379…
## $ bike_id          <int> 39852, 37558, 37512, 39674, 21093, 39594, 43315, 16571, 28205, 41760, 30745, 380…
## $ user_type        <chr> "Customer", "Subscriber", "Customer", "Customer", "Customer", "Subscriber", "Sub…
## $ birth_year       <int> 1997, 1969, 1988, 1969, 1997, 1990, 1938, 1995, 1971, 1989, 1990, 1969, 1984, 19…
## $ gender           <chr> "female", "unknown", "female", "unknown", "female", "male", "female", "female", …
glimpse(ny_stations)
## Rows: 1,010
## Columns: 5
## $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, …
## $ stn_id    <int> 116, 119, 120, 127, 128, 143, 144, 146, 150, 151, 157, 161, 164, 167, 168, 173, 174, 19…
## $ name      <chr> "W 17 St & 8 Ave", "Park Ave & St Edwards St", "Lexington Ave & Classon Ave", "Barrow S…
## $ longitude <dbl> -74.00150, -73.97803, -73.95928, -74.00674, -74.00297, -73.99338, -73.98069, -74.00911,…
## $ latitude  <dbl> 40.74178, 40.69609, 40.68677, 40.73172, 40.72710, 40.69240, 40.69840, 40.71625, 40.7208…

Transform

Transform with dplyr

Table 9: dplyr funcitions (verbs) for manipulating data frames.
function() Description
filter() Picks rows (observations) if their values match a specified criteria
arrange() Reorders rows (observations) based on their values
select() Picks a subset of columns (variables) by name (or name characteristics)
rename() Changes the name of columns in the data frame
mutate() Adds new columns (or variables)
group_by() Chunks the dataset into groups for grouped operations
summarise() Calculate single-row (non-grouped) or multiple-row (if grouped) summary values
..and more

dplyr is one of the most important packages for supporting modern data analysis workflows. The package provides a grammar of data manipulation, with access to functions that can be variously combined to support most data processing and transformation activity. Once you become familiar with dplyr functions (or verbs) you will find yourself generating analysis templates to re-use whenever you work on a new dataset.

All dplyr functions work in the same way:

  1. Start with a data frame.
  2. Pass some arguments to the function which control what you do to the data frame.
  3. Return the updated data frame.

So every dplyr function expects a data frame and will always return a data frame.

Use pipes %>% with dplyr

dplyr is most effective when its functions are chained together – you will see this shortly as we explore the New York bikeshare data. This chaining of functions can be achieved using the pipe operator (%>%). Pipes are used for passing information in a program. They take the output of a set of code (a dplyr specification) and make it the input of the next set (another dplyr specification).

Pipes can be easily applied to dplyr functions, and the functions of all packages that form the tidyverse. I mentioned in Session 1 that ggplot2 provides a framework for specifying a layered grammar of graphics (more on this in Session 3). Together with the pipe operator (%>%), dplyr supports a layered grammar of data manipulation.

count() rows

This might sound a little abstract so let’s use and combine some dplyr functions to generate some statistical summaries on the New York bikeshare data.

First we’ll count the number of trips made in Jun 2020 by gender. dplyr has a convenience function for counting, so we could run the code below, also in the 02-template.Rmd for this session. I’ve commented the code block to convey what each line achieves.

ny_trips %>%  # Take the ny_trips data frame.
  count(gender, sort=TRUE) # Run the count function over the data frame and set the sort parameter to TRUE.
##    gender       n
## 1    male 1044621
## 2  female  586361
## 3 unknown  251291

There are a few things happening in the count() function. It takes the gender variable from ny_trips, organises or groups the rows in the data frame according to its values (female | male | unknown), counts the rows and then orders the summarised output descending on the counts.

summarise() over rows

Table 10: A breakdown of aggregate functions commonly used with summarise().
Function Description
n() Counts the number of observations
n_distinct(var) Counts the number of unique observations
sum(var) Sums the values of observations
max(var)|min(var) Finds the min|max values of observations
mean(var)|median(var)|sd(var)| ... Calculates central tendency of observations
... Many more

Often you will want to do more than simply counting and you may also want to be more explicit in the way the data frame is grouped for computation. We’ll demonstrate this here with a more involved analysis of the usage data and using some key aggregate functions (Table 10).

A common workflow is to combine group_by() and summarise(), and in this case arrange() to replicate the count() example.

ny_trips %>% # Take the ny_trips data frame.
  group_by(gender) %>% # Group by gender.
  summarise(count=n()) %>% # Count the number of observations per group.
  arrange(desc(count)) # Arrange the grouped and summarised (collapsed) rows according to count.
## # A tibble: 3 x 2
##  gender    count
##  <chr>     <int>
## 1 male    1044621
## 2 female   586361
## 3 unknown  251291

In ny_trips there is a variable measuring trip duration in seconds (trip_duration) and distinguishing casual users from those formally registered to use the scheme (user_type - Customer vs. Subscriber). It may be instructive to calculate some summary statistics to see how trip duration varies between these groups.

The code below uses group_by(), summarise() and arrange() in exactly the same way, but with the addition of other aggregate functions profiles the trip_duration variable according to central tendency and by user_type.

ny_trips %>% # Take the ny_trips data frame.
  group_by(user_type) %>% # Group by user type.
  summarise( # Summarise over the grouped rows, generate a new variable for each type of summary.
    count=n(),
    avg_duration=mean(trip_duration/60),
    median_duration=median(trip_duration/60),
    sd_duration=sd(trip_duration/60),
    min_duration=min(trip_duration/60),
    max_duration=max(trip_duration/60)
    ) %>%
  arrange(desc(count)) # Arrange on the count variable.

## # A tibble: 2 x 6
##  user_type    count avg_duration median_duration sd_duration min_duration max_duration
##  <chr>        <int>        <dbl>           <dbl>       <dbl>        <dbl>        <dbl>
## 1 Subscriber 1306688         20.2            14.4        110.         1.02         33090
## 2 Customer    575585         43.6            23.2        393.         1.02         46982

Clearly there are some outlier trips that may need to be examined. Bikeshare schemes are built to incentivise short journeys of <30 minutes, but the maximum trip duration recorded above is clearly erroneous – 32 days. Ignoring these sorts of outliers by calculating the trip durations at the 95th percentiles is instructive. The max trip duration at the 95th percentile for Subscribers was almost 27 minutes and for Customers was 1 hours 26 mins. It makes sense that more casual users may have longer trip durations, as they are more likely to be tourists or occasional cyclists using the scheme for non-utility trips.

Returning to the breakdown of usage by gender, an interesting question is whether or not the male-female split in bikehare is similar to that of the cycling population of New York City as a whole. This might tell us something about whether the bikeshare scheme could be representative of wider cycling. This could be achieved with the code below. A couple of new additions: we use filter(), to remove observations where the gender of the cyclist is unknown. We also use mutate() for the first time, which allows us to modify or create new variables.

ny_trips %>% # Take the ny_trips data frame.
  filter(gender != "unknown") %>% # Filter out rows with the value "unknown" on gender.
  group_by(gender) %>% # Group by gender.
  summarise(count=n()) %>% # Count the number of observations per group.
  mutate(prop=count/sum(count)) %>% # Add a new column called `prop`, divide the value in the row for the variable count by the sum of the count variable across all rows.
  arrange(desc(count)) # Arrange on the count variable.

## # A tibble: 2 x 3
##  gender   count  prop
##  <chr>    <int> <dbl>
## 1 male   1044621 0.640
## 2 female  586361 0.360

As I’ve commented each line you hopefully get a sense of what is happening in the code above. I mentioned that dplyr functions read like verbs. This is a very deliberate design decision. With the code laid out as above – each dplyr verb occupying a single line, separated by a pipe (%>%) – you can generally understand the code with a cursory glance. There are obvious benefits to this. Once you are familiar with dplyr it becomes very easy to read, write and share code.

Remembering that pipes (%>%) take the output of a set of code and make it the input of the next set, what do you think would happen if you were to comment out the call to arrange() in the code block above? Try it for yourself. You will notice that I use separate lines for each call to the pipe operator. This is good practice for supporting readibility of your code.

Manipulate dates with lubridate

Let’s continue this investigation of usage by gender, and whether bikeshare might be representative of regular cycling, by profiling how usage varies over time. To do this we will need to work with date-time variables. The lubridate package provides various convenience functions for this.

In the code block below we extract the day of week and hour of day from the start_time variable using lubridate’s day accessor functions. Documentation on these can be accessed in the usual way (?<function-name>), but reading down the code it should be clear to you how this works. Next we count the number of trips made by hour of day, day of week and gender. The summarised data frame will be re-used several times in our analysis, so we store it as an object with a suitable name (ny_temporal) using the assignment operator.

# Create a hod dow summary by gender and assign it the name "ny_temporal".
ny_temporal <- ny_trips %>%  # Take the ny_trips data frame.
  mutate(
    day=wday(start_time, label=TRUE), # Create a new column identify dow.
    hour=hour(start_time)) %>% # Create a new column identify hod.
  group_by(gender, day, hour) %>% # Group by day, hour, gender.
  summarise(count=n()) %>% # Count the grouped rows.
  ungroup()

Whether or not to store derived data tables, like the newly assigned ny_temporal, in a session is not an easy decision. You want to try to avoid cluttering your Environment pane with many data objects. Often when generating charts it is necessary to create these sorts of derived tables as input data (to ggplot2) – and so when doing visual data analysis you may end up with an unhelpfully large number of these derived tables. The general rule I apply: if the derived table is to be used >3 times in a data analysis or is computationally intensive, assign it (<-) to an object.

In Figure 2 below these derived data are plotted. The template contains ggplot2 code for creating the graphic. Don’t obsess too much on it – more on this next session. The plot demonstrates a familiar weekday-weekend pattern of usage. Trip frequencies peak in the morning and evening rush hours during weekdays and mid/late-morning and afternoon during weekends. This is consistent with typical travel behaviour. Notice though that the weekday afternoon peak is much larger than the morning peak. There are several speculative explanations for this and re-running the plot on Subscriber users only may be instructive. A secondary observation is that whilst men and women share this overall pattern of usage, the relative number of trips taken each day of week varies. Men make many more trips at peak times during the start of the week than they do later in the week. The same pattern does not appear for women. This is certainly something to follow up on, for example by collecting data over a longer period of time.

Line charts generated with `ggplot2`. Plot data computed using `dplyr` and `lubridate`.

Figure 2: Line charts generated with ggplot2. Plot data computed using dplyr and lubridate.

Our analysis is based on data from June 2020, a time when New York residents were emerging from lockdown. It would be instructive to compare with data from a non-Covid year. If there is a very clear contrast in usage between this data and a control (non-Covid) year, this suggests bikeshare data may be used for monitoring behavioural change. The fact that bikeshare is collected continuously makes this possible. Check out Jo Wood’s work analysing Covid-related change in movement behaviours across a range of cities.

Relate tables with join()

Trip distance is not recorded directly in the ny_trips table, but may be important for profiling usage behaviour. Calculating trip distance is eminently achievable as the ny_trips table contains the origin and destination station of every trip and the ny_stations table contains coordinates corresponding to those stations. To relate the two tables, we need to specify a join between them.

A sensible approach is to:

  1. Select all uniquely cycled trip pairs (origin-destination pairs) that appear in the ny_trips table.
  2. Bring in the corresponding coordinate pairs representing the origin and destination stations by joining on the ny_stations table.
  3. Calculate the distance between the coordinate pairs representing the origin and destination.

The code below is one way of achieving this.

od_pairs <- ny_trips %>% # Take the ny_trips data frame.
select(start_station_id, end_station_id) %>% unique() %>% # Select trip origin and destination (OD) station columns and extract unique OD pairs.
  left_join(ny_stations %>% select(stn_id, longitude, latitude), by=c("start_station_id"="stn_id")) %>% # Select lat, lon columns from ny_stations and join on the origin column.
  rename(o_lon=longitude, o_lat=latitude) %>% # Rename new lat, lon columns -- associate with origin station.
  left_join(ny_stations %>% select(stn_id, longitude, latitude), by=c("end_station_id"="stn_id")) %>% # Select lat, lon columns from ny_stations and join on the destination column.
  rename(d_lon=longitude, d_lat=latitude) %>%  # Rename new lat, lon columns -- associate with destination station.
  rowwise() %>% # For computing distance calculation one row-at-a-time.
  mutate(dist=geosphere::distHaversine(c(o_lat, o_lon), c(d_lat, d_lon))/1000) %>% # Calculate distance and express in kms.
  ungroup()

The code block above introduces some new functions: select() to pick or drop variables, rename() to rename variables and a convenience function for calculating straight line distance from polar coordinates (distHaversine()). The key function to emphasise is the left_join(). If you’ve worked with relational databases and SQL, dplyr’s join functions will be familiar. In a left_join, all the values from the main table are retained, the one on the left – ny_trips, and variables from the table on the right (ny_stations) are added. We specify explicitly the variable on which the tables should be joined with the by= parameter, station_id in this case. If there is a station_id in ny_trips that doesn’t exist in ny_stations then NA is returned.

Other join functions provided by dplyr are in the table below. Rather than discussing each, I recommend consulting Chapter 13 of Wickham and Grolemund (2017).

*_join(x, y) ...
Table 11: A breakdown of dplyr join functions.
left_join() all rows from x
right_join() all rows from y
full_join() all rows from both x and y
semi_join() all rows from x where there are matching values in y, keeping just columns from x
inner_join() all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
anti_join return all rows from x where there are not matching values in y, never duplicate rows of x
Histograms generated with `ggplot2`. Plot data computed using `dplyr` and `lubridate`

Figure 3: Histograms generated with ggplot2. Plot data computed using dplyr and lubridate

From the newly created distance variable, we can calculate the average (mean) trip distance for the 1.9m trips – 1.6km. This might seem very short, but remember that the distance calculation is problematic in that these are straight-line distances between pairs of docking stations. Really we should be calculating network distances derived from the cycle network in New York. A separate reason – discovered when generating a histogram on the dist variable – is that there are a large number of trips (124,403) that start and end at the same docking station. Initially these might seem to be unsuccessful hires – people failing to undock a bike for example. We could investigate this further by paying attention to the docking stations at which same origin-destination trips occur, as in the code block below.

ny_trips %>%
  filter(start_station_id==end_station_id) %>%
  group_by(start_station_id) %>% summarise(count=n()) %>%
  left_join(ny_stations %>%  select(stn_id, name), by=c("start_station_id"="stn_id")) %>%
  arrange(desc(count))

## # A tibble: 958 x 3
##    start_station_id count name
##    <chr>            <int> <chr>
##  1 ny3423            2017 West Drive & Prospect Park West
##  2 ny3881            1263 12 Ave & W 125 St
##  3 ny514             1024 12 Ave & W 40 St
##  4 ny3349             978 Grand Army Plaza & Plaza St West
##  5 ny3992             964 W 169 St & Fort Washington Ave
##  6 ny3374             860 Central Park North & Adam Clayton Powell Blvd
##  7 ny3782             837 Brooklyn Bridge Park - Pier 2
##  8 ny3599             829 Franklin Ave & Empire Blvd
##  9 ny3521             793 Lenox Ave & W 111 St
## 10 ny2006             782 Central Park S & 6 Ave
## # … with 948 more rows

All of the top 10 docking stations are either in parks, near parks or located along river promenades. This coupled with the fact that these trips occur in much greater relative number for casual than regular users (Customer vs Subscriber) is further evidence that these are valid trips.

Write functions of your own

Through most of the course we will be making use of functions written by others – mainly those developed for packages that form the tidyverse and therefore that follow a consistent syntax. However, there may be times where you need to abstract over some of your code to make functions of your own. Chapter 19 of Wickham and Grolemund (2017) presents some helpful guidelines around the circumstances under which the data scientist typically tends to write functions. Most often this is when you find yourself copy-pasting the same chunks of code with minimal adaptation.

Functions have three key characteristics:

  1. They are (usually) named – the name should be expressive and communicate what the function does (we talk about dplyr verbs).
  2. They have brackets <function()> usually containing arguments – inputs which determine what the function does and returns.
  3. Immediately followed by <function()> are {} used to contain the body – in this is code that performs a distinct task, described by the function’s name.

Effective functions are short, perform single discrete operations and are intuitive.

You will recall that in the ny_trips table there is a variable called birth_year. From this we can derive cyclists’ approximate age. Below I have written a function get_age() for doing this. The function expects two arguments: yob – a year of birth as type chr; yref – a reference year. In the body, lubridate’s as.period function is used to calculate the time in years that elapsed, the value that the function returns. Once defined, and loaded into the session by being executed, it can be used (as below).

# Function for calculating time elapsed between two dates in years (age).
get_age <- function(yob, yref) {
    period <- lubridate::as.period(lubridate::interval(yob, yref),unit = "year")
    return(period$year)
}

ny_trips <- ny_trips %>% # Take the ny_trips data frame.
  mutate(
    age=get_age(as.POSIXct(birth_year, format="%Y"), as.POSIXct("2020", format="%Y")) # Calculate age from birth_date.
    )

We can use the two new derived variables – distance travelled and age – in our analysis. In Figure 4, we explore how approximate travel speeds vary by age, gender and trip distance. The code used to generate the summary data and plot is in the template file. Again the average “speed” calculation should be treated very cautiously as it is based on straight line distances and it is very difficult to select out “utility” from “leisure” trips. I have tried to do the latter by selecting trips that occur only on weekdays and that are made by Subscriber cyclists. Additionally, due to the heavy subsetting data become a little volatile for certain age groups and so I’ve aggregated the age variable into 5-year bands. Collecting more data is probably a good idea.

There are nevertheless some interesting patterns. Men tend to cycle at faster speeds than do women, although this gap narrows with the older age groups. The effect of age on speed cycled is more apparent for the longer trips. This trend is reasonably strong, although the volatility in the older age groups for trips >4.5km suggests we probably need more data and a more involved analysis to establish this. For example, it may be that the comparatively rare occurrence of trips in the 65-70 age group is made by only a small subset of cyclists. A larger dataset would result in a regression to the mean effect and negate any noise caused by outlier individuals. Certainly Figure 1 is an interesting data graphic – and the type of exploratory analysis demonstrated here, using dplyr functions, is most definitely consistent with that identified in the previous session when introducing Social Data Science.

Line charts generated with `ggplot2`. Plot data computed using `dplyr` and `lubridate`

Figure 4: Line charts generated with ggplot2. Plot data computed using dplyr and lubridate

Tidy

The ny_trips and ny_stations data already comply with the rules for tidy data (Wickham 2014). Each row in ny_trips is a distinct trip and each row in ny_stations a distinct station. However throughout the course we will undoubtedly encounter datasets that need to be reshaped. There are two key functions to learn here, made available via the tidyr package: pivot_longer() and pivot_wider(). pivot_longer() is used to tidy data in which observations are spread across columns, as in Table 6 (the gapminder dataset). pivot_wider() is used to tidy data in which observations are spread across rows, as in Table 7 (the gapminder dataset). You will find yourself using these functions, particularly pivot_longer(), not only for fixing messy data, but for flexibly reshaping data for use in ggplot2 specifications (more on this in sessions 3 and 4) or joining tables.

A quick breakdown of pivot_longer:

pivot_longer(
  data,
  cols, # Columns to pivot longer (across rows).
  names_to="name", # Name of the column to create from values held in spread *column names*.
  values_to="name" # Name of column to create form values stored in spread *cells*
  )

A quick breakdown of pivot_wider:

pivot_wider(
  data,
  names_from, # Column in the long format which contains what will be column names in the wide format.
  values_from # Column in the long format which contains what will be values in the new wide format.
  )

In the homework you will be tidying some messy derived tables based on the bikeshare data using both of these functions, but we can demonstrate their purpose in tidying the messy gapminder data in Table 7. Remember that these data were messy as the observations by gender were spread across the columns:

untidy_wide
##   country     year  f_cases m_cases f_population m_population
##   <chr>       <chr> <chr>   <chr>   <chr>        <chr>
## 1 Afghanistan 1999  447     298     9993400      9993671
## 2 Afghanistan 2000  1599    1067    10296280     10299080
## 3 Brazil      1999  16982   20755   86001181     86005181
## 4 Brazil      2000  39440   41048   87251329     87253569
## 5 China       1999  104007  108252  636451250    636464022
## 6 China       2000  104746  109759  640212600    640215983

First we need to gather the problematic columns with pivot_longer().

untidy_wide %>%
  pivot_longer(cols=c(f_cases: m_population), names_to=c("gender_count_type"), values_to=c("counts"))

##   country     year  gender_count_type       counts
##   <chr>       <chr> <chr>                   <chr>
##  1 Afghanistan 1999  f_cases                447
##  2 Afghanistan 1999  m_cases                298
##  3 Afghanistan 1999  f_population           9993400
##  4 Afghanistan 1999  m_population           9993671
##  5 Afghanistan 2000  f_cases                1599
##  6 Afghanistan 2000  m_cases                1067
##  7 Afghanistan 2000  f_population           10296280
##  8 Afghanistan 2000  m_population           10299080
##  9 Brazil      1999  f_cases                16982
## 10 Brazil      1999  m_cases                20755
## # … with 14 more rows

So this has usefully collapsed the dataset by gender, we now have a problem similar to that in Table 6 where observations are spread across the rows – in this instance cases and population are better treated as separate variables. This can be fixed by separating the gender_count_type variables and then spreading the values of the new count_type (cases, population) across the columns. Hopefully you can see how this gets us to the tidy gapminder data structure in Table 5

untidy_wide %>%
  pivot_longer(cols=c(f_cases: m_population), names_to=c("gender_count_type"), values_to=c("counts")) %>%
  separate(col=gender_count_type, into=c("gender", "count_type"), sep="_")

##    country     year  gender count_type counts
##    <chr>       <chr> <chr>  <chr>      <chr>
##  1 Afghanistan 1999  f      cases      447
##  2 Afghanistan 1999  m      cases      298
##  3 Afghanistan 1999  f      population 9993400
##  4 Afghanistan 1999  m      population 9993671
##  5 Afghanistan 2000  f      cases      1599
##  6 Afghanistan 2000  m      cases      1067
##  7 Afghanistan 2000  f      population 10296280
##  8 Afghanistan 2000  m      population 10299080
##  9 Brazil      1999  f      cases      16982
## 10 Brazil      1999  m      cases      20755
## # … with 14 more rows

untidy_wide %>%
  pivot_longer(cols=c(f_cases: m_population), names_to=c("gender_count_type"), values_to=c("counts")) %>%
  separate(col=gender_count_type, into=c("gender", "count_type"), sep="_") %>%
  pivot_wider(names_from=count_type, values_from=counts)

##    country     year  gender cases  population
##    <chr>       <chr> <chr>  <chr>  <chr>
##  1 Afghanistan 1999  f      447    9993400
##  2 Afghanistan 1999  m      298    9993671
##  3 Afghanistan 2000  f      1599   10296280
##  4 Afghanistan 2000  m      1067   10299080
##  5 Brazil      1999  f      16982  86001181
##  6 Brazil      1999  m      20755  86005181
##  7 Brazil      2000  f      39440  87251329
##  8 Brazil      2000  m      41048  87253569
##  9 China       1999  f      104007 636451250
## 10 China       1999  m      108252 636464022
## 11 China       2000  f      104746 640212600
## 12 China       2000  m      109759 640215983

Conclusions

Developing the vocabulary and technical skills to systematically describe and organise data is crucial to modern data analysis. This session has covered the fundamentals here: that data consist of observations and variables of different types (Stevens 1946) and that in order to work effectively with datasets, especially in a functional way in R, these data must be organised according to the rules of tidy data (Wickham 2014). Most of the session content was dedicated to the techniques that enable these concepts to be operationalised. We covered how to download, transform and reshape a reasonably large set of data from New York’s Citibike scheme. In doing so, we generated insights that might inform further data collection and analysis activity. In the next session we will apply and extend this conceptual and technical knowledge as we introduce the fundamentals of visual data analysis and ggplot2’s grammar of graphics.

References

Lovelace, R., R. Beecham, E. Heinen, E. Vidal Tortosa, Y. Yuanxuan, C. Slade, and A. Roberts. 2020. “Is the London Cycle Hire Scheme Becoming More Inclusive? An Evaluation of the Shifting Spatial Distribution of Uptake Based on 70 Million Trips.” Transportation Research Part A: Policy and Practice 140 (October): 1–15.
Stevens, S. 1946. “On the Theory of Scales of Measurement.” Science 103 (2684): 677–80.
Wickham, H. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.
Wickham, H., and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, California: O’Reilly Media.

  1. Note that the version of the trips data downloaded from my external repo contains a sample of just 500k records – this is not ideal, but was due to data storage limits on my external repo.↩︎