Visualization fundamentals: Codify, map, evaluate

Materials from class on Wednesday, August 10, 2022

Introduction
Concepts
Techniques
Conclusions
References

By the end of this session you should gain the following knowledge:

Recognise the characteristics of effective data graphics.
Understand that there is a grammar of graphics, and that this grammar underpins modern visualization toolkits (ggplot, vega-lite and Tableau).
Be aware of the vocabulary used by these toolkits – that of encoding data through visual channels.
Be able to select appropriate visual channels given a data item’s measurement type.
Appreciate how visual channels and evidence of their encoding effectiveness can be used to evaluate data graphics.

By the end of this session you should gain the following practical skills:

Write ggplot2 specifications that represent data using marks (geoms) and encoding channels (aesthetics – colour and position).

Introduction

This session outlines the fundamentals of visualization design. It offers a position on what effective data graphics should do, before discussing in detail the processes that take place when creating data graphics. You will learn that there is a framework – a vocabulary and grammar – for supporting this process which, combined with established knowledge around visual perception, can be used to describe, evaluate and create effective data graphics. Talking about a vocabulary and grammar of data and graphics may sound alien and abstract, the preserve of Computer Scientists. However, through an analysis of 2019 General Election results data we will demonstrate that these concepts underpin most visual data analysis.

Watch Miriah Meyer’s TEDx talk, Information Visualization for Scientific Discovery, which provides a nice introduction to many of the concepts covered in the session.

Concepts

Characteristics of effective data graphics

Data graphics take numerous forms and are used in many different ways by scientists, journalists, designers and many more. Whilst the intentions of those producing data graphics varies, those that are effective generally have the following characteristics:

Represent complex datasets graphically to expose structure, connections and comparisons that could not be achieved easily via other means.
Are data rich: present many numbers in a small space.
Reveal patterns at several levels of detail: from broad overview to fine structure.
Have elegance: emphasise dimensions of a dataset without extraneous details.
Generate an aesthetic response that encourages people to engage with the data or question.

Considering these characteristics, take a look at the data graphics below, which present an analysis of the 2016 US Presidential Election. Use the links to read the full stories and accompanying data analyses.

Maps of 2016 US presidential election results. Left - two-colour choropleth in [Medium](https://medium.com/thepensivepost/understanding-rural-america-d9695a6b3516). Right - information-rich data graphic in [The Washington Post](https://www.washingtonpost.com/graphics/politics/2016-election/election-results-from-coast-to-coast/).

Figure 1: Maps of 2016 US presidential election results. Left - two-colour choropleth in Medium. Right - information-rich data graphic in The Washington Post.

Both maps use 2016 county-level results data, but the The Washington Post graphic encodes many more data items than the Medium post (see Table 1 below).

It is not simply the data density that makes the Washington Post graphic successful. The authors usefully incorporate annotations and transformations in order to support comparison and emphasise structure. By varying the height of triangles according to the number of votes cast, the thickness according to whether or not the result for Trump/Clinton was a landslide and rotating the scrollable map 90 degrees, the very obvious differences between metropolitan, densely populated coastal counties that voted emphatically for Clinton and the vast number of suburban, provincial town and rural counties (everywhere else) that voted Trump, are exposed.

Table 1: Data items encoded in the Washington Post and Medium articles.
Data item	Data measurement level	Medium	Washington Post
county location	`Interval`
county result	`Nominal`
state result	`Nominal`
county votes cast (~pop size)	`Ratio`
county result margin	`Ratio`
county result landslide	`Nominal`

Grammar of Graphics

Data graphics visually display measured quantities by means of the combined use of points, lines, a coordinate system, numbers, symbols, words, shading, and color.

Tufte (1983)

In evidence in the Washington Post graphic is a judicious mapping of data to visuals, underpinned by a secure understanding of analysis context. This act of carefully considering how best to leverage visual systems given the available data and analysis priorities is key to designing effective data graphics.

In the late 1990s Leland Wilkinson, a Computer Scientist and Statistician, introduced the Grammar of Graphics as an approach that captures this process of turning data into visuals. Wilkinson (1999)’s thesis is that if graphics can be described in a consistent way according to their structure and composition, then the process of generating graphics of different types can be systematised. This has obvious benefits for building visualization toolkits: it makes it easy to specify chart types and combinations and helps formalise the process of designing data visualizations. vega-lite, Tableau and ggplot2 are all underpinned by Grammar of Graphics thinking.

Wilkinson (1999)’s grammar separates the construction of a data graphic into a series of components. Below are the components of the Layered Grammar of Graphics on which ggplot2 is based (Wickham 2010), a slight edit to Wilkinson (1999)’s original work.

Components of @wickham_layered_2010's Layered Grammar of Graphics.

Figure 2: Components of Wickham (2010)’s Layered Grammar of Graphics.

The seven components in Figure 2 are together used to create ggplot2 specifications. The aspects to emphasise at this stage are those in emphasis, which are required in any ggplot2 specification: the data containing the variables of interest, the geom or marks to be used to represent data and the aesthetic (mapping=aes(...)) attributes, or visual channels through which variables are to be encoded.

To demonstrate this, let’s generate some scatterplots based on the 2019 General Election data we will be analysing later in the session. Two variables worth exploring for association here are: con_1719, the change in Conservative vote share by constituency between 2017-2019, and leave_hanretty, the size of the Leave vote in the 2016 EU referendum, estimated at Parliamentary Constituency level (see Hanretty 2017).

Figure 3: Plots, grammars and associated ggplot2 specifications for the scatterplot.

In Figure 3 are three plots and associated ggplot2 specifications. Reading-off the graphics and the associated code, you should get a feel for how ggplot2 specifications are constructed:

Start with a data frame, in this case each observation is an electoral result for a Parliamentary Constituency. In the ggplot2 spec this is passed using the pipe operator (data_ge %>%). We also identify the variables we wish to encode and their measurement type. Remembering last session’s materials, both con_1719 and leave_hanretty are ratio scale variables.
Next is the encoding (mapping=aes()), which determines how the data are to be mapped to visual channels. A scatterplot is a 2d representation in which horizontal and vertical position varies in a meaningful way, in response to the values of a data set. Here the values of leave_hanretty are mapped along the x-axis and the values of con_1719 are mapped along the y-axis.
Finally, we represent individual data items with marks using the geom_point geometry.

In the middle plot, the grammar is updated such that the points are coloured according to winning_party, a variable of type categorical nominal. In the bottom plot constituencies that flipped from Labour-to-Conservative between 2017-19 are emphasised by varying the transparency (alpha) of points. I have described flipped as an ordinal variable, but strictly it is a nominal (binary) variable. Due to the way it is encoded in the plot – constituencies that flipped (flipped=TRUE) are given greater visual emphasis – I think it is more appropriate to call it an ordinal variable.

It is understandable if at this stage the specifications in Figure 3 still seem alien to you. We will be updating, expanding and refining ggplot2 specifications throughout this course to support all aspects of modern data analysis – from data cleaning and exploratory analysis through to model evaluation and communication.

Marks and visual channels

Effective data visualization design is concerned with representing data through marks and visual channels in a way that best conveys the properties of the data that are to be depicted.

via Jo Wood

You might have noticed that in my descriptions of ggplot2 specifications I introduced marks as another term for geometry and visual encoding channels as another term for aesthetics. I also paid special attention to the data types that are being encoded.

Marks are graphical elements such as bars, lines, points, ellipses that can be used to represent data items – in ggplot2, these are accessed through the functions prefaced with geom_*. Visual channels are attributes such as colour, size, position that, when mapped to data, control the appearance of marks in response to the values of a dataset. Not all channels are equally effective. In fact we can say confidently that for particular data types and tasks, some channels perform better than others.

Marks and channels are terms used in the interface of Tableau and in vega-lite specifications. They are also used widely in Information Visualization, an academic discipline devoted to the study of data graphics, and most notably by Tamara Munzner (2014) in her textbook Visualization Analysis and Design. Munzner (2014)’s work is important and widely adopted as it synthesises over foundational research in Information Visualization and Cognitive Science testing how effective different visual channels are at supporting different tasks.

Visual channels to which data items can be encoded, as they appear in @munzner_visualization_2014.

Figure 4: Visual channels to which data items can be encoded, as they appear in Munzner (2014).

Figure 4 is taken from Chapter 5 of Munzner (2014) and lists the main visual channels with which data might be encoded. The grouping and order of the figure is meaningful. Channels are grouped according to the tasks to which they are best suited and then ordered according to their effectiveness at supporting those tasks. To the left are magnitude:order channels – those that are best suited to tasks aimed at quantifying data items. To the right are identity:category channels – those that are most suited to supporting tasks that involve isolating, grouping and associating data items.

We can use this organisation of visual channels to make decisions about appropriate encodings given a variable’s measurement level. If we wished to convey the magnitude of something, for example a quantitative (ratio) variable like the size of the Conservative vote share in a constituency, we might select a channel that has good quantitative effectiveness – position on a common scale or length. If we wished to also effectively identify and associate constituencies according to the political party that was elected, a categorical nominal variable, we might select a channel that has good associative properties such as colour hue.

Evaluating designs

The effectiveness rankings of visual channels in Figure 4 are not simply based on Munzner’s preference. They are informed by detailed experimental work – W. Cleveland and McGill (1984), later replicated by Heer and Bostock (2010) – which involved conducting controlled experiments testing people’s ability to make judgements from graphical elements. We can use Figure 4 to help make decisions around which data item to encode with which visual channel. This is particularly useful when designing data-rich graphics, where several data items are to be encoded simultaneously. The figure also offers a low cost way of evaluating different designs against their encoding effectiveness. To illustrate this, we can use Munzner’s ranking of channels to evaluate the Washington Post graphic discussed in Figure 1.

Table 2: Encoding effectiveness for Washington Post graphic that emphasises *vote margin and size* of counties using triangle marks.
Mark	Data item	Type	Channel	Rank
Magnitude:Order
	Location	`interval`	position in x,y	quant
	Votes cast (~pop size)	`ratio`	length	quant
	Margin	`ratio`	length	quant
	Landslide	`ordinal`	area	quant
Identify:Category
	Winner	`nominal`	colour hue	cat

Table 2 provides a summary of the encodings used in the version of the graphic emphasising vote margin and size. US counties are represented using a peak-shaped mark (). The key purpose of the graphic is to depict the geography of voting outcomes, and so the most effective quantitative channel – position on an aligned scale – is used to order the county marks () with a 2D geographic arrangement. With the positional channels taken, the two quantitive measures, votes cast and result margin, are encoded with the next highest ranked channel, 1D length: height varies according to number of votes cast and width according to result margin. The marks are additionally encoded with two categorical variables: whether the county-level result was a landslide and also the ultimate winner. Since the intention is to give greater visual saliency to counties that resulted in a landslide, this as an ordinal variable, encoded with a quantitative channel: 2D area. The winning party, a categorical nominal variable, is encoded using colour hue.

Each of the encoding choices used in the graphic follow conventional wisdom in that data items are encoded using visual channels that are appropriate to their measurement level. Glancing down the “rank” column we can also argue that the graphic has high effectiveness. Whilst technically spatial region is the most effective channel for encoding nominal data, it is already in use in our graphic as the marks are arranged by geographic position. Additionally, it makes sense to distinguish Republican and Democrat wins using the colours with which they are always represented. Given the fact that the positional channels are in use to represent geographic location, length to represent votes cast and vote margin, the only superior visual channel to 2D area that could be used to encode the landslide variable is orientation. There are very good reasons for not varying the orientation of the marks. Most obvious is that this would clearly undermine perception of length encodings used to represent the vote margin (width) and absolute vote size (height).

Data visualization design almost always involves trade-offs. When deciding on a design configuration, it is necessary to prioritise analysis tasks and data and match representations and encodings that are most effective to the tasks that are most important. This then constrains the encoding options for less important data items and tasks. Good visualization design is sensitive to this interplay between tasks, data and encoding.

Symbolisation

Symbolization is the process of encoding something with meaning in order to represent something else. Effective symbol design requires that the relationship between a symbol and the information that symbol represents (the referent) be clear and easily interpreted.

White (2017)

Implicit in the discussion above, and when making design decisions, is the importance of symbolisation. Scrolling through the original Washington Post article, the overall pattern that can be discerned is of population-dense coastal and metropolitan counties voting Democrat – densely-packed, tall, wide and blue marks – contrasted against population-sparse rural and small town areas voting Republican – short, wide and red marks. The graphic evokes a distinctive landscape of voting behaviour, emphasised by its caption: “The peaks and valleys of Trump and Clinton’s support”.

Symbolisation is used equally well in the variant of the graphic emphasising two-party Swing between the 2012 and 2016 elections. Each county is represented as a | mark. The Swing variable is then encoded by continuously varying mark angles: counties swinging Republican are angled to the right /; counties swinging Democrat are angled to the left \. Although angle is a less effective channel at encoding quantities than is length, there are obvious links to the political phenomena in the symbolisation – angled right for counties that moved to the right politically. Additionally, the variable itself might be regarded as cyclic – or at least it has a ceiling with an important mid-point that requires emphasis. It is worth taking a second look at the full graphic here. Since there is spatial autocorrelation in case trajectories, we quickly assemble from the graphic dominant patterns of Swing to the Republicans (Great Lakes, rural East Coast), predictable Republican stasis (the mid west) and to detect more isolated, locally exceptional Swings to the Democrats (rapidly urbanising counties in the deep south).

Checking perceptual rankings

I mentioned that Munzner’s effectiveness ordering of visual channels is informed by empirical evidence – controlled experiments that examine perceptual abilities at making judgements from graphical primitives. It is worth elaborating a little on this experimental work, and on how established knowledge in Cognitive Science can be used to inform design choices.

W. S. Cleveland (1993) emphasises three perceptual activities that take place when we make sense of data graphics:

Detection : the element of the graphic must be easily discernible.
Assembly : the process of identifying patterns and structure within the graphical elements of the visualization.
Estimation : the process of making comparisons of the magnitudes of data items from the visual elements used.

These activities can be related to the categories of task outlined earlier. Detection is especially important for selective and associative tasks that involve isolating and grouping data items, whilst estimation is necessary for tasks that are orderable and quantitative, involving the ranking and reading-off of quantities.

Detection and preattentive processing

A useful distinction when considering graphical cognition is between processes that are attentive and pre-attentive (Ware 2008). Attentive processing describes the conscious processing that happens when we attempt to make sense of a visual field. Preattentive processing happens unconsciously and is the type of cognitive processing that allows something to be understood ‘at a glance’. Visual items that immediately pop-out to us induce preattentive processing.

The ability to provoke pop-out – making some things on a data graphic more easily detectible than others – relates to detection. It can be useful for supporting selective and associative tasks, and so is often used in a data graphic to encode categorical variables. For example, in the Washington Post graphic the use of colour hue to differentiate and group together counties that voted Republican or Democrat. Preattentive processes can also apply to assembly. We naturally construct and assemble patterns that are smooth and continuous when perceiving a graphic and so deviations from this continuity are often attended to unconsciously. An example here would be those urbanising counties in the deep South, which were locally exceptional in swinging to Democrat (to the left).

We can test this preattentive processing by using visual encoding channels to assist a task that requires us to select and associate visual items. Below are a set of data graphics containing 200 numbers. The graphics are currently hidden, but can be revealed by clicking the icon. For each graphic I want you to scan across the number, isolate or select the number 3, then group or associate the 3s together and count the number of instances that they occur. Speed is important here – so work as quickly as you can.

First, a set of numbers without applying any special encoding to the number 3.

Encoding: none