Learning Objectives

By the end of this practical lab you will be able to:

Basic Plotting in Base R

Base R has functionality that enables the creation of graphics, and although flexible, it is also very common for static graphics to be created using the very popular ggplot2 package. In this practical we will introduce base R functions, ggplot2 and also Plot.ly as a method by which interactive graphics can be created.

First we will read in some 2011 census data for the UK that we will used for the practical.

census <- read.csv("./data/census_small.csv")

This should look as follows…

head(census)
##        Code                     Ward PCT_Good_Health PCT_Higher_Managerial
## 1 E05000886 Allerton and Hunts Cross        48.97327             10.091491
## 2 E05000887                  Anfield        42.20538              2.912621
## 3 E05000888               Belle Vale        40.84911              3.931920
## 4 E05000889                  Central        58.62832              7.019923
## 5 E05000890                Childwall        51.90538             10.787704
## 6 E05000891                   Church        53.39201             17.437790
##   PCT_Social_Rented_Households
## 1                    13.005190
## 2                    22.772576
## 3                    42.555119
## 4                    18.363917
## 5                     6.937488
## 6                     3.025153

As we showed in an earlier practical (4. Descriptive Statistics), we can provide a summary of the attributes using the summary() function:

summary(census)
##         Code                          Ward    PCT_Good_Health
##  E05000886: 1   Allerton and Hunts Cross: 1   Min.   :37.32  
##  E05000887: 1   Anfield                 : 1   1st Qu.:43.47  
##  E05000888: 1   Belle Vale              : 1   Median :46.36  
##  E05000889: 1   Central                 : 1   Mean   :46.68  
##  E05000890: 1   Childwall               : 1   3rd Qu.:48.96  
##  E05000891: 1   Church                  : 1   Max.   :58.63  
##  (Other)  :24   (Other)                 :24                  
##  PCT_Higher_Managerial PCT_Social_Rented_Households
##  Min.   : 2.418        Min.   : 3.025              
##  1st Qu.: 3.701        1st Qu.:15.372              
##  Median : 5.215        Median :21.930              
##  Mean   : 7.012        Mean   :26.629              
##  3rd Qu.: 9.962        3rd Qu.:34.563              
##  Max.   :17.438        Max.   :57.993              
## 

However, it is also useful to graphically present the distributions. We can create a histogram using the hist() function, with additional options to specify the labels and color (these use hex values).

#Historgram
hist(census$PCT_Good_Health, col="#00bfff", xlab="Percent", main="Histogram") 

We might also be interested in the relationship between two variables. In the following plot, we show how the proportion of people who identify themselves as in good health within an area relate to the proportion of people who are living within socially rented housing.

plot(census$PCT_Good_Health,census$PCT_Social_Rented_Households,cex=.7,main="Good Health and Social Housing", xlab="% Good Health",ylab="% Social Housing",col="#00bfff",pch=19)

As was shown in a previous practical (see 4. Descriptive statistics), a mean can be calculated as follows:

mean(census$PCT_Good_Health)
## [1] 46.67722

We can then use this to test each of numbers contained in the “PCT_Good_Health” column.

census$PCT_Good_Health < mean(census$PCT_Good_Health)
##  [1] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [12] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
## [23] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

Which returns some TRUE and some FALSE values. We can then combine this with an ifelse() function to create a new variable called “target”. The ifelse() returns (rather than TRUE and FALSE) a value specified by the latter two parameters of the function. In this case, these are the strings “Yes” and “No”.

# Calculate a target for PCT in good health
census$target <- ifelse(census$PCT_Good_Health < mean(census$PCT_Good_Health),"Yes","No")
# Calculate a target for PCT social housing
census$target2 <- ifelse(census$PCT_Social_Rented_Households < mean(census$PCT_Social_Rented_Households),"Yes","No")

You will now see that these values have been added as a new variable in the data frame object:

head(census)
##        Code                     Ward PCT_Good_Health PCT_Higher_Managerial
## 1 E05000886 Allerton and Hunts Cross        48.97327             10.091491
## 2 E05000887                  Anfield        42.20538              2.912621
## 3 E05000888               Belle Vale        40.84911              3.931920
## 4 E05000889                  Central        58.62832              7.019923
## 5 E05000890                Childwall        51.90538             10.787704
## 6 E05000891                   Church        53.39201             17.437790
##   PCT_Social_Rented_Households target target2
## 1                    13.005190     No     Yes
## 2                    22.772576    Yes     Yes
## 3                    42.555119    Yes      No
## 4                    18.363917     No     Yes
## 5                     6.937488     No     Yes
## 6                     3.025153     No     Yes

A basic bar chart showing the frequency of zones within each category can be generated as follows:

#Create a table of the results
counts <- table(census$target)

barplot(counts, main="Target Distribution", xlab="Target",col="#00bfff")

You can also created stacked and side by side bar charts:

#Create a table of the results
counts <- table(census$target, census$target2)

#Create stacked bar chart
barplot(counts, main="Target Distribution", xlab="Target",col=c("#00bfff","#00cc66"),legend = rownames(counts))

#Create side by side bar chart
barplot(counts, main="Target Distribution", xlab="Target",col=c("#00bfff","#00cc66"),legend = rownames(counts),beside=TRUE)

We will now read in another dataset that shows the population of different racial groups within New York City between 1970 and 2010.

#Read data
racial <- read.csv("./data/NYC_Pop.csv")
#Create a plot for the total population without an x-axis label
plot(racial$Population,type = "o", col = "red", xlab = "Year", ylab = "Population", main = "Population over time",xaxt = "n")
# Add axis label
axis(1, at=1:5, labels=racial$Year)

It is also possible to add multiple lines to the plot using the lines() function:

#Create a plot for the total population without an x-axis label
plot((racial$White)/100000,type = "o", col = "green", xlab = "Year", ylab = "Population (100k)", main = "Population over time",xaxt = "n",ylim=c(0,max(racial$White/100000)))

lines(racial$Black/100000, type = "o", col = "red")
lines(racial$Asian/100000, type = "o", col = "orange")
lines(racial$Hispanic_Latino/100000, type = "o", col = "blue")

# Add axis label
axis(1, at=1:5, labels=racial$Year)

#Add a legend
legend("topright", c("White","Black","Asian","Hispanic / Latino"), cex=0.8, col=c("green","red","orange","blue"),pch=1, lty=1)

Basic Graphing with ggplot2

The ggplot2 library provides a range of functions that make graphing and visualization of your data both visually appealing and simple to implement. There are two ways in which graphs can be created in ggplot2, the first is ggplot() which we will discuss later, and the second is qlot(), which has a simplified syntax.

library(ggplot2)

Bar Charts

We can first create a bar chart using the factor column (“target”) of the data frame object “census”. The “geom” attribute is telling qplot what sort of plot to make. If you remember from the last practical, the target variable were wards within Liverpool where the percentage of people in good health was less than the city mean.

qplot(target,  data=census, geom="bar")

Histogram

We can create a histogram by changing the “geom” and variable being plotted. Try adjusting the bin width, which alters the bins into which the values of the “PCT_Social_Rented_Households” column are aggregated.

qplot(PCT_Social_Rented_Households, data=census, geom="histogram",binwidth=10)

Scatterplot

Another very common type of graph is a scatterplot which will typically plot the values of two continuous variables against one another on the x and y axis of the graph. This graph looks at the relationship between the percentage of people in socially rented housing, and those who are occupied in higher managerial roles. The default plot type is a scatterplot, so note in the next couple of examples we do not include geom = "point", however, this could be added and would return the same result (try it!)

qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census)

Adding colours or shapes

In the previous graph, all the points were black, however, if we swap these out for color, we can highlight a factor variable, which in this case is the “target” column.

qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,colour=target)

Alternatively, you can also use “shape” to keep the points as black, but alter their shape by the factor variable.

qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,shape=target)

Adding a smoothed line

If we want to add a trend line to the plot this is also possible by adding an addition parameter to the “geom”.

qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,geom = c("point","smooth"))
## `geom_smooth()` using method = 'loess'

We might also want a simpler linear regression line; which requires two further parameters including “method”" and “formula”.

qplot(PCT_Social_Rented_Households, PCT_Higher_Managerial, data = census,geom = c("point","smooth"),method="lm", formula=y~x)
## Warning: Ignoring unknown parameters: method, formula

Line Plots

To illustrate how to create line plots we will read in some economic data downloaded from the Office for National Statistics which concerns household expenditure since 1948.

household_ex <- read.csv("./data/expenditure.csv")

We can then have a quick look at the data and check on the data class.

head(household_ex)
##   Year Millions
## 1 1948   191274
## 2 1949   194639
## 3 1950   200097
## 4 1951   197686
## 5 1952   197993
## 6 1953   206868
str(household_ex)
## 'data.frame':    67 obs. of  2 variables:
##  $ Year    : int  1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
##  $ Millions: int  191274 194639 200097 197686 197993 206868 215626 224699 226305 231475 ...

We can now attempt to plot the data.

qplot(Year, Millions, data = household_ex, geom = "line")

Changing axis and labels

On the y axis, ggplot2 has defaulted to using scientific notation. We can change this, however, we will swap to the main ggplot syntax in order to do this. The first stage is to setup the plot, telling ggplot what data to use, and which “aesthetic mappings” (variables!) will be passed to the plotting function. In fact aes() is a function, however never used outside of ggplot(). This is stored in a variable “p”

p <- ggplot(household_ex, aes(Year, Millions))

If you just typed “p” into the terminal this would return an error as you still need to tell ggplot() which type of graphical output is desired. We do this by adding additional parameters using the “+” symbol.

p + geom_line()

Swapping out the scientific notation requires another package called “scales”. Once loaded, we can then add an additional parameter onto the graph.

library(scales)
p + geom_line() + scale_y_continuous(labels = comma)

We can also change the x and y axis labels

# Add scale labels
p <- p + geom_line() + scale_y_continuous(labels = comma) + labs(x="Years", y="Millions (£)")
# Plot p
p

Making Interactive Plots

Making an interactive plot is very easy with the plotly() package.

install.packages("plotly")
#Load package
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Firstly sign up to Plot.ly online; and then setup R to use Plot.ly by setting your username and also an API key - this is available here, and you need to click the “Regenerate Key” button.

One you have these details, enter these in the Sys.setenv() functions as follows and run:

# Set username
Sys.setenv("plotly_username"="your_plotly_username")

# Set API
Sys.setenv("plotly_api_key"="your_api_key")

Making an interactive plot becomes very simple when you already have a ggplot2 object created - earlier we created “p” which we can now make interactive with ggplotly():

# Create interactive plot
ggplotly(p)

Further resources / training