Learning Objectives

By the end of this practical lab you will be able to:

Manipulate LODES data from census blocks into an aggregate geography
Format origin-destination data so that it is suitable for mapping
Map flows using base plot and ggpolot

Interaction data

The dynamics of urban systems are captured by various types of interaction data with varying degrees of spatio-temporal granularity. In this practical you will explore various ways in which interaction data can be mapped; including the estimation and representation of routes from co-ordinate pairs, mapping GPS trails and how very large origin-destination flows can be summarized.

Estimating and representing routes

Many interaction data within cities are simply pairs of origin and destination locations, with some flow recorded between them. A good source of such data which are available within many municipalities relate to the location and flow between bike share docking stations. Many of the operators of these systems now make these data openly available.

We will now read in the September 2015 - August 2016 data from the Bay Area Bike Share, SF, USA. Although the original zip files contain other data, we will just read in the data related to the station location and also the use records.

#Read in data
stations <- read.csv("./data/201608_station_data.csv")
trips <- read.csv("./data/201608_trip_data.csv")

Each of the stations has various attributes and cover a series of locations within the bay area - in this case, we will subset to only those within San Francisco.

head(stations)

##   station_id                              name      lat      long
## 1          2 San Jose Diridon Caltrain Station 37.32973 -121.9018
## 2          3             San Jose Civic Center 37.33070 -121.8890
## 3          4            Santa Clara at Almaden 37.33399 -121.8949
## 4          5                  Adobe on Almaden 37.33141 -121.8932
## 5          6                  San Pedro Square 37.33672 -121.8941
## 6          7              Paseo de San Antonio 37.33380 -121.8869
##   dockcount landmark installation
## 1        27 San Jose     8/6/2013
## 2        15 San Jose     8/5/2013
## 3        11 San Jose     8/6/2013
## 4        19 San Jose     8/5/2013
## 5        15 San Jose     8/7/2013
## 6        15 San Jose     8/7/2013

# Limit to SF stations
stations <- stations[stations$landmark == "San Francisco",]

library(ggmap)

#Get background map for Chicago
SanFran <- get_map(location = c(-122.405,37.79), zoom = 14,color = 'bw')

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=37.79,-122.405&zoom=14&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

#Basic point plot
ggmap(SanFran) + geom_point(data = stations, aes(x = long, y = lat, colour = "red")) + 
theme_bw() +
theme(axis.line = element_blank(),
      axis.text = element_blank(),
      axis.title=element_blank(),
      axis.ticks = element_blank(),
      legend.key = element_blank(),
      legend.position="none",
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      panel.background = element_blank())

Turning now to the trips taken between these origin and destinations, these are ordered within the data frame as a trip per row, with each trip giving a various details including the start and end terminals. Using these data we will create a table of aggregate origin destination flows - however, we will only consider those flows between the stations identified as “San Francisco” and therefore shown on the map above.

# View top six rows of the trips data
head(trips)

##   Trip.ID Duration    Start.Date                           Start.Station
## 1  913465      746 9/1/2015 0:10 San Francisco Caltrain 2 (330 Townsend)
## 2  913466      969 9/1/2015 0:15                         Clay at Battery
## 3  913467      233 9/1/2015 0:15                        Davis at Jackson
## 4  913468      213 9/1/2015 1:29                         Clay at Battery
## 5  913469      574 9/1/2015 1:33                       Steuart at Market
## 6  913470      623 9/1/2015 1:36       San Jose Diridon Caltrain Station
##   Start.Terminal      End.Date                             End.Station
## 1             69 9/1/2015 0:23                 San Francisco City Hall
## 2             41 9/1/2015 0:31                    Washington at Kearny
## 3             42 9/1/2015 0:19                Commercial at Montgomery
## 4             41 9/1/2015 1:32                       Steuart at Market
## 5             74 9/1/2015 1:42 San Francisco Caltrain 2 (330 Townsend)
## 6              2 9/1/2015 1:47                               Japantown
##   End.Terminal Bike.. Subscriber.Type Zip.Code
## 1           58    238      Subscriber    94107
## 2           46     16      Subscriber    94133
## 3           45    534      Subscriber    94111
## 4           74    312      Subscriber    94107
## 5           69    279      Subscriber    94107
## 6            9    261      Subscriber    95112

# Get a list of the station IDs within SF
s_SF <- unique(stations$station_id)

# Limit trips to those with both origin and destination within the SF subset
trips_SF <- trips[(trips$Start.Terminal %in% s_SF) & (trips$End.Terminal %in% s_SF),]

# Create an table with origins and destination pairs
OD_trips_SF <- table(trips$Start.Terminal,trips$End.Terminal)
#View the top six rows
head(OD_trips_SF)

##    
##       2   3   4   5   6   7   8   9  10  11  12  13  14  16  21  22  23
##   2  67 193 971 354 402 390 118 350  96 289  93 209  26 135   0   0   0
##   3 148 118  23  31  39  24  13  23  11  18  15   8  20  12   0   0   0
##   4 932  35  96  14  14  29  11  10  11  11  49   8 206   4   0   0   0
##   5 372  29   9  58  51  26  53  15  37  22  15  14  17   3   0   0   0
##   6 396  51  25  37  81  38  31 137  30  16  18 131  53  47   0   0   0
##   7 322  27  18  17  54  55  60  51  15  24  13  30  17  83   0   0   0
##    
##      24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  41
##   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   6   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##    
##      42  45  46  47  48  49  50  51  54  55  56  57  58  59  60  61  62
##   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   6   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##    
##      63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  80  82
##   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  19   0
##   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  12   0
##   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   3   0
##   5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   3   0
##   6   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  50   0
##   7   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0
##    
##      83  84  88  89  90  91
##   2   0 314   1   8   0   0
##   3   0  19   1   8   0   0
##   4   0  13   0   1   0   0
##   5   0  65   0   0   0   0
##   6   0  57   0  17   0   0
##   7   0  43   0   1   0   0

If you remember from an earlier practical we can convert a table object into a data frame as follows which turns it from a wide to a narrow format:

# Create a data frame of the OD pairs
OD_trips_SF_Narrow <- data.frame(OD_trips_SF)
# Create sensible column names
colnames(OD_trips_SF_Narrow) <- c("Origin","Destination","Trips")

We will now identify the top ten most frequently ridden origin-destination pairs

# Sorts the trips in decending order
OD_trips_SF_Narrow <- OD_trips_SF_Narrow[order(OD_trips_SF_Narrow$Trips,decreasing = TRUE),]
# Get the top 10 trips
top10 <- OD_trips_SF_Narrow[1:10,]
top10

##      Origin Destination Trips
## 4272     65          69  3341
## 4274     67          69  3035
## 3593     50          60  2986
## 3010     61          50  2781
## 4637     60          74  2605
## 3019     70          50  2593
## 3667     50          61  2531
## 4355     74          70  2482
## 4334     51          70  2384
## 4333     50          70  2362

We will now add origin and destination latitude and longitude co-ordinates by merging with the stations data. First the origin locations:

# Add origin co-ordinates
top10 <- merge(top10,stations, by.x="Origin",by.y="station_id", all.x=TRUE)
# Remove unwanted columns
top10 <- subset(top10, select=c("Origin","Destination","Trips","lat","long"))
# Change column names
colnames(top10) <- c("Origin","Destination","Trips","O_lat","O_long")

And then the destinations:

# Add destination co-ordinates
top10 <- merge(top10,stations, by.x="Destination",by.y="station_id", all.x=TRUE)
# Remove unwanted columns
top10 <- subset(top10, select=c("Origin","Destination","Trips","O_lat","O_long","lat","long"))
# Change column names
colnames(top10) <- c("Origin","Destination","Trips","O_lat","O_long","D_lat","D_long")

One of the simplest ways of calculating a route is to use the Google maps API which is implemented in the googleway package.

# Install package
install.packages("googleway")

# Load package
library(googleway)

For this you will need to get a Google maps API key:

# Set your key
key <- "your_api_key"

We will then extract an origin destination pair from our top10 object, and then use the google_directions() function to generate a route - this is then converted to a set of lat lon waypoints using decode_pl():

# Using the first origin/destination
x <- 1 # You can change this between 1 - 10 to view each of the routes
origin <- c(top10[x,"O_lat"],top10[x,"O_long"])
destination <- c(top10[x,"D_lat"],top10[x,"D_long"])

# get the directions from Google Maps API
res <- google_directions(origin = origin,destination = destination,key = key, mode= "bicycling")

# Convert the results to co-ordinates
df_polyline <- decode_pl(res$routes$overview_polyline$points)

# See the top six rows
head(df_polyline)

##        lat       lon
## 1 37.77671 -122.3951
## 2 37.77686 -122.3953
## 3 37.77700 -122.3951
## 4 37.77723 -122.3948
## 5 37.77825 -122.3936
## 6 37.77859 -122.3931

These can then be mapped with ggmap:

ggmap(SanFran) +
  geom_path(aes(x = lon, y = lat), color = "red", size = 0.5, data = df_polyline, lineend = "round")

We can extend the above to run a conditional statement with the for() function which does something (in this case, what is in brackets) until a condition is satisfied. Here loop changes the value of x from 1 to the number of rows in the top10 object (i.e. 10), and for each change in x the code between the { and } is run. For loops are very helpful to run a block of code multiple times.

Because x is changed from 1-10 on each run, we can use this value in various helpful ways, firstly to select a particular row from the data frame top10, and second to act as an ID for each set of routes extracted.

tmp <-  data.frame(lat = numeric(0), lon = numeric(0), ID = numeric(0), Trips = numeric(0))

for (x in 1:nrow(top10)) {

# Get origins and destinations
origin <- c(top10[x,"O_lat"],top10[x,"O_long"])
destination <- c(top10[x,"D_lat"],top10[x,"D_long"])
  
# get the directions from Google Maps API
res <- google_directions(origin = origin,destination = destination,key = key, mode= "bicycling")

# Convert the results to co-ordinates
df_polyline <- decode_pl(res$routes$overview_polyline$points)

# Add a route ID and Trips to the data frame
df_polyline$ID <- x
df_polyline$Trips <- top10[x,"Trips"]

# Append the results to the tmp object
tmp <- rbind(tmp,df_polyline)

}

We can now visualize this using the ID as a factor which shows each route as a separate color.

ggmap(SanFran) +
  geom_path(aes(x = lon, y = lat,color = as.factor(ID)), size = 0.5, data = tmp, lineend = "round")

To enable some more experimentation with the flow data visualization without having to generate all the potential routes yourself, we have run these already for all origin destination station pairs where the flow was greater than 0. We will load these now:

# Load flows
load("./data/All_Flows.Rdata")
# Show the top six rows of the table
head(All_Flows)

##        lat       lon ID Trips
## 1 37.33245 -121.8903  1     4
## 2 37.33209 -121.8900  1     4
## 3 37.33227 -121.8900  1     4
## 4 37.33230 -121.8900  1     4
## 5 37.33230 -121.8900  1     4
## 6 37.33282 -121.8904  1     4

We can now show these on a map - we use the group option within the aes to tell ggmap that these are id that separate the routes, otherwise the whole set of co-ordinates are interpreted as a single route. You can remove these and generate the plot again to see what happens.

ggmap(SanFran) +
  geom_path(aes(x = lon, y = lat, group = ID), data = All_Flows)

## Warning: Removed 2506 rows containing missing values (geom_path).

For those who know the topography of San Francisco will understand why the Google routes have been calculated to avoid the central area.

We can now use the trip information to adjust the plot - for example, to scale the routes by the flow volume. We add the size option, but also divide the flows by 1000 to make the line widths an acceptable size. Thicker lines represent greater flows. We have also added the location of the stations in red.

ggmap(SanFran) +
  geom_path(aes(x = lon, y = lat, group = ID), data = All_Flows,size = All_Flows$Trips/1000) +
  geom_point(data=stations, aes(long, lat),colour="red")

## Warning: Removed 2506 rows containing missing values (geom_path).

There are a lot of other adjustment options, for example, we can darken the map and change the line color:

ggmap(SanFran,darken = 0.8) +
  geom_path(aes(x = lon, y = lat, group = ID), data = All_Flows, size = All_Flows$Trips/1000, colour = "white") +
  geom_point(data=stations, aes(long, lat),colour="red")

## Warning: Removed 2506 rows containing missing values (geom_path).

Or color the lines by intensity of flow; plus, we have also added some labels for the station ID using geom_text():

ggmap(SanFran,darken = 0.8) +
  geom_path(aes(x = lon, y = lat, group = ID,colour = All_Flows$Trips), data = All_Flows, size = All_Flows$Trips/1000) +
  scale_colour_gradient(low="#900C3F", high="#FFC300") +
  geom_point(data=stations, aes(long, lat),colour="red") +
  geom_text(data = stations,aes(x = long, y = lat, label = station_id), check_overlap = TRUE, colour="#FFFFFF",hjust=-0.6)

## Warning: Removed 2506 rows containing missing values (geom_path).

We will now clean up this plot by removing the unwanted elements and changing the title of the legend:

ggmap(SanFran,darken = 0.8) +
  geom_path(aes(x = lon, y = lat, group = ID,colour = All_Flows$Trips), data = All_Flows, size = All_Flows$Trips/1000) +
  scale_colour_gradient(low="#900C3F", high="#FFC300",name="Trips") +
  geom_point(data=stations, aes(long, lat),colour="red") +
  geom_text(data = stations,aes(x = long, y = lat, label = station_id), check_overlap = TRUE, colour="#FFFFFF",hjust=-0.6) +
  theme (
        axis.text = element_blank (), 
        axis.title = element_blank (),
        axis.ticks = element_blank ()
        )

## Warning: Removed 2506 rows containing missing values (geom_path).

Tracking data

In the previous example we estimated the routes that cyclists may have taken as part of a bike share scheme - in reality the true routes taken would be divergent from these estimated paths which are essentially an optimized shortest path (based on how google calculate these for cyclists), and as such are a set of co-ordinates that follow the road topology.

For some urban analytics applications we may have tracking data gathered by GPS. For this example we will use a sample of the GeoLife data. This GPS trajectory dataset was collected by 182 users during a period of three years from April 2007 to August 2012, and was part of a Microsoft Research Asia project. We will import a sample of this data which are the records for a single user. There are multiple text files with the file extension “.pit”, each of which relate to a different journey. The first six rows of these data can be discarded.

In total there are 395 of these files which as you could imagine would take a very long time to import one by one. As such, we have written a small block of code that will import these for you and store each imported data frame in a list. This is an alternative to the loop that was presented in the last section.

library(lubridate)

# Create a list of the files to import
file_list <- list.files("./data/GeoLife", full=T)
count <- length(file_list)

# This function imports a file
file_con <- lapply(file_list, function(x){
  tab <- read.table(x, head=F, quote = "\"", skip = 6, sep = ",") # import file
  colnames(tab) <- c("lat","lon","zero","alt","days","date","time") # change the column headings
  tab <- subset(tab,select=c("lat","lon","date","time")) # discard unwanted columns
  tab$date <- ymd(tab$date) # Set the time date
  tab$time <- hms(tab$time) # Set the time format
  tab$ID <- strsplit(x,"/|\\.")[[1]][5] # Uses the file name as an ID
  return(tab)
})

# Run the function and rbind the data frames together
file_con_df <- do.call(rbind, file_con)

The imported files are very similar in structure to the combined routes we explored in the last section. However, here, each GPS trail is separated by an ID.

head(file_con_df)

##        lat      lon       date        time             ID
## 1 39.99997 116.3271 2008-10-23 17H 58M 52S 20081023175852
## 2 40.00001 116.3272 2008-10-23 17H 58M 58S 20081023175852
## 3 40.00001 116.3271 2008-10-23  17H 59M 3S 20081023175852
## 4 40.00002 116.3271 2008-10-23  17H 59M 8S 20081023175852
## 5 40.00000 116.3271 2008-10-23 17H 59M 13S 20081023175852
## 6 40.00002 116.3271 2008-10-23 17H 59M 18S 20081023175852

We will first create a simple map of the densest area of the activity:

#Get background map 
beijing <- get_map(location = c(116.3244, 39.99202), zoom = 13,color = 'bw')

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=39.99202,116.3244&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

#Create plot
ggmap(beijing) +
  geom_path(aes(x = lon, y = lat, group = ID), data = file_con_df)

## Warning: Removed 27092 rows containing missing values (geom_path).

With GPS gathered data the tracks can be messy and will not necessarily snap to network features such as paths or road. One way to make a GPS map more readily interpretable we can adjust the alpha of the plot.

ggmap(beijing) +
  geom_path(aes(x = lon, y = lat, group = ID), data = file_con_df,alpha=0.1)

## Warning: Removed 27092 rows containing missing values (geom_path).

We can make this clearer by adjusting some of the colors:

ggmap(beijing,darken = 0.8) +
  geom_path(aes(x = lon, y = lat, group = ID), data = file_con_df,alpha=0.1,colour="#FFFFFF")

## Warning: Removed 27092 rows containing missing values (geom_path).

We can also use the wday function to split the gps data into weekdays or weekend and see if there are different patterns visible:

ggmap(beijing,darken = 0.8) +
  # plots the weekday
  geom_path(aes(x = lon, y = lat, group = ID), data = file_con_df[!wday(file_con_df$date) %in% c(0,7),],alpha=0.1,colour="green") +
  # plots weekend
  geom_path(aes(x = lon, y = lat, group = ID), data = file_con_df[wday(file_con_df$date) %in% c(0,7),],alpha=0.1,colour="white")

## Warning: Removed 18524 rows containing missing values (geom_path).

## Warning: Removed 10489 rows containing missing values (geom_path).

We can also use a facet grid to visualize the patterns by a temporal period - in this case by hour:

# Create a new hour of the day variable
file_con_df$hour <- hour(file_con_df$time)

# Facet plot
ggmap(beijing,darken = 0.8, legend='none') +
  geom_path(aes(x = lon, y = lat, group = ID), data = file_con_df,alpha=0.2,colour="white") +
  facet_wrap(~hour,ncol=3) +
  theme (
        axis.text = element_blank (), 
        axis.title = element_blank (),
        axis.ticks = element_blank ()
        )

## Warning: Removed 35037 rows containing missing values (geom_path).

Interaction data

At more disaggregate temporal scales flow data are captured within many urban contexts through more traditional survey data or censuses. These commonly are created for questions around home-work locations or migration; and as such for a variety of variables it is often possible to map aggregate flows both internally and within urban systems.

In this final section of the practical we will map some longitudinal employer-household dynamics data from the US Census Bureau. There are a range of different data within this collection, but for these purposes we will utilize an extract of the origin destination data - LEHD Origin-Destination Employment Statistics (LODES). These are available for each state, however, the extract we consider here is for 2014 in Texas. This is at a block level.

First we will read in the origin-destination data and a cross walk file which gives a list of higher aggregations for the blocks. We have cut down both tables from the full file to save space.

# Import OD
load("./data/tx_od_main_JT00_2014.Rdata")
# Import cross walk
XWalk <- read.csv("./data/tx_xwalk.csv")
# We will also turn off scientific notation for this example as the block codes are stored as a very large number
options(scipen=999)

# View top six rows of OD data
head(OD)

##         w_geocode       h_geocode S000
## 1 480019501001001 480019501002117    1
## 2 480019501001001 480019501003007    1
## 3 480019501001001 480019501003080    1
## 4 480019501001001 480739508012094    1
## 5 480019501001001 482030201032067    1
## 6 480019501001001 482139501001101    1

There are a lot of variables within the file, the details of which can be found here; however, for this practical we will just be using “S000” which relates to the total number of jobs alongside the origin (homes - h_geocode) and destination (work - w_geocode) block codes.

We will view the crosswalk data frame and then we need to append this to the OD data. We merge this twice as the tract codes are needed for both the origin and destination blocks.

# View the top six rows of data
head(XWalk)

##        tabblk2010        trct
## 1 480019501001000 48001950100
## 2 480019501001001 48001950100
## 3 480019501001002 48001950100
## 4 480019501001003 48001950100
## 5 480019501001004 48001950100
## 6 480019501001005 48001950100

# Merge onto the home block code
OD <- merge(OD,XWalk, by.x = "h_geocode", by.y = "tabblk2010", all.x= TRUE)
# Change column names
colnames(OD) <- c("h_geocode","w_geocode","S000","h_geocode_trct")

# Merge onto the work block code
OD <- merge(OD,XWalk, by.x = "w_geocode", by.y = "tabblk2010", all.x= TRUE)
# Change column names
colnames(OD) <- c("h_geocode","w_geocode","S000","h_geocode_trct","w_geocode_trct")

We can then aggregate the flows into the origin and destination tracts:

# Aggregate flows into Tracts
OD_Tract <- aggregate(data=OD, S000 ~ h_geocode_trct + w_geocode_trct, sum)

Before we can plot the flows we need the location of the tracts, which we can extract from a shapefile of the zone locations:

library(rgdal)

We will then read in the tract polygons for Texas downloaded from the US Census Beurueau site:

# Import spatial data
TX_SP <- readOGR("./data/Texas_Tract.geojson", "OGRGeoJSON")

## OGR data source with driver: GeoJSON 
## Source: "./data/Texas_Tract.geojson", layer: "OGRGeoJSON"
## with 5254 features
## It has 9 fields

# Convert to WGS84
TX_SP <- spTransform(TX_SP, CRS("+init=epsg:4326"))

We will now have a look at the content of the data frame associated with TX_SP and then use the coordinates() function which extracts the centroid of a zone to build a new lookup:

# View the top 6 rows
head(TX_SP@data)

##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID  NAME LSAD
## 0      48      005  000102 1400000US48005000102 48005000102  1.02   CT
## 1      48      005  001002 1400000US48005001002 48005001002 10.02   CT
## 2      48      005  001300 1400000US48005001300 48005001300    13   CT
## 3      48      007  950100 1400000US48007950100 48007950100  9501   CT
## 4      48      007  950400 1400000US48007950400 48007950400  9504   CT
## 5      48      021  950200 1400000US48021950200 48021950200  9502   CT
##       ALAND    AWATER
## 0 156392425   2427401
## 1 225237385   1495765
## 2 403384219  36228741
## 3 474927337 467095102
## 4   6040756         0
## 5  37063780     31442

# Create lookup table
TX_tract_centroids <- data.frame(TX_SP@data$GEOID,coordinates(TX_SP))
# Change column names
colnames(TX_tract_centroids) <- c("Tract","lon","lat")
# View the top six rows of the new data
head(TX_tract_centroids)

##         Tract       lon      lat
## 0 48005000102 -94.87112 31.32737
## 1 48005001002 -94.70740 31.16608
## 2 48005001300 -94.52019 31.19509
## 3 48007950100 -96.94047 28.23178
## 4 48007950400 -97.07097 28.01728
## 5 48021950200 -97.37416 30.35373

We will now merge the tract coordinates onto the tract flow data frame - again, we will do this twice and adjust the column names so both origin and destinations are coded.

# Add home lat lon
OD_Tract <- merge(OD_Tract,TX_tract_centroids,by.x="h_geocode_trct", by.y="Tract",all.x=TRUE)
# Fix column names
colnames(OD_Tract) <- c("h_geocode_trct","w_geocode_trct","S000","h_lon","h_lat")

# Add work lat lon
OD_Tract <- merge(OD_Tract,TX_tract_centroids,by.x="w_geocode_trct", by.y="Tract",all.x=TRUE)
# Fix column names
colnames(OD_Tract) <- c("w_geocode_trct","h_geocode_trct","S000","h_lon","h_lat","w_lon","w_lat")

We can now create a map - we will first create a map connecting tracts with flows of over 70

# Get base map
texas <- get_map(location = "Texas", zoom = 6,color = 'bw')

# Create plot
ggmap(texas) + 
      geom_segment(data=OD_Tract[OD_Tract$S000 > 70,],aes(y = h_lat, x = h_lon, yend = w_lat, xend = w_lon))

This is a bit messy, but we can improve this plot by adjusting some colors, line widths and transparency:

# Create plot
ggmap(texas,darken = 0.8) + 
      geom_segment(data=OD_Tract[OD_Tract$S000 > 70,],aes(y = h_lat, x = h_lon, yend = w_lat, xend = w_lon),colour= "white", alpha= 0.1, size=0.2)

We will now create a map at a city scale; in this case Houston. We have adjusted some of the parameters again, and also increased the number of origin-destinations:

# Get base map
Houston <- get_map(location = "Houston, TX", zoom = 10,color = 'bw')

# Create plot
ggmap(Houston,darken = 0.8) + 
      geom_segment(data=OD_Tract[OD_Tract$S000 > 5,],aes(y = h_lat, x = h_lon, yend = w_lat, xend = w_lon),colour= "white", alpha= 0.01, size=0.2)

## Warning: Removed 274660 rows containing missing values (geom_segment).

The problem with this plot is that all the origin destination pairs are being considered uniformly, and there is no consideration of the volume of flows. We can adjust this using the scale_alpha_continuous() option.

# Create plot
ggmap(Houston,darken = 0.8) + 
      geom_segment(data=OD_Tract[OD_Tract$S000 > 5,],aes(y = h_lat, x = h_lon, yend = w_lat, xend = w_lon, alpha= S000), size=0.3, colour = "white") +
  scale_alpha_continuous(range = c(0.004, 0.3))  +
  theme ( legend.position="none",
        axis.text = element_blank (), 
        axis.title = element_blank (),
        axis.ticks = element_blank ()
        )

## Warning: Removed 274660 rows containing missing values (geom_segment).

Further resources / training

Mapping flows in R

Data Viz 4 - Mapping Flows

Learning Objectives

Interaction data

Estimating and representing routes

Tracking data

Interaction data

Further resources / training