Learning Objectives

By the end of this practical lab you will be able to:

Matrix and Data Frames

Two main object types that can be used to store tabular data in R include the data frame and matrix. Data frames can contain input columns that are of multiple types (e.g. character, numeric etc); and a matrix a single type. You can create these within R manually or by reading in other common formats such as spreadsheets or csv files.

A data frame can be created using the data.frame() function.

#Create two vectors
a <- rep(2010:2017, each = 4) # this uses the rep() function to repeat values
b <- round(runif(32, 0, 40)) # runif can be used to generate random numbers - in this case between 0 and 40
#Create data frame
c <- data.frame(a,b)

You can type c into the console to return the who data frame, however, it you might just want to look at the top few rows. This can be achieved with the head() function:

#head returns the top six rows
head(c)
##      a  b
## 1 2010 34
## 2 2010 19
## 3 2010 38
## 4 2010  1
## 5 2011 26
## 6 2011 22

The matrix() function can be used as follows to create a tabular object of a single data type:

#Create a list of numbers
a <- 1:25 #The colon signifies a range
a
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25
#Create a matrix with 5 rows and 5 columns
b <- matrix(a,nrow=5, ncol=5)
b
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25

It is possible to multiply a numeric matrix by a constant or another matrix

#Multiply b by 10
b*10
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   10   60  110  160  210
## [2,]   20   70  120  170  220
## [3,]   30   80  130  180  230
## [4,]   40   90  140  190  240
## [5,]   50  100  150  200  250
#Multiply b * b
b*b
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   36  121  256  441
## [2,]    4   49  144  289  484
## [3,]    9   64  169  324  529
## [4,]   16   81  196  361  576
## [5,]   25  100  225  400  625

When a matrix prints, the columns and rows show their index as a set of numbers within square brackets. These can be used to extract values from the matrix. These are formatted as [row number, column number]. For example:

#Extract first row
b[1,]
## [1]  1  6 11 16 21
#Extract fourth column
b[,4]
## [1] 16 17 18 19 20
#Extract third and fourth columns
b[,3:4] # The colon is used to define a numeric vector between the two numbers
##      [,1] [,2]
## [1,]   11   16
## [2,]   12   17
## [3,]   13   18
## [4,]   14   19
## [5,]   15   20
#Extract first and fourth rows
b[c(1,5),] # The c() is used to create a numeric vector with the numbers separated by a comma
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    5   10   15   20   25
#Extract the value in the third row and fourth column
b[3,4]
## [1] 18

In the data frame that you created earlier, the column and rows were not numbered, however, you can still use the square brackets to extract the values in the same way as a matrix.

You can also reference the column names themselves using the $ symbol, for example:

#Return all the values in the column called "a"
c$a
##  [1] 2010 2010 2010 2010 2011 2011 2011 2011 2012 2012 2012 2012 2013 2013
## [15] 2013 2013 2014 2014 2014 2014 2015 2015 2015 2015 2016 2016 2016 2016
## [29] 2017 2017 2017 2017
#A different way of returning the column called "a"
c[,"a"]
##  [1] 2010 2010 2010 2010 2011 2011 2011 2011 2012 2012 2012 2012 2013 2013
## [15] 2013 2013 2014 2014 2014 2014 2015 2015 2015 2015 2016 2016 2016 2016
## [29] 2017 2017 2017 2017

We can also find out what a data frame column names are using the colnames() function:

colnames(c)
## [1] "a" "b"

Or, we can also use the same function to set new column names:

colnames(c) <- c("Year","Count")

Getting External Data into R

For most urban analytics you are more likely to be reading external data into R rather than creating data objects from scratch. Tabular data is commonly stored in text files such as CSV, or on spreadsheets; and explicitly spatial data will likely be stored in formats such as Shapefiles. In this section you will learn how to read data stored in these formats into R.

Reading Tabular Data

A common way in which data can be stored externally are the the use of .csv files. These are text files, and have a very simple format where columns of attributes are separated by a comma 1, and each row by a carriage return.

In the following example you will read in some U.S. Census Bureau, 2010-2014 American Community Survey (ACS) 5-Year Estimate data. This was downloaded from the American Fact Finder website. The data are for census tracts in San Francisco and relate to median earnings in the past 12 months.

Reading CSV files into R uses the read.csv() function:

#Read CSV file - creates a data frame called earnings
earnings <- read.csv("./data/ACS_14_5YR_S2001_with_ann.csv")

#Show column headings
colnames(earnings)
## [1] "UID"        "pop"        "pop_m"      "earnings"   "earnings_m"
#UID - Tract ID
#pop - estimated total population over 16 with income
#pop_m - estimated total population over 16 with income (margin of error)
#earnings - estimated median earnings
#earnings - estimated median earnings (margin of error)

It is possible to show the structure of the object using the str() function.

str(earnings)
## 'data.frame':    197 obs. of  5 variables:
##  $ UID       : num  6.08e+09 6.08e+09 6.08e+09 6.08e+09 6.08e+09 ...
##  $ pop       : int  2371 2975 2748 3668 1562 2172 2716 3173 3517 4030 ...
##  $ pop_m     : int  226 315 324 442 198 374 315 338 258 521 ...
##  $ earnings  : Factor w/ 195 levels "-","100204","100583",..: 110 172 105 108 6 14 30 113 193 139 ...
##  $ earnings_m: Factor w/ 194 levels "**","10048","10068",..: 54 19 14 134 115 181 125 167 29 140 ...

This shows that the object is a data frame with 197 rows and 5 variables. For each of the attributes the class is shown (num = numeric; int = integer and Factor, with the number of levels). The read.csv() function guesses the column types when the data are read into R.

One issue you might notice is that the earnings and earnings_m variables have been read in as a Factor. The reason these columns were not read as integers (like the pop and pop_m) column is the presence of two non-numeric values which are shown as “*" and “-”. In ACS data these two symbols indicate that the sample sizes were either no sample observations or too few sample observations to make a calculation.

Issues such as these are quite common when reading in external data; and we will look at how this can be corrected later.

Not all tabular data are distributed as textfiles, and another very common format is Microsoft Excel format - .xls or xlsx. Unlike .csv files there are no built in function to read these formats, however, extension packages exist (e.g. XLConnect)

#Download and install package
install.packages("XLConnect")
#Load package
library(XLConnect)

The following code downloads an Excel File from the London Data Store and then reads this into R.

download.file("https://files.datapress.com/london/dataset/number-bicycle-hires/2016-11-16T08:14:05/tfl-daily-cycle-hires.xls","./data/tfl-daily-cycle-hires.xls")
#Read workbook
workbook <- loadWorkbook("./data/tfl-daily-cycle-hires.xls")

#Read the Data Sheet
cycle_hire <- readWorksheet(workbook, sheet="Data")

Reading Spatial Data

Spatial data are distributed in a variety of formats, but commonly as Shapefiles. These can be read into R using a number of packages, however, is illustrated here with “rgdal”. The following code loads a Census Tract Shapefile which was downloaded from the SF OpenData.

#Download and install package
install.packages("rgdal")
#Load package
library(rgdal)

# Read Shapefile
SF <- readOGR(dsn = "./data", layer = "tl_2010_06075_tract10")
## OGR data source with driver: ESRI Shapefile 
## Source: "./data", layer: "tl_2010_06075_tract10"
## with 197 features
## It has 12 fields

This has created a SpatialPolygonsDataFrame Reading Spatial Data2 object and can view the tract boundaries using the plot() function:

plot(SF)

The San Francisco peninsula is shown, however, the formal boundaries extend into the ocean and also include the Farallon Islands. For cartographic purposes it may not be desirable to show these extents, and later we will explore how these can be cleaned up.

A SpatialPolygonsDataFrame is an object that contains a series of ‘slots’ holding different items of data:

#. The slotNames() function prints their names.
slotNames(SF)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"

The objects stored within the slots can be accessed using the “@” symbol:

#Show the top rows of the data object
head(SF@data)
##   STATEFP10 COUNTYFP10 TRACTCE10     GEOID10 NAME10       NAMELSAD10
## 0        06        075    016500 06075016500    165 Census Tract 165
## 1        06        075    016400 06075016400    164 Census Tract 164
## 2        06        075    016300 06075016300    163 Census Tract 163
## 3        06        075    016100 06075016100    161 Census Tract 161
## 4        06        075    016000 06075016000    160 Census Tract 160
## 5        06        075    015900 06075015900    159 Census Tract 159
##   MTFCC10 FUNCSTAT10 ALAND10 AWATER10  INTPTLAT10   INTPTLON10
## 0   G5020          S  370459        0 +37.7741958 -122.4477884
## 1   G5020          S  309097        0 +37.7750995 -122.4369729
## 2   G5020          S  245867        0 +37.7760456 -122.4295509
## 3   G5020          S  368901        0 +37.7799831 -122.4286631
## 4   G5020          S  158236        0 +37.7823363 -122.4224838
## 5   G5020          S  295388        0 +37.7833400 -122.4293428

The “data” slot contains a data frame with a row of attributes for each of the spatial polygons contained within the SF object; thus, one each row equates to one polygon. Other slots contain useful information such as the spatial projection.

Creating Spatial Data

Sometimes it is necessary to create a spatial object from scratch, which is most common for point data given that only a single co-ordinate is required for each feature. This can be achieved using the SpatialPointsDataFrame() function and is used within this example to create a 311 point dataset. 311 data record non emergency calls within the US, and in this case are those which occurred within San Francisco between January and December 2016. The 311 data used here have been simplified from the original data to only a few variables, and those calls without spatial references have been removed.

# Read csv into R
data_311 <- read.csv("./data/311.csv")
# Have a look at the structure
head(data_311)
##    CaseID                     Category      Lat       Lon
## 1 6405492               Street Defects 37.82269 -122.3632
## 2 6590944                 Sewer Issues 37.81054 -122.3634
## 3 5646247            Abandoned Vehicle 37.72862 -122.3647
## 4 5547584     Graffiti Public Property 37.72528 -122.3658
## 5 6407484 Street and Sidewalk Cleaning 37.72541 -122.3659
## 6 5503177       Temporary Sign Request 37.81994 -122.3663
# Create the SpatialPointsDataFrame
SP_311 <- SpatialPointsDataFrame(coords = data.frame(data_311$Lon, data_311$Lat), data = data.frame(data_311$CaseID,data_311$Category), proj4string = SF@proj4string)

# Show the results
plot(SP_311)

Subsetting Data

It is often necessary to subset data; either restricting a data frame to a set of columns or rows; or in the case of spatial data, creating an extract for a particular set of geographic features. Subsetting can occur in a number of different ways

#Create a table of frequencies by the categories used within the 311 data
table(data_311$Category)
## 
##         311 External Request            Abandoned Vehicle 
##                          179                         5655 
##   Blocked Street or SideWalk      Catch Basin Maintenance 
##                          941                          161 
##                   Color Curb             Damaged Property 
##                          146                         1966 
##       DPW Volunteer Programs             General Requests 
##                            7                         5976 
##    Graffiti Private Property     Graffiti Public Property 
##                         7436                         9500 
##             Illegal Postings    Interdepartmental Request 
##                         1137                           90 
##           Litter Receptacles                MUNI Feedback 
##                         1552                         1928 
##                 Noise Report        Rec and Park Requests 
##                          626                         1304 
## Residential Building Request                 Sewer Issues 
##                          295                         2432 
##                SFHA Requests             Sidewalk or Curb 
##                          693                         1822 
##                  Sign Repair Street and Sidewalk Cleaning 
##                         1219                        41683 
##               Street Defects                 Streetlights 
##                         2146                         2155 
##       Temporary Sign Request             Tree Maintenance 
##                         1045                         1764 
##    Unpermitted Cab Complaint 
##                            6
# Use the subset() function to extract rows from the data which relate to Sewer Issues
sewer_issues <- subset(data_311,Category == "Sewer Issues")

# Use the square brackets "[]" to perform the same task
sewer_issues <- data_311[data_311$Category == "Sewer Issues",]

# Extract a list of IDs for the "Sewer Issues"
sewer_issues_IDs <- subset(data_311,Category == "Sewer Issues", select = "CaseID")

Subsetting can also be useful for spatial data. In the example above the full extent of San Francisco was plotted, however, for cartographic purposes it may be preferable to remove the “Farallon Islands”. This has a GEOID10 of “06075980401” which can be used to remove this from a plot:

plot(SF[SF@data$GEOID10 != "06075980401",]) # Removes Farallon Islands from the plot

This can also be quite useful if you want to plot only a single feature, for example:

plot(SF[SF@data$GEOID10 == "06075980401",]) # Only plots Farallon Islands

You can also use the same syntax to create a new object - for example:

SF <- SF[SF@data$GEOID10 != "06075980401",] # Overwrites the SF object

Clipping Spatial Data

Clipping is a process of subsetting using overlapping spatial data. The following code uses the outline of the coast of the U.S. to clip the boundaries of the SD spatial data frame object:

#Load library
library("raster")
#Read in coastal outline (Source from - https://www.census.gov/geo/maps-data/data/cbf/cbf_counties.html)
coast <- readOGR(dsn = "./Data/", layer = "cb_2015_us_county_500k")
## OGR data source with driver: ESRI Shapefile 
## Source: "./Data/", layer: "cb_2015_us_county_500k"
## with 3233 features
## It has 9 fields
SF_clipped <- crop(SF, coast) # Clip the the SF spatial data frame object to the coastline

#Plot the results
plot(SF_clipped)

We will now perform a similar operation on the SP_311 object:

library(spatialEco)
SP_311_PIP <- point.in.poly(SP_311, coast) # Clip the the SF spatial data frame object to the coastline

You can now see that this has subset the data to the extent of the peninsula area of San Francisco (see the previous plot):

plot(SP_311_PIP)

Merging Tabular Data

So far we have utilized a single data frame or spatial object; however, it is often the case that in order to generate information, data from multiple sources are required. Where data share a common “key”, these can be used to combine / link tables together. This might for example be an identifier for a zone; and is one of the reasons why most statistical agencies adopt a standard sets of geographic codes to identify areas.

In the earlier imported data “earnings” this included a UID column which relates to a Tract ID. We can now import an additional data table called bachelors - this also includes the same ID.

#Read CSV file - creates a data frame called earnings
bachelors <- read.csv("./data/ACS_14_5YR_S1501_with_ann.csv")

#UID - Tract ID
#Bachelor_Higher - Bachelor degree or higher %
#Bachelor_Higher_m - Bachelor degree or higher % (margin of error)

Using the matching ID columns on both datasets we can link them together to create a new object with the merge() function:

#Perform the merge
SF_Tract_ACS <- merge(x=earnings,y=bachelors,by.x="UID",by.y="UID")
SF_Tract_ACS <- merge(earnings,bachelors,by="UID")# An alternative method to the above, but a shortened version as the ID columns are the same on both tables
#You can also use all.x=TRUE (or all.y=TRUE) to keep all the rows from either the x or y table - for more details type ?merge()
#The combined table now looks like
head(SF_Tract_ACS) # shows the top of the table
##          UID  pop pop_m earnings earnings_m Bachelor_Higher
## 1 6075010100 2371   226    49954      15503            44.3
## 2 6075010200 2975   315    75984      10892            58.8
## 3 6075010300 2748   324    47586      10549            48.8
## 4 6075010400 3668   442    48931       6531            35.9
## 5 6075010500 1562   198   110076       5331            39.2
## 6 6075010600 2172   374    22074       8666            34.1
##   Bachelor_Higher_m
## 1              19.5
## 2              30.8
## 3              13.2
## 4              26.1
## 5              33.7
## 6                23

Removing and Creating Attributes

It is sometimes necessary to remove variables from a tabular object or to create new values. In the following example we will remove some unwanted columns in the SF_clipped object, leaving just the zone id for each polygon.

#Remind yourself what the data look like...
head(SF_clipped@data)
##   STATEFP10 COUNTYFP10 TRACTCE10     GEOID10 NAME10       NAMELSAD10
## 0        06        075    016500 06075016500    165 Census Tract 165
## 1        06        075    016400 06075016400    164 Census Tract 164
## 2        06        075    016300 06075016300    163 Census Tract 163
## 3        06        075    016100 06075016100    161 Census Tract 161
## 4        06        075    016000 06075016000    160 Census Tract 160
## 5        06        075    015900 06075015900    159 Census Tract 159
##   MTFCC10 FUNCSTAT10 ALAND10 AWATER10  INTPTLAT10   INTPTLON10
## 0   G5020          S  370459        0 +37.7741958 -122.4477884
## 1   G5020          S  309097        0 +37.7750995 -122.4369729
## 2   G5020          S  245867        0 +37.7760456 -122.4295509
## 3   G5020          S  368901        0 +37.7799831 -122.4286631
## 4   G5020          S  158236        0 +37.7823363 -122.4224838
## 5   G5020          S  295388        0 +37.7833400 -122.4293428
SF_clipped@data <- data.frame(SF_clipped@data[,"GEOID10"]) #Makes a new version of the @data slot with just the values of the GEOID10 column - this is wrapped with the data.frame() function

#The data frame within the data slot now looks as follows
head(SF_clipped)
##   SF_clipped.data....GEOID10..
## 1                  06075016500
## 2                  06075016400
## 3                  06075016300
## 4                  06075016100
## 5                  06075016000
## 6                  06075015900

One thing you may not like on this new data frame is the column heading which has got a bit messy. We can clean this up using the colnames() function.

colnames(SF_clipped@data) <- "GEOID10" #Update column names
head(SF_clipped@data) #Check the updated values
##       GEOID10
## 1 06075016500
## 2 06075016400
## 3 06075016300
## 4 06075016100
## 5 06075016000
## 6 06075015900

These tract ID are supposed to match with those in the “SF_Tract_ACS” object, however, if you are very observant you will notice that there is one issue; the above have a leading zero.

head(SF_Tract_ACS) # show the top of the SF_Tract_ACS object
##          UID  pop pop_m earnings earnings_m Bachelor_Higher
## 1 6075010100 2371   226    49954      15503            44.3
## 2 6075010200 2975   315    75984      10892            58.8
## 3 6075010300 2748   324    47586      10549            48.8
## 4 6075010400 3668   442    48931       6531            35.9
## 5 6075010500 1562   198   110076       5331            39.2
## 6 6075010600 2172   374    22074       8666            34.1
##   Bachelor_Higher_m
## 1              19.5
## 2              30.8
## 3              13.2
## 4              26.1
## 5              33.7
## 6                23

As such, in this instance we will create a new column on the SF_Tract_ACS data frame with a new ID that will match the SF GEOID10 column. We can achieve this using the $ symbol and will call this new variable “GEOID10”.

# Creates a new variable with a leading zero
SF_Tract_ACS$GEOID10 <- paste0("0",SF_Tract_ACS$UID)
head(SF_Tract_ACS)
##          UID  pop pop_m earnings earnings_m Bachelor_Higher
## 1 6075010100 2371   226    49954      15503            44.3
## 2 6075010200 2975   315    75984      10892            58.8
## 3 6075010300 2748   324    47586      10549            48.8
## 4 6075010400 3668   442    48931       6531            35.9
## 5 6075010500 1562   198   110076       5331            39.2
## 6 6075010600 2172   374    22074       8666            34.1
##   Bachelor_Higher_m     GEOID10
## 1              19.5 06075010100
## 2              30.8 06075010200
## 3              13.2 06075010300
## 4              26.1 06075010400
## 5              33.7 06075010500
## 6                23 06075010600

If you remember from earlier in this practical, the earnings data had some values that were stored as factors rather than numeric or integers, and the same is true for both the bachelors data; and now the combined SF_Tract_ACS object. We can check this again as follows:

str(SF_Tract_ACS)
## 'data.frame':    197 obs. of  8 variables:
##  $ UID              : num  6.08e+09 6.08e+09 6.08e+09 6.08e+09 6.08e+09 ...
##  $ pop              : int  2371 2975 2748 3668 1562 2172 2716 3173 3517 4030 ...
##  $ pop_m            : int  226 315 324 442 198 374 315 338 258 521 ...
##  $ earnings         : Factor w/ 195 levels "-","100204","100583",..: 110 172 105 108 6 14 30 113 193 139 ...
##  $ earnings_m       : Factor w/ 194 levels "**","10048","10068",..: 54 19 14 134 115 181 125 167 29 140 ...
##  $ Bachelor_Higher  : Factor w/ 159 levels "-","0","1.5",..: 104 128 110 88 95 85 43 118 151 120 ...
##  $ Bachelor_Higher_m: Factor w/ 151 levels "**","1.6","10.2",..: 62 111 20 97 117 82 20 95 42 93 ...
##  $ GEOID10          : chr  "06075010100" "06075010200" "06075010300" "06075010400" ...

We can also remove the UID column. A quick way of doing this for a single variable is to use “NULL”:

SF_Tract_ACS$UID <- NULL

We will now convert the factor variables to numerics. The first stage will be to remove the “-” and “**" characters from the variables with the gsub() function, replacing these with NA values. This also has the effect of converting the factors to characters.

#Replace the "-" and "*" characters
SF_Tract_ACS$earnings <- gsub("-",NA,SF_Tract_ACS$earnings,fixed=TRUE) #replace the "-" values with NA
SF_Tract_ACS$earnings_m <- gsub("**",NA,SF_Tract_ACS$earnings_m,fixed=TRUE) #replace the "**" values with NA
SF_Tract_ACS$Bachelor_Higher <- gsub("-",NA,SF_Tract_ACS$Bachelor_Higher,fixed=TRUE) #replace the "-" values with NA
SF_Tract_ACS$Bachelor_Higher_m <- gsub("**",NA,SF_Tract_ACS$Bachelor_Higher_m,fixed=TRUE) #replace the "**" values with NA

We will now convert these to numeric values:

SF_Tract_ACS$earnings <- as.numeric(SF_Tract_ACS$earnings)
SF_Tract_ACS$earnings_m <- as.numeric(SF_Tract_ACS$earnings_m)
SF_Tract_ACS$Bachelor_Higher <- as.numeric(SF_Tract_ACS$Bachelor_Higher)
SF_Tract_ACS$Bachelor_Higher_m <- as.numeric(SF_Tract_ACS$Bachelor_Higher_m )

Now all the variables other than the “GEOID10” are stored as integers or numerics:

str(SF_Tract_ACS)
## 'data.frame':    197 obs. of  7 variables:
##  $ pop              : int  2371 2975 2748 3668 1562 2172 2716 3173 3517 4030 ...
##  $ pop_m            : int  226 315 324 442 198 374 315 338 258 521 ...
##  $ earnings         : num  49954 75984 47586 48931 110076 ...
##  $ earnings_m       : num  15503 10892 10549 6531 5331 ...
##  $ Bachelor_Higher  : num  44.3 58.8 48.8 35.9 39.2 34.1 21.4 51.3 83 52.6 ...
##  $ Bachelor_Higher_m: num  19.5 30.8 13.2 26.1 33.7 23 13.2 25.8 17 25.3 ...
##  $ GEOID10          : chr  "06075010100" "06075010200" "06075010300" "06075010400" ...

Merging Spatial Data

It is also possible to join tabular data onto a spatial object (e.g. SpatialPolygonsDataFrame) in the same way as with regular data frames. In this example, we will join the newly created SF_Tract_ACS data onto the SF_clipped data frame.

SF_clipped <- merge(SF_clipped,SF_Tract_ACS, by="GEOID10") # merge
head(SF_clipped@data)#show the attribute data
##        GEOID10  pop pop_m earnings earnings_m Bachelor_Higher
## 56 06075016500 3973   428    50901      10181            25.7
## 55 06075016400 3076   249    52870       7839            56.7
## 54 06075016300 3907   853    40522      11269            63.2
## 52 06075016100 2247   388    23906       8405            50.2
## 51 06075016000 1670   247    59583      15262            70.8
## 50 06075015900 2177   241    38774       6988            23.5
##    Bachelor_Higher_m
## 56              17.2
## 55              22.3
## 54              23.9
## 52              22.7
## 51              27.2
## 50              15.2

Spatial Joins

Earlier in this practical we created a SpatialPointDataFrame which we later cropped using the point.in.poly() function to create the “SP_311_PIP” object. As a reminder of what this looks like it is plotted below:

plot(SP_311_PIP)

We will now clean up the associated data frame by removing all of the attributes apart from the category (“data_311.Category”) and then add a sensible column name.

SP_311_PIP@data <- data.frame(SP_311_PIP@data[,"data_311.Category"])#subset data
colnames(SP_311_PIP@data) <- "Category" #update column names

Although point.in.poly() was used to clip a dataset to an extent earlier, the other really useful feature of this point in polygon function is that it also appends the attributes of the polygon to the point. For example, we might be interested in finding out which census tracts each of the 311 calls resides within. As such, we will implement another point in polygon analysis to create a new object SF_clipped_311:

SF_clipped_311 <- point.in.poly(SP_311_PIP, SF) # point in polygon
#Cleanup the attributes
SF_clipped_311@data <- SF_clipped_311@data[,c("GEOID10","Category")] #note that we don't need to use the data.frame() function as we are keeping more than one column
#Show the top rows of the data
head(SF_clipped_311@data)
##       GEOID10                     Category
## 1 06075980600            Abandoned Vehicle
## 2 06075980600     Graffiti Public Property
## 3 06075980600 Street and Sidewalk Cleaning
## 4 06075980600 Street and Sidewalk Cleaning
## 5 06075980600 Street and Sidewalk Cleaning
## 6 06075980600 Street and Sidewalk Cleaning

Writing out and saving your data

In order to share data it is often useful to write data frames or spatial objects back out of R as external files. This is very simple, and R supports multiple formats. In these examples, a CSV file and a Shapefile are both created.

#In this example we write out a CSV file from the data slot of the SpatialPointsDataFrame SF_clipped_311
write.csv(SF_clipped_311@data,"311_Tract_Coded.csv")

This has created a CSV file “311_Tract_Coded.csv” in your working directory; we will use this in the next practical class - “Basic SQL”.

It is also possible to write out a Shapefile

#This will write out a Shapefile for San Francisco - note, a warning is returned as the column names are a little longer than are allowed within a Shapefile and as such are automatically shortened.
writeOGR(SF_clipped, ".", "SF_clipped", driver="ESRI Shapefile")

Further resources / training


  1. There are a range of different delimiter which can be used in addition to a comma, with the most common being tab; although sometimes characters not commonly used such as bar (|) will be used.

  2. If a Shapefile containing points or lines imported into R, then these create SpatialPointsDataFrame or SpatialLinesDataFrame respectively.