Saturday, February 25, 2012

The undiscovered country – a tutorial on plotting maps in R

The ability to handle maps and geospatial images is always a nice trick to have up your sleeve. Almost any sizeable report will contain a map – of a locality, of a country, or of the world. However, very few analysts have the ability to produce these plots for themselves and often resort to using Google Earth snapshots, images that they have drawn in paint, or – heaven forefend – a map that they found on Google Images. Making maps appears to be a skill held only by a few citizens of some undiscovered country, from whose bourn few analysts return.

In this post we chart the shores of the undiscovered country, and show that drawing maps, such as the one shown in Figure 1, is not so difficult. We will make use of R's powerful functions that allow plotting of maps, locations on those maps, and connecting lines that illustrate links between points of interest.
Figure 1: Map of Australia's state capitals. Connections have been selected arbitrarily for illustrative purposes.

We will also make use of Python, as I have found it far easier to use the Python bindings for the GoogleMaps API. Our approach consists of three steps:
  1. Work out what you want to plot: the region, the nodes (ie, the locations), and the edges (ie, the connections between nodes).
  2. Get the coordinates of the nodes.
  3. Read and plot the data in R.
Step 1: What do you want to plot?

As a proud Australian, I am trying to increase the profile of my tiny country. All too often, the maps that I see on R bloggers are of the United States. Granted, the US has 300 million people and, despite the inexorable rise of China, remains the economic super-power of the world. Kudos to Uncle Sam. But Australia can boast a similar landmass, a bucket-load of natural resources, and more species of poisonous snakes and man-eating reptiles than any other country on the planet (a point that we have down-played since the unsuccessful tourism slogan "come to Australia and you might get eaten"). No, this article is going to be printed next to a map of the Land Downunder dammit!

To illustrate how one can plot points, label them, and connect them with lines, we will plot each of Australia's state capitals, and connect selected pairs with lines. These lines will be based on the geodesic between the two points that uses the geosphere package (see here). To complete this task, there are three pieces of input data that you will require:  a shape-file, a list of nodes, and an edge-list.

Shapefile
 I have used a shape-file located here: http://www.gadm.org/country. This website also contains shape files for other countries and is definitely worth a look.

List of nodes
The nodes could be anything from airports to electricity transmission hubs. In my case, the nodes are just the capital cities of each of the states and territories. The list, which I have stored in the CSV file "CapitalCities.csv", consists of two columns: the name and the state/territory of each city.

Edge-list
The edge-list contains two columns: the 'from' column and the 'to' column. Each entry represents a connection between two nodes. In this instance, the two columns are interchangeable as the network is not directed, however, it would be easy to incorporate this feature. The edge-list that I have used can be accessed here.

Step 2: Get the coordinates of the nodes

Getting coordinates of a specific location is a simple affair thanks to the wonders of Google Maps and Google Earth. The problem is not how to get the data, but instead how to get it programmatically. For ten nodes, you can just as easily do this manually. For 1000 nodes, you can either find some intern to sit in front of Google Earth for the next week, or accept that there must be a better way.

I am sure that there are solutions to this problem that use R, but I have not yet found anything that is as quick and easy as the Python wrappers for the Google Maps API. (If you are unfamiliar with Python, then I advise you to take the time to learn the basics – your time will not have been wasted.) The module can be downloaded at http://pypi.python.org/pypi/googlemaps/ or via easy_install. An API key is needed to interface with the GoogleMaps API, but this is easy enough to obtain – just search for 'Google Maps API key' and follow the instructions.

The following script retrieves the latitude and longitude of each city in the file 'CapitalCities.csv', and copies them into a new file 'CapCoordinates.csv'.

from googlemaps import GoogleMaps
import csv

ifile=csv.reader(open('CapitalCities.csv','rb'),delimiter=',')
ofile=open('CapCoordinates.csv','wb')

w=csv.writer(ofile,delimiter=',')
w.writerow(['Name','State','Latitude','Longitude'])

gmaps=GoogleMaps(API_KEY)

count=0

for row in ifile:
   if count!=0:
      address=str(row[0])+" "+str(row[1])
      lat, lng = gmaps.address_to_latlng(address)
      w.writerow([str(row[0]),str(row[1]),str(lat),str(lng)])
      print row[0],lat,lng

   count+=1

ofile.close() 


Step 3: Plotting the map in R

At last we arrive at the R portion of this article. We will need to load four packages: the maps, sp, maptools, and geosphere libraries. Below is the complete code.

# Load necessary packages
library(maps)
library(sp)
library(maptools)
library(geosphere)

# Function that returns coordinates of a given location
getCoords<-function(location,df){
  return(df[match(x=location,df[,"Name"]),][,c(4,3)])
}

# Plot a great circle from 'from' to 'to' names.
plotLine<-function(from,to,df){
  inter<-gcIntermediate(p1=getCoords(from,df),
                      p2=getCoords(to,df),
                      n=50,addStartEnd=TRUE)
  lines(inter, col="green",cex=2,lwd=2)
}

# Read in the Australia Shapefile
ozDATA<-readShapeSpatial("AUS_adm1")

# Read CSV file containing coordinates and edges
nodes<-read.csv(file="CapCoordinates.csv",sep=",",header=TRUE)
edges<-read.csv(file="EdgeListCapCities.csv",sep=",",header=TRUE)

# Plot the graph itself
plot(ozDATA,lwd=0.01,bg="#242424",col="#0000CD",ylim=c(-46,-10),xlim=c(125,145))

# Plot the nodes on the graph
points(nodes$Longitude,nodes$Latitude,pch=16,cex=1.5)

# Label the nodes
text(nodes$Longitude,nodes$Latitude,nodes$Name,cex=1,adj=0,pos=2,col="#C2C2C2")

# Plot each of the edges between nodes
for (j in 1:nrow(edges)){
   plotLine(edges[j,]$From,edges[j,]$To,nodes) 
}

Line-by-line walkthrough

We start by reading in the Australia shape-file using the readShapeSpatial function from the sp package.
ozDATA<-readShapeSpatial("AUS_adm1")
Then we read in our list of nodes and freshly obtained coordinates, as well as the edge list.
nodes<-read.csv(file="CapCoordinates.csv",sep=",",header=TRUE)
edges<-read.csv(file="EdgeListCapCities.csv",sep=",",header=TRUE)
Having loaded all the necessary data, we can start plotting.
plot(ozDATA,lwd=0.01,bg="#242424",col="#0000CD",ylim=c(-46,-10),xlim=c(125,145))
We plot the shape-file using the plot function with the following arguments:
  • lwd – the line width;
  • bg – the background colour;
  • col – the colour of the regions (usually landmasses);
  • ylim – a 2-tuple containing the latitude boundaries for the plot; and
  • xlim – a 2-tuple containing the longitude boundaries for the plot.
Now we want to plot the cities (ie, the nodes) on this map. To do this we simply pass the coordinates to the points function. The parameters pch and cex set the type and size of the marker, respectively.
points(nodes$Longitude,nodes$Latitude,pch=16,cex=1.5)
Next we want to add labels to each of the nodes.
text(nodes$Longitude,nodes$Latitude,nodes$Name,cex=1,adj=0,pos=2,col="#C2C2C2")
The text function is used in almost identical fashion to the points function, but with an additional argument that contains the names of the nodes. We also add two additional arguments: adj and pos. The adj parameter adjusts the location of the labels. In this instance we have set it to zero, as it will be overwritten by our specification of pos. The pos parameter takes a value between 1 and 4 that specifies whether the label will be below, to the left, above, or to the right of the specified coordinates. (In this instance we have chosen to have labels to the left of the nodes.)

All that remains is to plot the lines between the respective cities. We loop through each of the edges and call the plotLine function.
for (j in 1:nrow(edges)){
   plotLine(edges[j,]$From,edges[j,]$To,nodes) 
}

getCoords<-function(location,df){
  return(df[match(x=location,df[,"Name"]),][,c(4,3)])
}

plotLine<-function(from,to,df){
  inter<-gcIntermediate(p1=getCoords(from,df),
                      p2=getCoords(to,df),
                      n=50,addStartEnd=TRUE)
  lines(inter, col="green",cex=2,lwd=2)
}
plotLine simply takes each pair of nodes in the edge-list, and uses the gcIntermediate function (as defined in the geosphere package) to join the two nodes with the shortest edge that lies upon a great circle (ie, the geodesic between the two nodes). To simplify this function, I have defined an additional function, getCoords, that takes as arguments a string (the node name) and a data frame (the nodes and their coordinates) and returns the coordinates of that particular node. After the for loop is completed, we arrive at the graph shown in Figure 1.

Conclusion
Nothing here was particularly complicated. We decided what we wanted to plot, collected the data, and used R to handle the data and produce the image. However, the applications for the tools that we have used are practically endless. To that end, I would be keen to hear from people who have been exploiting R's mapping functions across different applications. Ideally, I would like to showcase a range of applications in my next posting. Feel free to leave comments and links to your websites.

Until next time, enjoy being one of the few analysts who has visited the undiscovered country and can make maps in R with ease.