Musings of a forgetful functor

Saturday, February 25, 2012

The undiscovered country – a tutorial on plotting maps in R

The ability to handle maps and geospatial images is always a nice trick to have up your sleeve. Almost any sizeable report will contain a map – of a locality, of a country, or of the world. However, very few analysts have the ability to produce these plots for themselves and often resort to using Google Earth snapshots, images that they have drawn in paint, or – heaven forefend – a map that they found on Google Images. Making maps appears to be a skill held only by a few citizens of some undiscovered country, from whose bourn few analysts return.

In this post we chart the shores of the undiscovered country, and show that drawing maps, such as the one shown in Figure 1, is not so difficult. We will make use of R's powerful functions that allow plotting of maps, locations on those maps, and connecting lines that illustrate links between points of interest.

Figure 1: Map of Australia's state capitals. Connections have been selected arbitrarily for illustrative purposes.

We will also make use of Python, as I have found it far easier to use the Python bindings for the GoogleMaps API. Our approach consists of three steps:

Work out what you want to plot: the region, the nodes (ie, the locations), and the edges (ie, the connections between nodes).
Get the coordinates of the nodes.
Read and plot the data in R.

Step 1: What do you want to plot?

As a proud Australian, I am trying to increase the profile of my tiny country. All too often, the maps that I see on R bloggers are of the United States. Granted, the US has 300 million people and, despite the inexorable rise of China, remains the economic super-power of the world. Kudos to Uncle Sam. But Australia can boast a similar landmass, a bucket-load of natural resources, and more species of poisonous snakes and man-eating reptiles than any other country on the planet (a point that we have down-played since the unsuccessful tourism slogan "come to Australia and you might get eaten"). No, this article is going to be printed next to a map of the Land Downunder dammit!

To illustrate how one can plot points, label them, and connect them with lines, we will plot each of Australia's state capitals, and connect selected pairs with lines. These lines will be based on the geodesic between the two points that uses the geosphere package (see here). To complete this task, there are three pieces of input data that you will require: a shape-file, a list of nodes, and an edge-list.

Shapefile
I have used a shape-file located here: http://www.gadm.org/country. This website also contains shape files for other countries and is definitely worth a look.

List of nodes
The nodes could be anything from airports to electricity transmission hubs. In my case, the nodes are just the capital cities of each of the states and territories. The list, which I have stored in the CSV file "CapitalCities.csv", consists of two columns: the name and the state/territory of each city.

Edge-list
The edge-list contains two columns: the 'from' column and the 'to' column. Each entry represents a connection between two nodes. In this instance, the two columns are interchangeable as the network is not directed, however, it would be easy to incorporate this feature. The edge-list that I have used can be accessed here.

Step 2: Get the coordinates of the nodes

Getting coordinates of a specific location is a simple affair thanks to the wonders of Google Maps and Google Earth. The problem is not how to get the data, but instead how to get it programmatically. For ten nodes, you can just as easily do this manually. For 1000 nodes, you can either find some intern to sit in front of Google Earth for the next week, or accept that there must be a better way.

I am sure that there are solutions to this problem that use R, but I have not yet found anything that is as quick and easy as the Python wrappers for the Google Maps API. (If you are unfamiliar with Python, then I advise you to take the time to learn the basics – your time will not have been wasted.) The module can be downloaded at http://pypi.python.org/pypi/googlemaps/ or via easy_install. An API key is needed to interface with the GoogleMaps API, but this is easy enough to obtain – just search for 'Google Maps API key' and follow the instructions.

The following script retrieves the latitude and longitude of each city in the file 'CapitalCities.csv', and copies them into a new file 'CapCoordinates.csv'.

from googlemaps import GoogleMaps
import csv

ifile=csv.reader(open('CapitalCities.csv','rb'),delimiter=',')
ofile=open('CapCoordinates.csv','wb')

w=csv.writer(ofile,delimiter=',')
w.writerow(['Name','State','Latitude','Longitude'])

gmaps=GoogleMaps(API_KEY)

count=0

for row in ifile:
   if count!=0:
      address=str(row[0])+" "+str(row[1])
      lat, lng = gmaps.address_to_latlng(address)
      w.writerow([str(row[0]),str(row[1]),str(lat),str(lng)])
      print row[0],lat,lng

   count+=1

ofile.close()

Step 3: Plotting the map in R

At last we arrive at the R portion of this article. We will need to load four packages: the maps, sp, maptools, and geosphere libraries. Below is the complete code.

# Load necessary packages
library(maps)
library(sp)
library(maptools)
library(geosphere)

# Function that returns coordinates of a given location
getCoords<-function(location,df){
  return(df[match(x=location,df[,"Name"]),][,c(4,3)])
}

# Plot a great circle from 'from' to 'to' names.
plotLine<-function(from,to,df){
  inter<-gcIntermediate(p1=getCoords(from,df),
                      p2=getCoords(to,df),
                      n=50,addStartEnd=TRUE)
  lines(inter, col="green",cex=2,lwd=2)
}

# Read in the Australia Shapefile
ozDATA<-readShapeSpatial("AUS_adm1")

# Read CSV file containing coordinates and edges
nodes<-read.csv(file="CapCoordinates.csv",sep=",",header=TRUE)
edges<-read.csv(file="EdgeListCapCities.csv",sep=",",header=TRUE)

# Plot the graph itself
plot(ozDATA,lwd=0.01,bg="#242424",col="#0000CD",ylim=c(-46,-10),xlim=c(125,145))

# Plot the nodes on the graph
points(nodes$Longitude,nodes$Latitude,pch=16,cex=1.5)

# Label the nodes
text(nodes$Longitude,nodes$Latitude,nodes$Name,cex=1,adj=0,pos=2,col="#C2C2C2")

# Plot each of the edges between nodes
for (j in 1:nrow(edges)){
   plotLine(edges[j,]$From,edges[j,]$To,nodes) 
}

Line-by-line walkthrough

We start by reading in the Australia shape-file using the readShapeSpatial function from the sp package.

ozDATA<-readShapeSpatial("AUS_adm1")

Then we read in our list of nodes and freshly obtained coordinates, as well as the edge list.

nodes<-read.csv(file="CapCoordinates.csv",sep=",",header=TRUE)
edges<-read.csv(file="EdgeListCapCities.csv",sep=",",header=TRUE)

Having loaded all the necessary data, we can start plotting.

plot(ozDATA,lwd=0.01,bg="#242424",col="#0000CD",ylim=c(-46,-10),xlim=c(125,145))

We plot the shape-file using the plot function with the following arguments:

lwd – the line width;
bg – the background colour;
col – the colour of the regions (usually landmasses);
ylim – a 2-tuple containing the latitude boundaries for the plot; and
xlim – a 2-tuple containing the longitude boundaries for the plot.

Now we want to plot the cities (ie, the nodes) on this map. To do this we simply pass the coordinates to the points function. The parameters pch and cex set the type and size of the marker, respectively.

points(nodes$Longitude,nodes$Latitude,pch=16,cex=1.5)

Next we want to add labels to each of the nodes.

text(nodes$Longitude,nodes$Latitude,nodes$Name,cex=1,adj=0,pos=2,col="#C2C2C2")

The text function is used in almost identical fashion to the points function, but with an additional argument that contains the names of the nodes. We also add two additional arguments: adj and pos. The adj parameter adjusts the location of the labels. In this instance we have set it to zero, as it will be overwritten by our specification of pos. The pos parameter takes a value between 1 and 4 that specifies whether the label will be below, to the left, above, or to the right of the specified coordinates. (In this instance we have chosen to have labels to the left of the nodes.)

All that remains is to plot the lines between the respective cities. We loop through each of the edges and call the plotLine function.

for (j in 1:nrow(edges)){
   plotLine(edges[j,]$From,edges[j,]$To,nodes) 
}

getCoords<-function(location,df){
  return(df[match(x=location,df[,"Name"]),][,c(4,3)])
}

plotLine<-function(from,to,df){
  inter<-gcIntermediate(p1=getCoords(from,df),
                      p2=getCoords(to,df),
                      n=50,addStartEnd=TRUE)
  lines(inter, col="green",cex=2,lwd=2)
}

plotLine simply takes each pair of nodes in the edge-list, and uses the gcIntermediate function (as defined in the geosphere package) to join the two nodes with the shortest edge that lies upon a great circle (ie, the geodesic between the two nodes). To simplify this function, I have defined an additional function, getCoords, that takes as arguments a string (the node name) and a data frame (the nodes and their coordinates) and returns the coordinates of that particular node. After the for loop is completed, we arrive at the graph shown in Figure 1.

Conclusion
Nothing here was particularly complicated. We decided what we wanted to plot, collected the data, and used R to handle the data and produce the image. However, the applications for the tools that we have used are practically endless. To that end, I would be keen to hear from people who have been exploiting R's mapping functions across different applications. Ideally, I would like to showcase a range of applications in my next posting. Feel free to leave comments and links to your websites.

Until next time, enjoy being one of the few analysts who has visited the undiscovered country and can make maps in R with ease.

Tuesday, December 27, 2011

Web scraping with Python - the dark side of data

In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites. The presentation can be found here:

http://python.mirocommunity.org/video/1616/pycon-2010-scrape-the-web-stra

Highlights (at least from my perspective)

Screen scraping is not about regular expressions. It is just too hard to use pattern matching for these tasks, as the tags can change regularly and have significant maintenance issues.
BeautifulSoup is the go-to html parser for poor quality source. I have used this in the past and am pleased to hear that I was not too far off the money!
Configuration of User Agent settings is discussed in detail, as well as other mechanisms that websites exploit to stop you from scraping content
Good description of how to use the Live HTTP Headers add-on for Firefox.
A thought-provoking discussion about APIs, and comments that suggest that their maintenance and support is woefully inadequate. I was interested to hear his views, as they imply that scraping may be the only alternative when you really need data that is highly inaccessible.

Other notes

The mechanise package features heavily in the examples for this presentation. The following link provides some good examples of how to use mechanise to automate forms:
http://wwwsearch.sourceforge.net/mechanize/forms.html

There was also some mention of how Javascript causes problems for web scrapers, although this problem can be overcome via the use of web-drivers such as Selenium (see http://pypi.python.org/pypi/selenium) and Watir. I have used safari-watir before, and from my experience it can perform many complex data gathering tasks with relative ease.

Please feel free to post your comments about your experiences with screen scraping, and other tools that you use to collect web data for R.

Saturday, July 2, 2011

The R apply function – a tutorial with examples

Today I had one of those special moments that is uniquely associated with R. One of my colleagues was trying to solve what I term an 'Excel problem'. That is, one where the problem magically disappears once a programming language is employed. Put simply, the problem was to take a range, and randomly shift the elements of the list in order. For example, 12345 could become 34512 or 51234.

The list in question had forty-thousand elements, and this process needed to be repeated numerous times as part of a simulation. Try doing this in Excel and you will go insane: the shift function is doable but resource intensive. After ten minutes of waiting for your VBA script to run you will be begging for mercy or access to a supercomputer. However, in R the same can be achieved with the function:

translate<-function(x){
  if (length(x)!=1){
    r<-sample(1:(length(x)),1)
    x<-append(x[r:length(x)],x[1:r-1])
  }
  return(x)

My colleague ran this function against his results several thousand times and had the pleasure of seeing his results spit out in less than thirty seconds: problem solved. Ain't R grand.

More R magic courtesy of the apply function

The translate function above is not rocket science, but it does demonstrate how powerful a few lines of R can be. This is best exemplified by the incredible functionality offered by the apply function. However, I have noticed that this tool is often under-utilised by less experienced R users.

The usage from the R Documenation is as follows:

apply(X, MARGIN, FUN, ...)

where:

X is an array or matrix;
MARGIN is a variable that determines whether the function is applied over rows (MARGIN=1), columns (MARGIN=2), or both (MARGIN=c(1,2));
FUN is the function to be applied.

In essence, the apply function allows us to make entry-by-entry changes to data frames and matrices. If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the columns of X. Most impressively, when MARGIN=c(1,2) the function is applied to every entry of X. As for the FUN argument, this can be anything from a standard R function, such as sum or mean, to a custom function like translate above.

An illustrative example
Consider the code below:

# Create the matrix
m<-matrix(c(seq(from=-98,to=100,by=2)),nrow=10,ncol=10)

# Return the product of each of the rows
apply(m,1,prod)

# Return the sum of each of the columns
apply(m,2,sum)

# Return a new matrix whose entries are those of 'm' modulo 10
apply(m,c(1,2),function(x) x%%10)

In the last example, we apply a custom function to every entry of the matrix. Without this functionality, we would be at something of a disadvantage using R versus that old stalwart of the analyst: Excel. But with the apply function we can edit every entry of a data frame with a single line command. No autofilling, no wasted CPU cycles.

In the next edition of this blog, I will return to looking at R's plotting capabilities with a focus on the ggplot2 package. In the meantime, enjoy using the apply function and all it has to offer.