Tuesday, December 27, 2011

Web scraping with Python - the dark side of data

In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites. The presentation can be found here:

http://python.mirocommunity.org/video/1616/pycon-2010-scrape-the-web-stra

Highlights (at least from my perspective)
  • Screen scraping is not about regular expressions. It is just too hard to use pattern matching for these tasks, as the tags can change regularly and have significant maintenance issues.
  • BeautifulSoup is the go-to html parser for poor quality source. I have used this in the past and am pleased to hear that I was not too far off the money!
  • Configuration of User Agent settings is discussed in detail, as well as other mechanisms that websites exploit to stop you from scraping content
  • Good description of how to use the Live HTTP Headers add-on for Firefox.
  • A thought-provoking discussion about APIs, and comments that suggest that their maintenance and support is woefully inadequate. I was interested to hear his views, as they imply that scraping may be the only alternative when you really need data that is highly inaccessible.
Other notes

The mechanise package features heavily in the examples for this presentation. The following link provides some good examples of how to use mechanise to automate forms:
http://wwwsearch.sourceforge.net/mechanize/forms.html

There was also some mention of how Javascript causes problems for web scrapers, although this problem can be overcome via the use of web-drivers such as Selenium (see http://pypi.python.org/pypi/selenium) and Watir. I have used safari-watir before, and from my experience it can perform many complex data gathering tasks with relative ease.

Please feel free to post your comments about your experiences with screen scraping, and other tools that you use to collect web data for R.

Saturday, July 2, 2011

The R apply function – a tutorial with examples

Today I had one of those special moments that is uniquely associated with R. One of my colleagues was trying to solve what I term an 'Excel problem'. That is, one where the problem magically disappears once a programming language is employed. Put simply, the problem was to take a range, and randomly shift the elements of the list in order. For example, 12345 could become 34512 or 51234.

The list in question had forty-thousand elements, and this process needed to be repeated numerous times as part of a simulation. Try doing this in Excel and you will go insane: the shift function is doable but resource intensive. After ten minutes of waiting for your VBA script to run you will be begging for mercy or access to a supercomputer. However, in R the same can be achieved with the function:
translate<-function(x){
  if (length(x)!=1){
    r<-sample(1:(length(x)),1)
    x<-append(x[r:length(x)],x[1:r-1])
  }
  return(x)
}
My colleague ran this function against his results several thousand times and had the pleasure of seeing his results spit out in less than thirty seconds: problem solved. Ain't R grand.

More R magic courtesy of the apply function
The translate function above is not rocket science, but it does demonstrate how powerful a few lines of R can be. This is best exemplified by the incredible functionality offered by the apply function. However, I have noticed that this tool is often under-utilised by less experienced R users.

The usage from the R Documenation is as follows:
apply(X, MARGIN, FUN, ...)

where:
  • X is an array or matrix;
  • MARGIN is a variable that determines whether the function is applied over rows (MARGIN=1), columns (MARGIN=2), or both (MARGIN=c(1,2));
  • FUN is the function to be applied.
In essence, the apply function allows us to make entry-by-entry changes to data frames and matrices. If MARGIN=1, the function accepts each row of X as a vector argument, and returns a vector of the results. Similarly, if MARGIN=2 the function acts on the  columns of X. Most impressively,  when MARGIN=c(1,2) the function is applied to every entry of X. As for the FUN argument, this can be anything from a standard R function, such as sum or mean, to a custom function like translate above.

An illustrative example
Consider the code below:
# Create the matrix
m<-matrix(c(seq(from=-98,to=100,by=2)),nrow=10,ncol=10)

# Return the product of each of the rows
apply(m,1,prod)

# Return the sum of each of the columns
apply(m,2,sum)

# Return a new matrix whose entries are those of 'm' modulo 10
apply(m,c(1,2),function(x) x%%10) 

In the last example, we apply a custom function to every entry of the matrix. Without this functionality, we would be at something of a disadvantage using R versus that old stalwart of the analyst: Excel. But with the apply function we can edit every entry of a data frame with a single line command. No autofilling, no wasted CPU cycles.

In the next edition of this blog, I will return to looking at R's plotting capabilities with a focus on the ggplot2 package. In the meantime, enjoy using the apply function and all it has to offer.

Friday, June 24, 2011

Multiple plots in R: lesson zero

Today, in one of my more productive days, I managed to create a sleek R script that plotted several histograms in a lattice, allowing for easy identification of the underlying trend. Although the majority of the time taken consisted of collecting the data and making various adjustments, it took a not inconsiderable amount of work to write the code.

As I was cursing the apply function – not for the last time I am sure – I suddenly realised the insane level of productivity that I have come to see as "par for the course". The level of computational analysis that can be conducted in a few hours with nothing more than a desktop PC,  a broadband connection, and copious amounts of caffeine is phenomenal. No longer can a lack of computing power or software be blamed for a lack of productivity growth: rate-limiting factors are now exclusively human.

Here at least is my contribution to the collective intelligence of biologicals. I have noticed that the most common reason that people avoid R is that they cannot rapidly make graphs that meet the high standards of their clients. In the next few editions of this blog we will build up a basic repertoire of plotting techniques, focusing on graphics that are sickeningly impressive.

Of course, Rome was not built in a day, and a thorough knowledge of R plotting cannot be built in one. Instead we will progress one layer at a time, adding additional levels of complexity and functionality. We start with a simple script that allows us to plot several graphs at the same time, each with a different value of a key variable. Here is the output:

Figure 1: Multiple plots using par

The code
# Clear all objects
rm(list=ls())

# Create a data set using random variables
df<-data.frame(x=rnorm(160),y=runif(160),a=sample(c(1,0,-1,10),
replace=TRUE,160))

df$z<-with(df,{3*a*y+x})

# Create a function that plots the value of "z" against the "y" value
plotM<-function(l){
  
  df.temp<-df[df$a==l,]
  plot(df.temp$y,df.temp$z,xlab="Y Value",ylab="Z Value",
  main=paste("Value of key variable: ",toString(l)))
  abline(lm(df.temp$z~df.temp$y),col="red")
  
}

# Create a grid to plot the different values of "a"
par(mfrow=c(2,2))

# Loop through each value of "a" and call the plotM function
for (i in c(1,0,-1,10)){
    plotM(i)
}
  
Walk-through
The first few lines of code create a data frame with 4 variables: x,y,z,a. Three of these variables are randomly generated, with the z variable dependent upon the other 3. Suppose that we are analysing this data set, unaware of the relationship between the x,y and z variables. A preliminary inspection shows that there are only 4 observed values of a. It seems sensible to plot z against y for each of these different a values.

The relevant code to create this plot starts with the function "plotM", standing for plot Multiple. This function takes an argument "l" that determines which value of the variable we are plotting. The line

df.temp<-df[df$a==l,]

filters the data frame to include only those rows where the variable a=l. The next line simply creates a standard R plot of y versus z. Finally, we use the abline function to plot a linear fit to highlight the trend. Too easy.

Now we come to the fun part. Using the par(mfrow=c(2,2)) we create a 2x2 grid in which to place the next four plots. Whatever plots we now create will be placed sequentially into this grid. Hence we can iterate over a vector containing the values of a,  calling plotM each time. This gives us our grid.

Comments
Okay, so the graph does not look like it came from NASA – or to be honest with NASA's ailing reputation maybe it does. But note that once we have created the plotM function, we only have to write 3 lines of code to make 4 separate charts. Moreover, the code would not increase even if we were plotting 100 charts.

Of course, we have not yet even drawn upon any of R's custom plotting packages. In the next edition of this blog we will look at how to use the ggplot2 package to add colour and a wide range of other features to our graphs.

Monday, May 16, 2011

Keep It Linear Stupid: the cost of achieving perfection under pressure

"That gives you 48 hours to develop a methodology – since we haven't done one of these before – prototype the model, build it, calibrate it to client expectations, and write the report. But don't worry, the client is paying a premium for the rush."

Every analyst has been in this sort of situation. The client wants something now, and your boss has agreed to deliver it. The trouble is that before that perfectly crafted 3-line email three is sent, accompanied by a hefty invoice, you actually have to build the thing. What is more, lack of time is no excuse for sloppy work. No mistakes, after all we're professionals.

Working to a deadline
The Mythical Man-Month is the best book about managing projects to a deadline that I have ever read.  Although focused on software engineering, the book has lessons that apply to project management in any technical field. The author, Frederick P. Brooks, provides a mathematical argument for why adding manpower to a technical project that is running late makes it later (Brook's Law).  The problem is that  time spent getting new people up to speed (initiation) and coordinating additional manpower (co-ordination) exceeds the additional productivity. There is a fixed time-cost to incorporate a new worker onto a project. Unless the project has a long way to go, adding people just slows it down.

An obvious corollary to Brook's Law is that there is a minimum time required to complete any project that is directly proportional to its complexity. Moreover, this minimum time corresponds to a unique number of workers: at some point additional manpower has decreasing returns to scale in the form of co-ordination and initiation costs. What does this mean for your 48-hour rush-job? It is a one-man, or at most two-man, show. Looks like you will be staying back in the office on your own.

No mistakes...well sort of
The big consultancy houses and banks have built a powerful image for their industries over time. That image is in every logo, every website, every banner at a conference. The image is one of flawlessness, of crystalline perfection, of exacting standards.

This notion of a clinical, relentless pursuit of perfection has managed to convince the non-technical members of society that for the right price perfection is not only achievable, but the norm. On the first day of my first job, I remember my boss emphasising the many processes that were in place to prevent errors. My boss was, naturally, a manager. Managers do not run the numbers: if no-one finds an error then they assume there were none to find. I rapidly realised that although it was unacceptable for errors to be discovered, it was perfectly acceptable for errors to exist.

A one page document can contain perfect spelling; a single table of data can be copied exactly from another source. But the problem for the analyst is managing complexity. Large documents are more likely to house inconsistencies; models that attempt to capture more variables are more prone to calculation errors. The Law of Large Numbers sucks, doesn't it.

Of course, the notion of perfection is somewhat flexible for the rush-job.  The odd inaccurate assumption is forgivable,  as is a minor inconsistency in the results. But if there is a major mistake, don't dare blame the deadline. So how do you avoid mistakes during a rush-job? The answer is a variation of the "Keep It Simple Stupid" (KISS) rule.

Keep it Linear Stupid
Let's assume for a second that you are young and naive, some might even say a cock-eyed optimist who has gotten caught up in the world of international modelling intrigue. Despite the impending deadline, you nevertheless are determined to capture every aspect of the phenomenon at hand. One crucial variable is clearly quadratic,  and to assume it is linear would be a travesty against all that mathematics stands for. Was it all for nothing, Archimedes? 

You have a choice in front of you: you pick the blue line – the complexity ends, you wake up in your bed and believe whatever you want to believe. You pick the red line – you stay in mathematical wonderland and this model is going to show you just how deep the rabbit-hole goes. You are young, you are fearless, you are stupid – you choose the red line.
"I know what you're asking yourself: why oh why didn't I pick the blue line?"

Down the rabbit-hole
Okay, so its not as bad as waking up in a petri dish to find that you have a massive fire-wire port in the back of your skull. But by 2 am in the morning of the second day, you are going to wish that you could go back and choose that blue line – I guarantee it.

You see the problem with a rush job is that you don't have time to go back. The client expects the results to make sense, and there is always one number that can't be explained by anything other than an error. But when you go to iron it out, that red line comes back to haunt you. Every tiny alteration to the inputs results in a non-linear change to the results. 

Ordinarily, this is fine as you have plenty of time to get it right  Given time you would be able to rework the model, but that is time you don't have. Every time you pin down one set of numbers, another one jumps out of its place. Non-linearity is hard to explain and even harder to defend, especially when you know that there may be errors. If you are lucky, you will be able to explain away the aberration. If not, you have a report that may as well be lit with a neon sign saying "Error: Division by zero".

The price of sanity
Okay, so you learned your lesson. Next time you get a rush job, you restrict your analysis to the one-dimensional case (and even that seems gutsy), tether your key variables to a stake in the numeric al equivalent of Alcatatraz, and assume that everything from the inflation rate to the diffusion of information throughout the economy can be approximated by a linear function. Nice job.

You have done well by your boss, you have done well by yourself. You have even done right be your client, as they have the report they wanted. But what good is it? The purpose of quantitative analysis is to enlighten decision making, to add something that could not have been uncovered by logic alone. Once you have removed all complexity, you have also removed all value. You have created consistency and clarity but at vulgar cost. In truth, you have obscured more than you have explained.

An analyst often has little choice in the matter at the time, but they can make clear how banal the results are that spring from these sorts of exercises. More time may be an option, absolute internal consistency may not be essential. If the alternative is simplification ad absurdum, then it may be better to pass on the job. 

Keep it simple, keep it trivial. Keep it linear? Stupid.

 


Wednesday, April 27, 2011

Getting a first class education without the luxury price tag: ITunes University

Permit me to paint a picture. You have just finished your undergraduate degree in a business or quantitative discipline (eg, economics, computer science, or mathematics). You finished with first class honours, so you have landed a great job in a medium-sized firm. You are fed up to the teeth with formal study, and are looking to gain some experience in the real world and finally repay your student loans. The time has come for your study to start paying off...or has it?

The problem is that in the modern world there is an arms race for qualifications. As demand for skilled graduates has increased, so have the number of students completing university. But as a result, there are more and more members of the work-force completing advanced degrees. In addition, this has been fueled by the behemoth style corporate factories financing postgraduate education. Masters degrees have become the norm, and no sooner have students left university than they are back, but this time in the evening after work.

I am highly sceptical of these Masters programs, particularly those in my home country of Australia. The entry requirements for these degrees are extremely low, most notably in my own area of applied mathematics. Demand for postgraduate qualifications in mathematical finance is so great that universities have slowly lowered the level of prerequisite knowledge. Many of these students have barely completed first year mathematics, but are placed in classes alongside honours students (who often coach them through the course).

As the quality of candidature drops, so does the quality of the course. Students are so concerned about passing that they have no interest in engaging with the material. Courses are crammed into a single evening, usually a three to four hour block, allowing those who work full-time to attend with minimal inconvenience. Of course, by half way through the second hour no one is paying attention, and everyone is secretly hoping that an impromptu visit from the fire department will put everyone out of their misery.

If you are looking for an extra line on your resume then go right ahead and sign up to one of these programs. But what if you genuinely want to gain further education. I guess you can always consider a PhD. It still represents a real qualification that is going to last you for the duration (we hope!). But if you were squeamish about signing up for a Masters, a PhD is likely to make you throw your rifle into the cornfield and run for the hills. After all, you don't want to become an academic, you just want to ensure that you continue learning.

Until recently the obvious answer was self-education and, for all intents and purposes, it still is. Picking up a few books on your area of expertise (eg, computational economics, operations research, stochastic analysis) and working through them as though you were embarking on a PhD is a reasonable way to simulate the experience. Many of my friends who have stayed at university to complete doctorates have noted that they could do most of their work at home, provided that they were still funded to do so. Of course, the greatest benefits of being enrolled in a formal program stem from interaction with other researchers, and this is hard to replicate on your own. However, there are also advantages to working in industry where you are constantly faced with real world problems that can direct and focus your research.

It is, however, this focus or structure that so often is lacking from self-education. However, with the advent of the internet this too can be achieved without needing to drop everything to go back to university. The solution is now freely available on ITunes.

Truly open universities: ITunesU
The availability of online university courses has exploded in the last year. My first exposure to online courses was through MIT Opencourseware. I watched a series of lectures on Linear Algebra to supplement a course that I was studying at the time. To be honest, the course I was taking was far more in depth and of a far higher quality, but I really benefited from being able to watch a completely different course in its entirety and from the comfort of my own home. What was so revolutionary about MIT Opencourseware was its attempt to allow online users to have the full course experience. Every lecture was posted online, and all class materials were available for download. Its not exactly like being enrolled, but it is about 90% of the way there. The problem was that only a very limited number of courses were available. The 101 course was there but 102 often was not. Darn.

In the last 12 months, however, more and more courses have become available online. Some of the best are offered by Stanford, MIT, and Yale and are all available at the ITunes store in the ITunes University Section (ITunesU). Below is a short list of some of the highlights:
  • Convex Optimization (A and B)– Taught by Stephen Boyd – Stanford University
  • Financial Markets – Taught by Robert Schiller – Yale University
  • Information and Entropy – Taught by Paul Penfield & Seth Lloyd – MIT
Sure, you are not going to be able to replicate a degree course-for-course, but chances are you can cover most of the main subjects and fill in the gaps with self-study by reading through the recommended texts.

Better than the real thing – the multiplier effect
The major advantage of online courses is that they are free and thus available to the financially challenged (or alternatively those who don't want to fork out a hundred grand for a degree). As a consequence of this, these courses can actually be better than the real thing. Pray tell how so, I hear you ask?

I found during my university degree that it was sometimes subjects outside of one's discipline that were the most valuable. In my case, I gained more from a couple of well-chosen courses in computer science than I would have from a year of further mathematics. I like to think of cross-disciplinary study as having a multiplier effect – the military meaning as opposed to the economic term. On the battlefield snipers are often referred to as having a multiplier effect, as the presence of a sniper increases the effectiveness of all other members of the unit. In the same way, a basic training in a neighbouring  discipline can leverage your existing knowledge.

As online courses can be taken without cost, they are custom made for this type of 'field-hopping' (tell me if you come up with a better term). With no program requirements, or restrictions on choices of electives you are free to study what you like. I have found that this improves the learning experience, and provides a truly liberal education.

Of course, you still have to put in the hours, do the tutorial exercises, and gain mastery of the material.  Whatever degree you enroll in, no matter how prestigious, the final responsibility will always rest with you. Sandstone spires and grassy quadrangles do not a genius make. As online courses increase in quality and number, I expect that this simple truth will become evermore evident. Make sure you aren't left behind!

Thursday, April 21, 2011

Survival skills for today's analyst

I suffer a little from the age-old affliction of contrarianism. If a software package is used by the majority of the population, I assume it is flawed, highly limited, and its continued use will ultimately result in the downfall of the human race. Conversely, I am always extremely interested in a piece of software that has spread no further than the ivory tower in which it was first conceived.

The most longstanding example of this is my profound preference for the statistical computing language, R, over Microsoft Excel–a program in which I have begrudgingly developed an extremely high level of expertise. As every analyst knows, in the world of statistical software Excel is like McDonald's, Burger King, Pizza Hut, and KFC all rolled into one. It is so prepackaged and devoid of customization, yet so ubiquitous that we cannot do without it. Like the fast-food chains, we loathe Excel because it always produces the same graphs, the same simple statistical analyses. Yet when we find ourselves lost in a strange, unfriendly foreign country we go running back to the grid lines of excel to order a Big Mac. As soon as we enter the jungle, our survival skills are found wanting.

Yes my friends, like it or not, Excel is here to stay although not for lack of alternatives. The fact is that it is the user-friendly nature of this program that has been the key to its success. A friend of mine once put it thus: "Excel has allowed a generation of knowledge workers to survive without being able to program."

In truth, the driving force behind Excel's success is simple: Excel is easy. Oh I know that the die-hards will talk about how it is a superior visual tool, and that spreadsheets allow for increased transparency in financial models. But this argument falls flat on its face when we introduce macros to the equation; if spreadsheets are about transparency then why do we add VBA scripts that the user can neither see nor understand. And if we are happy to use scripts at some level, why on earth do we need to do everything else in a cumbersome visual environment.

Furthermore, the simple ends to which Excel users put their tools is demonstrated by the tiny fraction of users use the (admittedly limited) functionality afforded by VBA. That so many users can get by without loops, functions, or any notion of encapsulation is testament to the primitive uses to which Excel is put: it is just a big button calculator with an autofill feature. Surely there are more skills that we need to survive in the analytical jungle.

Finding an alternative
As I write this, I am sure that I have just alienated the entire community of so called "Power Excel Users". But I am sure that many engineers, economists, and scientists will agree that Excel is too limited to be the only quantitative tool that you have available in your office. The problem is finding software that you can successfully use in an office environment, and that is worth investing the time in learning.

The most important obstacle to overcome is the cost barrier. One of my friends, Rex, works for a major insurance group in their risk division. As far as I can tell, there are few organisations as willing to shell out money on analytical software as an insurance company. As a result, Rex regularly tells me about the wonderful software package that they just bought for $X million. These packages are highly customised and very user friendly (that's why they cost big dollars). The problem is, what happens when the company's systems change, or when you need to solve a new problem? Moreover, how does Rex do his job when he no longer has access to the software (ie, if he moves to another job). The cost of these highly customized packages means that they are not useful tools to acquire for your repertoire. As a rule of thumb, if it costs more than the latest version of Excel then assume that it is not portable: you cannot take it with you.

Enter R
Since I am a contrarian, I am sure that my advice should be taken with a grain (if not a barrel) of salt. However, I believe that there is now a viable alternative to Excel: R. R has been around for a long time, but it has taken a while to gain the following that it so rightfully deserves.

R is completely free and thus available at your fingertips wherever you go. No need to negotiate with the boss about breaking the budget for some fancy new piece of software. Download the binary, install it, and you are good to go. The advantage of this is not just that it is freely available, but that you can rely on it being available.

That just leaves its functionality, and my friends the good news is that R has functionality in spades. Take a quick look at its graphical features and you will see that almost any chart or graph you can dream of can be generated in R. In addition, the R community is continually adding new packages with new functions. In the last few years, the development of these packages has exploded in line with growth in the user base.

Transcending Excel and transitioning to R
Having used R for a reasonable amount of time, I find it hard to see why other analysts struggle day-in day-out with Excel. However, the great barrier to using R is that it is one step closer to all-out coding. Run through an interpreter, R seems strange and frightening to the non-programmer. If you have never learned a programming language, then chances are it will take you some time to shift to R.

Another issue is the need for other people to have the ability to review, check, and edit your work. Unless your boss is up to speed with R or is willing for your work to be checked by another R-literate colleague, you may have to stick with Excel for the moment.

There is, however, great scope for the analyst to grow their organisation into R over time. Whenever you are asked to do a self-contained piece of work independently, try doing it in R. I tend to go overboard and try to create advanced graphics that showcase R's capabilities. The majority of the time, people ask how I made the graph and are then keen to see what else R can do.

Into the jungle
As the old saying goes, "to the man that has only a hammer, every problem looks like a nail". At the moment, there are an awful lot of organisations that are filled with people who only have Excel and every problem sure looks like a spreadsheet.

I believe that analysts that fail to expand their toolkit tend to lose the ability to solve new problems. The generation of knowledge workers who are now in their 40s may have been lucky enough to survive on nothing more than their spreadsheet skills. However, as a twenty-something making my way in the business world, I cannot see how an analyst will be able to survive without some high-powered programming in their utility belt. R may not be enough on its own, but it seems like a good starting point.

Good luck in the analytical jungle.