Skip to content

Reading OECD.Stat into R

March 2, 2014

EDIT (17-8-14): I no longer use the XML2R package to pull SDMX data. There is a new package called RSDMX which makes the task described below a lot easier. You can find it and some examples here – https://github.com/opensdmx/rsdmx 

OECD.Stat is a commonly used statistics portal in the research world but there are no easy ways (that I know of) to query it straight from R. There are two main benefits of querying OECD.Stat straight from R:

1. Create reproducible analysis (something that is easily lost if you have to download excel files)

2. Make tweaks to analysis easily

There are three main ways I could see to collect data from OECD.Stat

  1. Find the unique name for each dataset and scrape the html from the dataset’s landing page (e.g. the unique URL for pension operating expenses is http://stats.oecd.org/Index.aspx?DataSetCode=PNNI_NEW). This probably would have been the easiest way to scrape the data but it doesn’t offer the flexibility that the other two options do.
  2. Use the OECD Open Data API . This was the avenue I explored initially but it doesn’t seem that the API functionality is fully built yet.
  3. Use the SDMX query provided under the export tab on the OECD.Stat site. This URL query can be easily edited to change the selected countries, observation range and even datasets.

I went with option 3, using the SDMX query.

To get the query that you need to use in R, navigate to your dataset and click

export -> SDMX (XML) (as per picture below)

sdmx2

Then, copy everything in the ‘SDMX DATA URL’ box

Query

In the example below I am using the trade union density dataset.

Getting the SDMX URL as described above for the trade union dataset gives us a very long URL as it contains a lot countries. I cut it down in this example for clarity to:

http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/UN_DEN/AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD/OECD?startTime=1960&endTime=2012

The important parts of the URL above are

  1. UN_DEN – This is the Trade Union Density dataset code
  2. AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD – Unsurprisingly, this is the list of countries we are querying, you can delete countries or if you know the ISO country codes you can add to it.
  3. startTime=1960&endTime=2012 – Change the date range as you please.

Note that many datasets have a lot more options on offer so there is usually a bunch more junk after the dataset code in the URL.

The following code R code creates a melted data frame, ready for use in analysis or ggplot or rCharts. I make use of Carson Sievert’s XML2R package. All you need to do is paste your own SDMX URL into the relevant spot.

library(XML2R)

file <- "http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/UN_DEN/AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD/OECD?startTime=1960&endTime=2012"

obs <- XML2Obs(file)
tables <- collapse_obs(obs)

# The data we care about is stored in the following three nodes
# We only care about the country variable in the keys node
keys <- tables[["MessageGroup//DataSet//Series//SeriesKey//Value"]]
dates <- tables[["MessageGroup//DataSet//Series//Obs//Time"]]
values <- tables[["MessageGroup//DataSet//Series//Obs//ObsValue"]]

# Extract the country part of the keys table
# Have to use both COU and COUNTRY as OECD don't use a standard name
country_list <- keys[keys[,1]== "COU" | keys[,1]== "COUNTRY"]
# The country names are stored in the middle third of the above list
country_list <- country_list[(length(country_list)*1/3+1):(length(country_list)*2/3)]

# Bind the existing date and value vectors
dat <- cbind.data.frame(as.numeric(dates[,1]),as.numeric(values[,1]))
colnames(dat) <- c('date', 'value')

# Add the country variable
# This code maps a new country each time the diff(dat$date)<=0 ...
# ...as there are a different number of readings for each country
# This is not particularly robust
dat$country <- c(country_list[1], country_list[cumsum(diff(dat$date) <= 0) + 1])
#created this as too many sig figs make the rChart ugly
dat$value2 <- signif(dat$value,2)

head(dat)

This should create a data frame for nearly all annualised OECD.Stat data. You will need to do some work with the dates if you want to use more frequently reported data (e.g. quarterly, monthly data).

Once the data is set up like this it is very easy to visualise.

library(ggplot2)
ggplot(dat) + geom_line(aes(date,value,colour=country))

Rplot

Or just as easy but a bit cooler, use rCharts to make it interactive

# library(devtools)
# install_github('rCharts', 'ramnathv', ref = 'dev')
library(rCharts)

n1 <- nPlot(value2 ~ date, group = "country", data = dat, type = "lineChart")
n1$chart(forceY = c(0))

#To publish to a gist on github use this code, it will produce the url for viewing
#(you need a github account)
n1$publish('Union Density over time', host = 'gist')

Link to the interactive is here (or click on the image below)

Union Density

Advertisements

From → R

15 Comments
  1. Anonymous permalink

    Great post. Minor typo: missing closed parenthesis in the ggplot command.

  2. Thanks! Fixed

  3. I whipped up a gist to replicate the scraping in a more robust fashion — https://gist.github.com/cpsievert/9356058

    • Thanks, Carson. I will check it out and implement!

    • Hi Carson, I just tried to implement your update (it looks a lot more robust) but the problem I am having is that there are multiple ‘concepts’ listed under the ‘SeriesKey//Value’ identifier. If you look at this gist it replicates the error whereby it assigns the first SeriesKey//Value value of ‘APLANS’ (under the concept “DTYP”) instead of allocating the “COU” ‘concept’ of Aus. Gist of code is here – https://gist.github.com/timcameron/9594660

      • In this new file, it isn’t very clear (at least to me) how the info under the “SeriesKey” node is related to the information in the “Obs” nodes. In particular, there are six different “concepts” but only three “observations”. Are you sure there is a sensible relation between “concepts” and “observations” in this file?

  4. There appears to be six concepts for each country being observed (no mater how many time observations there are)

    The concept I am interested in is “COU” (the country). The other five just appear to be settings (e.g. unit of currency that ‘MessageGroup//DataSet//Series//Obs//ObsValue’ is measured in)

  5. Hi Carson,

    I tried to use your provided script (thanks in advance!) but I get the following error.

    Error in function (type, msg, asError = TRUE) :
    Could not resolve host:; No data record of requested type

    Am I doing something wrong or did they close this awesome way of data extraction?

    Kind regards,

    Robert

  6. Attilio permalink

    Hi, you could also try using RJSDMX (on CRAN or Github) to get SDMX data. You just need to hit:

    tslist = getSDMX(‘OECD’, ‘UN_DEN/AUS+CAN+FRA+DEU+NZL+GBR+USA+OECD’, start=’1960′, end=’2012′)

    and you’ll get an R list of zoo time series.

    The command is not tied to the OECD provider, you could use it the same way with all the available SDMX providers (e.g. ECB, Eurostat…)

    Cheers

  7. helaldu permalink

    Hi tcamm,

    I tried the code you have provided, but in the country its says .

    date value country value2
    1 2000 85.81791 86
    2 2001 85.85960 86
    3 2002 84.65066 85
    4 2003 84.89245 85
    5 2004 86.77672 87
    6 2005 89.96999 90

    how can fix that issue?

    Thanks in advance.

  8. helaldu permalink

    date value country value2
    1 2000 85.81791 NA 86
    2 2001 85.85960 NA 86
    3 2002 84.65066 NA 85
    4 2003 84.89245 NA 85
    5 2004 86.77672 NA 87
    6 2005 89.96999 NA 90

  9. There is also this package https://github.com/expersso/OECD/blob/master/readme.md to directly get OECD data from within R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: