Chapter 4 Get existing data

There are multiple ways you can get data into R. In this chapter we introduce different strategies for getting data into R from a variety of political data sources. First, we look at data included in packages. Second, we show how you can find datasets online and introduce a resource with a lot of links to political datasets. Third, we introduce a series of different packages that makes it easy to get data into R.

Throughout the chapter we will use the tidyverse package so make sure to load this.

library("tidyverse")

4.1 Using data from data packages

A lot of the packages we are working with, including packages in the tidyverse, include datasets. To illustrate this, we will be using the package poliscidata.8 The first thing we will need to do is to install the package.

install.packages("poliscidata")

Next, we will need to load the package with library().

library("poliscidata")

There are multiple datasets in the poliscidata package. We will focus on the dataset states, a dataset with variables about the 50 states in the United States. We use the function names() to get a list of all variables in the data frame states (it takes up a lot of space but gives an indication of the variety of variables in the data frame).

names(states)
  [1] "abort_rank3"       "abortion_rank12"   "adv_or_more"      
  [4] "ba_or_more"        "cig_tax12"         "cig_tax12_3"      
  [7] "conserv_advantage" "conserv_public"    "dem_advantage"    
 [10] "govt_worker"       "gun_rank3"         "gun_rank11"       
 [13] "gun_scale11"       "hr_cons_rank11"    "hr_conserv11"     
 [16] "hr_lib_rank11"     "hr_liberal11"      "hs_or_more"       
 [19] "obama2012"         "obama_win12"       "pop2000"          
 [22] "pop2010"           "pop2010_hun_thou"  "popchng0010"      
 [25] "popchngpct"        "pot_policy"        "prochoice"        
 [28] "prolife"           "relig_cath"        "relig_prot"       
 [31] "relig_high"        "relig_low"         "religiosity3"     
 [34] "romney2012"        "smokers12"         "stateid"          
 [37] "to_0812"           "uninsured_pct"     "abort_rate05"     
 [40] "abort_rate08"      "abortlaw3"         "abortlaw10"       
 [43] "alcohol"           "attend_pct"        "battle04"         
 [46] "blkleg"            "blkpct04"          "blkpct08"         
 [49] "blkpct10"          "bush00"            "bush04"           
 [52] "carfatal"          "carfatal07"        "cig_tax"          
 [55] "cig_tax_3"         "cigarettes"        "college"          
 [58] "conpct_m"          "cons_hr06"         "cons_hr09"        
 [61] "cook_index"        "cook_index3"       "defexpen"         
 [64] "demhr11"           "dem_hr09"          "demnat06"         
 [67] "dempct_m"          "demstate06"        "demstate09"       
 [70] "demstate13"        "density"           "division"         
 [73] "earmarks_pcap"     "evm"               "evo"              
 [76] "evo2012"           "evr2012"           "gay_policy"       
 [79] "gay_policy2"       "gay_policy_con"    "gay_support"      
 [82] "gay_support3"      "gb_win00"          "gb_win04"         
 [85] "gore00"            "gun_check"         "gun_dealer"       
 [88] "gun_murder10"      "gun_rank_rev"      "gunlaw_rank"      
 [91] "gunlaw_rank3_rev"  "gunlaw_scale"      "hispanic04"       
 [94] "hispanic08"        "hispanic10"        "indpct_m"         
 [97] "kerry04"           "libpct_m"          "mccain08"         
[100] "modpct_m"          "nader00"           "obama08"          
[103] "obama_win08"       "over64"            "permit"           
[106] "pop_18_24"         "pop_18_24_10"      "prcapinc"         
[109] "region"            "relig_import"      "religiosity"      
[112] "reppct_m"          "rtw"               "secularism"       
[115] "secularism3"       "seniority_sen2"    "south"            
[118] "state"             "to_0004"           "to_0408"          
[121] "trnout00"          "trnout04"          "unemploy"         
[124] "union04"           "union07"           "union10"          
[127] "urban"             "vep00_turnout"     "vep04_turnout"    
[130] "vep08_turnout"     "vep12_turnout"     "womleg_2007"      
[133] "womleg_2010"       "womleg_2011"       "womleg_2015"      

While the data is available, it is not possible to see in the Environment window. To see the data frame, we can save states in an object of the same name.

states <- states

Now we can see in the Environment window that we have 50 observations of 135 variables. We will be using this data later, but for now we will see that we have actual data. Using the table() function we can show the distribution of observations in the gay_policy variable, showing data on the Billman’s policy scale (4 ordinal categories).

table(states$gay_policy)

     Most liberal           Liberal      Conservative Most conservative 
                6                14                10                20 

Here we see that 6 states have a most liberal score, 14 have a liberal score, 10 have a conservative score, and 6 have a most conservative score.

4.2 Download data from webpages

A lot of the political datasets you will find are available online and can be downloaded for free. A free resource with an overview of political datasets can be found here: https://github.com/erikgahner/PolData

In this dataset with political datasets, you can find datasets from different topics (international relations, political institutions, democracy etc.). For each dataset you will also be able to see whether it is possible to download the data for free, and if so, what the link to the dataset is.

To illustrate this, we can find the link to download the Global Media Freedom dataset. The dataset is available as a .csv file and get into R with the read.csv() function.

gmd <- read.csv(
  "http://faculty.uml.edu/Jenifer_whittenwoodring/GMFD_V2.csv"
  )

The dataset consists of the following four variables: id, year, country, mediascore.

In the next sections, we will introduce different packages, that can make it easier to work with different datasets.

4.3 Data: European Social Survey (essurvey)

To get data from European Social Survey (ESS), we will be using the essurvey package (Cimentada, 2018). If you do not have a free user, the first step is to go online and create a user: http://www.europeansocialsurvey.org/user/new

The next thing you need to do is to install the package.

install.packages("essurvey")

And then load the package.

library("essurvey")

Now you need to set the email you used to register an account. If you don’t do this, ESS will not be able to confirm that you have an account, and you will not be able to get access to the data.

set_email("your@mail.com")

There are multiple functions to use in order to get data, and for an overview of some of them, check out https://ropensci.github.io/essurvey/.

Here, we will provide an example on how to reproduce the main result in Larsen (2018). Here we use the import_country() function to import data from Denmark in Round 6 of the ESS.

ess <- import_country("Denmark", 6)

All the recodings are made with the mutate() function.

ess <- ess %>%
  mutate(
    stfgov = ifelse(stfgov > 10, NA, stfgov),
    reform = case_when(inwmme < 2 ~ 0,
                       inwmme == 2 & inwdde < 19 ~ 0,
                       inwmme == 2 & inwdde > 19 ~ 1,
                       inwmme > 2 ~ 1,
                       TRUE ~ NA_real_)
  )

And the regression model can be achieved with the lm() function.

lm(stfgov ~ reform, data=ess)

4.4 Data: Manifesto Project Dataset (manifestoR)

To use data from the Manifesto Project Dataset, you need to create an account as well. This can be done at: https://manifesto-project.wzb.eu/signup

Next, install and load the package manifestoR (Lewandowski & Merz, 2018).

# install the package
install.packages("manifestoR")

# load the package
library("manifestoR")

You now need to go to your profile page at https://manifesto-project.wzb.eu/. You will need to click on the button to get an API key. You can now click ‘download API Key file (txt)’ and place this file in your working directory - or copy your key and use the code below.

mp_setapikey(key = "yourKeyHere")

You are now able to download text data from the Manifesto Project into R. We use the mp_corpus() function to download election programmes texts and codings, in this case from Denmark.

manifesto_dk <- mp_corpus(countryname == "Denmark")

To see some of the content from the manifesto data, you can try the code below.

head(content(manifesto_dk[[1]]))

If you want to find a more detailed description of how to look at the data, please see https://cran.r-project.org/web/packages/manifestoR/vignettes/manifestoRworkflow.pdf.

4.5 Data: Varieties of Democracy (vdem)

To get data from Varieties of Democracy into R, we are going to use the vdem package (Coppedge et al., 2017). This package is not on CRAN, and accordingly, we cannot use install.packages() to install it. Instead, we will have to use the function install_github() as it is on GitHub. In order to do this, you need to have the package devtools. To install this package, you can uncomment the first line below. The second line says that we are using the install_github() function from the devtools package (with ::).

#install.packages("devtools")
devtools::install_github("xmarquez/vdem")

When the package is installed, use library() to load it.

library("vdem")

To get the main democracy indices from the data, we can use the extract_vdem() function.

vdem_data <- extract_vdem(section_number = 1)

This gives us a dataset with 17,604 observations of 55 variables. To see the first observations, use head() (output not shown).

head(vdem_data)

4.6 Data: World Development Indicators (WDI)

To use data from the World Bank’s World Development Indicators, we can use the WDI package (Arel-Bundock, 2018). For more information on the World Development Indicators, see https://datacatalog.worldbank.org/dataset/world-development-indicators. First, install and load the package.

# install the package
install.packages("WDI")

# load the package
library("WDI")

To search for data in the WDI, you can use the WDIsearch() function. In the example below, we search for data on GDP.

WDIsearch("gdp")

This returns the indicator and name of the data in WDI. We can see that the indicator for GDP per capita, PPP (constant 2005 international $) is NY.GDP.PCAP.PP.KD. To save the data, use the function WDI(), where you specify the indicator as well as the countries and years you want the data from.

wdi <- WDI(indicator="NY.GDP.PCAP.PP.KD", 
           country=c("US", "GB"), 
           start = 1960, 
           end = 2012)

4.7 Data: GitHub repositories

A lot of data today is available in GitHub repositories. To get data from GitHub, one good package to use is the package RCurl. The first thing to do is to load the package (and if you have not already installed the package, you need to do this prior to loading it).

library("RCurl")

Next, find the dataset on GitHub that you would like to use. When you find the dataset, you should click on ‘Raw’ to get to the raw dataset, as shown in Figure 4.1.

How to get to the raw dataset file on GitHub

Figure 4.1: How to get to the raw dataset file on GitHub

Copy the url of the raw dataset and use the getURL() function in R to load the content of the dataset into R. In the example below, we load data on Danish opinion polls and save it in the object gh_url.

gh_url <- getURL("https://raw.githubusercontent.com/erikgahner/polls/master/polls.csv")

Last, to get a data frame with the data, and as the dataset is in a .csv format, we use read.csv() to save the dataset in the object gh_data.

gh_data <- read.csv(text = gh_url)

  1. For more information on the package and the included packages, see: https://cran.r-project.org/web/packages/poliscidata/poliscidata.pdf