Chapter 5 Get existing data

There are multiple ways you can get data into R. In this chapter we introduce different strategies for getting data into R from a variety of political data sources. First, we look at data included in packages. Second, we show how you can find datasets online and introduce a resource with a lot of links to political datasets. Third, we introduce a series of different packages that makes it easy to get data into R.

Throughout the chapter we will use the tidyverse package so make sure to load this.

library("tidyverse")

5.1 Using data from data packages

A lot of the packages we are working with, including packages in the tidyverse, include datasets. To illustrate this, we will be using the package poliscidata.8 The first thing we will need to do is to install the package.

install.packages("poliscidata")

Next, we will need to load the package with library().

library("poliscidata")

There are multiple datasets in the poliscidata package. We will focus on the dataset states, a dataset with variables about the 50 states in the United States. We use the function names() to get a list of all variables in the data frame states (it takes up a lot of space but gives an indication of the variety of variables in the data frame).

names(states)
##   [1] "abort_rank3"       "abortion_rank12"   "adv_or_more"       "ba_or_more"        "cig_tax12"        
##   [6] "cig_tax12_3"       "conserv_advantage" "conserv_public"    "dem_advantage"     "govt_worker"      
##  [11] "gun_rank3"         "gun_rank11"        "gun_scale11"       "hr_cons_rank11"    "hr_conserv11"     
##  [16] "hr_lib_rank11"     "hr_liberal11"      "hs_or_more"        "obama2012"         "obama_win12"      
##  [21] "pop2000"           "pop2010"           "pop2010_hun_thou"  "popchng0010"       "popchngpct"       
##  [26] "pot_policy"        "prochoice"         "prolife"           "relig_cath"        "relig_prot"       
##  [31] "relig_high"        "relig_low"         "religiosity3"      "romney2012"        "smokers12"        
##  [36] "stateid"           "to_0812"           "uninsured_pct"     "abort_rate05"      "abort_rate08"     
##  [41] "abortlaw3"         "abortlaw10"        "alcohol"           "attend_pct"        "battle04"         
##  [46] "blkleg"            "blkpct04"          "blkpct08"          "blkpct10"          "bush00"           
##  [51] "bush04"            "carfatal"          "carfatal07"        "cig_tax"           "cig_tax_3"        
##  [56] "cigarettes"        "college"           "conpct_m"          "cons_hr06"         "cons_hr09"        
##  [61] "cook_index"        "cook_index3"       "defexpen"          "demhr11"           "dem_hr09"         
##  [66] "demnat06"          "dempct_m"          "demstate06"        "demstate09"        "demstate13"       
##  [71] "density"           "division"          "earmarks_pcap"     "evm"               "evo"              
##  [76] "evo2012"           "evr2012"           "gay_policy"        "gay_policy2"       "gay_policy_con"   
##  [81] "gay_support"       "gay_support3"      "gb_win00"          "gb_win04"          "gore00"           
##  [86] "gun_check"         "gun_dealer"        "gun_murder10"      "gun_rank_rev"      "gunlaw_rank"      
##  [91] "gunlaw_rank3_rev"  "gunlaw_scale"      "hispanic04"        "hispanic08"        "hispanic10"       
##  [96] "indpct_m"          "kerry04"           "libpct_m"          "mccain08"          "modpct_m"         
## [101] "nader00"           "obama08"           "obama_win08"       "over64"            "permit"           
## [106] "pop_18_24"         "pop_18_24_10"      "prcapinc"          "region"            "relig_import"     
## [111] "religiosity"       "reppct_m"          "rtw"               "secularism"        "secularism3"      
## [116] "seniority_sen2"    "south"             "state"             "to_0004"           "to_0408"          
## [121] "trnout00"          "trnout04"          "unemploy"          "union04"           "union07"          
## [126] "union10"           "urban"             "vep00_turnout"     "vep04_turnout"     "vep08_turnout"    
## [131] "vep12_turnout"     "womleg_2007"       "womleg_2010"       "womleg_2011"       "womleg_2015"      
## [136] "cig_taxes"         "cig_taxes_cat"

While the data is available, it is not possible to see in the Environment window. To see the data frame, we can save states in an object of the same name.

states <- states

Now we can see in the Environment window that we have 50 observations of 135 variables. We will be using this data later, but for now we will see that we have actual data. Using the table() function we can show the distribution of observations in the gay_policy variable, showing data on the Billman’s policy scale (4 ordinal categories).

table(states$gay_policy)
## 
##      Most liberal           Liberal      Conservative Most conservative 
##                 6                14                10                20

Here we see that 6 states have a most liberal score, 14 have a liberal score, 10 have a conservative score, and 6 have a most conservative score.

5.2 Download data from webpages

A lot of the political datasets you will find are available online and can be downloaded for free. A free resource with an overview of political datasets can be found here: https://github.com/erikgahner/PolData

In this dataset with political datasets, you can find datasets from different topics (international relations, political institutions, democracy etc.). For each dataset you will also be able to see whether it is possible to download the data for free, and if so, what the link to the dataset is.

To illustrate this, we can find the link to download the Global Media Freedom dataset. The dataset is available as a .csv file and get into R with the read.csv() function.

gmd <- read.csv(
  "http://faculty.uml.edu/Jenifer_whittenwoodring/GMFD_V2.csv"
  )

The dataset consists of the following four variables: id, year, country, mediascore.

In the next sections, we will introduce different packages, that can make it easier to work with different datasets.

5.3 Data: European Social Survey (essurvey)

To get data from European Social Survey (ESS), we will be using the essurvey package (Cimentada, 2018). If you do not have a free user, the first step is to go online and create a user: http://www.europeansocialsurvey.org/user/new

The next thing you need to do is to install the package.

install.packages("essurvey")

And then load the package.

library("essurvey")

Now you need to set the email you used to register an account. If you don’t do this, ESS will not be able to confirm that you have an account, and you will not be able to get access to the data.

set_email("your@mail.com")

There are multiple functions to use in order to get data, and for an overview of some of them, check out https://ropensci.github.io/essurvey/.

Here, we will provide an example on how to reproduce the main result in Larsen (2018). Here we use the import_country() function to import data from Denmark in Round 6 of the ESS.

ess <- import_country("Denmark", 6)

All the recodings are made with the mutate() function.

ess <- ess %>%
  mutate(
    stfgov = ifelse(stfgov > 10, NA, stfgov),
    reform = case_when(inwmme < 2 ~ 0,
                       inwmme == 2 & inwdde < 19 ~ 0,
                       inwmme == 2 & inwdde > 19 ~ 1,
                       inwmme > 2 ~ 1,
                       TRUE ~ NA_real_)
  )

And the regression model can be achieved with the lm() function.

lm(stfgov ~ reform, data=ess)

5.4 Data: General Social Survey (gssr)

To get data from the General Social Survey, we will use the gssr package (Kieran Healy, 2019). This package is not on CRAN yet, so we are going to use the devtools package to install it and then use library() to load it (as always, remember, you only need to install the package once).

devtools::install_github("kjhealy/gssr")

library("gssr")

The easiest way to get the GSS data is to simple ask to get all of the data in an object called gss_all with data(). At the time of writing, this data frame includes 64814 observations and 6108 variables.

data(gss_all)

However, there are several other functions that might be help conditional upon what you would like to study. Accordingly, we recommend that you consult https://kjhealy.github.io/gssr/ for various examples on how to access the data and the documentation of different variables.

5.5 Data: American National Election Study (anesr)

To access data from the American National Election Study, we are going to use the package anesr (Martherus, 2019). As with the gssr package, this package is currently not available on CRAN so we are going to use the install_github() function from the devtools package. Once that is done, we load the package.

devtools::install_github("jamesmartherus/anesr")

library("anesr")

To open a window with all the datasets available via the package, simply run this line:

data(package="anesr")

In the window (not shown here), we can see that there is a dataset called timeseries_2016. To access this data, use the data() function.

data(timeseries_2016)

This returns an object called timeseries_2016 with 4270 observations and 1842 variables.

Before using any publicly available data, it’s good to keep two things in mind. First, when you are downloading data, you are using resources. For example, when you are downloading 4270 observations and 1842 variables, you are using a server. For that reason, we recommend that you don’t do this often but save the data you are working with in a local file. There is no reason to download the data again and again (especially not if you are running all of your code multiple times a day).

In addition, dataset often comes with specific terms of condition. For the ANES, these include:

  • Use these datasets solely for research or statistical purposes and not for investigation of specific survey respondents.
  • Make no use of the identity of any survey respondent(s) discovered intentionally or inadvertently, and to advise ANES of any such discovery (anes@electionstudies.org)
  • Cite ANES data and documentation in your work that makes use of the data and documentation. Authors of publications based on ANES data should send citations of their published works to ANES for inclusion in our bibliography of related publications.
  • You acknowledge that the original collector of the data, ANES, and the relevant funding agency/agencies bear no responsibility for use of the data or for interpretations or inferences based upon such uses.
  • ANES is not responsible for any errors in these datasets - any mistakes are mine.

5.6 Data: Manifesto Project Dataset (manifestoR)

To use data from the Manifesto Project Dataset, you need to create an account as well. This can be done at: https://manifesto-project.wzb.eu/signup

Next, install and load the package manifestoR (Lewandowski & Merz, 2018).

# install the package
install.packages("manifestoR")

# load the package
library("manifestoR")

You now need to go to your profile page at https://manifesto-project.wzb.eu/. You will need to click on the button to get an API key. You can now click ‘download API Key file (txt)’ and place this file in your working directory - or copy your key and use the code below.

mp_setapikey(key = "yourKeyHere")

You are now able to download text data from the Manifesto Project into R. We use the mp_corpus() function to download election programmes texts and codings, in this case from Denmark.

manifesto_dk <- mp_corpus(countryname == "Denmark")

To see some of the content from the manifesto data, you can try the code below.

head(content(manifesto_dk[[1]]))

If you want to find a more detailed description of how to look at the data, please see https://cran.r-project.org/web/packages/manifestoR/vignettes/manifestoRworkflow.pdf.

5.7 Data: Varieties of Democracy (vdem)

To get data from Varieties of Democracy into R, we are going to use the vdem package (Coppedge et al., 2017). This package is not on CRAN, and accordingly, we cannot use install.packages() to install it. Instead, we will have to use the function install_github() as it is on GitHub. In order to do this, you need to have the package devtools. To install this package, you can uncomment the first line below. The second line says that we are using the install_github() function from the devtools package (with ::).

#install.packages("devtools")
devtools::install_github("xmarquez/vdem")

When the package is installed, use library() to load it.

library("vdem")

To get the main democracy indices from the data, we can use the extract_vdem() function.

vdem_data <- extract_vdem(section_number = 1)

This gives us a dataset with 17,604 observations of 55 variables. To see the first observations, use head() (output not shown).

head(vdem_data)

5.8 Data: World Development Indicators (WDI)

To use data from the World Bank’s World Development Indicators, we can use the WDI package (Arel-Bundock, 2018). For more information on the World Development Indicators, see https://datacatalog.worldbank.org/dataset/world-development-indicators. First, install and load the package.

# install the package
install.packages("WDI")

# load the package
library("WDI")

To search for data in the WDI, you can use the WDIsearch() function. In the example below, we search for data on GDP.

WDIsearch("gdp")

This returns the indicator and name of the data in WDI. We can see that the indicator for GDP per capita, PPP (constant 2005 international $) is NY.GDP.PCAP.PP.KD. To save the data, use the function WDI(), where you specify the indicator as well as the countries and years you want the data from.

wdi <- WDI(indicator="NY.GDP.PCAP.PP.KD", 
           country=c("US", "GB"), 
           start = 1960, 
           end = 2012)

The WDI package is only one example of packages among others that can be used to access international statistics. For a tutorial on accessing additional international statistics into R, see this guide: A Guide to Getting International Statistics into R

5.9 Data: GitHub repositories

A lot of data today is available in GitHub repositories. To get data from GitHub, we can simply use the read_csv() function. First, find the dataset on GitHub that you would like to use. When you find the dataset, you should click on ‘Raw’ to get to the raw dataset, as shown in Figure 5.1.

How to get to the raw dataset file on GitHub

Figure 5.1: How to get to the raw dataset file on GitHub

Copy the url of the raw dataset and use the read_csv() function in R to load the content of the dataset into R. In the example below, we save a data frame on Danish opinion polls in an object called polls_raw.

polls_raw <- read_csv("https://raw.githubusercontent.com/erikgahner/polls/master/polls.csv")

  1. For more information on the package and the included datasets, see: https://cran.r-project.org/web/packages/poliscidata/poliscidata.pdf