Chapter 3 Data management

There are multiple ways to manage data in R and in particular different ways to create and change variables in a data frame. In this chapter, we show different ways of working with data frames with a focus on how to change and create new variables. Noteworthy, there are multiple packages we can use to manipulate data frames, but the best is without a doubt dplyr (Hadley Wickham & Francois, 2016). This is part of the tidyverse package so you do not need to install any new packages if you have already installed tidyverse. The syntax is inspired by SQL and if you want to learn SQL at some point, you will have an advantage from having used the dplyr package.

The package provides some basic functions making it easy to work with data frames. These functions include select(), filter(), arrange(), rename(), mutate() and summarize().7 select() allows you to pick variables by their names. filter() allows you to pick observations by their values. arrange() allows you to reorder the rows. rename() allows you to rename columns. mutate() allows you to create new variables based on the values of old variables. summarize() allows you to collapse many values to a single summary.

All these functions rely on data frames. In other words, you can not use these functions on other types of data in R. Furthermore, they all return a new data frame that you will need to save in a new object or overwrite the existing object with your data frame.

As the dplyr package is part of the tidyverse, the first thing we do is to call the tidyverse.

library("tidyverse")

We will use the dataset we created in the previous chapter. If you do not have it, you can use the script below to create the data frame again.

party <- c("Conservative Party", "Labour Party", "Scottish National Party", 
       "Liberal Democrats", "Democratic Unionist Party", "Sinn Féin") 

leader <- c("Theresa May", "Jeremy Corbyn", "Nicola Sturgeon", 
            "Tim Farron", "Arlene Foster", "Gerry Adams")

votes <- c(42.4, 40.0, 3.0, 7.4, 0.9, 0.7)

seats <- c(317, 262, 35, 12, 10, 7)

seats_change <- c(-13, 30, -21, 4, 2, 3)

uk2017 <- data.frame(party, leader, votes, seats, seats_change)

To see the information in the dataset, use head().

head(uk2017)
##                       party          leader votes seats seats_change
## 1        Conservative Party     Theresa May  42.4   317          -13
## 2              Labour Party   Jeremy Corbyn  40.0   262           30
## 3   Scottish National Party Nicola Sturgeon   3.0    35          -21
## 4         Liberal Democrats      Tim Farron   7.4    12            4
## 5 Democratic Unionist Party   Arlene Foster   0.9    10            2
## 6                 Sinn Féin     Gerry Adams   0.7     7            3

3.1 Selecting variables: select()

When we work with large datasets, we often want to select the few variables that are of key interest to our project. For this task the select() function is perfect. If we only want to have information on the party name and the votes in the uk2017 data frame, we can write:

select(uk2017, party, votes)
##                       party votes
## 1        Conservative Party  42.4
## 2              Labour Party  40.0
## 3   Scottish National Party   3.0
## 4         Liberal Democrats   7.4
## 5 Democratic Unionist Party   0.9
## 6                 Sinn Féin   0.7

Again, this is not saved in a new data frame. If we want to save this in a new data frame, say uk2017_pv, we need to assign the output from select() to our object.

uk2017_pv <- select(uk2017, party, votes)

There are multiple different functions that can help us find specific variables in the data frame. We can use contains(), if we want to include variables that contain a specific word in the variable name. In the example below we look for variables that contain the text seat.

select(uk2017, contains("seat"))
##   seats seats_change
## 1   317          -13
## 2   262           30
## 3    35          -21
## 4    12            4
## 5    10            2
## 6     7            3

Other noteworthy functions similar to contains() that can be of help are functions such as starts_with(), ends_with(), matches(), num_range(), one_of() and everything(). The last function, everything() is helpful if we want to move a variable to the beginning of our data frame.

select(uk2017, votes, everything())
##   votes                     party          leader seats seats_change
## 1  42.4        Conservative Party     Theresa May   317          -13
## 2  40.0              Labour Party   Jeremy Corbyn   262           30
## 3   3.0   Scottish National Party Nicola Sturgeon    35          -21
## 4   7.4         Liberal Democrats      Tim Farron    12            4
## 5   0.9 Democratic Unionist Party   Arlene Foster    10            2
## 6   0.7                 Sinn Féin     Gerry Adams     7            3

Last, we can use the negative sign if we want to remove a variable from the data frame.

select(uk2017, -leader)
##                       party votes seats seats_change
## 1        Conservative Party  42.4   317          -13
## 2              Labour Party  40.0   262           30
## 3   Scottish National Party   3.0    35          -21
## 4         Liberal Democrats   7.4    12            4
## 5 Democratic Unionist Party   0.9    10            2
## 6                 Sinn Féin   0.7     7            3

3.2 Selecting observations: filter()

To select only some of the observations in our data frame, but for all variables, we can use the function filter(). In the example below we select the observations in our data frame with a positive value on seats_change (i.e. greater than 0).

filter(uk2017, seats_change > 0)
##                       party        leader votes seats seats_change
## 1              Labour Party Jeremy Corbyn  40.0   262           30
## 2         Liberal Democrats    Tim Farron   7.4    12            4
## 3 Democratic Unionist Party Arlene Foster   0.9    10            2
## 4                 Sinn Féin   Gerry Adams   0.7     7            3

Importantly, we are not making any changes to the data frame uk2017. Again, this will only hapen if we replace our existing data frame or create a new data frame. In the example below we create a new data frame, uk2017_seatlosers, with the observations losing seats from 2015 to 2017.

uk2017_seatlosers <- filter(uk2017, seats_change < 0)

uk2017_seatlosers
##                     party          leader votes seats seats_change
## 1      Conservative Party     Theresa May  42.4   317          -13
## 2 Scottish National Party Nicola Sturgeon   3.0    35          -21

If we want to select two parties, say the Conservative Party and the Labour Party, we can use %in% to specify this.

filter(uk2017, party %in% c("Conservative Party", "Labour Party"))
##                party        leader votes seats seats_change
## 1 Conservative Party   Theresa May  42.4   317          -13
## 2       Labour Party Jeremy Corbyn  40.0   262           30

Similarly, by adding an ! in front of party we can get the parties that are not the Conservative Party and the Labour Party.

filter(uk2017, !party %in% c("Conservative Party", "Labour Party"))
##                       party          leader votes seats seats_change
## 1   Scottish National Party Nicola Sturgeon   3.0    35          -21
## 2         Liberal Democrats      Tim Farron   7.4    12            4
## 3 Democratic Unionist Party   Arlene Foster   0.9    10            2
## 4                 Sinn Féin     Gerry Adams   0.7     7            3

Last, if we want to drop observations that contain missing values on specific variables, we can use the function drop_na().

3.3 Sorting observations: arrange()

We can use the function arrange() if we want to change the order of observations. In the example below we sort our data frame according to how many votes the party got, with the party getting the least votes in the top of our data frame.

arrange(uk2017, votes)
##                       party          leader votes seats seats_change
## 1                 Sinn Féin     Gerry Adams   0.7     7            3
## 2 Democratic Unionist Party   Arlene Foster   0.9    10            2
## 3   Scottish National Party Nicola Sturgeon   3.0    35          -21
## 4         Liberal Democrats      Tim Farron   7.4    12            4
## 5              Labour Party   Jeremy Corbyn  40.0   262           30
## 6        Conservative Party     Theresa May  42.4   317          -13

If we prefer to have the parties with the greatest number of votes in the top, we can use the negative sign (-).

arrange(uk2017, -votes)
##                       party          leader votes seats seats_change
## 1        Conservative Party     Theresa May  42.4   317          -13
## 2              Labour Party   Jeremy Corbyn  40.0   262           30
## 3         Liberal Democrats      Tim Farron   7.4    12            4
## 4   Scottish National Party Nicola Sturgeon   3.0    35          -21
## 5 Democratic Unionist Party   Arlene Foster   0.9    10            2
## 6                 Sinn Féin     Gerry Adams   0.7     7            3

Alternatively, you can use the desc() function.

arrange(uk2017, desc(votes))
##                       party          leader votes seats seats_change
## 1        Conservative Party     Theresa May  42.4   317          -13
## 2              Labour Party   Jeremy Corbyn  40.0   262           30
## 3         Liberal Democrats      Tim Farron   7.4    12            4
## 4   Scottish National Party Nicola Sturgeon   3.0    35          -21
## 5 Democratic Unionist Party   Arlene Foster   0.9    10            2
## 6                 Sinn Féin     Gerry Adams   0.7     7            3

3.4 Rename variables: rename()

In the case that we have a variable we would prefer having another name, we can use the function rename(). In the example below we change the name of party to party_name.

rename(uk2017, party_name = party)
##                  party_name          leader votes seats seats_change
## 1        Conservative Party     Theresa May  42.4   317          -13
## 2              Labour Party   Jeremy Corbyn  40.0   262           30
## 3   Scottish National Party Nicola Sturgeon   3.0    35          -21
## 4         Liberal Democrats      Tim Farron   7.4    12            4
## 5 Democratic Unionist Party   Arlene Foster   0.9    10            2
## 6                 Sinn Féin     Gerry Adams   0.7     7            3

3.5 Create variables: mutate()

The best way to create a new variable from existing variables in our data frame is to use the function mutate(). In the example below we create a new variable, votes_m with information on how many percentage points a party is from the average number of votes a party got in the election.

mutate(uk2017, votes_m = votes - mean(votes))
##                       party          leader votes seats seats_change    votes_m
## 1        Conservative Party     Theresa May  42.4   317          -13  26.666667
## 2              Labour Party   Jeremy Corbyn  40.0   262           30  24.266667
## 3   Scottish National Party Nicola Sturgeon   3.0    35          -21 -12.733333
## 4         Liberal Democrats      Tim Farron   7.4    12            4  -8.333333
## 5 Democratic Unionist Party   Arlene Foster   0.9    10            2 -14.833333
## 6                 Sinn Féin     Gerry Adams   0.7     7            3 -15.033333

We can also use the sum() function to find the proportion of seats a party got in a variable, seats_prop.

mutate(uk2017, seats_prop = seats / sum(seats))
##                       party          leader votes seats seats_change seats_prop
## 1        Conservative Party     Theresa May  42.4   317          -13 0.49300156
## 2              Labour Party   Jeremy Corbyn  40.0   262           30 0.40746501
## 3   Scottish National Party Nicola Sturgeon   3.0    35          -21 0.05443235
## 4         Liberal Democrats      Tim Farron   7.4    12            4 0.01866252
## 5 Democratic Unionist Party   Arlene Foster   0.9    10            2 0.01555210
## 6                 Sinn Féin     Gerry Adams   0.7     7            3 0.01088647

3.6 The pipe operator: %>%

So far we have looked at a series of different functions. In most cases we want to combine these functions, e.g. when we both have to select specific variables and observations. Luckikly, there is nothing against using one function nested within another, as the example below shows.

filter(select(uk2017, party, votes), seats_change > 0)
##                       party votes
## 1              Labour Party  40.0
## 2         Liberal Democrats   7.4
## 3 Democratic Unionist Party   0.9
## 4                 Sinn Féin   0.7

The problem is that it can be complicated to read, especially as the number of functions we use increases. Furthermore, the likelihood of making a stupid mistake, e.g. by including an extra ( or ) increases substantially. We can use the pipe operator, %>%, to make our code more readable.

The operator relies on a step-wise logic so we first specify the data frame and then a line for each function we want to run on the data frame.

In the example below we do the same as above but in a way that is easier to follow.

uk2017 %>% 
  select(party, votes) %>%
  filter(seats_change > 0)
##                       party votes
## 1              Labour Party  40.0
## 2         Liberal Democrats   7.4
## 3 Democratic Unionist Party   0.9
## 4                 Sinn Féin   0.7

On the first line, we show that we are using the data frame uk2017. We end this line with %>%, telling R that we are not done yet but will have to put this into the function on the line below. The next line uses the input from the previous line and selects party and votes from the data frame. This line also ends with the pipe, %>%. The third line shows the observations in our data frame where seats_change is greater than 0. Note that we did not select seats_change as a variable with select(), so this is not crucial in order to use it (as long as it is in the uk2017 data frame). Last, we do not end with a pipe as we are done and do not want to do more to our data frame.

If we want to change the names of our variables, we can use setNames().

uk2017 %>% 
  filter(seats_change > 0) %>%
  select(party, votes) %>%
  setNames(c("party_name", "vote_share"))
##                  party_name vote_share
## 1              Labour Party       40.0
## 2         Liberal Democrats        7.4
## 3 Democratic Unionist Party        0.9
## 4                 Sinn Féin        0.7

Hopefully, you will soon be using the pipe operator a lot. Instead of having to type >, % and > again and again, there are good shortcuts available in RStudio to create the pipe, specifically Ctrl+Shift+M (Windows) and Cmd+Shift+M (Mac).

Last, if you want to get information on what happens at the individual steps, you can install the tidylog package. This package adds information on what each dplyr function does. In the example below we use select() and filter() with tidylog.

uk2017 %>% 
  tidylog::select(party, votes) %>%
  tidylog::filter(seats_change > 0)
## select: dropped 3 variables (leader, seats, seats_change)
## filter: removed 2 rows (33%), 4 rows remaining
##                       party votes
## 1              Labour Party  40.0
## 2         Liberal Democrats   7.4
## 3 Democratic Unionist Party   0.9
## 4                 Sinn Féin   0.7

We can see that we dropped 3 variables with select() and removed 2 out of 6 rows with filter(). If you prefer to get all this information for all your code, you can include library("tidylog") in your script (after you have loaded the tidyverse package).

3.7 Running functions on variables: apply()

If we would like to run a function on some of our rows or columns, we can use the function apply(). For example, we can get the average number of votes and seats for parties with a positive value on seats_change (i.e. parties with an increase in seats from 2015 to 2017).

The addition here is the function apply() on the data frame used above. The first thing we specify here is MARGIN, i.e. whether we want to run a function on our rows (1) or columns (2). The next thing we specify is the function together with any relevant options.

uk2017 %>%
  filter(seats_change > 0) %>%
  select(votes, seats) %>%
  apply(MARGIN = 2, FUN = mean, na.rm = TRUE)
## votes seats 
## 12.25 72.75

In the case you want to apply a function to both rows and columns, you will have to specify c(1, 2). It is not important to mention MARGIN or FUN if you have the order right. In other words, we can simplify our example to the code below.

uk2017 %>%
  filter(seats_change > 0) %>%
  select(votes, seats) %>%
  apply(2, mean)
## votes seats 
## 12.25 72.75

3.8 Aggregating variables: summarize() and group_by()

If we want to create new variables with aggregated information, similar to the information we got in the previous section, we can use the function summarize(). In the example below we get a data frame with information on the number of observatins, given by n(), the minimum number of votes a party got (votes_min), the maximum number of votes a party got (votes_max) and the average number of votes a party got (votes_mean) (all in percentages).

uk2017 %>%
  summarize(party = n(), 
            votes_min = min(votes), 
            votes_max = max(votes), 
            votes_mean = mean(votes))
##   party votes_min votes_max votes_mean
## 1     6       0.7      42.4   15.73333

If we want this information for different groups, we can supply with group_by(). In the example below we get the same information for parties with an increase in seats from 2015 to 2017 and not.

uk2017 %>%
  group_by(seats_change > 0) %>%
  summarize(party = n(), 
            votes_min = min(votes), 
            votes_max = max(votes), 
            votes_mean = mean(votes))
## # A tibble: 2 x 5
##   `seats_change > 0` party votes_min votes_max votes_mean
##   <lgl>              <int>     <dbl>     <dbl>      <dbl>
## 1 FALSE                  2       3        42.4       22.7
## 2 TRUE                   4       0.7      40         12.2

In the example, you can see the aggregated information. The information next to TRUE is the aggregated information for the observations where seats_change is greater than 0. Last, often you want to conduct additional work on your data frame where it should no longer be grouped anymore. In that case, you can use ungroup().

3.9 Recoding variables: case_when()

In a lot of cases we want to recode the information in a single variable. To do this, we can use case_when(). Let us use the leader variable in uk2017 as an example.

uk2017$leader 
## [1] Theresa May     Jeremy Corbyn   Nicola Sturgeon Tim Farron      Arlene Foster   Gerry Adams    
## Levels: Arlene Foster Gerry Adams Jeremy Corbyn Nicola Sturgeon Theresa May Tim Farron

In the case that we want to create a new leader variable (in this example called leader_new) where Theresa May is replaced with Boris Johnson, we can do that with the code below.

uk2017 %>% 
  mutate(leader_new = case_when(leader == "Theresa May" ~ "Boris Johnson", 
                                TRUE ~ as.character(leader))
         )
##                       party          leader votes seats seats_change      leader_new
## 1        Conservative Party     Theresa May  42.4   317          -13   Boris Johnson
## 2              Labour Party   Jeremy Corbyn  40.0   262           30   Jeremy Corbyn
## 3   Scottish National Party Nicola Sturgeon   3.0    35          -21 Nicola Sturgeon
## 4         Liberal Democrats      Tim Farron   7.4    12            4      Tim Farron
## 5 Democratic Unionist Party   Arlene Foster   0.9    10            2   Arlene Foster
## 6                 Sinn Féin     Gerry Adams   0.7     7            3     Gerry Adams

Using case_when() is a great way to make recodings easily. As you can see, you first specify what condition a variable should satisfy, before you use ~ to tell what the new variable in mutate should be for this variable. At the end, we use TRUE to specify what all other values on the variable should be (in this example the other leaders in our variable).

Noteworthy, we do not save the leader_new variable in our data frame. If we want to save the changes, we can save the new variable to our data frame with the code below.

uk2017 <- uk2017 %>% 
  mutate(leader_new = case_when(leader == "Theresa May" ~ "Boris Johnson", 
                                TRUE ~ as.character(leader))
         )

dplyr in the tidyverse also has a recode() function, and the car package (Fox & Weisberg, 2011) has a similar function worth exploring. However, it is recommended that you stick with mutate() and case_when().

3.10 Getting a variable from a data frame: pull()

Often you do something to a data frame but might be interested in just working with one variable within that data frame. To do this, pull() is a great function. Let’s say we want to get the average votes for parties with a positive value on seats_changes. Here, we first use filter() to get the observations we want to look at and then use pull() with the votes variable to use mean() on this variable

uk2017 %>% 
  filter(seats_change > 0) %>% 
  pull(votes) %>% 
  mean()
## [1] 12.25

Here, we can see that the average support for parties with a positive value on seats_change is 12.25.


  1. For another good introduction to dplyr, see: Managing Data Frames with the dplyr package.