Chapter 3 Data management

There are multiple ways to manage data in R and in particular different ways to create and change variables in a data frame. In this chapter, we show different ways of working with data frames with a focus on how to change and create new variables. Noteworthy, there are multiple packages we can use to manipulate data frames, but the best is without a doubt dplyr (Hadley Wickham & Francois, 2016). This is part of the tidyverse package so you do not need to install any new packages if you have already installed tidyverse.

The package provides some basic functions making it easy to work with data frames. These functions include select(), filter(), arrange(), rename(), mutate() and summarize().7 select() allows you to pick variables by their names. filter() allows you to pick observations by their values. arrange() allows you to reorder the rows. rename() allows you to rename columns. mutate() allows you to create new variables based on the values of old variables. summarize() allows you to collapse many values to a single summary.

All these functions rely on data frames. In other words, you can not use these functions on other types of data in R. Furthermore, they all return a new data frame that you will need to save in a new object or overwrite the existing object with your data frame.

As the dplyr package is part of the tidyverse, the first thing we do is to call the tidyverse.

library("tidyverse")

We will use the dataset we created in the previous chapter. If you do not have it, you can use the script below to create the data frame again.

party <- c("Conservative Party", "Labour Party", "Scottish National Party", 
       "Liberal Democrats", "Democratic Unionist Party", "Sinn Féin") 

leader <- c("Theresa May", "Jeremy Corbyn", "Nicola Sturgeon", 
            "Tim Farron", "Arlene Foster", "Gerry Adams")

votes <- c(42.4, 40.0, 3.0, 7.4, 0.9, 0.7)

seats <- c(317, 262, 35, 12, 10, 7)

seats_change <- c(-13, 30, -21, 4, 2, 3)

uk2017 <- data.frame(party, leader, votes, seats, seats_change)

To see the information in the dataset, use head().

head(uk2017)
                      party          leader votes seats seats_change
1        Conservative Party     Theresa May  42.4   317          -13
2              Labour Party   Jeremy Corbyn  40.0   262           30
3   Scottish National Party Nicola Sturgeon   3.0    35          -21
4         Liberal Democrats      Tim Farron   7.4    12            4
5 Democratic Unionist Party   Arlene Foster   0.9    10            2
6                 Sinn Féin     Gerry Adams   0.7     7            3

3.1 Selecting variables: select()

When we work with large datasets, we often want to select the few variables that are of key interest to our project. For this task the select() function is perfect. If we only want to have information on the party name and the votes in the uk2017 data frame, we can write:

select(uk2017, party, votes)
                      party votes
1        Conservative Party  42.4
2              Labour Party  40.0
3   Scottish National Party   3.0
4         Liberal Democrats   7.4
5 Democratic Unionist Party   0.9
6                 Sinn Féin   0.7

Again, this is not saved in a new data frame. If we want to save this in a new data frame, say uk2017_pv, we need to assign the output from select() to our object.

uk2017_pv <- select(uk2017, party, votes)

There are multiple different functions that can help us find specific variables in the data frame. We can use contains(), if we want to include variables that contain a specific word in the variable name. In the example below we look for variables that contain the text seat.

select(uk2017, contains("seat"))
  seats seats_change
1   317          -13
2   262           30
3    35          -21
4    12            4
5    10            2
6     7            3

Other noteworthy functions similar to contains() that can be of help are functions such as starts_with(), ends_with(), matches(), num_range(), one_of() and everything(). The last function, everything() is helpful if we want to move a variable to the beginning of our data frame.

select(uk2017, votes, everything())
  votes                     party          leader seats seats_change
1  42.4        Conservative Party     Theresa May   317          -13
2  40.0              Labour Party   Jeremy Corbyn   262           30
3   3.0   Scottish National Party Nicola Sturgeon    35          -21
4   7.4         Liberal Democrats      Tim Farron    12            4
5   0.9 Democratic Unionist Party   Arlene Foster    10            2
6   0.7                 Sinn Féin     Gerry Adams     7            3

Last, we can use the negative sign if we want to remove a variable from the data frame.

select(uk2017, -leader)
                      party votes seats seats_change
1        Conservative Party  42.4   317          -13
2              Labour Party  40.0   262           30
3   Scottish National Party   3.0    35          -21
4         Liberal Democrats   7.4    12            4
5 Democratic Unionist Party   0.9    10            2
6                 Sinn Féin   0.7     7            3

3.2 Selecting observations: filter()

To select only some of the observations in our data frame, but for all variables, we can use the function filter(). In the example below we select the observations in our data frame with a positive value on seats_change (i.e. greater than 0).

filter(uk2017, seats_change > 0)
                      party        leader votes seats seats_change
1              Labour Party Jeremy Corbyn  40.0   262           30
2         Liberal Democrats    Tim Farron   7.4    12            4
3 Democratic Unionist Party Arlene Foster   0.9    10            2
4                 Sinn Féin   Gerry Adams   0.7     7            3

Importantly, we are not making any changes to the data frame uk2017. Again, this will only hapen if we replace our existing data frame or create a new data frame. In the example below we create a new data frame, uk2017_seatlosers, with the observations losing seats from 2015 to 2017.

uk2017_seatlosers <- filter(uk2017, seats_change < 0)

uk2017_seatlosers
                    party          leader votes seats seats_change
1      Conservative Party     Theresa May  42.4   317          -13
2 Scottish National Party Nicola Sturgeon   3.0    35          -21

Last, if we want to drop observations that contain missing values on specific variables, we can use the function drop_na().

3.3 Sorting observations: arrange()

We can use the function arrange() if we want to change the order of observations. In the example below we sort our data frame according to how many votes the party got, with the party getting the least votes in the top of our data frame.

arrange(uk2017, votes)
                      party          leader votes seats seats_change
1                 Sinn Féin     Gerry Adams   0.7     7            3
2 Democratic Unionist Party   Arlene Foster   0.9    10            2
3   Scottish National Party Nicola Sturgeon   3.0    35          -21
4         Liberal Democrats      Tim Farron   7.4    12            4
5              Labour Party   Jeremy Corbyn  40.0   262           30
6        Conservative Party     Theresa May  42.4   317          -13

If we prefer to have the parties with the greatest number of votes in the top, we can use the negative sign (-).

arrange(uk2017, -votes)
                      party          leader votes seats seats_change
1        Conservative Party     Theresa May  42.4   317          -13
2              Labour Party   Jeremy Corbyn  40.0   262           30
3         Liberal Democrats      Tim Farron   7.4    12            4
4   Scottish National Party Nicola Sturgeon   3.0    35          -21
5 Democratic Unionist Party   Arlene Foster   0.9    10            2
6                 Sinn Féin     Gerry Adams   0.7     7            3

Alternatively, you can use the desc() function.

arrange(uk2017, desc(votes))
                      party          leader votes seats seats_change
1        Conservative Party     Theresa May  42.4   317          -13
2              Labour Party   Jeremy Corbyn  40.0   262           30
3         Liberal Democrats      Tim Farron   7.4    12            4
4   Scottish National Party Nicola Sturgeon   3.0    35          -21
5 Democratic Unionist Party   Arlene Foster   0.9    10            2
6                 Sinn Féin     Gerry Adams   0.7     7            3

3.4 Rename variables: rename()

In the case that we have a variable we would prefer having another name, we can use the function rename(). In the example below we change the name of party to party_name.

rename(uk2017, party_name = party)
                 party_name          leader votes seats seats_change
1        Conservative Party     Theresa May  42.4   317          -13
2              Labour Party   Jeremy Corbyn  40.0   262           30
3   Scottish National Party Nicola Sturgeon   3.0    35          -21
4         Liberal Democrats      Tim Farron   7.4    12            4
5 Democratic Unionist Party   Arlene Foster   0.9    10            2
6                 Sinn Féin     Gerry Adams   0.7     7            3

3.5 Create variables: mutate()

The best way to create a new variable from existing variables in our data frame is to use the function mutate(). In the example below we create a new variable, votes_m with information on how many percentage points a party is from the average number of votes a party got in the election.

mutate(uk2017, votes_m = votes - mean(votes))
                      party          leader votes seats seats_change
1        Conservative Party     Theresa May  42.4   317          -13
2              Labour Party   Jeremy Corbyn  40.0   262           30
3   Scottish National Party Nicola Sturgeon   3.0    35          -21
4         Liberal Democrats      Tim Farron   7.4    12            4
5 Democratic Unionist Party   Arlene Foster   0.9    10            2
6                 Sinn Féin     Gerry Adams   0.7     7            3
     votes_m
1  26.666667
2  24.266667
3 -12.733333
4  -8.333333
5 -14.833333
6 -15.033333

We can also use the sum() function to find the proportion of seats a party got in a variable, seats_prop.

mutate(uk2017, seats_prop = seats / sum(seats))
                      party          leader votes seats seats_change
1        Conservative Party     Theresa May  42.4   317          -13
2              Labour Party   Jeremy Corbyn  40.0   262           30
3   Scottish National Party Nicola Sturgeon   3.0    35          -21
4         Liberal Democrats      Tim Farron   7.4    12            4
5 Democratic Unionist Party   Arlene Foster   0.9    10            2
6                 Sinn Féin     Gerry Adams   0.7     7            3
  seats_prop
1 0.49300156
2 0.40746501
3 0.05443235
4 0.01866252
5 0.01555210
6 0.01088647

3.6 The pipe operator: %>%

So far we have looked at a series of different functions. In most cases we want to combine these functions, e.g. when we both have to select specific variables and observations. Luckikly, there is nothing against using one function nested within another, as the example below shows.

filter(select(uk2017, party, votes), seats_change > 0)
                      party votes
1              Labour Party  40.0
2         Liberal Democrats   7.4
3 Democratic Unionist Party   0.9
4                 Sinn Féin   0.7

The problem is that it can be complicated to read, especially as the number of functions we use increases. Furthermore, the likelihood of making a stupid mistake, e.g. by including an extra ( or ) increases substantially. We can use the pipe operator, %>%, to make our code more readable.

The operator relies on a step-wise logic so we first specify the data frame and then a line for each function we want to run on the data frame.

In the example below we do the same as above but in a way that is easier to follow.

uk2017 %>% 
  select(party, votes) %>%
  filter(seats_change > 0)
                      party votes
1              Labour Party  40.0
2         Liberal Democrats   7.4
3 Democratic Unionist Party   0.9
4                 Sinn Féin   0.7

On the first line, we show that we are using the data frame uk2017. We end this line with %>%, telling R that we are not done yet but will have to put this into the function on the line below. The next line uses the input from the previous line and selects party and votes from the data frame. This line also ends with the pipe, %>%. The third line shows the observations in our data frame where seats_change is greater than 0. Note that we did not select seats_change as a variable with select(), so this is not crucial in order to use it (as long as it is in the uk2017 data frame). Last, we do not end with a pipe as we are done and do not want to do more to our data frame.

3.7 Running functions on variables: apply()

If we would like to run a function on some of our rows or columns, we can use the function apply(). For example, we can get the average number of votes and seats for parties with a positive value on seats_change (i.e. parties with an increase in seats from 2015 to 2017).

The addition here is the function apply() on the data frame used above. The first thing we specify here is MARGIN, i.e. whether we want to run a function on our rows (1) or columns (2). The next thing we specify is the function together with any relevant options.

uk2017 %>%
  filter(seats_change > 0) %>%
  select(votes, seats) %>%
  apply(MARGIN = 2, FUN = mean, na.rm = TRUE)
votes seats 
12.25 72.75 

In the case you want to apply a function to both rows and columns, you will have to specify c(1, 2). It is not important to mention MARGIN or FUN if you have the order right. In other words, we can simplify our example to the code below.

uk2017 %>%
  filter(seats_change > 0) %>%
  select(votes, seats) %>%
  apply(2, mean)
votes seats 
12.25 72.75 

3.8 Aggregating variables: summarize() and group_by()

If we want to create new variables with aggregated information, similar to the information we got in the previous section, we can use the function summarize(). In the example below we get a data frame with information on the number of observatins, given by n(), the minimum number of votes a party got (votes_min), the maximum number of votes a party got (votes_max) and the average number of votes a party got (votes_mean) (all in percentages).

uk2017 %>%
  summarize(party = n(), 
            votes_min = min(votes), 
            votes_max = max(votes), 
            votes_mean = mean(votes))
  party votes_min votes_max votes_mean
1     6       0.7      42.4   15.73333

If we want this information for different groups, we can supply with group_by(). In the example below we get the same information for parties with an increase in seats from 2015 to 2017 and not.

uk2017 %>%
  group_by(seats_change > 0) %>%
  summarize(party = n(), 
            votes_min = min(votes), 
            votes_max = max(votes), 
            votes_mean = mean(votes))
# A tibble: 2 x 5
  `seats_change > 0` party votes_min votes_max votes_mean
  <lgl>              <int>     <dbl>     <dbl>      <dbl>
1 FALSE                  2       3        42.4       22.7
2 TRUE                   4       0.7      40         12.2

In the example, you can see the aggregated information. T is short for TRUE and is the aggregated information for the observations where seats_change is greater than 0.

3.9 Recoding variables: recode()

In a lot of cases we want to recode the information in a single variable. To do this, we can use recode(). Importantly, this function works for individual variables and not for a data frame. Let us use the leader variable in uk2017 as an example.

uk2017$leader 
[1] Theresa May     Jeremy Corbyn   Nicola Sturgeon Tim Farron     
[5] Arlene Foster   Gerry Adams    
6 Levels: Arlene Foster Gerry Adams Jeremy Corbyn ... Tim Farron

In the case that we want to replace Tim Farron in the variable with a new guy, we can do that with the code below.

recode(uk2017$leader, "Tim Farron" = "New guy")
[1] Theresa May     Jeremy Corbyn   Nicola Sturgeon New guy        
[5] Arlene Foster   Gerry Adams    
6 Levels: Arlene Foster Gerry Adams Jeremy Corbyn ... New guy

Noteworthy, we do not create any changes to the leader variable. If we want to save the changes, we can save the new variable to our data frame.

uk2017$leader_new <- recode(uk2017$leader, "Tim Farron" = "New guy")

uk2017$leader_new
[1] Theresa May     Jeremy Corbyn   Nicola Sturgeon New guy        
[5] Arlene Foster   Gerry Adams    
6 Levels: Arlene Foster Gerry Adams Jeremy Corbyn ... New guy

Last, dplyr in the tidyverse is not the only package with a recode() function. The package car (Fox & Weisberg, 2011) has a similar function worth exploring.


  1. For another good introduction to dplyr, see: Managing Data Frames with the dplyr package.