Chapter 4 Manipulating text

We introduced text in the previous chapter. In this chapter, we will show how to manipulate text as strings and factors. We will use the states dataset from the poliscidata package. For more on this data, see Chapter 5.

library("poliscidata")
## Registered S3 method overwritten by 'gdata':
##   method         from  
##   reorder.factor gplots
states <- states

You can use View(states) to get a sense of the 50 observations and the 135 variables.

4.1 Strings

There are a few functions that are great to use, when you use strings. First, paste() makes it easy to connect two strings.

paste("Hello", "World")
## [1] "Hello World"

As you can see, there is a space between the two strings we that we connect. If you would like not to have a space between the two strings, we can use paste0().

paste0("face", "book")
## [1] "facebook"

Most of the relevant functions we can use will be in the package stringr. It is part of the tidyverse but can also be called individually. As you have already installed the tidyverse by now, it is not necessary to install the package again.

library("stringr")

4.1.1 Regular expression (regex)

When working with strings, and in particular when manipulating strings, it is useful to rely on regular expression. Regular expression is a formal language for specifying text strings. Before you can fully leverage the functions available in the stringr package, we suggest that you familiarize yourself with the basics of regular expression (regex for short).

What we will do here is to briefly introduce the basics. If you would like additional material, we recommend that you check out https://github.com/ziishaned/learn-regex and https://github.com/aloisdg/awesome-regex.

To illustrate the usefulness of regex, we will use the functions grep() and grepl(). The former gives us information about the place of a string in a vector. The latter returns a logical vector with information on whether the text we are looking for is present or not. Let us save four pieces of text in an object called trump_text.

trump_text <- c("Trump", "trump", "trump is a 0", "Trump is a loser")

To see where “trump” is mentioned in this object, we can first use grep(). Notice that we first specify the text we will like to find and then the object.

grep("trump", trump_text)
## [1] 2 3

The output shows that “trump” is present in the second and third place in the object. We can use grepl() to get a similar result.

grepl("trump", trump_text)
## [1] FALSE  TRUE  TRUE FALSE

The only difference here is that we get a TRUE or FALSE for each element in the object.

Let us say, however, that we want to get information on whether “trump” or “Trump” is mentioned, i.e. that we do not care about whether it is with a lower t or not. To do this, we can use [] as a disjunction. Within this, we can say that we are looking for both “trump” and “Trump.”

grepl("[Tt]rump", trump_text)
## [1] TRUE TRUE TRUE TRUE

The output not shows that either Trump or trump is present in all four elements. Next, we can use the same logic to explore whether a number is present in any of the elements.

grepl("[0123456789]", trump_text)
## [1] FALSE FALSE  TRUE FALSE

We can see that a number is present in the third element. However, we do not need to include all numbers and we will get the same result by specifying a range of numbers from 0 to 9 (the same logic can be applied to letters, e.g. [a-o] will include all letters from a to o in the alphabet.

grepl("[0-9]", trump_text)
## [1] FALSE FALSE  TRUE FALSE

If you want to check whether at least one of two different words is mentioned, e.g. Trump or trump, you can use the “|”-sign.

grepl("Trump|trump", trump_text)
## [1] TRUE TRUE TRUE TRUE

A few other characters that are good to know about:

  • “.” means any character. Tr.mp will match both Trump and Tramp.
  • “\.” is a period (i.e., the "" sign escapes the regex)
  • “?” is for a character that might be there or not (for example, colou?r matches both color and colour)
  • “*” will be any number of similar characters (for example, wo*w will match wow, woow, wooow etc.)

4.1.2 Changing the case of strings

Some of the functions have relatively simple purposes, such as str_to_upper() (which convert all characters in a string to upper case), str_to_lower() (which convert all characters in a string to lower case), str_to_title() (which convert the first letter in each word in a string to upper case) and str_to_sentence() (which convert the first letter in a sentence to upper case and everything else to lower case).

str_to_upper("Quantitative Politics with R")
## [1] "QUANTITATIVE POLITICS WITH R"
str_to_lower("Quantitative Politics with R")
## [1] "quantitative politics with r"
str_to_title("Quantitative Politics with R")
## [1] "Quantitative Politics With R"
str_to_sentence("Quantitative Politics with R")
## [1] "Quantitative politics with r"

4.1.3 Subset and mutate strings

We can use str_sub() to get a part of the text we are looking at. Say we want to get the first four characters of a string, we can specify start = 1 and end = 4.

str_sub("Quantitative Politics with R", start = 1, end = 4)
## [1] "Quan"

If we would like to get the last four characters, we can simply specify start = -4 as the option.

str_sub("Quantitative Politics with R", start = -4)
## [1] "th R"

Here, we are going to look at cigarette taxes, and namely on whether the cigarette taxes are in the low, middle or high category. To look at this we will use the cig_tax12_3 variable in the states data frame.

table(states$cig_tax12_3)

We can see that the names for these categories are LoTax, MidTax and HiTax. With the code below we use str_replace_all() to replace the characters with new characters, e.g. HiTax becomes High taxes.

states$cig_taxes <- str_replace_all(states$cig_tax12_3, 
                                    c("HiTax" = "High taxes", 
                                      "MidTax" = "Middle taxes",
                                      "LoTax" = "Low taxes"))

table(states$cig_taxes)
## 
##   High taxes    Low taxes Middle taxes 
##           15           17           18

For examples on more of the functions available in the stringr package, see this introduction.

Three other functions that can come in useful are from the tidyr package: separate(), unite() and extract(). tidyr is also part of the tidyverse and if you load the tidyverse package, you do not need to load tidyr. These functions come with multiple options and we suggest that you consult the documentations for these in order to see examples and the different options, e.g. help("separate").

4.2 Factors

For the cigarette taxes we have worked with above, these are categorical data that we can order. To work with ordered and unordered categories, factors is a class in R class that makes these categories good to work with. In brief, categorical data is a variable with a fixed set of possible values. This is also useful when you want to use a non-alphabetical ordering of the values in a variable.

For factors, we are going to use the package forcats. This package is also part of the tidyverse.

library("forcats")

We create a new variable, cig_taxes_cat as a factor variable and then we see what levels we have (and the order of these).

states$cig_taxes_cat <- factor(states$cig_taxes)

levels(states$cig_taxes_cat)
## [1] "High taxes"   "Low taxes"    "Middle taxes"

As we can see, these levels are now in the wrong order (sorted alphabetically). We can use the fct_relevel() to specify the order of the categories (from low to high).

states$cig_taxes_cat <- fct_relevel(states$cig_taxes_cat, 
                                    "Low taxes", 
                                    "Middle taxes", 
                                    "High taxes")

levels(states$cig_taxes_cat)
## [1] "Low taxes"    "Middle taxes" "High taxes"

This will become useful later on when we want to make sure that the categories in a data visualisation has the correct order.

For additional guidance on the functions available in the forcats package, see https://forcats.tidyverse.org/.

4.3 Dates and time

To work with dates and time in R, there are two useful packages. The first is hms that is good with hours, minutes and seconds. The second is lubridate that is good with dates. Let us take a closer look at how to work with seconds, minutes and hours by loading the package hms.

library("hms")

This package is useful if you want to easily convert minutes into hours, for example 500 minutes into hours. As lubridate also have an hms(), we will use :: to tell that we are using the function hms() in the hms package.

hms::hms(min = 500)
## 08:20:00

As we can see, this gives 8 hours and 20 minutes. We can also specify hours, minutes and seconds to get the time as POSIXct.

hms::hms(hour = 15, min = 90, seconds = 12)
## 16:30:12

For dates, we will first load the package lubridate.

library("lubridate")
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:hms':
## 
##     hms
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

This package has several functions that are useful in terms of working with dates. For example, we can use ymd() if we have text that has the year, month and day.

ymd("2019/09/30")
## [1] "2019-09-30"

The package can also work with months as text, such as:

mdy("September 30, 2019")
## [1] "2019-09-30"

The good thing about this is that we can work with the date information. Let us first save the date in an object called date and use year() to get the year out of the variable.

# Save September 30, 2019 in object
date <- ymd("2019-09-30")

# Get year
year(date)
## [1] 2019

Similarly, we can get the week number out of the date.

week(date)
## [1] 39

We can see that this date was in week 39. We can use wday() to get the number of the day in the week this was.

wday(date)
## [1] 2

If we would rather prefer the name of the day, we can use label and abbr as options (the latter option in order to get the full day name).

wday(date, label = TRUE, abbr = FALSE)
## [1] Monday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

If you would like to get the difference between two dates, you can simply subtract one date from the other as in the example below.

ymd("2019-09-30") - ymd("2019-09-01")
## Time difference of 29 days

Some of the relevant functions in the lubridate package are: year() (year), month() (month), week() (week number), day() (day of month), wday() (day week), qday() (day of quarter), and yday() (day of year).

Last, if you have information on a date in a different format (i.e. not as used above), you can specify this format. For example, if the data is 30/09/2020, you can specify the format with %d/%m/%Y to turn it into a date.

as.Date("30/09/2020", format = "%d/%m/%Y")
## [1] "2020-09-30"

Here is a list of some of the most useful formats:

  • %d = day as number (0-31)
  • %a = abbreviated weekday
  • %A = unabbreviated weekday
  • %m = month (00-12)
  • %b = abbreviated month
  • %B = unabbrevidated month
  • %y = 2 digit year
  • %Y = four digit year