Chapter 4 Manipulating text
We introduced text in the previous chapter. In this chapter, we will show how to manipulate text as strings and factors. We will use the
states dataset from the
poliscidata package. For more on this data, see Chapter 5.
## Registered S3 method overwritten by 'gdata': ## method from ## reorder.factor gplots
You can use
View(states) to get a sense of the 50 observations and the 135 variables.
There are a few functions that are great to use, when you use strings. First,
paste() makes it easy to connect two strings.
##  "Hello World"
As you can see, there is a space between the two strings we that we connect. If you would like not to have a space between the two strings, we can use
##  "facebook"
Most of the relevant functions we can use will be in the package
stringr. It is part of the
tidyverse but can also be called individually. As you have already installed the
tidyverse by now, it is not necessary to install the package again.
4.1.1 Regular expression (regex)
When working with strings, and in particular when manipulating strings, it is useful to rely on regular expression. Regular expression is a formal language for specifying text strings. Before you can fully leverage the functions available in the
stringr package, we suggest that you familiarize yourself with the basics of regular expression (regex for short).
What we will do here is to briefly introduce the basics. If you would like additional material, we recommend that you check out https://github.com/ziishaned/learn-regex and https://github.com/aloisdg/awesome-regex.
To illustrate the usefulness of regex, we will use the functions
grepl(). The former gives us information about the place of a string in a vector. The latter returns a logical vector with information on whether the text we are looking for is present or not. Let us save four pieces of text in an object called
<- c("Trump", "trump", "trump is a 0", "Trump is a loser")trump_text
To see where “trump” is mentioned in this object, we can first use
grep(). Notice that we first specify the text we will like to find and then the object.
##  2 3
The output shows that “trump” is present in the second and third place in the object. We can use
grepl() to get a similar result.
##  FALSE TRUE TRUE FALSE
The only difference here is that we get a
FALSE for each element in the object.
Let us say, however, that we want to get information on whether “trump” or “Trump” is mentioned, i.e. that we do not care about whether it is with a lower t or not. To do this, we can use  as a disjunction. Within this, we can say that we are looking for both “trump” and “Trump.”
##  TRUE TRUE TRUE TRUE
The output not shows that either Trump or trump is present in all four elements. Next, we can use the same logic to explore whether a number is present in any of the elements.
##  FALSE FALSE TRUE FALSE
We can see that a number is present in the third element. However, we do not need to include all numbers and we will get the same result by specifying a range of numbers from 0 to 9 (the same logic can be applied to letters, e.g. [a-o] will include all letters from a to o in the alphabet.
##  FALSE FALSE TRUE FALSE
If you want to check whether at least one of two different words is mentioned, e.g. Trump or trump, you can use the “|”-sign.
##  TRUE TRUE TRUE TRUE
A few other characters that are good to know about:
- “.” means any character. Tr.mp will match both Trump and Tramp.
- “\.” is a period (i.e., the "" sign escapes the regex)
- “?” is for a character that might be there or not (for example, colou?r matches both color and colour)
- “*” will be any number of similar characters (for example, wo*w will match wow, woow, wooow etc.)
4.1.2 Changing the case of strings
Some of the functions have relatively simple purposes, such as
str_to_upper() (which convert all characters in a string to upper case),
str_to_lower() (which convert all characters in a string to lower case),
str_to_title() (which convert the first letter in each word in a string to upper case) and
str_to_sentence() (which convert the first letter in a sentence to upper case and everything else to lower case).
str_to_upper("Quantitative Politics with R")
##  "QUANTITATIVE POLITICS WITH R"
str_to_lower("Quantitative Politics with R")
##  "quantitative politics with r"
str_to_title("Quantitative Politics with R")
##  "Quantitative Politics With R"
str_to_sentence("Quantitative Politics with R")
##  "Quantitative politics with r"
4.1.3 Subset and mutate strings
We can use
str_sub() to get a part of the text we are looking at. Say we want to get the first four characters of a string, we can specify
start = 1 and
end = 4.
str_sub("Quantitative Politics with R", start = 1, end = 4)
##  "Quan"
If we would like to get the last four characters, we can simply specify
start = -4 as the option.
str_sub("Quantitative Politics with R", start = -4)
##  "th R"
Here, we are going to look at cigarette taxes, and namely on whether the cigarette taxes are in the low, middle or high category. To look at this we will use the
cig_tax12_3 variable in the
states data frame.
We can see that the names for these categories are LoTax, MidTax and HiTax. With the code below we use
str_replace_all() to replace the characters with new characters, e.g. HiTax becomes High taxes.
$cig_taxes <- str_replace_all(states$cig_tax12_3, statesc("HiTax" = "High taxes", "MidTax" = "Middle taxes", "LoTax" = "Low taxes")) table(states$cig_taxes)
## ## High taxes Low taxes Middle taxes ## 15 17 18
For examples on more of the functions available in the
stringr package, see this introduction.
Three other functions that can come in useful are from the
tidyr is also part of the
tidyverse and if you load the
tidyverse package, you do not need to load
tidyr. These functions come with multiple options and we suggest that you consult the documentations for these in order to see examples and the different options, e.g.
For the cigarette taxes we have worked with above, these are categorical data that we can order. To work with ordered and unordered categories, factors is a class in
R class that makes these categories good to work with. In brief, categorical data is a variable with a fixed set of possible values. This is also useful when you want to use a non-alphabetical ordering of the values in a variable.
For factors, we are going to use the package
forcats. This package is also part of the
We create a new variable,
cig_taxes_cat as a factor variable and then we see what levels we have (and the order of these).
$cig_taxes_cat <- factor(states$cig_taxes) states levels(states$cig_taxes_cat)
##  "High taxes" "Low taxes" "Middle taxes"
As we can see, these levels are now in the wrong order (sorted alphabetically). We can use the
fct_relevel() to specify the order of the categories (from low to high).
$cig_taxes_cat <- fct_relevel(states$cig_taxes_cat, states"Low taxes", "Middle taxes", "High taxes") levels(states$cig_taxes_cat)
##  "Low taxes" "Middle taxes" "High taxes"
This will become useful later on when we want to make sure that the categories in a data visualisation has the correct order.
For additional guidance on the functions available in the
forcats package, see https://forcats.tidyverse.org/.
4.3 Dates and time
To work with dates and time in
R, there are two useful packages. The first is
hms that is good with hours, minutes and seconds. The second is
lubridate that is good with dates. Let us take a closer look at how to work with seconds, minutes and hours by loading the package
This package is useful if you want to easily convert minutes into hours, for example 500 minutes into hours. As
lubridate also have an
hms(), we will use
:: to tell that we are using the function
hms() in the
::hms(min = 500)hms
As we can see, this gives 8 hours and 20 minutes. We can also specify hours, minutes and seconds to get the time as POSIXct.
::hms(hour = 15, min = 90, seconds = 12)hms
For dates, we will first load the package
## ## Attaching package: 'lubridate'
## The following object is masked from 'package:hms': ## ## hms
## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union
This package has several functions that are useful in terms of working with dates. For example, we can use
ymd() if we have text that has the year, month and day.
##  "2019-09-30"
The package can also work with months as text, such as:
mdy("September 30, 2019")
##  "2019-09-30"
The good thing about this is that we can work with the date information. Let us first save the date in an object called
date and use
year() to get the year out of the variable.
# Save September 30, 2019 in object <- ymd("2019-09-30") date # Get year year(date)
##  2019
Similarly, we can get the week number out of the date.
##  39
We can see that this date was in week 39. We can use
wday() to get the number of the day in the week this was.
##  2
If we would rather prefer the name of the day, we can use
abbr as options (the latter option in order to get the full day name).
wday(date, label = TRUE, abbr = FALSE)
##  Monday ## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
If you would like to get the difference between two dates, you can simply subtract one date from the other as in the example below.
ymd("2019-09-30") - ymd("2019-09-01")
## Time difference of 29 days
Some of the relevant functions in the
lubridate package are:
week() (week number),
day() (day of month),
wday() (day week),
qday() (day of quarter), and
yday() (day of year).
Last, if you have information on a date in a different format (i.e. not as used above), you can specify this format. For example, if the data is 30/09/2020, you can specify the format with %d/%m/%Y to turn it into a date.
as.Date("30/09/2020", format = "%d/%m/%Y")
##  "2020-09-30"
Here is a list of some of the most useful formats:
- %d = day as number (0-31)
- %a = abbreviated weekday
- %A = unabbreviated weekday
- %m = month (00-12)
- %b = abbreviated month
- %B = unabbrevidated month
- %y = 2 digit year
- %Y = four digit year