Chapter 4 Manipulating text

We introduced text in the previous chapter. In this chapter, we will show how to manipulate text as strings and factors. We will use the states dataset from the poliscidata package. For more on this data, see Chapter 5.

library("poliscidata")

states <- states

You can use View(states) to get a sense of the 50 observations and the 135 variables.

4.1 Strings

We will use the package string. It is part of the tidyverse but can also be called individually. As you already should have installed the tidyverse by now, it is not necessary to install the package again.

library("stringr")

Some of the functions have relatively simple purposes, such as str_to_upper() (which convert all characters in a string to upper case) and str_to_lower() (which convert all characters in a string to lower case).

str_to_upper("Quantitative Politics with R")
## [1] "QUANTITATIVE POLITICS WITH R"
str_to_lower("Quantitative Politics with R")
## [1] "quantitative politics with r"

We can use str_sub() to get a part of the text we are looking at. Say we want to get the first four characters of a string, we can specify start = 1 and end = 4.

str_sub("Quantitative Politics with R", start = 1, end = 4)
## [1] "Quan"

If we would like to get the last four characters, we can simply specify start = -4 as the option.

str_sub("Quantitative Politics with R", start = -4)
## [1] "th R"

Here, we are going to look at cigarette taxes, and namely on whether the cigarette taxes are in the low, middle or high category. To look at this we will use the cig_tax12_3 variable in the states data frame.

table(states$cig_tax12_3)

We can see that the names for these categories are LoTax, MidTax and HiTax. With the code below we use str_replace_all() to replace the characters with new characters, e.g. HiTax becomes High taxes.

states$cig_taxes <- str_replace_all(states$cig_tax12_3, 
                                    c("HiTax" = "High taxes", 
                                      "MidTax" = "Middle taxes",
                                      "LoTax" = "Low taxes"))

table(states$cig_taxes)
## 
##   High taxes    Low taxes Middle taxes 
##           15           17           18

For examples on more of the functions available in the stringr package, see this introduction.

4.2 Factors

For the cigarette taxes we have worked with above, these are categorical data that we can order. To work with ordered and unordered categories, factors is a class in R class that makes these categories good to work with. For factors, we are going to use the package forcats. This package is also part of the tidyverse.

library("forcats")

We create a new variable, cig_taxes_cat as a factor variable and then we see what levels we have (and the order of these).

states$cig_taxes_cat <- factor(states$cig_taxes)

levels(states$cig_taxes_cat)
## [1] "High taxes"   "Low taxes"    "Middle taxes"

As we can see, these levels are now in the wrong order (sorted alphabetically). We can use the fct_relevel() to specify the order of the categories (from low to high).

states$cig_taxes_cat <- fct_relevel(states$cig_taxes_cat, 
                                    "Low taxes", 
                                    "Middle taxes", 
                                    "High taxes")

levels(states$cig_taxes_cat)
## [1] "Low taxes"    "Middle taxes" "High taxes"

This will become useful later on when we want to make sure that the categories in a data visualisation has the correct order.

For additional guidance on the functions available in the forcats package, see https://forcats.tidyverse.org/.

4.3 Dates and time

To work with dates and time in R, there are two useful packages. The first is hms that is good with hours, minutes and seconds. The second is lubridate that is good with dates. Let us take a closer look at how to work with seconds, minutes and hours by loading the package hms.

library("hms")

This package is useful if you want to easily convert minutes into hours, for example 500 minutes into hours. As lubridate also have an hms(), we will use :: to tell that we are using the function hms() in the hms package.

hms::hms(min = 500)
## 08:20:00

As we can see, this gives 8 hours and 20 minutes. We can also specify hours, minutes and seconds to get the time as POSIXct.

hms::hms(hour = 15, min = 90, seconds = 12)
## 16:30:12

For dates, we will first load the package lubridate.

library("lubridate")
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:hms':
## 
##     hms
## The following object is masked from 'package:base':
## 
##     date

This package has several functions that are useful in terms of working with dates. For example, we can use ymd() if we have text that has the year, month and day.

ymd("2019/09/30")
## [1] "2019-09-30"

The package can also work with months as text, such as:

mdy("September 30, 2019")
## [1] "2019-09-30"

The good thing about this is that we can work with the date information. Let us first save the date in an object called date and use year() to get the year out of the variable.

# Save September 30, 2019 in object
date <- ymd("2019-09-30")

# Get year
year(date)
## [1] 2019

Similarly, we can get the week number out of the date.

week(date)
## [1] 39

We can see that this date was in week 39. We can use wday() to get the number of the day in the week this was.

wday(date)
## [1] 2

If we would rather prefer the name of the day, we can use label and abbr as options (the latter option in order to get the full day name).

wday(date, label = TRUE, abbr = FALSE)
## [1] Monday
## Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday

If you would like to get the difference between two dates, you can simply subtract one date from the other as in the example below.

ymd("2019-09-30") - ymd("2019-09-01")
## Time difference of 29 days

Some of the relevant functions in the lubridate package are: year() (year), month() (month), week() (week number), day() (day of month), wday() (day week), qday() (day of quarter), and yday() (day of year).