6 Loading and tidying Excel data

In the first part of the course, we saw how SPSS data (.sav) can be loaded into R using the haven package. Another common format of data that cannot be loaded into base R is excel (.xlsx). The package required to read Excel data is the readxl package.

If this is the first time using the readxl package, remember to install this to your machine using the install.packages function:

install.packages("readxl")

Once this package has been installed, add it to the list of packages to install at the top of your script file:

pacman::p_load(tidyverse, haven, readxl)

6.1 Loading an Excel sheet into R

As with any file format, we must ensure data are in the correct form before loading them into R, ensuring each column represents a variable, each row represents an observation, and there are no tables or graphics.

Excel files can be a little trickier to manipulate than SPSS and CSV files as they often contain multiple sheets. This is the case for the data that we will be using for this part of the course.

The Excel file that we will be loading contains the Office for Budget Responsibility (OBR) economic and fiscal outlook from 2024. This contains many sheets of data, but for this course we will just be focusing on three:

1.13 Household disposable income
1.14 National living wage
1.17 Housing market

Note

Data can be accessed via the course repository.

Let’s begin with housing market data, stored in the 18th sheet, labelled “1.17”. This sheet can be selected in the read_xlsx function using the sheet argument.

The housing market sheet shows information over different time scales: first by quarters, then years, and then across pairs of years. For this example, we will extract information measured quarterly (rows 4 - 88). The argument range allows us to define the range of cells (by columns and rows) to extract.

Finally, we can see that the column headers are not in an appropriate format for R: they contain spaces, brackets, and are very long! There are two approaches we will consider to overcome this.

The first is to remove the column names completely (by not including them in the range argument and setting col_names = FALSE within the read_xlsx function) and add them manually, using the setNames function.

Setting names manually can take a long time and a lot of typing if there are many variables. An alternative to this manual approach is to include them in the range of the read_xlsx function, and use an R function to ‘clean’ them, making them follow the style guide.

Style tip

The janitor package has been designed to format inputed data to ensure it follows the Tidyverse style guide. The clean_names function can be applied to a data frame or tibble to adapt variable names in this way.

The following code loads the housing market sheet and manually sets the variable names:

list.files(path = "data")

housing_market <- 
  read_xlsx("data/Detailed_forecast_tables_Economy_March_2024.xlsx",
                            sheet = "1.17",range = "B4:J88",
                            col_names = FALSE) |>
  setNames(c("period", "hpi_comp15", "hpi_prev_year",
             "residential_property_transactions", 
             "private_enterprise_housing_starts",
             "private_enterprise_housing_comp", 
             "housing_stock", "net_additions_housing_stock",
             "turnover_rate"))

1: Return file names from the data folder
2: Specify the sheet and range of cells to keep
3: Remove column names (too messy)
4: Add variable names manually

[1] "Detailed_forecast_tables_Economy_March_2024.xlsx"
[2] "generalfs21_EUL.sav"                             
[3] "interviewfs21_EUL.sav"

The following code loads the same data but uses the janitor package.

Warning

Do not run this code without installing and loading the janitor package first. We will not run this during the course, but it is included for future reference.

housing_market_alt <-
  read_xlsx("data/Detailed_forecast_tables_Economy_March_2024.xlsx",
                                # Specify the sheet and range of cells to keep
                                sheet = "1.17",range = "B3:J88") |> 
  clean_names()

6.2 Splitting variables

In the current dataset, the time variable is given as a character and so is not recognised as ordered or temporal by R. To overcome this, we can split the variable to create separate year and quarter variables.

The str_sub function from tidyverse’s stringr package extracts elements based on their position in a string of characters. This can be used to return the first 4 digits to a new year variable, and the final digit to a new quarter variable:

housing_market <- housing_market |>
  mutate(year =  as.numeric(str_sub(period, start = 1, end = 4)),
         quarter = as.numeric(str_sub(period, 
                              start = -1L, end = -1L))) |>
  select(-period)

1: Don’t forget to convert the string to numberic
2: Use - to work from the end of the string
3: Remove the original period variable

Exercise 5

Load in the OBR’s household disposable income data (sheet 1.13). Split the period data into separate year and quarter variables, ensure that all variable names follow Tidyverse’s style guide. Name this object disposable_income.
Load the OBR’s national living wage data (sheet 1.14), keep as an object named living_wage.

Exercise hint

The original variables take the name of the year they represent. However, variable names cannot begin with a number. Therefore, we must add a prefix to the original names. One way to add a prefix to a string is through the paste0 function which combines arguments in the function separated by commas into a single string. For example,

paste0("abc", 1, 10, "_2010")

[1] "abc110_2010"

Although the rename function requires each individual variable to be listed, look at the dplyr cheatsheet to find an alternative that renames variables using a function.

6.3 Transforming data

The disposable income and housing market data are currently considered in what is known as long format, with many rows and fewer variables. The alternative to this format, wide format, can be seen in the living wage data, which has many variables and few (only one!) row. Sometimes we may wish to convert between these variables, either to join them to other datasets (as is the case here), or to carry out an analysis or visualisation that requires a certain format. These conversions are carried out using the pivot_longer or pivot_wider functions.

There are many ways to pivot data within R (see the helpfile ?pivot_longer for a full list of arguments), and the setup of this function tends to differ for every situation. For worked examples and a more detailed explanation of the function’s capabilities, enter vignette("pivot") into the console.

For our data, we will need to convert the wide-format living wage data to a long-format so we are able to join it to the other data. This will create a new dataset with 2 variables: year and living_wage, with a row per year. The year variable will be taken from the wide data names, and the living_wage variable will come from the wide data values:

living_wage_long <- living_wage |> 
  pivot_longer(cols = everything(),
               names_to = "year",
               names_prefix = "year_",
               names_transform = as.numeric,
               values_to = "living_wage")

1: First, select the columns that we wish to pivot (all of them). See the ?tidyr_tidy_select help file for a list of options.
2: Move the old variable names to a new year variable.
3: Remove the prefix from the old variable names.
4: Convert the new year variable to numeric.
5: Take the old cell values and create a new living_wage variable.

Exercise 6

Combine all three OBR datasets (housing market, disposable income and living wage) together to create one complete dataset, obr_data.