Appendix C — Exercise 4 solutions

C.1 Question 1

How many respondents had both weekly rent and mortgage payments given? What are the potential reasons for this?

Solution

We can combine the filter and count functions to answer the first part of this question:

ehs_tidy %>% 
  filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>% 
  count()
# A tibble: 1 × 1
      n
  <int>
1   105

There were 105 respondents with both weekly rent and mortgage payments.

To understand why this is, we could first view this data (or a summary of the data) to see what other characteristics these respondents share:

ehs_tidy %>% 
  filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>% 
  summary()
       id              weighting                    tenure_type 
 Min.   :2.022e+10   Min.   :  305.5   housing association:  0  
 1st Qu.:2.022e+10   1st Qu.:  628.9   local authority    :  0  
 Median :2.022e+10   Median : 1058.4   owner occupied     :105  
 Mean   :2.022e+10   Mean   : 1859.3   private rented     :  0  
 3rd Qu.:2.022e+10   3rd Qu.: 2535.4                            
 Max.   :2.022e+10   Max.   :10568.5                            
                                                                
           region    gross_income            length_residence  weekly_rent    
 South East   :22   Min.   :  9880   two years       :27      Min.   :  8.30  
 East         :19   1st Qu.: 26595   one year        :19      1st Qu.: 49.80  
 London       :19   Median : 39655   3-4 years       :15      Median : 69.69  
 North West   :13   Mean   : 41908   5-9 years       :15      Mean   : 86.23  
 South West   :13   3rd Qu.: 52323   less than 1 year: 9      3rd Qu.:120.00  
 East Midlands: 8   Max.   :100000   10-19 years     : 8      Max.   :219.23  
 (Other)      :11                    (Other)         :12                      
 weekly_mortgage    freehold_leasehold
 Min.   :  0.0231   freehold :29      
 1st Qu.:  0.0231   leasehold:65      
 Median : 60.0000   NA's     :11      
 Mean   : 71.6271                     
 3rd Qu.:103.8462                     
 Max.   :343.8138                     
                                      

All respondents in this group owned and lived in their own home. Most were leasehold properties, suggesting some of the weekly rent refers to lease payments. Other potential reasons could include shared ownership (which is not given as an option for tenure type), or respondents that lived with renters in the same property.

C.2 Question 2

Combine the weekly rent and mortgage variables into a single weekly payment variable.

Solution

Where only one value has been recorded, we want to use this in the new variable. Where both have been recorded, we will need to add the values together to get a weekly total.

There are a few different ways to do this. The first is to include an if_else statement in the mutate function, changing how the variable is calculated whether either value is missing or not:

ehs_tidy_ex4 <- ehs_tidy %>% 
  mutate(weekly_total = if_else(is.na(weekly_rent) |
                                  is.na(weekly_mortgage),
                                coalesce(weekly_rent, weekly_mortgage),
                                weekly_rent + weekly_mortgage))
1
If either weekly_rent or weekly_mortgage are missing
2
Then return the non-missing value
3
Or else (if both are NOT missing), return the sum of these values

C.3 Question 3

Create a summary table containing the mean, median, standard deviation, and the upper and lower quartiles of the weekly payment (rent and mortgage combined) for each region. What, if anything, can you infer about the distribution of this variable based on the table?

Solution

ehs_tidy_ex4 %>% 
  group_by(region) %>%
  summarise(mean_payment = wtd.mean(weekly_total, weights = weighting,
                                    na.rm = TRUE),
            median_payment = wtd.quantile(weekly_total,
                                          weights = weighting,
                                          probs = .5, na.rm = TRUE),
            sd_payment = sqrt(wtd.var(weekly_total, weights = weighting,
                                      na.rm = TRUE)),
            lq_payment = wtd.quantile(weekly_total, weights = weighting,
                                      probs = .25, na.rm = TRUE),
            uq_payment = quantile(weekly_total, weights = weighting,
                                  probs = .75, na.rm = TRUE)) %>%
  ungroup()
1
Calculate summaries per region
2
Return the weighted mean
3
Return the weighted median (the 50th percentile)
4
Return the weighted standard deviation (the square root of the weighted variance)
5
Return the weighted lower quartile (the 25th percentile)
6
Return the weighted upper quartile (the 75th percentile)
7
Don’t forget to ungroup
# A tibble: 9 × 6
  region            mean_payment median_payment sd_payment lq_payment uq_payment
  <fct>                    <dbl>          <dbl>      <dbl>      <dbl>      <dbl>
1 East                      182.          150        154.       104.        208.
2 East Midlands             151.          125.       117.        93         162.
3 London                    283.          254.       223.       135         346.
4 North East                111.           96.9       52.6       81         115.
5 North West                124.          110.        68.7       85.2       138.
6 South East                210.          179.       149.       121.        245.
7 South West                153.          138.        92.0       97         183.
8 West Midlands             142.          122.        88.1       92         150 
9 Yorkshire and th…         120.          107.        76.4       82         133.

There are big differences between most mean and medians across regions, indicating that the data are not normally distributed. If we use the approximate 95% range formula (mean \(\pm\) (2 \(\times\) sd)), we would get negative values for almost all regions. Negative payments do not make sense in this context, confirming that the data are not normally distributed.

In this case, the median and IQR should be give, not the mean and standard deviation.