ehs_tidy %>%
filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>%
count()# A tibble: 1 × 1
n
<int>
1 105
How many respondents had both weekly rent and mortgage payments given? What are the potential reasons for this?
We can combine the filter and count functions to answer the first part of this question:
ehs_tidy %>%
filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>%
count()# A tibble: 1 × 1
n
<int>
1 105
There were 105 respondents with both weekly rent and mortgage payments.
To understand why this is, we could first view this data (or a summary of the data) to see what other characteristics these respondents share:
ehs_tidy %>%
filter(!is.na(weekly_rent), !is.na(weekly_mortgage)) %>%
summary() id weighting tenure_type
Min. :2.022e+10 Min. : 305.5 housing association: 0
1st Qu.:2.022e+10 1st Qu.: 628.9 local authority : 0
Median :2.022e+10 Median : 1058.4 owner occupied :105
Mean :2.022e+10 Mean : 1859.3 private rented : 0
3rd Qu.:2.022e+10 3rd Qu.: 2535.4
Max. :2.022e+10 Max. :10568.5
region gross_income length_residence weekly_rent
South East :22 Min. : 9880 two years :27 Min. : 8.30
East :19 1st Qu.: 26595 one year :19 1st Qu.: 49.80
London :19 Median : 39655 3-4 years :15 Median : 69.69
North West :13 Mean : 41908 5-9 years :15 Mean : 86.23
South West :13 3rd Qu.: 52323 less than 1 year: 9 3rd Qu.:120.00
East Midlands: 8 Max. :100000 10-19 years : 8 Max. :219.23
(Other) :11 (Other) :12
weekly_mortgage freehold_leasehold
Min. : 0.0231 freehold :29
1st Qu.: 0.0231 leasehold:65
Median : 60.0000 NA's :11
Mean : 71.6271
3rd Qu.:103.8462
Max. :343.8138
All respondents in this group owned and lived in their own home. Most were leasehold properties, suggesting some of the weekly rent refers to lease payments. Other potential reasons could include shared ownership (which is not given as an option for tenure type), or respondents that lived with renters in the same property.
Combine the weekly rent and mortgage variables into a single weekly payment variable.
Where only one value has been recorded, we want to use this in the new variable. Where both have been recorded, we will need to add the values together to get a weekly total.
There are a few different ways to do this. The first is to include an if_else statement in the mutate function, changing how the variable is calculated whether either value is missing or not:
ehs_tidy_ex4 <- ehs_tidy %>%
mutate(weekly_total = if_else(is.na(weekly_rent) |
is.na(weekly_mortgage),
coalesce(weekly_rent, weekly_mortgage),
weekly_rent + weekly_mortgage))weekly_rent or weekly_mortgage are missing
Create a summary table containing the mean, median, standard deviation, and the upper and lower quartiles of the weekly payment (rent and mortgage combined) for each region. What, if anything, can you infer about the distribution of this variable based on the table?
ehs_tidy_ex4 %>%
group_by(region) %>%
summarise(mean_payment = wtd.mean(weekly_total, weights = weighting,
na.rm = TRUE),
median_payment = wtd.quantile(weekly_total,
weights = weighting,
probs = .5, na.rm = TRUE),
sd_payment = sqrt(wtd.var(weekly_total, weights = weighting,
na.rm = TRUE)),
lq_payment = wtd.quantile(weekly_total, weights = weighting,
probs = .25, na.rm = TRUE),
uq_payment = quantile(weekly_total, weights = weighting,
probs = .75, na.rm = TRUE)) %>%
ungroup()ungroup
# A tibble: 9 × 6
region mean_payment median_payment sd_payment lq_payment uq_payment
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 East 182. 150 154. 104. 208.
2 East Midlands 151. 125. 117. 93 162.
3 London 283. 254. 223. 135 346.
4 North East 111. 96.9 52.6 81 115.
5 North West 124. 110. 68.7 85.2 138.
6 South East 210. 179. 149. 121. 245.
7 South West 153. 138. 92.0 97 183.
8 West Midlands 142. 122. 88.1 92 150
9 Yorkshire and th… 120. 107. 76.4 82 133.
There are big differences between most mean and medians across regions, indicating that the data are not normally distributed. If we use the approximate 95% range formula (mean \(\pm\) (2 \(\times\) sd)), we would get negative values for almost all regions. Negative payments do not make sense in this context, confirming that the data are not normally distributed.
In this case, the median and IQR should be give, not the mean and standard deviation.