You have been contracted as a healthcare consulting company to understand the factors on which the pricing of health insurance depends.
The data consists of a data frame with 1338 observations on the following 7 variables:
price: Response variable ($)age: Quantitative variablesex: Qualitative variablebmi: Quantitative variablechildren: Quantitative variablesmoker: Qualitative variableregion: Qualitative variableTo read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv()
pacman::p_load(magrittr, patchwork, ggplot2)insurance <- read.csv("insurance.csv", head=TRUE)
insurance[c('sex', 'smoker', 'region')] %<>% lapply(factor)
str(insurance)'data.frame': 1338 obs. of 7 variables:
$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
$ bmi : num 27.9 33.8 33 22.7 28.9 ...
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
$ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
$ price : num 16885 1726 4449 21984 3867 ...
price, against three quantitative predictors age, bmi, and children. Describe the general trend (direction and form) of each plot. It should be 3 separate scatter plots.gp <- ggplot(insurance, aes(y=price))
gscatter <- geom_point(alpha=.6, size=2, stroke=0)
ggrad <- scale_colour_gradient(low="#19a6c4", high="#f94c4c")
greg <- geom_smooth(method = lm, color="#6b0050")
sp1 <- gp+aes(x=age, colour=children)+gscatter+ggrad+greg+
ggtitle("Plot of Price v. Age")
sp2 <- gp+aes(x=bmi, colour=age)+gscatter+ggrad+greg+
ggtitle("Plot of Price v. BMI")
sp3 <- gp+aes(x=children, colour=bmi)+geom_jitter()+ggrad+
greg+ggtitle("Jitter Plot of Price v. Children")
sp1/sp2/sp3ylog <- scale_y_log10()
sp1 <- sp1+ylog+ggtitle("Log-scaled Plot of Price v. Age")
sp2 <- sp2+ylog+ggtitle("Log-scaled Plot of Price v. BMI")
sp3 <- sp3+ylog+ggtitle("Log-scaled Jitter Plot of Price v. Children")
sp1/sp2/sp3All three plots indicate a very weak positive relationship between the predictors and the price of insurance. Both age and children appear to have a hard lower price boundary (more noticeable in the log-scaled plots.) This indicates that there are mandatory minimum added costs associated with these two predictors.
# Age-Price Corr.
insurance %$%
cor(price, age)0.2990082
# BMI-Price Corr.
insurance %$%
cor(price, bmi)0.198341
# Children-Price Corr.
insurance %$%
cor(price, children)0.06799823
# Age*BMI-Price Corr.
insurance %$%
cor(price, age*bmi)0.3347539
# BMI*Children-Price Corr.
insurance %$%
cor(price, bmi*children)0.1025917
# Children*Age-Price Corr.
insurance %$%
cor(price, children*age)0.1319637
# Age-Price Corr.
insurance %$%
cor(log(price), age)0.527834
# BMI-Price Corr.
insurance %$%
cor(log(price), bmi)0.1326694
# Children-Price Corr.
insurance %$%
cor(log(price), children)0.1613363
As observed in the previous part, the correlation between the predictors and the response are very low, with children having an almost negligible correlation which could be caused by the aforementioned minimum added costs.
price, and the three qualitative predictors sex, smoker, and region. Based on these box plots, does there appear to be a relationship between these qualitative predictors and the response?### Already implemented `factor` conversion when the data was loaded
gviolin <- geom_violin(colour="white", fill="#f94c4c", width=1, alpha=0.3)
gbox <- geom_boxplot(colour="#19a6c4", width=0.25, fill="#19a6c4", size=0.8,
alpha=0.4, notchwidth=1, outlier.fill="#b0c9d9",
outlier.size=1)
bp1 <- gp+aes(x=sex)+gviolin+gbox+ggtitle("Price v. Sex")
bp2 <- gp+aes(x=smoker)+gviolin+gbox+ggtitle("Price v. Smoker")
bp3 <- gp+aes(x=region)+gviolin+gbox+ggtitle("Price v. Region")
(bp1+bp2)/bp3 +plot_annotation(title="Box Plots")(bp1+ylog+bp2+ylog)/bp3+ylog +plot_annotation(title="Log-scaled Box Plots")In the box plots alone, it appears that price on average significantly differ only with smoker predictor; we need further metrics to verify this; but the densities (violin plots) show region:southeast and sex:male have increased variance in the response compared to their counterparts. (this is more noticeable in the log-scaled box plots.) An increase in variance may lead to under-fitting in our models; this should be further examined by omitting region and sex in our models creation and comparing the results.
Based on the analysis of predictors in parts b. and c. a simple multiple linear regression base on price~(smoker+age+bmi) should produce a model that explains a large portion of the variance in our response. Although the extended analysis suggests log(price)~smoker*(age+bmi) should produce a significantly better model (see the third tab under Q3 b.).
Build a multiple linear regression model, named model1, using the response, price, and all 6 predictors, and then answer the questions that follow:
model1 <- lm(price~., insurance)
summary(model1)
Call:
lm(formula = price ~ ., data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
age 256.9 11.9 21.587 < 2e-16 ***
sexmale -131.3 332.9 -0.394 0.693348
bmi 339.2 28.6 11.860 < 2e-16 ***
children 475.5 137.8 3.451 0.000577 ***
smokeryes 23848.5 413.1 57.723 < 2e-16 ***
regionnorthwest -353.0 476.3 -0.741 0.458769
regionsoutheast -1035.0 478.7 -2.162 0.030782 *
regionsouthwest -960.0 477.9 -2.009 0.044765 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Here, model1 with \(R^2=0.7509\) shows that its predictors explain ~75.1% of the variance in the price response.
For \(\hat{y}=\beta_0+\beta_1x_1+...+\beta_nx_n\) regression where \(\hat{y}\) is the estimated response, \(\beta_0\), the intercept is the average value of \(y\) when all predictors \(x_i=0\), and \(\beta_i\) is the average change in \(y\) associated with one unit increase in \(x_i\):
Null Hypothesis: \(H_0: \beta_1 = \beta_i = ...=\beta_k=0\)
(i.e., the estimates made by the model are no better than the coefficients excluding intercept chosen at random.)
Alternative Hypothesis: \(H_A: \text{at least one} \: \beta_i \neq 0, \text{ for } i=1,...,k\)
(i.e., at least one of the coefficients excluding the intercept in the model produce a statistically more accurate \(\hat{y}\) than coefficients chosen at random.)
\(F-\)statistic: \(F(8, 1329)=500.8\) (i.e. Residuals sum of squares \(SSR\) over the explained sum of squares for the model \(SSE\).)
\(p-\)value: \(<4.9\mathrm{e}{-419} \approx 0 \quad\) (\(p-\)value is smaller than double precision \(\varepsilon\))
Since \(p-\)value is smaller than \(\alpha=0.05\), we reject the null hypothesis in favor of the alternative hypothesis.
summary(aov(price~region, insurance)) Df Sum Sq Mean Sq F value Pr(>F)
region 3 1.301e+09 433586560 2.97 0.0309 *
Residuals 1334 1.948e+11 146007093
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since \(p-\)value is smaller than \(\alpha=0.05\), we reject the null hypothesis in favor of the alternative hypothesis: At least one of the qualitative values in region has response values with statistically significant different means than other qualitative values in this predictor.
model2, using price as the response variable, and all variables except region as the predictors. Conduct a partial \(F-\)test comparing model2 with model1. What is the partial-F test p-value? Can we reject the null hypothesis that the regression coefficients for region variables are zero at an \(\alpha\)-level of 0.05?model2 <- lm(price~.-region, insurance)
anova(model1, model2) %>% printAnalysis of Variance Table
Model 1: price ~ age + sex + bmi + children + smoker + region
Model 2: price ~ (age + sex + bmi + children + smoker + region) - region
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1329 4.8840e+10
2 1332 4.9073e+10 -3 -233431209 2.1173 0.09622 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2 Summarysummary(model2)
Call:
lm(formula = price ~ . - region, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11837.2 -2916.7 -994.2 1375.3 29565.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12052.46 951.26 -12.670 < 2e-16 ***
age 257.73 11.90 21.651 < 2e-16 ***
sexmale -128.64 333.36 -0.386 0.699641
bmi 322.36 27.42 11.757 < 2e-16 ***
children 474.41 137.86 3.441 0.000597 ***
smokeryes 23823.39 412.52 57.750 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6070 on 1332 degrees of freedom
Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
F-statistic: 798 on 5 and 1332 DF, p-value: < 2.2e-16
model3Here is a better model based on our previous findings with smaller DF than model1 that significantly outperforms both models by every metric.
model3 <- lm(log(price) ~ smoker*(children+age+bmi), insurance)
summary(model3)
Call:
lm(formula = log(price) ~ smoker * (children + age + bmi), data = insurance)
Residuals:
Min 1Q Median 3Q Max
-0.74310 -0.15339 -0.06160 0.01018 2.26709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0420361 0.0675207 104.294 < 2e-16 ***
smokeryes 1.4242587 0.1482274 9.609 < 2e-16 ***
children 0.1276748 0.0097962 13.033 < 2e-16 ***
age 0.0416997 0.0008536 48.851 < 2e-16 ***
bmi -0.0011534 0.0019886 -0.580 0.562
smokeryes:children -0.1196447 0.0226370 -5.285 1.46e-07 ***
smokeryes:age -0.0327043 0.0019012 -17.202 < 2e-16 ***
smokeryes:bmi 0.0494274 0.0042282 11.690 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3888 on 1330 degrees of freedom
Multiple R-squared: 0.8222, Adjusted R-squared: 0.8212
F-statistic: 878.3 on 7 and 1330 DF, p-value: < 2.2e-16
| model1 | model2 | model3 | |
|---|---|---|---|
| R2 | 0.75091 | 0.74972 | 0.82215 |
| adj.R2 | 0.74941 | 0.74878 | 0.82122 |
| sigma | 6062.1 | 6069.7 | 0.3888 |
| f.stat | 500.81 | 798.02 | 878.33 |
| p.value | 0 | 0 | 0 |
| logLik | -13548 | -13551 | -630.54 |
| AIC | 27116 | 27116 | 1279.1 |
Since \(p-\)value is larger than \(\alpha=0.05\), we fail to reject the null hypothesis: Predictor region does not have a statistically significant impact on \(\hat{y}\) in our models.
In part a. we concluded that response for at least one of the variables in region had a statistically different means than the rest, but in conjunction with part b. we conclude that the statistically difference means produced by one or more of the variables in region is not enough to improve the model significantly. (i.e., there is a statistically significant marginal relationship between region and response, but the conditional relationship of region in association with other predictors and price is not a statistically significant one.) It is worth noting that if we identify the statistically differing region(s) and encode the entries as region:(unimportant=0, important=1) we would get better partial \(F-\)test results more inline with a. (but without any actual improvement to our model, only reducing DF value without changing SSR.)
Note: Please use model1 for all of the following questions.
sexmale in the context of the problem. Make sure female is the baseline level for sex. Mention any assumptions you make about other predictors clearly when stating the interpretation.Considering sex:female as the baseline, entries with predictor sexmale = 0 are female, and individuals with sexmale = 1 pay 131.3 less on average in their \(\hat{y}\:\) price.
bmi in model1 is increased by 0.01 and the other predictors are kept constant, what change in the response would be expected?With bmi coefficient of 339.2, we’d expect a ~$3.39 increase in price given all other predictors are kept constant.
age for model1. What observations can you make about the width of these intervals?# 90% Confidence Interval
model1 %>%
confint("age", level=.9)| 5 % | 95 % | |
|---|---|---|
| age | 237.2708 | 276.4419 |
# 95% Confidence Interval
model1 %>%
confint("age", level=.95)| 2.5 % | 97.5 % | |
|---|---|---|
| age | 233.5138 | 280.1989 |
As a rule, an increase in confidence level always require the widening of the confidence interval. As expected, this is observed in this case as well.
model1, estimate the average price for all insurance policies with the same characteristics as the first data point in the sample. What is the 95% confidence interval? Provide an interpretation of your results.# 95% Confidence Interval
model1 %>%
predict(insurance[1,-7], level=.95,
interval='confidence')| fit | lwr | upr |
|---|---|---|
| 25293.71 | 24143.98 | 26443.44 |
Mean estimated price for any data point identical to the first data point is expected to be 25293.71 with 95% confidence that the population mean for similar data points falls within the [24143.98, 26443.44] interval.
age value for the first data point is increased to 50, while all other values are kept fixed. Using model1, predict the price of an insurance policy with these characteristics. What is the 95% prediction interval? Provide an interpretation of your results.# 95% Prediction Interval
insurance['age'] <- 50
model1 %>%
predict(insurance[1,-7], level=.95,
interval='prediction')| fit | lwr | upr |
|---|---|---|
| 33256.26 | 21313.29 | 45199.23 |
Mean estimated price for any data point identical to the first data point with the exception of age = 50 is expected to be 25293.71 with 95% confidence that any single new similar data point would have response within the [21313.29, 45199.23] range.
The raw Rmarkdown (.Rmd) file for this report is available upon request. Please contact me at dan@sadatian.io.