Background

You have been contracted as a healthcare consulting company to understand the factors on which the pricing of health insurance depends.

Data Description

The data consists of a data frame with 1338 observations on the following 7 variables:

  1. price: Response variable ($)
  2. age: Quantitative variable
  3. sex: Qualitative variable
  4. bmi: Quantitative variable
  5. children: Quantitative variable
  6. smoker: Qualitative variable
  7. region: Qualitative variable

Instructions on reading the data

To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv()

pacman::p_load(magrittr, patchwork, ggplot2)
insurance <- read.csv("insurance.csv", head=TRUE)
insurance[c('sex', 'smoker', 'region')] %<>% lapply(factor)
str(insurance)
'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
 $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
 $ price   : num  16885 1726 4449 21984 3867 ...

Part 1: Exploratory Data Analysis

  1. Create scatter plots of the response, price, against three quantitative predictors age, bmi, and children. Describe the general trend (direction and form) of each plot. It should be 3 separate scatter plots.

Scatter Plots
gp <- ggplot(insurance, aes(y=price))
gscatter <- geom_point(alpha=.6, size=2, stroke=0)
ggrad <- scale_colour_gradient(low="#19a6c4", high="#f94c4c")
greg <- geom_smooth(method = lm, color="#6b0050")

sp1 <- gp+aes(x=age, colour=children)+gscatter+ggrad+greg+
  ggtitle("Plot of Price v. Age") 
sp2 <- gp+aes(x=bmi, colour=age)+gscatter+ggrad+greg+
  ggtitle("Plot of Price v. BMI")
sp3 <- gp+aes(x=children, colour=bmi)+geom_jitter()+ggrad+
  greg+ggtitle("Jitter Plot of Price v. Children")
  
sp1/sp2/sp3

Log-Scaled
ylog <- scale_y_log10()

sp1 <- sp1+ylog+ggtitle("Log-scaled Plot of Price v. Age") 
sp2 <- sp2+ylog+ggtitle("Log-scaled Plot of Price v. BMI")
sp3 <- sp3+ylog+ggtitle("Log-scaled Jitter Plot of Price v. Children")
  
sp1/sp2/sp3

All three plots indicate a very weak positive relationship between the predictors and the price of insurance. Both age and children appear to have a hard lower price boundary (more noticeable in the log-scaled plots.) This indicates that there are mandatory minimum added costs associated with these two predictors.

  1. What is the value of the correlation coefficient for each of the above pair of response and predictor variables? What does it tell you about your comments in part (a)?

Correlations
# Age-Price Corr.
insurance %$%
  cor(price, age)
0.2990082
# BMI-Price Corr.
insurance %$%
  cor(price, bmi)
0.198341
# Children-Price Corr.
insurance %$%
  cor(price, children)
0.06799823
Cross-Correlations
# Age*BMI-Price Corr.
insurance %$%
  cor(price, age*bmi)
0.3347539
# BMI*Children-Price Corr.
insurance %$%
  cor(price, bmi*children)
0.1025917
# Children*Age-Price Corr.
insurance %$%
  cor(price, children*age)
0.1319637
\(\log{y}\:\) Correlations
# Age-Price Corr.
insurance %$%
  cor(log(price), age)
0.527834
# BMI-Price Corr.
insurance %$%
  cor(log(price), bmi)
0.1326694
# Children-Price Corr.
insurance %$%
  cor(log(price), children)
0.1613363

As observed in the previous part, the correlation between the predictors and the response are very low, with children having an almost negligible correlation which could be caused by the aforementioned minimum added costs.

  1. Create box plots of the response, price, and the three qualitative predictors sex, smoker, and region. Based on these box plots, does there appear to be a relationship between these qualitative predictors and the response?

Box Plots
### Already implemented `factor` conversion when the data was loaded
gviolin <- geom_violin(colour="white", fill="#f94c4c", width=1, alpha=0.3)
gbox <-  geom_boxplot(colour="#19a6c4", width=0.25, fill="#19a6c4", size=0.8,
                      alpha=0.4, notchwidth=1, outlier.fill="#b0c9d9",
                      outlier.size=1)

bp1 <- gp+aes(x=sex)+gviolin+gbox+ggtitle("Price v. Sex")
bp2 <- gp+aes(x=smoker)+gviolin+gbox+ggtitle("Price v. Smoker")
bp3 <- gp+aes(x=region)+gviolin+gbox+ggtitle("Price v. Region")

(bp1+bp2)/bp3 +plot_annotation(title="Box Plots")

Log-Scaled
(bp1+ylog+bp2+ylog)/bp3+ylog +plot_annotation(title="Log-scaled Box Plots")

In the box plots alone, it appears that price on average significantly differ only with smoker predictor; we need further metrics to verify this; but the densities (violin plots) show region:southeast and sex:male have increased variance in the response compared to their counterparts. (this is more noticeable in the log-scaled box plots.) An increase in variance may lead to under-fitting in our models; this should be further examined by omitting region and sex in our models creation and comparing the results.

  1. Based on the analysis above, does it make sense to run a multiple linear regression with all of the predictors?

Based on the analysis of predictors in parts b. and c. a simple multiple linear regression base on price~(smoker+age+bmi) should produce a model that explains a large portion of the variance in our response. Although the extended analysis suggests log(price)~smoker*(age+bmi) should produce a significantly better model (see the third tab under Q3 b.).

Part 2: Fitting the Multiple Linear Regression Model

Build a multiple linear regression model, named model1, using the response, price, and all 6 predictors, and then answer the questions that follow:

  1. Report the coefficient of determination (R-squared) for the model and give a concise interpretation of this value.
model1 <- lm(price~., insurance)
summary(model1)

Call:
lm(formula = price ~ ., data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11304.9  -2848.1   -982.1   1393.9  29992.8 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
age                256.9       11.9  21.587  < 2e-16 ***
sexmale           -131.3      332.9  -0.394 0.693348    
bmi                339.2       28.6  11.860  < 2e-16 ***
children           475.5      137.8   3.451 0.000577 ***
smokeryes        23848.5      413.1  57.723  < 2e-16 ***
regionnorthwest   -353.0      476.3  -0.741 0.458769    
regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Here, model1 with \(R^2=0.7509\) shows that its predictors explain ~75.1% of the variance in the price response.

  1. Is the model of any use in predicting price? Using \(\alpha=0.05\), provide the following elements of the test of overall regression of the model: null hypothesis \(H_0\), alternative hypothesis \(H_a\), \(F-\)statistic or \(p-\)value, and conclusion.

For \(\hat{y}=\beta_0+\beta_1x_1+...+\beta_nx_n\) regression where \(\hat{y}\) is the estimated response, \(\beta_0\), the intercept is the average value of \(y\) when all predictors \(x_i=0\), and \(\beta_i\) is the average change in \(y\) associated with one unit increase in \(x_i\):
Null Hypothesis: \(H_0: \beta_1 = \beta_i = ...=\beta_k=0\)
(i.e., the estimates made by the model are no better than the coefficients excluding intercept chosen at random.)
Alternative Hypothesis: \(H_A: \text{at least one} \: \beta_i \neq 0, \text{ for } i=1,...,k\)
(i.e., at least one of the coefficients excluding the intercept in the model produce a statistically more accurate \(\hat{y}\) than coefficients chosen at random.)
\(F-\)statistic: \(F(8, 1329)=500.8\) (i.e. Residuals sum of squares \(SSR\) over the explained sum of squares for the model \(SSE\).)
\(p-\)value: \(<4.9\mathrm{e}{-419} \approx 0 \quad\) (\(p-\)value is smaller than double precision \(\varepsilon\))
Since \(p-\)value is smaller than \(\alpha=0.05\), we reject the null hypothesis in favor of the alternative hypothesis.

Part 3: Model Comparison

  1. Assuming a marginal relationship between region and price, perform an ANOVA \(F-\)test on the mean insurance prices among the different regions. Using an \(\alpha\)-level of 0.05, can we reject the null hypothesis that the means of the regions are equal? Please interpret.
summary(aov(price~region, insurance))
              Df    Sum Sq   Mean Sq F value Pr(>F)  
region         3 1.301e+09 433586560    2.97 0.0309 *
Residuals   1334 1.948e+11 146007093                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since \(p-\)value is smaller than \(\alpha=0.05\), we reject the null hypothesis in favor of the alternative hypothesis:
At least one of the qualitative values in region has response values with statistically significant different means than other qualitative values in this predictor.

  1. Now, build a second multiple linear regression model, called model2, using price as the response variable, and all variables except region as the predictors. Conduct a partial \(F-\)test comparing model2 with model1. What is the partial-F test p-value? Can we reject the null hypothesis that the regression coefficients for region variables are zero at an \(\alpha\)-level of 0.05?

\(F-\)test
model2 <- lm(price~.-region, insurance)
anova(model1, model2) %>% print
Analysis of Variance Table

Model 1: price ~ age + sex + bmi + children + smoker + region
Model 2: price ~ (age + sex + bmi + children + smoker + region) - region
  Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
1   1329 4.8840e+10                               
2   1332 4.9073e+10 -3 -233431209 2.1173 0.09622 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2 Summary
summary(model2)

Call:
lm(formula = price ~ . - region, data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11837.2  -2916.7   -994.2   1375.3  29565.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -12052.46     951.26 -12.670  < 2e-16 ***
age            257.73      11.90  21.651  < 2e-16 ***
sexmale       -128.64     333.36  -0.386 0.699641    
bmi            322.36      27.42  11.757  < 2e-16 ***
children       474.41     137.86   3.441 0.000597 ***
smokeryes    23823.39     412.52  57.750  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6070 on 1332 degrees of freedom
Multiple R-squared:  0.7497,    Adjusted R-squared:  0.7488 
F-statistic:   798 on 5 and 1332 DF,  p-value: < 2.2e-16
a better model3

Here is a better model based on our previous findings with smaller DF than model1 that significantly outperforms both models by every metric.

model3 <- lm(log(price) ~ smoker*(children+age+bmi), insurance)
summary(model3)

Call:
lm(formula = log(price) ~ smoker * (children + age + bmi), data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.74310 -0.15339 -0.06160  0.01018  2.26709 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         7.0420361  0.0675207 104.294  < 2e-16 ***
smokeryes           1.4242587  0.1482274   9.609  < 2e-16 ***
children            0.1276748  0.0097962  13.033  < 2e-16 ***
age                 0.0416997  0.0008536  48.851  < 2e-16 ***
bmi                -0.0011534  0.0019886  -0.580    0.562    
smokeryes:children -0.1196447  0.0226370  -5.285 1.46e-07 ***
smokeryes:age      -0.0327043  0.0019012 -17.202  < 2e-16 ***
smokeryes:bmi       0.0494274  0.0042282  11.690  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3888 on 1330 degrees of freedom
Multiple R-squared:  0.8222,    Adjusted R-squared:  0.8212 
F-statistic: 878.3 on 7 and 1330 DF,  p-value: < 2.2e-16
model1model2model3
R20.750910.749720.82215
adj.R20.749410.748780.82122
sigma6062.16069.70.3888
f.stat500.81798.02878.33
p.value000
logLik-13548-13551-630.54
AIC27116271161279.1

Since \(p-\)value is larger than \(\alpha=0.05\), we fail to reject the null hypothesis:
Predictor region does not have a statistically significant impact on \(\hat{y}\) in our models.

  1. What can you conclude from 3a and 3b? Do they provide the exact same results?

In part a. we concluded that response for at least one of the variables in region had a statistically different means than the rest, but in conjunction with part b. we conclude that the statistically difference means produced by one or more of the variables in region is not enough to improve the model significantly. (i.e., there is a statistically significant marginal relationship between region and response, but the conditional relationship of region in association with other predictors and price is not a statistically significant one.)
It is worth noting that if we identify the statistically differing region(s) and encode the entries as region:(unimportant=0, important=1) we would get better partial \(F-\)test results more inline with a. (but without any actual improvement to our model, only reducing DF value without changing SSR.)

Note: Please use model1 for all of the following questions.

Part 4: Coefficient Interpretation

  1. Interpret the estimated coefficient of sexmale in the context of the problem. Make sure female is the baseline level for sex. Mention any assumptions you make about other predictors clearly when stating the interpretation.

Considering sex:female as the baseline, entries with predictor sexmale = 0 are female, and individuals with sexmale = 1 pay 131.3 less on average in their \(\hat{y}\:\) price.

  1. If the value of the bmi in model1 is increased by 0.01 and the other predictors are kept constant, what change in the response would be expected?

With bmi coefficient of 339.2, we’d expect a ~$3.39 increase in price given all other predictors are kept constant.

Part 5: Confidence and Prediction Intervals

  1. Compute 90% and 95% confidence intervals (CIs) for the parameter associated with age for model1. What observations can you make about the width of these intervals?
# 90% Confidence Interval
model1 %>%
  confint("age", level=.9)
5 % 95 %
age 237.2708 276.4419
# 95% Confidence Interval
model1 %>%
  confint("age", level=.95)
2.5 % 97.5 %
age 233.5138 280.1989

As a rule, an increase in confidence level always require the widening of the confidence interval. As expected, this is observed in this case as well.

  1. Using model1, estimate the average price for all insurance policies with the same characteristics as the first data point in the sample. What is the 95% confidence interval? Provide an interpretation of your results.
# 95% Confidence Interval
model1 %>% 
  predict(insurance[1,-7], level=.95,
          interval='confidence')
fit lwr upr
25293.71 24143.98 26443.44

Mean estimated price for any data point identical to the first data point is expected to be 25293.71 with 95% confidence that the population mean for similar data points falls within the [24143.98, 26443.44] interval.

  1. Suppose that the age value for the first data point is increased to 50, while all other values are kept fixed. Using model1, predict the price of an insurance policy with these characteristics. What is the 95% prediction interval? Provide an interpretation of your results.
# 95% Prediction Interval
insurance['age'] <- 50
model1 %>% 
  predict(insurance[1,-7], level=.95, 
          interval='prediction')
fit lwr upr
33256.26 21313.29 45199.23

Mean estimated price for any data point identical to the first data point with the exception of age = 50 is expected to be 25293.71 with 95% confidence that any single new similar data point would have response within the [21313.29, 45199.23] range.

The raw Rmarkdown (.Rmd) file for this report is available upon request. Please contact me at dan@sadatian.io.