Linear Regression of Clusters

We have all seen a scatter plot that shows a clear linear relationship.   I recently came across financial data that when plotted, appeared to have multiple linear lines describing it. To illustrate a method that automatically fits multiple lines without explicit labeling, I will use Insurance Forecast data.  Plotting dollars charged for a patient versus the age, shows that there are possibly three separate linear relationships.  

Potentially, there can be features beyond age that explain the three groupings.  In this case, these features are smoking status and BMI.  However, assume your data engineer didn’t track that feature – a common occurrence as the trade-off between tracking every piece of data under the sun can result in unnecessarily burdensome technology stack.  

If you wanted to model the relationship between age and charges, one solution would be a two-step process: 1)  identify the latent variable label via clustering and then 2) performing the regression separately on each cluster.  A more powerful solution of probabilistic programming combines this into one step using pymc3

Model description:

Model: f = \pi_1 \mu_1 + \pi_2 \mu_2 + \pi_3 \mu_3 where:
\mu_k = \alpha_k + \beta_k x
\pi_k = Dir(\alpha_k) mixing coefficient
\alpha_1, \alpha_2, \alpha_3 \sim N(10^4,10^5) intercept
\beta_1, \beta_2, \beta_3 \sim N(10^3, 10^4) slope
\sigma \sim HalfNormal(\sigma=10^4) error

Fitting this model, assuming three lines, yields:

Adding labels: As you can see, this model fits three lines with their separate intercepts and slopes.  In fact, there is a clear grouping of patients: smoking status and body mass index.

For the three lines, there are two distinct slopes: $270 per year of age for non-smokers and $284 per year age for smokers.  The intercept, which can be interpreted as a fixed set of charges for customer, is about $19k for smokers, but this number is a bit deceptive.  Looking at the marginal distribution of alpha for smokers (both types: high & low BMI), there are two modes: $10k and $30k. This is due to a mixing of these populations by the model. One can assume $10 is for the lower BMI group and $30k for the higher group. The distribution of the population (eg \pi) is about 72% non-smokers and an equal 14% of high/low BMI smokers. 

Finally, one of the most powerful aspects of this model is that we can generate samples using the posterior distribution since we have a full description of the model behavior. Below is a plot of the original data (black) along with the generated data (blue).