data# Spreadsheet data analysis: four approaches

## Segmentation, clustering, scatter plots, and linear regression

Richard Li•

#data#exploration#business intelligence

You’ve spent countless hours collecting data, gathering every piece of information you think might be valuable. Now, you ned to sift through all this data to find what’s truly relevant. One of the enduring challenges with data is that most of it isn’t relevant to your immediate needs, yet identifying the irrelevant parts isn’t always straightforward.

In this article, I’ll delve into four basic techniques you can employ in spreadsheet software for effective data exploration: segmentation, clustering, scatter plots, and linear regression. Data exploration is a critical, yet often overlooked, step in making your data truly actionable. By highlighting what data is relevant and what isn’t, data exploration can identify new opportunities and focus your efforts.

The initial phase of data exploration involves defining what we mean by “relevance.” **Relevance is determined by identifying a metric within your dataset that you aim to alter or influence.** This metric serves as the focal point for your analysis, guiding your exploration towards actionable insights.

Fortunately, defining relevance is relatively straightforward. It’s the key metric in your dataset that you’re interested in changing or understanding better. For a business, this might be revenue; for environmental studies, it could be CO2 levels; and in sports analytics, it might be the number of runs scored.

If you’re statistically minded, you probably have realized that by framing “relevance” in terms of a metric you want to alter or influence, we’ve simplified the question of data exploration to a regression analysis problem. If so, congratulations! If you’re not familiar or uncomfortable with regression analysis, don’t worry.

We’re going to start with the Insurance Customer Lifetime Value dataset, which I had previously analyzed using some more advanced techniques. I think it’s instructive to start with some basic spreadsheet analysis to better understand the advantages and limitations of simple approaches. All things being equal, simple is better, as you’ll have a much better chance of explaining your results to others if you can show them a simple analysis.

The insurance CLTV data set contains approximately 80,000 insurance customers, some demographic information (e.g., gender, eduction), and their lifetime value to the business. We’ll walk through an analysis of this data in a spreadsheet to identify what factors drive LTV. In the course of this analysis, we’ll also discover why using LTV is not a great metric for insurance customers! The actual spreadsheet is available here.

We’ll start with the basic technique of segmentation. For each gender, we can use the `COUNTIF`

function to count the number of Males and Females. For the Average LTV, we use `AVERAGEIF`

and create a formula `=AVERAGEIF(CLTV!B:B, "Male", CLTV!M:M)`

that averages the LTV for all rows that match a given gender.

Count | Average LTV | |
---|---|---|

Male | 49942 | 97,826 |

Female | 38450 | 98,102 |

The results show that men and women have more-or-less the same LTV. The simplicity of the calculation belies its power, which is its explainability. Imagine we had discovered women had a 30% higher LTV then men: this may suggest that marketing to women may be a plausible strategy to increase LTV.

Segmentation works well for categorical data, but what about claim amounts? We can start by calculating the Pearson coefficient, which we can’t do easily for non-numerical values. Pearson coefficient shows how well claim amounts correlate with LTV. The formula `=CORREL(CLTV!M2:M80000,CLTV!H2:H80000)`

gives us a Pearson coefficient of 0.18, suggesting that there is not a strong correlation. We want to keep this in mind whenever we draw a conclusion.

We can create segments by creating clusters of customers with similar claim amounts. Let’s group the claim amounts by percentile. This approach ensures that each group has roughly the same number of customers.

Percentile | Cutoff | Formula |
---|---|---|

25th | 2407 | `=PERCENTILE(CLTV!H:H,0.25)` |

50th | 4089 | `=PERCENTILE(CLTV!H:H,0.5)` |

75th | 6093 | `=PERCENTILE(CLTV!H:H,0.75)` |

We can then use `COUNTIF`

and `AVERAGEIF`

functions again to calculate the average LTV of each cluster:

Claim Amount | Count | Average LTV |
---|---|---|

0 - 2407 | 22092 | 77,547 |

2408 - 4089 | 2407 | 93,206 |

409- - 6093 | 4089 | 100,014 |

6094+ | 6093 | 102,428 |

As claim amounts increase, so does LTV! This makes sense: if a customer has a claim, the insurance company needs to increase their premium to compensate, thus driving up LTV. At the same time, this neatly illustrates why LTV isn’t a good metric for an insurance company. We should keep in mind the low Pearson’s coefficient, so we shouldn’t get too focused on this result.

We’ve used clusters and segments to better understand claim amounts. That said, averages can hide away data. We can create a scatter plot of the first 5,000 rows of data. (You can create a scatter plot of all 80K+ rows, but this will be exceedingly slow in both Excel and Google Sheets).

The major takeaway is that there is a lot more claims at the lower end than at the higher end — so the 75th percentile box (which starts at 6,093) encompasses a very wide range of claim amounts, which could skew our conclusion.

To show a different scatter plot with more information, we can create an artificial predictor of LTV using this formula: `=(0.4 + (RAND() * 0.4)) * LTV`

. This formula creates a strong linear relationship between LTV and LTV Predictor. In fact, if we calculate the Pearson coefficient, we get 0.95!

You can see that as the LTV predictor goes up, LTV goes up.

All of the various strategies we’ve employed have been an attempt to uncover relationships between our independent variables and our dependent variable, lifetime value. Regression analysis is a statistic modeling technique that uncovers relationships between a dependent variable and one or more independent variables. Modern spreadsheets support linear regression using the LINEST function. We can compute a linear regression in Google Sheets on claim amount versus LTV with the formula `=LINEST(CLTV!M2:M80000,CLTV!H2:H80000, TRUE, TRUE)`

. The output, sadly, is hideously complicated:

β1 | β0 |
---|---|

4.999658097 | 76295.4678 |

0.0969918959 | 527.1015201 |

0.03214735568 | 89254.15577 |

2657.111108 | 79997 |

21167355703592 | 637280446857528 |

Column β1 refers to the independent variable (claim amount, in our case). The first value of 4.999658097 is the coefficient. This means is that for every 1 rupee increase in claim amount, the equation predicts a 5 rupee increase in LTV. The second third value is the R^2, which is an indication of statistical significance. The low R^2 value (typically, a 0.5 or higher indicates statistical significance) suggests we should not put too much emphasis on this particular result.

We’ve explored four techniques—segmentation, clustering, scatter plots, and linear regression—that you can utilize to gain insights from your data.

While these techniques are easily accessible in a spreadsheet, they do have a few important limitations. In particular:

- Each of these techniques assume each variable is completely independent from each other, which is unlikely to be the case.
- They assume (implicitly or explicitly) a linear relationship between the independent and dependent variable.
- We don’t have a great way (other than the Pearson coefficient and R^2) to measure the relative impact of each of these independent variables.

Fortunately, these limitations can be overcome with more sophisticated analysis and visualization strategies. You’ll need to move beyond spreadsheets and use data science tools such as R, Python, and Jupyter notebooks to analyze your data. Or, if that’s too much for you, start a free trial of Amorphous Data today.