NoSQL

In most of my experience as a Data Scientist at a large Bank, I have retrieved data from relational databases.  More recently, we have built a new platform, where I have been working with Cassandra for timeseries on a Kubernetes cluster.  I happened to watch this 1-hour presentation on NoSQL databases, which furthered my motivation to understand the topic, so I read a textbook on NoSQL by Pramod Sadalage and Martin Fowler.  Below are my notes: 
 
  • Polyglot Persistence is the idea of using different data stores in different circumstances.  The term borrows from the term Polyglot Programming referring to multiple computer programming languages within an application.
  • NoSQL is a movement, not a technology.
  • Relational databases are not designed to run on clusters and thus scaling presents a challenge.  Amazon (Dynamo paper) and Google (BigTable paper) were very influential in setting the direction for the resolution.
  • Relational databases often work very well.  Migrating away from this framework should be motivated by a specific objective (e.g. running on clusters).
  • Integration databases are a single source of data for multiple applications.  The alternative paradigm is an Application database, which has a one to one relationship between storage and application.
  • Application databases are the paradigm of NoSQL.  The application “knows” the database structure.  A schemaless db shifts the schema into the application that accesses it.  This type of db is more forgiving for evolving needs.
  • Relational databases are good for analyzing data.  NoSQL databases are not flexible for querying.
  • There are 4 types of NoSQL dbs, the first 3 are called aggregate oriented data models.
    • Key-Value: Redis, Riak, Dynamo; value is opaque
    • Document: MongoDB, Couch; value has structure.
    • Column-family: Cassandra, HBase
    • Graph: NodeJs
  • An aggregate is a collection of related objects that are treated as an object.  They form the boundaries of an ACID operation.  It is central to running on a cluster.
  • Distribution Model.  Demonstrates the trade-off between consistency and availability.
    • Single-server
    • Sharding: different parts of data onto different servers.  Each server is a single source of a subset of data.
    • Master/slave replication: Replicating data across multiple nodes with one node the authority (master).  Helps read scalability.
    • Peer-to-peer replication: Replicating data across all nodes, no authority.  Helps write scalability.  Linear scaling because no master (Cassandra is example).  
  • Consistency: 
    • Conflicts occur when clients try to write the same data at the same time (write-write) or one client reads inconsistent data during another’s write (read-write).
    • Pessimistic approach lock data to prevent conflicts, optimistic detects conflicts and fixes.
    • To get good consistency, many nodes should be involved but the reduces latency.  
  • CAP Theorem: when you create partitions, you trade-off consistency with availability.  
 
 
 
 
 

Linear Regression of Clusters

We have all seen a scatter plot that shows a clear linear relationship.   I recently came across financial data that when plotted, appeared to have multiple linear lines describing it. To illustrate a method that automatically fits multiple lines without explicit labeling, I will use Insurance Forecast data.  Plotting dollars charged for a patient versus the age, shows that there are possibly three separate linear relationships.  

Potentially, there can be features beyond age that explain the three groupings.  In this case, these features are smoking status and BMI.  However, assume your data engineer didn’t track that feature – a common occurrence as the trade-off between tracking every piece of data under the sun can result in unnecessarily burdensome technology stack.  

If you wanted to model the relationship between age and charges, one solution would be a two-step process: 1)  identify the latent variable label via clustering and then 2) performing the regression separately on each cluster.  A more powerful solution of probabilistic programming combines this into one step using pymc3

Model description:

Model: f = \pi_1 \mu_1 + \pi_2 \mu_2 + \pi_3 \mu_3 where:
\mu_k = \alpha_k + \beta_k x
\pi_k = Dir(\alpha_k) mixing coefficient
\alpha_1, \alpha_2, \alpha_3 \sim N(10^4,10^5) intercept
\beta_1, \beta_2, \beta_3 \sim N(10^3, 10^4) slope
\sigma \sim HalfNormal(\sigma=10^4) error

Fitting this model, assuming three lines, yields:

Adding labels: As you can see, this model fits three lines with their separate intercepts and slopes.  In fact, there is a clear grouping of patients: smoking status and body mass index.

For the three lines, there are two distinct slopes: $270 per year of age for non-smokers and $284 per year age for smokers.  The intercept, which can be interpreted as a fixed set of charges for customer, is about $19k for smokers, but this number is a bit deceptive.  Looking at the marginal distribution of alpha for smokers (both types: high & low BMI), there are two modes: $10k and $30k. This is due to a mixing of these populations by the model. One can assume $10 is for the lower BMI group and $30k for the higher group. The distribution of the population (eg \pi) is about 72% non-smokers and an equal 14% of high/low BMI smokers. 

Finally, one of the most powerful aspects of this model is that we can generate samples using the posterior distribution since we have a full description of the model behavior. Below is a plot of the original data (black) along with the generated data (blue).

Effect of Sampling on ROC and PR Curves

I was surprised to find that a model can generate the same ROC curve, but have drastically different precisions/recalls depending on the data set the model is applied to.  This is driven by the fact that ROC is scaling invariant, while the latter is not.


I was recently working with an imbalanced data set D, where the positive minority class represented ~5% of the samples.  I down sampled the negative class to create a balanced (50-50 split) data set D’.  I trained the model on D’ and in order to get an idea of model performance, I generated the ROC curve and the Precision/Recall using both D and D’.  While the ROC curve was the same for both, the precision dropped from 80% on D’ to 10% on D.  What is the interpretation?
The ROC curve is the True Positive Rate (TPR) plotted against the False Positive Rate (FPR).  The former measures the success rate within the actual positive sample, while the latter measures the rate within the actual negative class.  Thus, if the rate of prediction within each class remains the same for D and D’, the ROC curve will look similar.  
On the other hand, precision measures performance of the positive predictions.  Going from D’ to D increases the number of negative samples.  If the rate of prediction within each class remains the same, giving a similar ROC curve, then the precision will drop because of the substantially more negative samples.  Essentially, the model will be calling many of the new negative samples as positive.
Mathematically

  • Assume that FPR and TPR remains the same
  • TN, TP, FN, FP is the number of True Negatives, True Positives, False Negatives, False Positives.
  • Actual negative class is under sampled by a factor n – this only effects the number of TN and FP

FPR_D’ = FP / FP+TNFPR_D = nFP/ nFP + nTN = FP / FP+TNFPR_D’ = FPR_D
Precision_D’ = TP / TP + FPPrecision_D = TP / TP + nFP
Precision_D’ / Precision_D = TP + nFP / TP + FP
Plugging in numbers to see the impact.  Assume 1) TP ~ FP, so that Precision_D’=50% and 2) n~5 (majority negative class is 5x minority positive class).  Then:  Precision_D’ / Precision_D = 1+n / 2 = 3.  The precision on the under sampled class D’ will be 3x higher than on the full data set D. 

The Idea Maze

I came across a great blog post on the idea maze, based on a lecture by Balaji Srinivasan. I read it through the lens an employee at a large bank and aspiring “intra-preneur”.  

It points out that it is relatively straightforward to have an idea, to identify the point of entry to a complex maze.  The heavy lifting is to think through and develop deep, unique insight into all possible paths from idea to profit.

In order to generate this bird’s eye view, four sources of inspiration are highlighted.  Each source has a distinct representation inside a large bank.

  1. Past history: Identify similar ideas that have been tried before.  It is the beauty and often discouraging fact that most ideas have been tried, but died on a vine for various reasons (eg political, technological, or organizational constraints).  Finding out requires some connection to senior people who have been around.  Sharing your idea and not operating in stealth mode has benefits that compensate for the small risk of having the idea stolen.
  2. Analogy: Identify other businesses within a large bank where the idea has been tried, perhaps in a different format with adaptations for that business’s nuances.  Like buying an old home, a little imagination is required to see how it applies – if it was obvious, there isn’t much value in your contribution.
  3. Theories: Connect with your local academic institutions to find scholars on the problem.  By offering adjustments to the assumptions in their publications to account for business specifics, this source may provide deep insight.
  4. Direct experience: Put yourself in the maze and be aware that you are in one. 

538 Riddler: Ghostbusters

A solution to the riddle on page 71 of The Riddler and also in the weekly column: Will You Be A Ghostbuster Or A World Destroyer?

There is a simple closed form solution, but I wanted to do a brute force Monte Carlo simulation. I randomly assigned two spots on the circle for the two ghostbusters (Abe and Betty) and then assigned two spots for the ghosts (Dan and Candace). Then ran multiple simulations and tracked of the streams crossed or not.

The convergence plot below shows how ~1k sims convergence to the solution of 33% chance that they cross streams.

See the Jupyter Notebook.

Improve your win rate

Despite winning all season, your baseball team can disappoint in the playoffs.  The Oakland Athletics played 162 games and won 60% of them, then lose the 1 game wildcard playoff (again!).  All probable outcomes suffer from statistical significance.  If you are 60% sure your trade will be a success, that is a phenomenal edge.  However, if you make a single bet, then 4 times out of 10, you will lose.  And remember, you only care about the outcome realized, like the Athletics going home, all that matters is what happened.  Life has a statistical sample size of 1.


Then, how do you capitalize on your edge?  Make more bets, play more games.  In a 7 game series, the Athletics win 7 times out of 10.  If you place 41 trades, then 90% of the time you make moe money than lose 90%.  

Hierarchical Job Clusters

Introduction

Last week, I gave a guest lecture at NYU School of Engineering to Financial Engineering students on a Data Science topic.  The lecture covered Unsupervised Learning techniques: PCA, K-means and Hierarchical clustering.  I started the lecture by discussing the breadth of potential career paths in Finance.  On the train ride home after the lecture, it occurred to me that I could use a clustering algorithm to build a map of the roles within a Bank.  I think this would be helpful to anyone because there are universes amongst universes inside these large Banks; only select gray haired veterans have a true understanding of.  Wouldn’t it be nice if we could draw a diagram showing that?

The Data

The most obtainable data on jobs that I could think of are the descriptions themselves.  The goal would be to relate the descriptions, a natural language processing (NLP) problem.  However, the first issue getting the job descriptions to create a first pass/proof of concept.  Using monster.com’s “rss” feed, I was able to get about 15 descriptions for 4 separate job searches: ‘financial analyst’, ‘accounting manager’, ‘web developer’, ‘pharmaceutical sales’).

The Code
Natural Language Processing (NLP) is notoriously difficult and I have never used any of the Python NLTK libraries to solve a problem.  Fortunately, bloggers have posted prototype proof of concepts.  In particular,  Brandon Rose’s approach for clustering movie descriptions was highly translatable to job description clustering.  Briefly, after vectorizing the job description, a distance was calculated between vectors.  Then the vectors were clustered using Ward’s linkage function.

Much more work is required to take this seriously.  Specifically: 1) more data 2) optimize vectorization thresholds 3) scrub/vet data appropriately 4) pick a good distance measure.

 Results
In the dendrogram below (I cropped it in order for the text to show properly), the horizontal length of the legs joining two nodes is proportional to their dissimilarity: nodes joined by short legs are highly related.  Of the 4 job descriptions, web developer stands alone and tends to have job descriptions highly related to each other.  Pharmaceutical sales is also well clustered, although can be found closely linked to accounting managers in some cases.  Financial analyst, a very vague title, tends to be sprinkled in.

dendrogram_crop
Figure 1: Cropped dendrogram of 15 job descriptions from each of 4 searches: ‘financial analyst’, ‘accounting manager’, ‘web developer’, ‘pharmaceutical sales’.

Harness Your Fitbit Data

I am certain that ownership of a Fitbit is highly correlated to obsessive compulsive behavior and a love of data, but I only have this Sedaris essay to corroborate that.  When I got my Fitbit Blaze, I thought the pedometer/heart rate monitor/tracker would be an excellent way to generate personal biodata for me to analyze.  I was disappointed by the limited dashboard views they offer.  Although Fitbit claims that “Your data belongs to you…”, you only get a very small view, as pointed out here.  Letting their user base geek out and obsess over their data isn’t high up on their priority list.  To this end, I am sharing an approach to dissecting your outdoor runs.
Data from outdoor runs
In an exceptional case, Fitbit let’s the user access the results from a run in an XML file (*.tcx).  During my training for the 2016 Hartford marathon, I wanted a way to view my runs to track progress.  In addition, the data might offer some way to improve.  For example, beginner runners manage pace poorly, either running too fast in the beginning and draining the tank or being too conservative and holding back too much (me).  The tcx file offered two response variables of interest: pace (eg miles per minute) and heart rate (beats per minute).  A short Python script converts the XML file into a dataframe, which allows a couple of interesting views.
Central Park Training Run
For people who run the same route, a map view can give insight on where particular spots are pain points.  For many years, I ran the same loop in Central Park (6.1 miles).  Figure 1 shows how my heart rate and pace as a function of location in the park.  There’s an additional bit of info that can be layered on, like change in altitude, which might help explain the slow downs and increase heart rate ( the dark red in the upper right corner is a very steep hill in Central Park).  Also, the graph shows that I am clearly starting too slow!

Figure 1: Heart rate (left) and pace (right) as a function of location in Central Park. An “X” marks the starting point of the run and the arrows give the direction of the run. This graph can be layered on top of a satellite using the Python basemap package, but at the time of writing, my interpreter chose not to cooperate.

In case you are less interested in the geographical view, a simple plot of heart rate and pace versus time and distance can be informative, as in Figure 2.

Figure 2: Heart rate (left) and pace (right) as a function of time and distance. The red dashed lines show 70%, 80% and 90% of max heart rate (measured as 220-age).

Beer Nutrition

scatter
Figure 1: Calories, carbohydrates and ABV % are somewhat linearly related. Data on 70+ beers reveals some useful rule of thumb calculations you can use to make estimates. For example, multiply your ABV % by 30 to estimate the number of calories in your beer.

Introduction
I like beer and I like data, so it is natural that I would love to play with data on beer.  I like the abundance of flavor in beers with higher ABV, but began wondering: at what nutritional cost?  Extra calories, carbs, both?  It’s clear one can pick a Michelob Ultra if they were trying to minimize calories, but lets face it, compared to an IPA, that’s like a paper fan compared to an AC unit.
Data Description
I found some beer nutrition data online to start the analysis with.  Ultimately, I would like to build a scraper to capture some more data.  There are three attributes I am interested in: Calories, Carbohydrates, and ABV %.
Results
A factor plot is a convenient way to get a quick glance at the relationships between the three attributes (Figure 2).  It shows the linear relationship between them, which is expected because they are intimately related.  A single carbohydrate is a calorie.  In fact, Google says that one gram of carb is ~4 calories.  Figure 1 shows this in more detail. The slope of the regression line implies slightly more calories per carb in beer: ~6.7 cals per gram carb.  The fit is skewed by the “skinny” beers with under 10 grams of carbs, where the first few carbs don’t seem to make a caloric difference. This is probably due to engineering trade-offs.  The <5g carb beers are probably missing something that gives the 5-10g carb beers some taste.  The intercept is ~69 calories, indicative of some minimum non-carb contribution to the amount of calories.
Other interesting relationships are available too because of the linear relationship.  ABV is typically the most available stat on the beer label.  From this, you can estimate calories and carbs using these handy rule-of-thumb calculations:
  • Multiply ABV% by 30 and to get total calories.
  • Multiply ABV% by 3 and to get total carbohydrates.

factor plot
Figure 2: Factor plot between ABV %, calories and carbohydrates. Red line is a regression with shaded area showing 95% confidence interval. There is a strong linear relationship between these correlated variables.