Polyglot Persistence is the idea of using different data stores in different circumstances. The term borrows from the term Polyglot Programming referring to multiple computer programming languages within an application.
NoSQL is a movement, not a technology.
Relational databases are not designed to run on clusters and thus scaling presents a challenge. Amazon (Dynamo paper) and Google (BigTable paper) were very influential in setting the direction for the resolution.
Relational databases often work very well. Migrating away from this framework should be motivated by a specific objective (e.g. running on clusters).
Integration databases are a single source of data for multiple applications. The alternative paradigm is an Application database, which has a one to one relationship between storage and application.
Application databases are the paradigm of NoSQL. The application “knows” the database structure. A schemaless db shifts the schema into the application that accesses it. This type of db is more forgiving for evolving needs.
Relational databases are good for analyzing data. NoSQL databases are not flexible for querying.
There are 4 types of NoSQL dbs, the first 3 are called aggregate oriented data models.
Key-Value: Redis, Riak, Dynamo; value is opaque
Document: MongoDB, Couch; value has structure.
Column-family: Cassandra, HBase
An aggregate is a collection of related objects that are treated as an object. They form the boundaries of an ACID operation. It is central to running on a cluster.
Distribution Model. Demonstrates the trade-off between consistency and availability.
Sharding: different parts of data onto different servers. Each server is a single source of a subset of data.
Master/slave replication: Replicating data across multiple nodes with one node the authority (master). Helps read scalability.
Peer-to-peer replication: Replicating data across all nodes, no authority. Helps write scalability. Linear scaling because no master (Cassandra is example).
Conflicts occur when clients try to write the same data at the same time (write-write) or one client reads inconsistent data during another’s write (read-write).
Pessimistic approach lock data to prevent conflicts, optimistic detects conflicts and fixes.
To get good consistency, many nodes should be involved but the reduces latency.
CAP Theorem: when you create partitions, you trade-off consistency with availability.
We have all seen a scatter plot that shows a clear linear relationship. I recently came across financial data that when plotted, appeared to have multiple linear lines describing it. To illustrate a method that automatically fits multiple lines without explicit labeling, I will use Insurance Forecast data. Plotting dollars charged for a patient versus the age, shows that there are possibly three separate linear relationships.
Potentially, there can be features beyond age that explain the three groupings. In this case, these features are smoking status and BMI. However, assume your data engineer didn’t track that feature – a common occurrence as the trade-off between tracking every piece of data under the sun can result in unnecessarily burdensome technology stack.
If you wanted to model the relationship between age and charges, one solution would be a two-step process: 1) identify the latent variable label via clustering and then 2) performing the regression separately on each cluster. A more powerful solution of probabilistic programming combines this into one step using pymc3.
Fitting this model, assuming three lines, yields:
Adding labels: As you can see, this model fits three lines with their separate intercepts and slopes. In fact, there is a clear grouping of patients: smoking status and body mass index.
For the three lines, there are two distinct slopes: $270 per year of age for non-smokers and $284 per year age for smokers. The intercept, which can be interpreted as a fixed set of charges for customer, is about $19k for smokers, but this number is a bit deceptive. Looking at the marginal distribution of alpha for smokers (both types: high & low BMI), there are two modes: $10k and $30k. This is due to a mixing of these populations by the model. One can assume $10 is for the lower BMI group and $30k for the higher group. The distribution of the population (eg ) is about 72% non-smokers and an equal 14% of high/low BMI smokers.
Finally, one of the most powerful aspects of this model is that we can generate samples using the posterior distribution since we have a full description of the model behavior. Below is a plot of the original data (black) along with the generated data (blue).
I was surprised to find that a model can generate the same ROC curve, but have drastically different precisions/recalls depending on the data set the model is applied to. This is driven by the fact that ROC is scaling invariant, while the latter is not.
I was recently working with an imbalanced data set D, where the positive minority class represented ~5% of the samples. I down sampled the negative class to create a balanced (50-50 split) data set D’. I trained the model on D’ and in order to get an idea of model performance, I generated the ROC curve and the Precision/Recall using both D and D’. While the ROC curve was the same for both, the precision dropped from 80% on D’ to 10% on D. What is the interpretation?
The ROC curve is the True Positive Rate (TPR) plotted against the False Positive Rate (FPR). The former measures the success rate within the actual positive sample, while the latter measures the rate within the actual negative class. Thus, if the rate of prediction within each class remains the same for D and D’, the ROC curve will look similar.
On the other hand, precision measures performance of the positive predictions. Going from D’ to D increases the number of negative samples. If the rate of prediction within each class remains the same, giving a similar ROC curve, then the precision will drop because of the substantially more negative samples. Essentially, the model will be calling many of the new negative samples as positive.
- Assume that FPR and TPR remains the same
- TN, TP, FN, FP is the number of True Negatives, True Positives, False Negatives, False Positives.
- Actual negative class is under sampled by a factor n – this only effects the number of TN and FP
FPR_D’ = FP / FP+TNFPR_D = nFP/ nFP + nTN = FP / FP+TNFPR_D’ = FPR_D
Precision_D’ = TP / TP + FPPrecision_D = TP / TP + nFP
Precision_D’ / Precision_D = TP + nFP / TP + FP
Plugging in numbers to see the impact. Assume 1) TP ~ FP, so that Precision_D’=50% and 2) n~5 (majority negative class is 5x minority positive class). Then: Precision_D’ / Precision_D = 1+n / 2 = 3. The precision on the under sampled class D’ will be 3x higher than on the full data set D.
It points out that it is relatively straightforward to have an idea, to identify the point of entry to a complex maze. The heavy lifting is to think through and develop deep, unique insight into all possible paths from idea to profit.
In order to generate this bird’s eye view, four sources of inspiration are highlighted. Each source has a distinct representation inside a large bank.
- Past history: Identify similar ideas that have been tried before. It is the beauty and often discouraging fact that most ideas have been tried, but died on a vine for various reasons (eg political, technological, or organizational constraints). Finding out requires some connection to senior people who have been around. Sharing your idea and not operating in stealth mode has benefits that compensate for the small risk of having the idea stolen.
- Analogy: Identify other businesses within a large bank where the idea has been tried, perhaps in a different format with adaptations for that business’s nuances. Like buying an old home, a little imagination is required to see how it applies – if it was obvious, there isn’t much value in your contribution.
- Theories: Connect with your local academic institutions to find scholars on the problem. By offering adjustments to the assumptions in their publications to account for business specifics, this source may provide deep insight.
- Direct experience: Put yourself in the maze and be aware that you are in one.
There is a simple closed form solution, but I wanted to do a brute force Monte Carlo simulation. I randomly assigned two spots on the circle for the two ghostbusters (Abe and Betty) and then assigned two spots for the ghosts (Dan and Candace). Then ran multiple simulations and tracked of the streams crossed or not.
The convergence plot below shows how ~1k sims convergence to the solution of 33% chance that they cross streams.
See the Jupyter Notebook.
Despite winning all season, your baseball team can disappoint in the playoffs. The Oakland Athletics played 162 games and won 60% of them, then lose the 1 game wildcard playoff (again!). All probable outcomes suffer from statistical significance. If you are 60% sure your trade will be a success, that is a phenomenal edge. However, if you make a single bet, then 4 times out of 10, you will lose. And remember, you only care about the outcome realized, like the Athletics going home, all that matters is what happened. Life has a statistical sample size of 1.
Then, how do you capitalize on your edge? Make more bets, play more games. In a 7 game series, the Athletics win 7 times out of 10. If you place 41 trades, then 90% of the time you make moe money than lose 90%.
Last week, I gave a guest lecture at NYU School of Engineering to Financial Engineering students on a Data Science topic. The lecture covered Unsupervised Learning techniques: PCA, K-means and Hierarchical clustering. I started the lecture by discussing the breadth of potential career paths in Finance. On the train ride home after the lecture, it occurred to me that I could use a clustering algorithm to build a map of the roles within a Bank. I think this would be helpful to anyone because there are universes amongst universes inside these large Banks; only select gray haired veterans have a true understanding of. Wouldn’t it be nice if we could draw a diagram showing that?
The most obtainable data on jobs that I could think of are the descriptions themselves. The goal would be to relate the descriptions, a natural language processing (NLP) problem. However, the first issue getting the job descriptions to create a first pass/proof of concept. Using monster.com’s “rss” feed, I was able to get about 15 descriptions for 4 separate job searches: ‘financial analyst’, ‘accounting manager’, ‘web developer’, ‘pharmaceutical sales’).
Much more work is required to take this seriously. Specifically: 1) more data 2) optimize vectorization thresholds 3) scrub/vet data appropriately 4) pick a good distance measure.
In case you are less interested in the geographical view, a simple plot of heart rate and pace versus time and distance can be informative, as in Figure 2.
- Multiply ABV% by 30 and to get total calories.
- Multiply ABV% by 3 and to get total carbohydrates.