Hierarchical Job Clusters


Last week, I gave a guest lecture at NYU School of Engineering to Financial Engineering students on a Data Science topic.  The lecture covered Unsupervised Learning techniques: PCA, K-means and Hierarchical clustering.  I started the lecture by discussing the breadth of potential career paths in Finance.  On the train ride home after the lecture, it occurred to me that I could use a clustering algorithm to build a map of the roles within a Bank.  I think this would be helpful to anyone because there are universes amongst universes inside these large Banks; only select gray haired veterans have a true understanding of.  Wouldn’t it be nice if we could draw a diagram showing that?

The Data

The most obtainable data on jobs that I could think of are the descriptions themselves.  The goal would be to relate the descriptions, a natural language processing (NLP) problem.  However, the first issue getting the job descriptions to create a first pass/proof of concept.  Using monster.com’s “rss” feed, I was able to get about 15 descriptions for 4 separate job searches: ‘financial analyst’, ‘accounting manager’, ‘web developer’, ‘pharmaceutical sales’).

The Code
Natural Language Processing (NLP) is notoriously difficult and I have never used any of the Python NLTK libraries to solve a problem.  Fortunately, bloggers have posted prototype proof of concepts.  In particular,  Brandon Rose’s approach for clustering movie descriptions was highly translatable to job description clustering.  Briefly, after vectorizing the job description, a distance was calculated between vectors.  Then the vectors were clustered using Ward’s linkage function.

Much more work is required to take this seriously.  Specifically: 1) more data 2) optimize vectorization thresholds 3) scrub/vet data appropriately 4) pick a good distance measure.

In the dendrogram below (I cropped it in order for the text to show properly), the horizontal length of the legs joining two nodes is proportional to their dissimilarity: nodes joined by short legs are highly related.  Of the 4 job descriptions, web developer stands alone and tends to have job descriptions highly related to each other.  Pharmaceutical sales is also well clustered, although can be found closely linked to accounting managers in some cases.  Financial analyst, a very vague title, tends to be sprinkled in.
Figure 1: Cropped dendrogram of 15 job descriptions from each of 4 searches: ‘financial analyst’, ‘accounting manager’, ‘web developer’, ‘pharmaceutical sales’.

Harness Your Fitbit Data

I am certain that ownership of a Fitbit is highly correlated to obsessive compulsive behavior and a love of data, but I only have this Sedaris essay to corroborate that.  When I got my Fitbit Blaze, I thought the pedometer/heart rate monitor/tracker would be an excellent way to generate personal biodata for me to analyze.  I was disappointed by the limited dashboard views they offer.  Although Fitbit claims that “Your data belongs to you…”, you only get a very small view, as pointed out here.  Letting their user base geek out and obsess over their data isn’t high up on their priority list.  To this end, I am sharing an approach to dissecting your outdoor runs.
Data from outdoor runs
In an exceptional case, Fitbit let’s the user access the results from a run in an XML file (*.tcx).  During my training for the 2016 Hartford marathon, I wanted a way to view my runs to track progress.  In addition, the data might offer some way to improve.  For example, beginner runners manage pace poorly, either running too fast in the beginning and draining the tank or being too conservative and holding back too much (me).  The tcx file offered two response variables of interest: pace (eg miles per minute) and heart rate (beats per minute).  A short Python script converts the XML file into a dataframe, which allows a couple of interesting views.
Central Park Training Run
For people who run the same route, a map view can give insight on where particular spots are pain points.  For many years, I ran the same loop in Central Park (6.1 miles).  Figure 1 shows how my heart rate and pace as a function of location in the park.  There’s an additional bit of info that can be layered on, like change in altitude, which might help explain the slow downs and increase heart rate ( the dark red in the upper right corner is a very steep hill in Central Park).  Also, the graph shows that I am clearly starting too slow!
Figure 1: Heart rate (left) and pace (right) as a function of location in Central Park. An “X” marks the starting point of the run and the arrows give the direction of the run. This graph can be layered on top of a satellite using the Python basemap package, but at the time of writing, my interpreter chose not to cooperate.

In case you are less interested in the geographical view, a simple plot of heart rate and pace versus time and distance can be informative, as in Figure 2.

Figure 2: Heart rate (left) and pace (right) as a function of time and distance. The red dashed lines show 70%, 80% and 90% of max heart rate (measured as 220-age).

Beer Nutrition

Figure 1: Calories, carbohydrates and ABV % are somewhat linearly related. Data on 70+ beers reveals some useful rule of thumb calculations you can use to make estimates. For example, multiply your ABV % by 30 to estimate the number of calories in your beer.
I like beer and I like data, so it is natural that I would love to play with data on beer.  I like the abundance of flavor in beers with higher ABV, but began wondering: at what nutritional cost?  Extra calories, carbs, both?  It’s clear one can pick a Michelob Ultra if they were trying to minimize calories, but lets face it, compared to an IPA, that’s like a paper fan compared to an AC unit.
Data Description
I found some beer nutrition data online to start the analysis with.  Ultimately, I would like to build a scraper to capture some more data.  There are three attributes I am interested in: Calories, Carbohydrates, and ABV %.
A factor plot is a convenient way to get a quick glance at the relationships between the three attributes (Figure 2).  It shows the linear relationship between them, which is expected because they are intimately related.  A single carbohydrate is a calorie.  In fact, Google says that one gram of carb is ~4 calories.  Figure 1 shows this in more detail. The slope of the regression line implies slightly more calories per carb in beer: ~6.7 cals per gram carb.  The fit is skewed by the “skinny” beers with under 10 grams of carbs, where the first few carbs don’t seem to make a caloric difference. This is probably due to engineering trade-offs.  The <5g carb beers are probably missing something that gives the 5-10g carb beers some taste.  The intercept is ~69 calories, indicative of some minimum non-carb contribution to the amount of calories.
Other interesting relationships are available too because of the linear relationship.  ABV is typically the most available stat on the beer label.  From this, you can estimate calories and carbs using these handy rule-of-thumb calculations:
  • Multiply ABV% by 30 and to get total calories.
  • Multiply ABV% by 3 and to get total carbohydrates.
factor plot
Figure 2: Factor plot between ABV %, calories and carbohydrates. Red line is a regression with shaded area showing 95% confidence interval. There is a strong linear relationship between these correlated variables.

Child Care Business in NJ

Figure 1: Same data as the bar plot in Figure 2, but in map format. Left side shows that addressable market metric is proportional to population density (coastal regions have more people and a higher metric). The right hand side shows the ratio metric normalizes this since the colors become a little more random. Note the depressed ratio in counties near NYC, perhaps indicative of economic factors.
Continuing my exploration of graphing and maps, I went to NJ.gov to find publicly available data to play with.  The first dataset I came across was on Licensed Child Care Centers, which fortunately for society, is a public safety mechanism that creates requirements for all child care centers and requires their premise to get inspection certificates, which are available online.  In addition, there are statistics on each center’s location, capacity, and min/max age. This lead me to wonder: if I were to open a business, I’d prefer to choose a place that is underserved (ignoring Hotelling’s law, I am not an economics major, but maybe this business domain is an exception to the law).
The US Census Bureau has 2017 estimates for each county’s population for children under 5.  I calculated two metrics for each county:
  • Addressable Market: using this term very informally, I defined it as the number of children under 5 in the county minus the county’s capacity to cater to under 5.
  • Ratio of children under 5 to child center capacity.
Bar Plot
Figure 2: Addressable market (left) and ratio (right) metrics by county. Average ratio is 35:1, so taller bars in ratio plot represent opportunity.
The left side of Figure 2 shows that Ocean, Middlesex, and Essex have 20-30k children that wouldn’t have a care center spot in their county if each operated at capacity (and they were looking of course).  Warren, Salem, and Hunterdon have a smaller issue by population size.  While this metric provides an idea of how many children cannot be served, it doesn’t give any insight on opportunity because the counties on the left side are ordered roughly in proportion to their population.


The ratio of children under 5 to capacity gives a normalized view.  On average, counties have a ratio of 35 children to 1 child care center spot.  In other words, 3% of children under 5 use child care centers.  The right hand side of Figure 2 shows 6 counties with a ratio of 40:1.  These include: Cape May, Salem, Sussex, Ocean, Atlantic, and Warren.


Figure 1 and 2 show the same data, but in different formats.  In Figure 2, the first metric (left side) shows the darker regions along the coast, which reflects the point above that this metric is proportional to population density (which is higher near the coast). Using the normalized metric, the colors become a little more random (right side). The area close to NYC shows lower ratios, perhaps driven by higher demand for child care services by people who commute to NYC.


There are several factors that determine child care center businesses.  If I were really going to look into business opportunities, I would want to cut the data with socio-economic attributes.


Rush Hour at Penn Station

New Jersey commuters are all aware of the third-world-esque brutality that ensues when the evening monitor at Penn Station posts their train’s track number.  Entrances onto the tracks were designed for significantly less people.  The urgency to get on the train brings out the Mike Tyson in the sweetest old lady with a cane; a book bag to the face is a standard tactic to make room.  While being kicked from the back and inhaling the scent of the sweaty gentleman in front of me, it occurred to me: there has to be a better way. Can I predict the track for my train prior to the posting? Then, while I sit comfortably in my seat having a train beer, most passengers would still be scrumming with their NJT compatriots to get on the platform.  Queue the Dr. Evil laugh.

Get the data

I sensed there was some method to the posting rationale from observing the way people hung out by certain tracks.  I took my data obsession to the next level by writing a Python script to scrape data every minute from the posting website using Amazon Web Services.  After collecting data for over a month, I put together some analysis that I hope you find enlightening, more to come.

Data Description

There are 4 main train lines that have service to Penn Station: Montclair-Boonton, Morristown Line, Northeast Corridor and the North Jersey Coast.  In the graphs below, the plots use the same color as the posting site (eg cornflower blue for North Jersey Coast).  The focus is on my particular interest: weekday rush hour defined as Monday through Friday 4p-8p.

Are the trains on time?

Before predicting track postings, I explore a key part of passenger frustration: track “announce” times.  On average, rush hour trains are announced ~10 minutes prior to planned departure.  If “on time” could be measured by having an announced time prior to departure time, only 3% would be late, seemingly a small number.  However, this isn’t a good metric for measuring “on time”.

Assume that NJT gives 7 minutes from track announcement to actual train departure from Penn.  In this case, 20-25% of rush hour trains have not been announced and can be assumed to be late leaving Penn Station (see dotted line).  Figure 1 below shows that although the average rush hour train is announced 10 minutes before departure, there is a fair amount of trains that aren’t announced.  Alas, for one of the trains I take, Montclair-Boonton, 20% of trains remain unannounced until 3 minutes before departure.

Figure 1: Percentage of trains that have been announced by minutes prior to the train’s departure time.

Distribution by train line:

Figure 2 shows a histogram and kernel density estimation of announce times, by train line.  These explicitly show clear bumps closer to the departure time (black dashed line is departure time), in the 2-7min range.  For Montclair-Boonton, the bump right before departure time (0th minute) is interesting – I wonder if there is some incentive or penalty around how NJT officials keep their lateness metrics. From a commuter perspective, late announcements are cruel because 1) there is an accumulation of passengers waiting to board 2) there is an increased sense of urgency – people arriving as the train is announced feel their train is about to leave and push a little more 3) it disallows people who are a minute or two late to get on a train that is still on the tracks (NJT erases the train from the board after departure time if it has been announced).

Figure 2: Normalized histogram of train lines by minutes prior to departure time.  The solid lines represent kernel density estimations.

Where do you stand?

Where should one stand, literally – does this data allow one to predict which track entrance you should stand at?  To build a model, we make some assumptions on how the trains operate.  Assume each train is assigned a primary track that it should leave from, and if for some reason, they cannot leave from that track, then there is a secondary track.

First pass: Naive Bayes Classifier, predicting on train number only

Training and testing a model on the same data is a biased way of seeing how well it works.  However, it’s good for a first pass.  The first test uses a Naive Bayes Classifier on training data consisting of a train number and track number.  The figure below shows a confusion matrix.  The left side is by individual tracks and shows some structure, for example: trains that go to Track 2 get predicted to track 1,3&4.  Since Penn Station is designed in such a way that consecutive tracks have the same entry way for passengers, these can be grouped (right side).  Looks like one can stand in a particular area (between the doors for Track 1-2/3-4) and have a good success at being the first ones in.

The accuracy by train line is listed in the table below.  Not surprising that the train that announces track really late cannot be predicted well.  This leads to the next pass at modeling the prediction: adding features that take into account that each train is affected by each other, since an occupied track does not allow a train to dock at its primary track.

Figure 3: Confusion matrix demonstrating Naive Bayes Classifier model performance.  Left side is for predicting the exact track.  As a passenger, the interest is to be at the right entrance – the right side shows the model performance by entrance.

Destination Samples Accuracy
All 1787 46.7%
Northeast Corr 762 44.1%
Morristown 412 48.8%
No Jersey Coast 397 54.7%
Montclair-Boonton 172 36.6%

Table 1: Model performance – note that the testing data set is the same as the training data set, so this is overstating the accuracy.