Child Care Business in NJ

Figure 1: Same data as the bar plot in Figure 2, but in map format. Left side shows that addressable market metric is proportional to population density (coastal regions have more people and a higher metric). The right hand side shows the ratio metric normalizes this since the colors become a little more random. Note the depressed ratio in counties near NYC, perhaps indicative of economic factors.
Continuing my exploration of graphing and maps, I went to to find publicly available data to play with.  The first dataset I came across was on Licensed Child Care Centers, which fortunately for society, is a public safety mechanism that creates requirements for all child care centers and requires their premise to get inspection certificates, which are available online.  In addition, there are statistics on each center’s location, capacity, and min/max age. This lead me to wonder: if I were to open a business, I’d prefer to choose a place that is underserved (ignoring Hotelling’s law, I am not an economics major, but maybe this business domain is an exception to the law).
The US Census Bureau has 2017 estimates for each county’s population for children under 5.  I calculated two metrics for each county:
  • Addressable Market: using this term very informally, I defined it as the number of children under 5 in the county minus the county’s capacity to cater to under 5.
  • Ratio of children under 5 to child center capacity.
Bar Plot
Figure 2: Addressable market (left) and ratio (right) metrics by county. Average ratio is 35:1, so taller bars in ratio plot represent opportunity.
The left side of Figure 2 shows that Ocean, Middlesex, and Essex have 20-30k children that wouldn’t have a care center spot in their county if each operated at capacity (and they were looking of course).  Warren, Salem, and Hunterdon have a smaller issue by population size.  While this metric provides an idea of how many children cannot be served, it doesn’t give any insight on opportunity because the counties on the left side are ordered roughly in proportion to their population.


The ratio of children under 5 to capacity gives a normalized view.  On average, counties have a ratio of 35 children to 1 child care center spot.  In other words, 3% of children under 5 use child care centers.  The right hand side of Figure 2 shows 6 counties with a ratio of 40:1.  These include: Cape May, Salem, Sussex, Ocean, Atlantic, and Warren.


Figure 1 and 2 show the same data, but in different formats.  In Figure 2, the first metric (left side) shows the darker regions along the coast, which reflects the point above that this metric is proportional to population density (which is higher near the coast). Using the normalized metric, the colors become a little more random (right side). The area close to NYC shows lower ratios, perhaps driven by higher demand for child care services by people who commute to NYC.


There are several factors that determine child care center businesses.  If I were really going to look into business opportunities, I would want to cut the data with socio-economic attributes.


Rush Hour at Penn Station

New Jersey commuters are all aware of the third-world-esque brutality that ensues when the evening monitor at Penn Station posts their train’s track number.  Entrances onto the tracks were designed for significantly less people.  The urgency to get on the train brings out the Mike Tyson in the sweetest old lady with a cane; a book bag to the face is a standard tactic to make room.  While being kicked from the back and inhaling the scent of the sweaty gentleman in front of me, it occurred to me: there has to be a better way. Can I predict the track for my train prior to the posting? Then, while I sit comfortably in my seat having a train beer, most passengers would still be scrumming with their NJT compatriots to get on the platform.  Queue the Dr. Evil laugh.

Get the data

I sensed there was some method to the posting rationale from observing the way people hung out by certain tracks.  I took my data obsession to the next level by writing a Python script to scrape data every minute from the posting website using Amazon Web Services.  After collecting data for over a month, I put together some analysis that I hope you find enlightening, more to come.

Data Description

There are 4 main train lines that have service to Penn Station: Montclair-Boonton, Morristown Line, Northeast Corridor and the North Jersey Coast.  In the graphs below, the plots use the same color as the posting site (eg cornflower blue for North Jersey Coast).  The focus is on my particular interest: weekday rush hour defined as Monday through Friday 4p-8p.

Are the trains on time?

Before predicting track postings, I explore a key part of passenger frustration: track “announce” times.  On average, rush hour trains are announced ~10 minutes prior to planned departure.  If “on time” could be measured by having an announced time prior to departure time, only 3% would be late, seemingly a small number.  However, this isn’t a good metric for measuring “on time”.

Assume that NJT gives 7 minutes from track announcement to actual train departure from Penn.  In this case, 20-25% of rush hour trains have not been announced and can be assumed to be late leaving Penn Station (see dotted line).  Figure 1 below shows that although the average rush hour train is announced 10 minutes before departure, there is a fair amount of trains that aren’t announced.  Alas, for one of the trains I take, Montclair-Boonton, 20% of trains remain unannounced until 3 minutes before departure.

Figure 1: Percentage of trains that have been announced by minutes prior to the train’s departure time.

Distribution by train line:

Figure 2 shows a histogram and kernel density estimation of announce times, by train line.  These explicitly show clear bumps closer to the departure time (black dashed line is departure time), in the 2-7min range.  For Montclair-Boonton, the bump right before departure time (0th minute) is interesting – I wonder if there is some incentive or penalty around how NJT officials keep their lateness metrics. From a commuter perspective, late announcements are cruel because 1) there is an accumulation of passengers waiting to board 2) there is an increased sense of urgency – people arriving as the train is announced feel their train is about to leave and push a little more 3) it disallows people who are a minute or two late to get on a train that is still on the tracks (NJT erases the train from the board after departure time if it has been announced).

Figure 2: Normalized histogram of train lines by minutes prior to departure time.  The solid lines represent kernel density estimations.

Where do you stand?

Where should one stand, literally – does this data allow one to predict which track entrance you should stand at?  To build a model, we make some assumptions on how the trains operate.  Assume each train is assigned a primary track that it should leave from, and if for some reason, they cannot leave from that track, then there is a secondary track.

First pass: Naive Bayes Classifier, predicting on train number only

Training and testing a model on the same data is a biased way of seeing how well it works.  However, it’s good for a first pass.  The first test uses a Naive Bayes Classifier on training data consisting of a train number and track number.  The figure below shows a confusion matrix.  The left side is by individual tracks and shows some structure, for example: trains that go to Track 2 get predicted to track 1,3&4.  Since Penn Station is designed in such a way that consecutive tracks have the same entry way for passengers, these can be grouped (right side).  Looks like one can stand in a particular area (between the doors for Track 1-2/3-4) and have a good success at being the first ones in.

The accuracy by train line is listed in the table below.  Not surprising that the train that announces track really late cannot be predicted well.  This leads to the next pass at modeling the prediction: adding features that take into account that each train is affected by each other, since an occupied track does not allow a train to dock at its primary track.

Figure 3: Confusion matrix demonstrating Naive Bayes Classifier model performance.  Left side is for predicting the exact track.  As a passenger, the interest is to be at the right entrance – the right side shows the model performance by entrance.

Destination Samples Accuracy
All 1787 46.7%
Northeast Corr 762 44.1%
Morristown 412 48.8%
No Jersey Coast 397 54.7%
Montclair-Boonton 172 36.6%

Table 1: Model performance – note that the testing data set is the same as the training data set, so this is overstating the accuracy.