Teaching Your Computer To Read With Machine Learning

Jupyter Notebook and prototype can be found in the project repository on GitHub.

The Problem

In today’s instant gratification paradigm product documentation is getting replaced with YouTube videos and support calls. The willingness to understand is becoming a lost art. Technical products that require lengthy documentation are struggling to educate their customers. This results in frustration on both ends. I propose that we can supplement this documentation with a well-trained chatbot interface which would increase customer understanding by getting directly to the information they are looking for in a conversational way. Getting a chatbot to find documentation is simple enough, getting it to generalize well on new data is a challenge worth researching.

I envision scanning a document using my smartphone, asking its digital assistant questions about the document, and having it reply as an expert on the subject. This requires an enormous amount of computational power and the clever use of natural language processing. To start our experiment, let’s first limit the success criteria to reading in Wikipedia articles and telling us if the article supports a true or false statement provided by us, the user.

There are several parts to this chatbot. First, it’s going to have to be trained on natural language, then read in a document, and last the user’s input. Now work backwards using the chatbot’s understanding of language to search the read in document for similarities to the user input. If it finds the user input is similar enough to parts of the article, it will return true. For testing purposes, we need an article that contains information that is not too subjective so that there is little debate if the user’s input actually is true or false.

The Rules of Chess

I have chosen to use the “Rules of Chess” Wikipedia article for this purpose. Once the article is read in and processed, we can investigate how the chatbot will gain meaning from it. Once all the stop words are removed, we can begin processing. The task of processing an article involves something called tokenization. Tokenization cuts up the article into pieces called tokens. The relationship of these tokens can be measured by various algorithms.

Word tokens as a word cloud generated from Wikipedia’s “Rules of Chess” article and processed for NLP.

Developing our Model

One such algorithm is the ngrams algorithm. This counts how many times a token appears in an article. Based on the word cloud, if we give the ngrams algorithm the word “move” it will be sure to return an integer symbolizing how many times that word appears. Now tokens don’t have to be a single word, they can also be two words, three words, up to “n” words. My thought was we could assign weights to tokens with more words in them because they tend to happen less often and represent a relevant statement. This algorithm works well for a basic search engine, but what about finding meaning in the words? That is where Gensim’s Word2Vec model comes in. Word2Vec calculates a word vector based on the other words that appear around them. This is the part of our strategy that requires a lot of data and power.

Instead of training the model from scratch, we will use a pre-trained model from Google. This model was trained on over 100 billion words and likely will give us better results. Unfortunately, there is not the same option for the looping ngrams model. Once the model is equipped with data, we need to find the right parameters. To avoid burning up my notebook, we are going to use a random search on Google’s Cloud Console to run a compute engine virtual machine. Even with cannibalizing training data for a faster performance, it took nearly 17 hours to test 100 iterations of random parameters. The results of the random search are listed below with an accuracy score just barely touching 80%.

gateweight_modwindowepochsvector_scopevector_weight
31.743273294926081391214.1647849325647
Top model parameters as per accuracy on a fullx100 random search.

The gate is the parameter that decides if the user input is true or false. After several tests I found a gate parameter of 20 tends to have better results. This likely is an effect of the over-fitting going on due to the lack of computational resources and data cannibalization. Nevertheless, I used a gate value of 20 in the prototype.

The testing data was designed in 3 sections. One piece was made of the same read in “Rules of Chess” article as sentence tokens, these inputs should all be predicted as true. The next was a random article from Wikipedia and is expected to be predicted as false. Last was a handwritten section of queries labelled true or false accordingly. We must also understand that true or false is in relation to article relevancy and is not providing absolute meaning. For example, we can ask the chatbot to check if the earth is round, but if we read in the article about chess it should return false. Below is the transcript from that interaction with the prototype.

-> Hello!
-> I am the prototype chatbot interface for Wikipedia documents.
-> What topic would you like to discuss?

<- Rules of Chess

-> Please allow me a moment to train on rules of chess.
-> Okay I've got it, tell me a fact and I'll tell you if the article I read supports your fact.

<- The earth is round

-> That is false.

However, if we continue the conversation to query information about the rules of chess, we will be provided with expected results.

<- In the game of chess, there are 2 kings, 2 queens, and 1 chessboard.

-> That is true.

Here the chatbot returns true. It has achieved this because the user article contains enough relevant information that appears in the read in article to score our fact above the gate parameter. Unfortunately, a limitation to this algorithm is the understanding of quantity. We can change the numbers in our input to anything we want and produce the same results.

<- In the game of chess, there are 16 kings, 16 queens, and 1 chessboard.    

-> That is true.

The deciding factor of this input was the word chessboard. If we increase the number of chessboards and add the “s”, this will return false. The number before the “chessboard” token carries almost no meaning to the chatbot. Hence, the 16 kings and queens, this only adds value to the relevance because each player has 16 pieces and that number appears elsewhere in the article. We could exploit this limitation by repeating top words in the same input.

<- move rules move rules move rules

-> That is true.

This is where we could bring up the debate if this user input should be classified as true or false. Should we discount it just because it’s not grammatically correct or is the relevancy to the read article enough to count as a true statement? – This is a question to be answered by the whoever uses this model in production and likely would be specific to the nature of its use case.

What Have We Learned?

Which brings us back to the purpose of this model’s development. Is it reasonable to deploy this across a general platform such as a digital assistant? – Likely not. Providing a binary relevance could be an essential building block to an impressive chatbot, but it would require specific scripting and constant training on new user input. The new training input would need to be classified by hand to increase performance to acceptable scores. Over time and with a deep analysis of user data, this chatbot could provide respite for support calls on specific subjects. – But we are a long way from the instant transfer of knowledge that we set out to accomplish.

Given the complications that are introduced by natural language processing it seems that using these techniques for general purpose machine understanding is not practical in a user environment. This is likely the reason we see chatbots providing scripted responses and digital assistants provide answers without understanding. It simply takes too much power to create machine understanding in a reasonable amount of time. The narrower the scope of the chatbot the better. At least for the time being, we are still going to need to do our own research and create our own understanding of product documentation. However, enterprise level chatbots that are designed for a specific understanding are becoming a proven concept and with the right resources, will prove to be an invaluable asset in the customer service field.

Predicting Airline Flight Delays

Imagine yourself on the last day of work before the big family trip. You have coordinated with your manager to get extra time off, shopped for weeks to get the very best travel deals, and your phone gives you a notification that it’s time to check in for your flight. You choose preferred seating so that the kids don’t have to crawl on top of you to go to the bathroom and arrange for friends to pick you up. The next morning you pile everyone in the car, rush through security, and make it to your gate just in time to find out that the aircraft was rerouted due to a storm resulting in the airline scrambling for another aircraft. All the planning, preparation, and preferred seating is gone.

Wouldn’t it have been nice to know how likely this could happen when you checked in?

In this analysis we look at ways to provide the technology to do just that. The data we will be using is historical and contains information from over 450,000 flights in the United States during January 2017. To view the detailed analysis and the code on how I arrived at these conclusions, check out the Jupyter Notebook on Git Hub.

The first thing to look at are what pieces contribute to the flight delays. We all know a severe storm halts travel plans, and our tools are pretty good at predicting them. What we are concerned with are the little ripples in airline traffic that create the smaller surprise delays. Factors recorded in our data set are departure time, taxi out, taxi in, arrival time, cancellations, diversions, distance, weather delays, and security delays just to name a few.

Once we isolated which pieces of data to use we could start identifying and visualizing correlations. Logically, we expect departure time and arrival time to have a strong correlation along with distance and air time as well. What was most interesting is the shape of a departure delay versus a late arriving aircraft. This shows that not all late departures result in a late arrival.

Scatter plot of late aircraft delays versus departure delays. What is interesting is that there an arc to the shape and late departures carry a linear ceiling on a late aircraft.

We now have an idea of how the features in the data work together. My hypothesis is that late flights arriving will cause a reverse ripple affect measured over time in late departures at the destination airport. Logically this makes sense because if a Boeing 737 is booked to leave Chicago at 8:00 AM Central and arrive in Miami at 11:32AM Eastern, and is delayed by security for 20 minutes, the subsequent flight that aircraft is has been booked for out of Miami will be affected by this delay. I suspect the arc in the scatter plot above is created due to various counter measures the airlines and ATC employ to reduce the reverse ripple effect. If an early aircraft takes the flight that the late one could not make due to a delay, our delays become reduced and hard to track. This is where measuring a local delay ratio comes into play.

The delay ratio is calculated by summing all the flights that have been delayed at the origin, and dividing by the total number of flights made at the origin. The trick is narrowing your scope by location and time. Doing so produces meaningful measurement that does not generalize too much.

Reverse ripple shown from ORD, LAX, and DEN. Other delay ripples can be seen, however the blue line shows a ripple that has been clearly captured.

The above line plot shows the trends of delay ratios throughout three airports. There is a clear reverse ripple originating in O’Hare, sweeping through Los Angeles, then the smallest peak in Denver. This means we could have predicted the amount of flights that are delayed in Denver based on the amount of flights that are delayed from aircraft targeting Denver as their final destination. This is just one example of delay ratios leading to delays at other airports.

For the scope of this analysis, we will be looking at the top one hundred airports in categorized by the nine regions of the United States. Below is the trend found in the New England region.

New England delay ratios calculated by the top 100 airports in the US.
Middle Atlantic ratios calculated by the top 100 airports in the US.
South Atlantic delay ratios by top 100 airports in the US.

When comparing the Eastern delay ratio trends, we can see the delays are similar between the regions. One explanation for the spike in delays in near the beginning of the month was the severe winter storm originating in Philadelphia on January 7th. We can see that the trend shows delays splashed across to other regions.

Now we can use machine learning to predict the delay ratios by region. Several machine learning models were attempted. Some are noted in my Jupyter Notebook. The algorithm that performed best on the test data was support vector regression.

SVR model predicting flight delays against the test data by region.

Using the US regions as test data, all regions performed with a mean square error and mean absolute error below 0.1. Data modeling parameters and metrics are recorded in the Git Hub repository.

Let’s lower our scope and test our model against the Middle Atlantic flight delays.

SVR prediction on Middle Atlantic flights.

Now we can use our model to provide a delay ratio forecast to customers during check in. The scope of our data was daily delays over the course of a month, however the same model could be used over hours as well. By watching the ripples that are created when delays in one area occurs, we could follow those aircraft and provide passengers with an estimation of what to expect when they arrive at their gate. This will give passengers the time to prepare for possible delays. ATC and airlines could use these forecasts by month, like the one performed here, to provide insight on what aircraft should be prepared and where in case of a delay. The flight could be taken on the held aircraft and reduce the amount of ripples through the day. Airlines can also use the hourly delay forecast to incentive passengers downloading their respective mobile apps, and only provide a static forecast to at check in.

The current data used to train the model was limited to one month. During this month, there was a winter storm causing significant delays originating in the South East. This storm could be throwing the model off when training for non-weather related flight delays. Also, the single month of data is an unacceptable constraint for production. For more accurate predictions, we would want to use years of data to incorporate how seasons affect flight delays.

2017-Jan-OnTimeFlightData-USA

Data.world

Top 100 airports of 2017

worldairports.com

Top 100 airports in US

fi-aeroweb.com

Census Region Division of the US

US Census

Storm Prediciton Center

spc.noaa.gov

US COVID-19 Data Analysis

Project analysis of the New York Times COVID-19 data. The story that we get from this data is difficult to take. It shows that COVID-19 is more deadly depending on where you are in the US and isolation lowers the chances of infection and death. It’s a tough story to wrap our heads around because infection and death should be dependent features.

Project Notebook

The data is puzzling.

The Covid-19 pandemic has become a point of high controversy in the US. Opinions have been muddled by different practices of classification and over saturation of data. Some resources claim the entire response to Covid-19 is a political hoax. Fox News suggests that the deaths reported in the US are inflated. However, the entire world seems to have gone into lockdown. Here we will attempt to gain insight on the integrity of the Covid-19 data provided graciously by the New York Times. We will also reference the 2019 US Census data for perspective.

2019 US Census data by state ranked by population. The COVID-19 Infections Graph begins at the yellow line. COVID-19 deaths graph are shown at the red line. These graphs are shown by the yellow and red arrows to explicitly show that the following graphs are not measured by the same x-axis. The red line does not appear on the populations graph.

We can see by the vertical yellow line that we begin to measure infections at a relatively low number when compared to the entire population. Be aware that the vertical yellow line is denoted by the state with the highest amount of infections in the US, which we will be able to see clearly in the plot below. According to our data, the average amount of infections by population is 1.3%. Another point that must be made is we are not measuring time. The number of people that have been recorded as infected are not the same as the number of people that are currently infectious. Although these numbers are many, this data could support that the CDC preventative measures are working or the virus is not being properly measured. Perhaps a mix of both.

COVID-19 confirmed and probable infections in the US by state from the New York Times data set.

The infections plot carries a similar distribution to its parent populations plot. At a glance, we could say that the data that shows a higher population contributes to a higher number of people infected, until we get to Washington state. Washington appears to have less COVID-19 activity when compared to other states with similar populations. If we check the states with the most aggressive counter measures to COVID-19, Washington state ranks in 5th place. Some aggressive measures from Washington state include a 4 step data-driven plan with excellent feature definitions. Unfortunately, our New York Times data set does not show evidence that these aggressive measures have helped reduce the amount of COVID-19 activity when compared to the other 9 states on the list.

COVID-19 confirmed and probable deaths in the US by state from the New York Times data set.

When viewing the deaths chart, we should expect the amount of deaths reported to be strongly correlated with the amount of infections, however this is clearly not the case. Alaska reports deaths per infections as 0.5% and Connecticut has a reported death rate of 8.83%. Either something is causing COVID-19 deaths to have a wide range of survival rates from state to state or the reporting is incorrect, there are arguments for both sides. The data says that the average death rate in the US is nearly 3%, this is especially troubling when compared to Germany’s metric of 0.4%. The argument in this article is that Germany has tested more of the population and found more infections that are asymptomatic resulting in a higher infection rate and lower death rate. Fox News reports that people who die with COVID-19 are being reported as a victim of COVID-19 regardless of other factors contributing to death. Without an international standard for logging these deaths the above comparison carries little weight.

To dive deeper into the data, Below we will observe a table of the infection rate and death rate for each state as well as some summary statistics.

Table of New York Times data set, COVID-19 infection and death ratios. Where infections = infections / population and deaths = deaths / infections. Ordered by descending population size.

The most interesting thing about this table is the range of deaths vs the range of infections. If we assume the behavior of the COVID-19 virus is a constant, this range should provide little variation. Instead, the data shows that there is a range of 8.29% death rate depending on which state you are in. This could be evidence that hospitals in states that are showing above a 5% death rate are unable to provide the same treatment as other states, or the data is not getting reported in the same manner.

So what states are getting the best COVID-19 metrics?

Both Alaska and Hawaii have infection rates and death rates lower than 1%. This could be an indicator that isolation is the best preventative measure to COVID-19.

The story that we get from this data is difficult to take. It shows that COVID-19 is more deadly depending on where you are in the US and isolation lowers the chances of infection and death. It’s a tough story to wrap our heads around because infection and death should be dependent features. The range brings up many questions worth investigating.

Sources

Kaggle Covid-19 Dataset – Scientific community call to action

  • Kaggle

New York Times COVID-19 data

  • GitHub

Coronavirus myths: Don’t believe these fake reports about the deadly virus

  • Cnet

2019 Census Data

  • Census.gov

Cases & Deaths by County

  • CDC

Coronavirus hype biggest political hoax in history

  • Washington Times

Fox News shares that the deaths reported in the US are inflated

  • Fox News

government is classifying all deaths of patients with coronavirus as ‘COVID-19’ deaths, regardless of cause

  • Fox News

Safe Start Washington

  • Washington State

10 States with the Most Aggressive Response to Covid

  • US News

Why is Covid-19 death rate so low in Germany?

  • CNN

Our ongoing list of how countries are reopening, and which ones remain under lockdown

  • Business Insider

CDC’s ‘best estimate’ is 40 percent COVID-19 infections are asymptomatic

  • Fox News

Population density in the U.S. by federal states including the District of Columbia in 2019

  • Statista

Create your website with WordPress.com
Get started