How sure are you?

It's the 7th inning.

Your team is winning 5-3.

Your top bullpen arm is on the mound.

Their top of the order is up.

How confident are you?

This is a very weighted question. Your team is up 2 runs with 9 outs to go, so you probably like your chances, but how much do you like your chances?

In reality, the answer depends.

What if the top of their order is in a combined 5-50 slump? Then you love your chances! What if the guy leading off always kills your team? Then you get incredibly worried if he gets on base! What if your top bullpen arm lost the strike zone 2 weeks ago and can't find it (Carl Edwards Jr....)? I'll let the Cubs fans answer that one.

I've spent some time working on quantifying the answer to these questions. Thanks to MLB Advanced Media and Baseball Reference it is incredibly easy to scrape their websites and mine data. To start answering the confidence question I pulled data every at bat of the 2016 and 2017 MLB season into R with some simple html scraping of their sites.

Once the data is pulled in, I was able to do some manipulation such that it read more like a box score you would see online.

Something like this:

People forget the Cubs won the 2016 World Series.

Continuing with the manipulation, I used regex to scrap the team names from the gameday links and attach the home team name / away team name since team name isn't readily available for whatever reason?

Afterwards I grabbed the winner of every game for the last two years and calculated the probability a home team won or away team won based on any combination of inning, inning side (top or bottom of inning), home team score, and away team score. After that, I wrote up a quick function and bam! Results and predictions!

Play with it yourself here: Luke's Win Probability App

(This is the first app I've written using Shiny and R, so I know it's not the most aesthetic...yet)

The default has the game in the top of the 9th, with the score tied 0-0. The home team goes on to win this game 75% of the time.

In the example below, after the top of the 7th, the home team is losing 2-1. At this point in the game the high powered bullpens are in play and the home team wins 36% of the time.

Additionally, I've started building classification models (logistic regression, k-means, support vector machine models) to see how accurately this can be predicted. I'm planning on updating the app to compare the different models and to see what is most accurate.

So far, my logistic regression modeling has come out to 76% accuracy. I haven't seen any unique differences by team which I think is very interesting. Wild to think that after accounting for inning and score differential all that matters is the score itself! Such a simple concept, but tough to wrap our minds around. Essentially, over the course of a long season, baseball is so unique that the worst teams maintain their expected winning percentage at the same rate that good teams do.

Data is on github: Data for Win Prob Project

Check out the app! I built this sucker:


  1. Baccarat Basics: An Unbiased Guide To Baccarat
    Baccarat is very easy to learn but also very 바카라 easy to master. I would recommend you avoid playing baccarat on youtube mp3 this videodl site.


Post a Comment

Popular posts from this blog

Garmin + Data + R = Combining Hobbies

Visualize and Understand Your Golf Game