Machine Learning for Baseball ⚾ (my story)

9 min readJan 30, 2017

A year after I left Google in late 2008 — I traveled around for a bit, got involved in a messy startup, etc — I was looking for a new project and figured that this was my one chance to explore baseball statistics.

Like Nate Silver before me and other since, I started by downloading previous years’ player stats, and building projections for each player’s next year’s performance. I won’t bore you with the details, as what I did was pretty generic. If you’re looking to do something like that — download baseball data, and try to project future performance, there’s a decent book on the topic, although sadly in R, not Python. Baseball is one of the better way to get into data science, if you’ve got a passion for the ‘ole ball game 🌭.

Actually let me get into that for a second: baseball is a great way to learn data science.

My first projection system was ridiculous. Or rather, I over-fit the data. I knew that the tree-based approach I took had flaws, but I wrote it up anyway and sent it to the head of operations at the Chicago White Sox, through a friend of a friend. I never heard back, and reading my own work a few weeks later, I realized how horrible it had been.

It was comically bad, to be honest. I had trees branching on number of pitcher wins in the previous year, as well pitcher age (my projections were exclusively for pitchers, as this is where others like Nate Silver also started).

I learned a valuable lesson — what bad generalization actually looks like. Moreover, even after 3.5 years as an engineer on Google’s Quality team (coding for the core ranking algorithm), I had never learned these intuitive statistical priciples.

Limited by the data you have

Baseball presents data-constrained problems. There were only about 1,500 pitcher seasons in my dataset (2002–2009) and you couldn’t get more data. Sure you could go back to older seasons, but the game you’d be looking at wouldn’t be the same.

MLB strikeouts by batter over the years, source: Eno Sarris, SportsOnEarth

Pretty quickly, I learned to appreciate why baseball research frowns upon fitting the data with anything other than a linear system. This linear-orthodox approach is also prevelant among quants on Wall Street (although they will fit a linear model to an underlying non-linear feature sometimes).

Marcel the Monkey, source: Friends Wikia

I also realized why projection systems are kind of pointless — they can’t significantly outperform a simple three-year regression of previous performance, as Tom Tango fastidiously points out with his Marcel model (a reference to Ross’s pet monkey on Friends).

Now what?

If you can’t beat baseline projections based on player age and the three previous years’ regressed performance, what else is there to do? Plenty. I could not avoid seeing Bill James come up with a new statistical insight almost every week on his website.

I don’t remember why, but after months of doing vanilla pitcher projections (not based on trees this time, and with a proper validation set), I thought I’d study how pitchers’ physical characteristics can help explain their on the field performance. In other words, I started look at stats beyond the box score:

injury data [disabled list (DL) records]
pitch data [how many fastballs, curveballs, etc does a pitcher throw, and how hard is his fastball]
physical characteristics, like age, height and weight

All of this research, good and bad, can be found on my Ivan Bezdomny Baseball blog, which I maintained from 2009 to 2010.

Insight from injuries

The injury data didn’t get me very far, although I did gain some some common-sense insights into injury types.

elbow injuries are not as bad long-term as shoulder injuries
surgeries aren’t great, and labrum or rotator cuff surgeries tend to end careers
pitchers with Tommy John surgery miss a year, then tend to come back about as good as new

This research was enough to make me understand that after rotator cuff surgery, Brandon Webb was probably done as a pitcher, despite almost winning three Cy Young awards in 2006–2008, and still a young man. At the same time, I understood why Billy Beane and the Oakland A’s signed Ben Sheets for large annual salary, after he missed the whole year with elbow surgery in 2009.

You are what you throw

The pitch data was much more interesting. I focused on two ideas

Srikeout rates are much more stable that overall pitcher performance (ERA, etc), and I’d read on Bill James’s site that teams care mostly about strikeouts for younger pitchers — since that is the determining factor between those who make it in the major leagues, and those who don’t. A few soft-tossing finesse lefties like Kirk Rueter aside.

Strikeouts by fastball velocity, AL vs NL

Most of the variance between pitcher strikeouts at the MLB level can be explained by two variables: fastball velocity and “is he a lefty” (it helps). Oh, and the pitchers with the same physical characteristics get consistently more strikeouts in the National League than the American League — I urged Brian Cashman (GM of my beloved Yankees) to stop signing high-strikeout National League pitchers.

I made a few other predictions:

Aroldis Chapman was a great signing by the Reds for 6 years, $30M. All we knew about Cuban defector Chapman was that he’s a lefty and throws close to 100mph: that projected to one of the best strikeout pitchers in the league… at a 50% projection. Chapman turned out to be one of the highest strikeout rate pitchers in baseball history.
I didn’t understand the Diamondbacks trading a young Max Scherzer for a slightly more established Edwin Jackson. Jackson had good stuff, but had consistently under-performed his physical tools to strikeout projection. Whereas Scherzer was already a high strikeout pitcher, with the potential to improve. Jackson went on to have a solid career, pitching until he was 32 with a couple of 2.0 WAR seasons. Meanwhile Scherzer won the 2013 AL Cy Young, the 2016 NL Cy Young, made four All-Star teams, and signed a $210M contract.
Joba Chamberlain — remember him? He was lights out as a reliever. Initially I was all for moving him from the bullpen to a starter role — young guy, more innings, thus more value (even at a lower ERA). But once I saw that strikeout rates are super-linearly correlated to fastball velocity, I saw why you don’t make a 97mph thrower into a starter. Joba’s average fastball dropped to 95mph as a reliever, then to 92.5mph as a starter. Few pitchers can maintain a high-90s velocity in longer outings, and so we’ve seen more and more relievers get used by MLB teams over the past seven years. The hardest thowers can be more valuable in fewer innings, if they can maintain those high-90s fastball speeds.
Some teams, particularly St. Louis Cardinals under legendary pitching coach Dave Duncan, prefer developing specific types of pitchers. You can see this by bucketing pitchers into their nearest K-means cluster, and naming those “pitcher types.” Dave Duncan hates change-ups, so he teaches his pitchers to throw more curveballs, perhaps as their change of pace pitch. This may contribute to the Cardinals giving up fewer fly balls, and thus fewer home runs than their opposition.
After a year and a half of declining fastball velocity (from an average fastball of 94mph during his Cy Young winning years, down to 91.3mph), Tim Lincecum was done as an elite pitcher by 2010.

What is the point, other than that I made a few lucky picks — and helped explain what more informed baseball researches surely already knew?

At the time, I did not think that anyone was reading my blog. I’m sure Blogspot had traffic counts that I could have found, but this wasn’t as easy to see in 2010, as it is on Medium today. I didn’t have a “share on Facebook” plugin, and my blog didn’t get too many comments.

As it turns out, people were reading. In particular, I was contacted by Keith Woolner, principal data scientist of the Cleveland Indians, and someone whose work I had known and loved from his days at Baseball Prospectus. We spoke about a role with the Indians, which ultimately did not work out — in part because my background as an ex-Google New Yorker doing baseball analysis for about a year while playing high stakes professional poker, did not make it obvious that I’d really be all that committed to a data science job in Cleveland. But we had a great conversation, and I had several such chats with representatively of other MLB teams.

Not much later, I gave up the blog. I had gotten started on a new idea, which ended up becoming a startup project and a story for another time.

A couple years later, my blogs started getting cited by Kyle Boddy, founder of the Driveline Baseball gym in Seattle. Kyle’s trains young pitchers to maximize their physical potential, in a scientific and sustainable manner.

Scientific, sport-specific training for ⚾

Before his business blew up (several major leagues and many professional pitchers have come through those doors), to the point where he can’t accept most of the pitchers who apply to his program, he used my articles to explain to youngsters and their families how much it would help their pro and college prospects to throw the fastball a mile or two faster.

I put hundreds of hours into this blog and the research for it. Was it worth it? Absolutely.

I experienced the embarrassment of submitting a really bad model to important people, I learned statistical principles, I learned how to research a new area and find interesting problems that others hadn’t fully explored. I practiced writing for a public audience, which is something I’ve recently started doing again with this blog.

That would have been enough, but I also got to meet some awesome people as a direct result of my writing. I love baseball, but it’s unlikely that I’ll ever get to work in the industry. Getting to know some nerdy people who were able to make baseball their life, is this a great country or what! 🇺🇸

Update/Addendum

This story is about my experience messing around with baseball data, and learning a bit of data science along the way. My methods weren’t cutting edge then, and that was 2009–2010.

If you have suggestions for better resources for getting into baseball analysis in 2017, please post a comment below. I’d love to add a link here, so others might stumble upon them.

On a related note, since I work on applied deep learning, people often ask me about getting into this hot field. My advice is simple — start a public GitHub. If you’re fortunate enough to have free time and are looking into either deep learning or data science, there are wonderful open source projects that you can download and run. Fork a project and improve it. Or better yet, apply it to a new problem and post that solution to your own GitHub. Stars and a nice README are nice, but just being active on the platform is a good start. It shows initiative, and it might get you noticed.

Deep sabermetrics ⚾?

Since someone will ask: can you use deep learning to study baseball? Probably, although deep neural nets aren’t well suited for problems with limited size datasets.

I did do one neural net baseball project about a year ago, although I’m not in a position to share the code, as it was for a friend who wanted to keep the work private. But I will describe it in broad strokes below.

As Tom Tango and others have shown, individual event stats for a team (singles, doubles, home runs, stolen bases, strikeouts, etc) can be combined with a “linear weights” model to predict how many runs should have been scored. This formula is more or less invariant over time, at least at the MLB level, since the rules remain the same, even as strikeout and home run rates can change year to year, and decade to decade.

Intuitively though, the “best fit” formula should quite be linear. In a high on-base environment, strikeouts are more costly. While in a high-strikeout environment, home runs might have a little bit more value.

In short, I was able to fit the runs-scored data significantly better with a simple fully connected multi-layer neural network, where the inputs were individual event counts and the output was a softmax over runs scored. The gain over linear weights was not huge, but it was significantly better on a holdout set of games. It also had the nice bonus of producing a probability distribution (of runs scored). I used a pretty basic setup, with ReLU activations, 50% dropout and a fixed decay learning rate.

Once you have tens of thousands of example of anything, you can think about training with deep nets — although more data is better. Until then, statistical methods will have to suffice!