Machine Learning for Baseball ⚾ (my story)

Limited by the data you have

Baseball presents data-constrained problems. There were only about 1,500 pitcher seasons in my dataset (2002–2009) and you couldn’t get more data. Sure you could go back to older seasons, but the game you’d be looking at wouldn’t be the same.

MLB strikeouts by batter over the years, source: Eno Sarris, SportsOnEarth
Marcel the Monkey, source: Friends Wikia

Now what?

If you can’t beat baseline projections based on player age and the three previous years’ regressed performance, what else is there to do? Plenty. I could not avoid seeing Bill James come up with a new statistical insight almost every week on his website.

  • injury data [disabled list (DL) records]
  • pitch data [how many fastballs, curveballs, etc does a pitcher throw, and how hard is his fastball]
  • physical characteristics, like age, height and weight

Insight from injuries

The injury data didn’t get me very far, although I did gain some some common-sense insights into injury types.

  • elbow injuries are not as bad long-term as shoulder injuries
  • surgeries aren’t great, and labrum or rotator cuff surgeries tend to end careers
  • pitchers with Tommy John surgery miss a year, then tend to come back about as good as new

You are what you throw

The pitch data was much more interesting. I focused on two ideas

Strikeouts by fastball velocity, AL vs NL
  • Aroldis Chapman was a great signing by the Reds for 6 years, $30M. All we knew about Cuban defector Chapman was that he’s a lefty and throws close to 100mph: that projected to one of the best strikeout pitchers in the league… at a 50% projection. Chapman turned out to be one of the highest strikeout rate pitchers in baseball history.
  • I didn’t understand the Diamondbacks trading a young Max Scherzer for a slightly more established Edwin Jackson. Jackson had good stuff, but had consistently under-performed his physical tools to strikeout projection. Whereas Scherzer was already a high strikeout pitcher, with the potential to improve. Jackson went on to have a solid career, pitching until he was 32 with a couple of 2.0 WAR seasons. Meanwhile Scherzer won the 2013 AL Cy Young, the 2016 NL Cy Young, made four All-Star teams, and signed a $210M contract.
  • Joba Chamberlain — remember him? He was lights out as a reliever. Initially I was all for moving him from the bullpen to a starter role — young guy, more innings, thus more value (even at a lower ERA). But once I saw that strikeout rates are super-linearly correlated to fastball velocity, I saw why you don’t make a 97mph thrower into a starter. Joba’s average fastball dropped to 95mph as a reliever, then to 92.5mph as a starter. Few pitchers can maintain a high-90s velocity in longer outings, and so we’ve seen more and more relievers get used by MLB teams over the past seven years. The hardest thowers can be more valuable in fewer innings, if they can maintain those high-90s fastball speeds.
  • Some teams, particularly St. Louis Cardinals under legendary pitching coach Dave Duncan, prefer developing specific types of pitchers. You can see this by bucketing pitchers into their nearest K-means cluster, and naming those “pitcher types.” Dave Duncan hates change-ups, so he teaches his pitchers to throw more curveballs, perhaps as their change of pace pitch. This may contribute to the Cardinals giving up fewer fly balls, and thus fewer home runs than their opposition.
  • After a year and a half of declining fastball velocity (from an average fastball of 94mph during his Cy Young winning years, down to 91.3mph), Tim Lincecum was done as an elite pitcher by 2010.
Scientific, sport-specific training for ⚾


This story is about my experience messing around with baseball data, and learning a bit of data science along the way. My methods weren’t cutting edge then, and that was 2009–2010.

Deep sabermetrics ⚾?

Since someone will ask: can you use deep learning to study baseball? Probably, although deep neural nets aren’t well suited for problems with limited size datasets.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nikolai Yakovenko

Nikolai Yakovenko


AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami