So in a previous post, where I compiled all of the StatCast data available on players released by MLBAM video demos, I promised I would try to find a good way to display and explain the tables I built and how I incorporated those into the Lahman database.
The Lahman database uses a playerID as a key to connect a majority of the the tables. I kept this same playerID in the new tables that I constructed to house the play-by-play data gathered from StatCast. I chose to create four tables. One for the batting element of each play, one for the fielding element, one for base running, and one for pitching. All four tables are connected by a playID. I am defining a play to be anything other than your typical ball or strike calls, including foul balls out of play, where nothing is affected besides the count. Depending on the type of play that occurs during the game, there might be a playID referenced in all four tables. A minimum of two tables will be used on each play of a game.
Example 1: Pitcher throws a pitch to the batter. The batter hits a fly ball to the left fielder. The left fielder catches it for the first out of the inning. Everything that StatCast could record about that play would be stored within the Play Index tables. In this case, the data would be stored in all tables except for base running (PlayIndexR).
The playID is determined by the month, day and year of the game, along with the two teams playing (teamID) and the number of the play in that game.
Example: The tenth play of the game on August 14, 2014 where the Boston Red Sox are playing the New York Yankees.
08 14 2014 010 BOS NYA = 08142014010BOSNYA
D M Yr # Team Team
The playerID's allow for each play to be unique for every individual player involved. This in turn allows for us to count statistics for each player to use for other purposes like finding season total numbers. The playID connects each part of the play. Without the playID it would be more difficult to query a full play. However, without the playerID, you couldn't use counting or rate statistics to calculate things such as a players career average and totals from certain StatCast data.
As you look at the list some fields will be self explanatory and others will require more clarification. Remember, each table collects a specific players data that was involved in a play. Going back to the example from above. All of the StatCast information regarding the batter who hit the fly ball to the left fielder will be stored in the Batting Play Index table (PlayIndexB). For that same play, all of the data relating to the catch by the left fielder will be stored in the Fielding Play Index table (PlayIndexF) and the pitchers information will funnel into the Pitching Play Index table (PlayIndexP). Since there was no base runner, the Base running Play Index table contains nothing for that play. All of this data is unique to each player; the batter, left fielder, and pitcher. All of their data is, however, connected by the playID which remains the same.
Browse the spreadsheet and think of ways in which you could make this more functional. By no means is this a one-all. In a later post I will break down each field in more detail.