Batting Stats 7: wOBA Corrected

I think I found one of the culprits: Intentional Walks.  Our two guys, Sink and McDonough, have IBB’s in that season well over the average.  I queried the database to find their IBB’s as well as league average for those seasons:

SELECT b.player_id
  , concat(p.first_name, " ", p.last_name) as player 
  , l.abbr
  , b.year
  , b.league_id as league 
  , b.ibb
  , x.lg_avg_ibb
  , b.ibb-x.lg_avg_ibb as diff
FROM (  	
	SELECT b.year
	 , b.league_id
	 , round(avg(b.ibb),0) as lg_avg_ibb
	FROM CalcBatting b
	INNER JOIN players p ON b.player_id-p.player_id
	WHERE b.ab>200 AND p.position<>1 
	GROUP BY b.year, b.league_id
	) as x
  INNER JOIN CalcBatting b ON x.year=b.year AND x.league_id=b.league_id
  INNER JOIN players p ON b.player_id=p.player_id
  INNER JOIN leagues l ON b.league_id=l.league_id
 WHERE b.player_id IN (2574, 472) AND b.year=2015;

You’ll see that both were well above average:

Sink is about 1 standard deviation above average while McDonough is off the charts at almost 6.

I took a look at the 10 highest IBB variances to see if we can say with any certainty that we’re handling IBB’s incorrectly:

Looking at this, I think it’s pretty clear that if nothing else, IBB is at least contributing to the differences in wOBA by weighting it too highly.

It turns out that it’s not necessarily the weighting, but rather for formula for wOBA itself.  I removed IBB from the formula and recalculated wOBA for these 10 player-seasons:

Every single one of them improved.  Not yet perfect, but now that I have the outliers at least within the medium range, I feel comfortable moving on for real.  So, here’s the revised script for CalcBatting with the corrected formula for wOBA:

#Calculated batting stats for OOTP
    DROP TABLE IF EXISTS CalcBatting;
    CREATE TABLE IF NOT EXISTS CalcBatting AS

    SELECT b.year
    , b.league_id
    , b.player_id
    , b.stint #We can eventually move this down the list
    , b.split_id #We can eventually remove
    , b.team_id #We can eventually move this down the list
    , l.abbr as Lg
    , t.abbr as Team
    , b.g
    , b.ab
    , @PA := b.ab+b.bb+b.sh+b.sf+b.hp AS PA
    , b.r 
    , b.h
    , b.d
    , b.t
    , b.hr
    , b.rbi
    , b.sb
    , b.cs
    , b.bb
    , b.k
    , b.ibb
    , b.hp
    , b.sh
    , b.sf
    , b.gdp
    , b.ci
    , @BA := round(b.h/b.ab,3) AS ba
    , round(b.k/@PA,3) as krate
    , round((b.bb)/@PA,3) as bbrate
    , @OBP := round((b.h + b.bb + b.hp)/(@PA-b.sh-b.ci),3) AS obp
    , round(100*(@OBP/r.woba),0) as OBPplus
    , @SLG := round((b.h+b.d+2*b.t+3*b.hr)/b.ab,3) as slg
    , round(@OBP+@SLG,3) as ops
    , round(@SLG-@BA,3) as iso
    , round((b.h-b.hr)/(b.ab-b.k-b.hr+b.sf),3) as babip
    , round((r.wobaBB*(b.bb-b.ibb) + r.wobaHB*b.hp + r.woba1B*(b.h-b.d-b.t-b.hr) +
       r.woba2B*b.d + r.woba3B*b.t + r.wobaHR*b.hr)
       /(b.ab+b.bb-b.ibb+b.sf+b.hp),3) as woba
    /* NOT yet modified for OOTP and MySQL
    , round((([woba]-r.lgwoba)/r.wobascale)*[PA],1) as wRAA
    , round(((([woba]-r.lgwoba)/r.wobascale)+(l.totr/l.totpa))*[PA],0) as wRC
    , ((([wRAA]/[PA] + l.RperPA) + (l.RperPA - t.bpf*l.RperPA))/(lb.wRC/lb.PA))*100 AS [wRC+]
    */
    FROM 
      players_career_batting_stats b 
      INNER JOIN leagues l ON b.league_id=l.league_id 
      INNER JOIN teams t ON b.team_id=t.team_id
      INNER JOIN tblRunValues2 r ON b.year=r.year AND b.league_id=r.league_id
    WHERE b.ab<>0 AND b.split_id=1
    ORDER BY b.player_id, b.year

Batting Stats 5: wOBA

Here it is- wOBA.  I’ve been building up to this for a while and I’m excited to get going.  For the full details on wOBA, check out Fangraphs‘ page.  The short version is that wOBA takes On Base Percentage and weights each of the components based on the run expectancy for each of the component events in each league year.  It’s a very good measure of the offensive contribution of a player in a specific league setting.  It does not normalize the data across years and ballparks – we will get to some stats that do that later on.  But it does acknowledge that a double is worth more than a walk – and it quantifies how much more.

The formula takes the woba weights that we calculated in Run Table 2 and multiplies them by the number of walks, singles, doubles, etc. that a player accumulated during a season:

, round((r.wobaBB*b.bb + r.wobaHB*b.hp + r.woba1B*(b.h-b.d-b.t-b.hr) +
       r.woba2B*b.d + r.woba3B*b.t + r.wobaHR*b.hr)
       /(b.ab+b.bb-b.ibb+b.sf+b.hp),3) as woba

I’ve already plugged this formula into the table and I’m getting results.  Before I start comparing my results to the game and pulling my hair out, I am going to establish success criteria.  I mentioned in the Run Environment posts that our formulae contain constants that I don’t fully understand.  Moreover, I don’t know if the game uses the same constants.  So, to expect a perfect match between my database and the game may be unreasonable.

Let’s think about what a reasonable margin of difference would be.  Any difference less than 5 points is negligible.  I can’t really make a distinction between two players with wOBAs of .320 and .325.  They’re essentially equal.  Can I say the same thing about 20 points?  No.  There’s clearly a difference between .320 and .340.  What about 10?  Iffy.  If I want to give myself as much wiggle room as possible without losing faith in my metrics, I can’t go higher than 10.  Let’s set 10 as the absolute limit and see what we get.   I am going to take a sample of 35 player years and see how close I get:

So, that’s 21 where the difference is less than 10 points.  7 where the difference is between 10 and 20 points, and 7 where the difference is greater than 20.  Not perfect – 40% outside of the range I set for myself.  However, still better than I feared, and actually better than it looks.  For example, Dan Wasielewski’s 2012 season is off by 20 points.  But that 2012 season only had 6 plate appearances.  I am totally OK having wacky numbers for very small sample sizes.  The greatest difference (.029), similarly came from a season with barely over 100 plate appearances.

Without digging in to how to determine my own constants and without being able to pick Markus’s or Matt’s brains on how they derive these formulae, I think I’m pretty OK with my stats.  Until I decide to relentlessly pursue perfection, that is…

Batting Stats 4: BABIP

We’ll look at one last easy stat before the hard stuff.

BABIP is a very interesting stat because it acts as a kind of sanity check on all of the other stats a player produces.  It shows Batting Average on Balls in Play and average in most professional leagues is around .300.  A player with a career average well above that generally hits the ball hard.  A player with a career BABIP well below that mark often makes weak contact.  That’s an OK measure in itself, but it serves a much better purpose. If, during the course of a season, a player is putting up exceptional numbers, checking his BABIP against his career average will give you a sense of whether he’s getting lucky and likely to regress or if he’s turned a corner and upped his game.

The formula for BABIP is pretty straightforward:
(H - HR) / (AB - K - HR + SF)

The whole Megillah up to this point, then, is:

#Calculated batting stats for OOTP
    DROP TABLE IF EXISTS CalcBatting;
    CREATE TABLE IF NOT EXISTS CalcBatting AS

    SELECT b.year
    , b.league_id
    , b.player_id
    , b.stint #We can eventually move this down the list
    , b.split_id #We can eventually remove
    , b.team_id #We can eventually move this down the list
    , l.abbr as Lg
    , t.abbr as Team
    , b.g
    , b.ab
    , @PA := b.ab+b.bb+b.sh+b.sf+b.hp AS PA
    , b.r 
    , b.h
    , b.d
    , b.t
    , b.hr
    , b.rbi
    , b.sb
    , b.cs
    , b.bb
    , b.k
    , b.ibb
    , b.hp
    , b.sh
    , b.sf
    , b.gdp
    , b.ci
    , @BA := round(b.h/b.ab,3) AS ba
    , round(b.k/@PA,3) as krate
    , round((b.bb)/@PA,3) as bbrate
    , @OBP := round((b.h + b.bb + b.hp)/(@PA-b.sh-b.ci),3) AS obp
    , round(100*(@OBP/r.woba),0) as OBPplus
    , @SLG := round((b.h+b.d+2*b.t+3*b.hr)/b.ab,3) as slg
    , round(@OBP+@SLG,3) as ops
    , round(@SLG-@BA,3) as iso
    , round((b.h-b.hr)/(b.ab-b.k-b.hr+b.sf),3) as babip
    /* NOT yet modified for OOTP and MySQL
    , round((r.wobaBB*nz(b.bb,0) + r.wobahb*nz(b.hbp,0) + r.woba1b*(b.h-b.[2b]-b.[3b]-b.hr) + r.woba2b*b.[2b] + r.woba3b*b.[3b] + r.wobahr*b.hr)/(b.ab+nz(b.bb,0)-nz(b.ibb,0)+nz(b.sf,0)+nz(b.hbp,0)),3) as woba
    , round((([woba]-r.lgwoba)/r.wobascale)*[PA],1) as wRAA
    , round(((([woba]-r.lgwoba)/r.wobascale)+(l.totr/l.totpa))*[PA],0) as wRC
    , ((([wRAA]/[PA] + l.RperPA) + (l.RperPA - t.bpf*l.RperPA))/(lb.wRC/lb.PA))*100 AS [wRC+]
    */
    FROM 
      players_career_batting_stats b 
      INNER JOIN leagues l ON b.league_id=l.league_id 
      INNER JOIN teams t ON b.team_id=t.team_id
      INNER JOIN tblRunValues2 r ON b.year=r.year AND b.league_id=r.league_id
    WHERE b.ab<>0 AND b.split_id=1
    ORDER BY b.player_id, b.year

 

The Tools I’m Using

Just some thoughts about the tools and services I am using for my project.

Up until the middle of the batting stats table, I had been running MySQL locally on my late 2012 MacBook Pro 15inch Retina, maxed out with the 2.6 i7 and 16GB of RAM.  It’s still a great machine.

For a while, I had been using MySQL Workbench as my interface.  It’s not bad at all and, for a free tool, it’s great.  It does have some shortcomings in terms of ease of use and importing data.  For example, when using it to connect to a remote database, it insists on treating it like a localhost.  Honestly, after a few hours of live chats with tech support trying to make it work, I gave up on it.

I switched to Querious and I couldn’t be happier.  Not cheap, but it’s a joy to work with.  My favorite feature?  That would have to be the ability to easily import from csv.  Being able to do it at all is pretty fantastic, but being able to clean and manipulate the data easily before import is amazing.  On top of that, the import speeds are blazing fast.

Querious has 4 main views: Content, Structure, Query, and Server.

Content allows you to view the content of tables and views just by clicking on an object in the sidebar.  From within that view, you can filter the results shown by clicking on the “show all’ button.  This pulls down a list of all the columns in the table and allows you to write a quick WHERE clause to filter the data.  A real time saver when you need to do a quick sanity check.

Clicking on the database itself rather than a table gives you a view of every table and view along with data size, index size, auto_increment count, and row count.  My only quibble with this is that the row count does not update automatically.  You have to refresh for each table to see a value.

Structure shows you the structure of the database and the tables.  Clicking on the database will show you a diagram of all the tables, with lines showing foreign keys and relationships between tables.  Clicking on a  table will show you all of the columns, their types, any constraints.  You can switch tabs to see primary keys, indexes, and foreign keys.  You can also add, change, or delete these from this view.  Other tabs show you what triggers exist for tables and allow you to create or change them.  Finally, you can see table info and options – along with an exportable CREATE statement.

The Query view gives you a SQL editor with all the usual bells and whistles: context sensitive highlighting, an intelligent (much more so than Workbench) auto complete feature.  There’s also a context sensitive table reference panel that senses which table you’re referring to and lists all of the columns.  Double clicking on a column will insert the name at the cursor.  You can also switch to a different table if you’re thinking quicker than the AI.

You can also open or create SQL files in a separate window that has all of the same features except for the table reference panel.    The look and feel of both the editors are great, but for reasons mentioned below, I am still doing most of my editing in a different program.

My only complaint at all about Querious is that it seems to be frightened of large sql files.  My 10MB Test Data Load script has never failed to hang the program (beachball) necessitating a force-quit.  Even loading a 1000 line script hangs the program for at least a minute.  That’s loading, not executing.  The way around the test data load is to import the data via csv.  It handles that like a pro.

Still, as mentioned above, I do most of my editing in a separate program.  I use Brackets because the editing environment is almost as good as Querious and the file management is far better.  Perhaps because it was made with web design in mind, it makes it much easier to manage a collection of files, keeping them straight, organized, and easily accessible.  Querious allows you to save sql files and even allows you to put them in folders and favorites, but I can’t seem to get comfortable with the function.  With Brackets, I know where my files are and can flip back and forth with ease.

My remote database is hosted by DreamHost.  I have a number of domains hosted with them and really like them.  Their pricing is fair and have a number of goodies that make setting up portfolios and blogs, as well as provisioning databases, simple.  Their customer service is great.  If you’re interested in setting up hosting with them, ask me for a referral!

Batting Stats 2: Correction

I’ve been shown the light!  Matt Arnold, friendly neighborhood OOTP dev, reminded me that the calculation for OBP isn’t quite what I had remembered.  In fact, the denominator isn’t Plate Appearances.  Rather, it’s PA minus Sacrifice Hits minus Catcher Interference.  Having made that correction, OBP now ties with the game.  Our new code is:

#Calculated batting stats for OOTP
    DROP TABLE IF EXISTS CalcBatting;
    CREATE TABLE IF NOT EXISTS CalcBatting AS

    SELECT b.year
    , b.league_id
    , b.player_id
    , b.stint #We can eventually move this down the list
    , b.split_id #We can eventually remove
    , b.team_id #We can eventually move this down the list
    , l.abbr as Lg
    , t.abbr as Team
    , b.g
    , b.ab
    , @PA := b.ab+b.bb+b.sh+b.sf+b.hp AS PA
    , b.r 
    , b.h
    , b.d
    , b.t
    , b.hr
    , b.rbi
    , b.sb
    , b.cs
    , b.bb
    , b.k
    , b.ibb
    , b.hp
    , b.sh
    , b.sf
    , b.gdp
    , b.ci
    , round(b.h/b.ab,3) AS ba
    , round(b.k/@PA,3) as krate
    , round((b.bb)/@PA,3) as bbrate
    , @OBP := round((b.h + b.bb + b.hp)/(@PA-b.sh-b.ci),3) AS obp
    , round(100*(@OBP/r.woba),0) as 'OBP+'
    /* NOT YET CONVERTED TO OOTP AND MYSQL
    , round((b.h+[b.2b]+2*b.[3b]+3*b.hr)/b.ab,3) as slg
    , [obp]+[slg] as ops
    , [slg]-[avg] as iso
    , round((r.wobaBB*nz(b.bb,0) + r.wobahb*nz(b.hbp,0) + r.woba1b*(b.h-b.[2b]-b.[3b]-b.hr) + r.woba2b*b.[2b] + r.woba3b*b.[3b] + r.wobahr*b.hr)/(b.ab+nz(b.bb,0)-nz(b.ibb,0)+nz(b.sf,0)+nz(b.hbp,0)),3) as woba
    , round((([woba]-r.lgwoba)/r.wobascale)*[PA],1) as wRAA
    , round(((([woba]-r.lgwoba)/r.wobascale)+(l.totr/l.totpa))*[PA],0) as wRC
    , ((([wRAA]/[PA] + l.RperPA) + (l.RperPA - t.bpf*l.RperPA))/(lb.wRC/lb.PA))*100 AS [wRC+]
    */
    FROM 
      players_career_batting_stats b 
      INNER JOIN leagues l ON b.league_id=l.league_id 
      INNER JOIN teams t ON b.team_id=t.team_id
      INNER JOIN tblRunValues2 r ON b.year=r.year AND b.league_id=r.league_id
    WHERE b.ab<>0 AND b.split_id=1
    ORDER BY b.player_id, b.year

Batting Stats 2: Easy Calculated Stats

I’ll be adding to the scripts started yesterday working towards a complete script for all batting stats.  I’ll post the relevant bits of the code I’m working on in the main body and post the complete script as it stands beneath the fold.

I’ll be tackling the easy calculated stats in this post, starting with batting average.  Easy and not worth a discussion.

Next are K% and BB%.  These stats measure the percentage a player’s plate appearances end in a strikeout or walk, respectively.

I made an interesting discovery while coding these easy stats.  As PA is a calculated stat, I was not looking forward to repeating the definition of PA (AB+BB+IBB+HP+SF+SH). A minor annoyance, but for a reason that promises larger annoyances down the road.  I thought about how, in Wyers’s Run Value tables, we used variables that were defined and used in expressions in the same statement.  I figured it was worth a try.  I created a view with PA defined as @PA := AB+BB+IBB+HP+SF+SH AS PA and used @PA as the denominator in the expressions. I got an error telling me I couldn’t do that. Figures. But why? It worked for Wyers – so what was the difference? The only one I could think of was that he used his variables in creating a table and I used mine creating a view. So, I switched from a view to a table. It worked! Here are the relevant lines of code:

DROP TABLE IF EXISTS CalcBatting;
    CREATE TABLE IF NOT EXISTS CalcBatting AS

   SELECT b.year
    , b.league_id
    , b.player_id
    , b.stint #We can eventually move this down the list
    , b.split_id #We can eventually remove
    , b.team_id #We can eventually move this down the list
    , l.abbr as Lg
    , t.abbr as Team
    , b.g
    , b.ab
    , @PA := b.ab+b.bb+b.sh+b.sf+b.hp AS PA
    , round(b.h/b.ab,3) AS ba
    , round(b.k/@PA,3) as krate
    , round((b.bb+b.ibb)/@PA,3) as bbrate
   FROM 
      players_career_batting_stats b 
      INNER JOIN leagues l ON      b.league_id=l.league_id 
      INNER JOIN teams t ON b.team_id=t.team_id
   WHERE b.ab<>0 AND b.split_id=1
   ORDER BY b.player_id, b.year

 

On Base Percentage is next and, since I’ll be using that stat again in this statement, I decided to define it as a variable as well: @OBP := round((b.h + b.bb + b.hp)/@PA,3) AS obp. This worked nicely, giving me identical results to when done without a variable.

Next up is OBP+.  Not a very often used stat, and not present in OOTP, but I like it.  It presents a player’s OBP in relation to the rest of the league average which is set to 100.  So, a player with a 103 OBP+ is 3% higher than league average.  wOBA is always relative to what the current league OBP is, so for any nuanced understanding of a player’s wOBA you need to have an understanding of what the current league OBP is.  Don’t get me wrong, wOBA is one of the first offensive stats I look at, but for a quick glance at the rate a player gets on base vs the rest of the world, OBP+ is great.

The formula for OBP+ is: 100 * (On Base Percentage / League On Base Percentage).

Remember, we calculated League wOBA in the Intermediate Run Values Table and referenced it in the table RunValues2, so we can call that up pretty easily. We’ll have to add another join to our statement, and that will give us:

DROP TABLE IF EXISTS CalcBatting;
    CREATE TABLE IF NOT EXISTS CalcBatting AS

    SELECT b.year
    , b.league_id
    , b.player_id
    , b.stint #We can eventually move this down the list
    , b.split_id #We can eventually remove
    , b.team_id #We can eventually move this down the list
    , l.abbr as Lg
    , t.abbr as Team
    , b.g
    , b.ab
    , @PA := b.ab+b.bb+b.sh+b.sf+b.hp AS PA
    , round(b.h/b.ab,3) AS ba
    , round(b.k/@PA,3) as krate
    , round((b.bb)/@PA,3) as bbrate
    , @OBP := round((b.h + b.bb + b.hp)/@PA,3) AS obp
    , round(100*(@OBP/r.woba),0) as 'OBP+'
FROM 
      players_career_batting_stats b 
      INNER JOIN leagues l ON b.league_id=l.league_id 
      INNER JOIN teams t ON b.team_id=t.team_id
      INNER JOIN tblRunValues2 r ON b.year=r.year AND b.league_id=r.league_id
    WHERE b.ab<>0 AND b.split_id=1
    ORDER BY b.player_id, b.year

I’m going to stop here for now as I’ve hit a little snag.  My calculations for OBP aren’t matching up 100% against the ones that the game gives us.  Only off a couple points for full-season samples but much more with smaller ones.  I’d like to at least understand what’s happening, if not correct it, before moving on much further.  I posted about it here, and maybe the community can help me figure out the issue.  Spot checking a few players in both my database and the game, evaluating it in Excel has shown me that my database is calculating the formula correctly.  So, maybe the game is using a different formula?

Continue reading “Batting Stats 2: Easy Calculated Stats”

Batting Stats 1: The Easy Stuff

The goal here is to create a table (or view) for all offensive stats for each player-year-stint-split.  Let’s go back and parse that briefly:  Stats are collected for each player who accumulates them.  Each player gets his own row.  For each year that a player accumulates stats, a new row of data is created for that player.  For each team that a player plays in a given year (stint), a new row of data is created for that player.  Stats are accumulated and placed into three splits for each player-year-stint: Overall, vs. Left, and vs. Right.

I will most likely create views for each split, but for the moment I am going to focus only on Overall (split_id=1).

As this view is where I have gotten tripped up in the past, I’m going to take it slowly here, checking against game data periodically.  So, easy bit first: Let’s get the counting stats and the more traditional rate stats out of the way:

#Calculated batting stats for OOTP
    CREATE OR REPLACE VIEW CalcBatting AS

    SELECT b.year
    , b.league_id
    , b.player_id
    , b.stint #We can eventually move this down the list
    , b.split_id #We can eventually remove this
    , b.team_id #We can eventually move this down the list
    , l.abbr as Lg
    , t.abbr as Team
    , b.g
    , b.ab
    , b.ab+b.bb+b.sh+b.sf+b.hp as PA
    , b.r 
    , b.h
    , b.d
    , b.t
    , b.hr
    , b.rbi
    , b.sb
    , b.cs
    , b.bb
    , b.k
    , b.ibb
    , b.hp
    , b.sh
    , b.sf
    , b.gdp
        FROM 
      players_career_batting_stats b 
      INNER JOIN leagues l ON b.league_id=l.league_id 
      INNER JOIN teams t ON b.team_id=t.team_id
    WHERE b.ab<>0 AND b.split_id=1
    ORDER BY b.player_id, b.year

We’ll pick a random player from in-game and make sure we line up.  Let’s pick player #14 from 2015 and compare to OOTP.  Not expecting any problems on the easy stuff.  Still, just to be sure.

Output from calcbatting
Frank Garcia

So far so good.  Next post will finish up with the standard stats and some of the easier advanced stats.

Run Environment 6: Run Values 2

The last table for our Run Environment set is almost exactly the same as the one I presented before with two differences: we’re returning the runMinus and RunPlus from the intermediate table, and we’re using a variable to define wOBAscale- making the whole thing much more elegant and easy to understand.

# The last Run Values table, returning fields from RunValues1A, defining wOBAscale
# as a variable, and giving us the weighted factors by league year for batting events
DROP TABLE IF EXISTS tblRunValues2;
CREATE TABLE tblRunValues2 AS
SELECT year
, league_id
, RperOut 
, runBB 
, runHB 
, run1B 
, run2B 
, run3B 
, runHR 
, runSB 
, runCS 
, runMinus 
, runPlus 
, wOBA 
, @ws := 1/(runPlus+runMinus) AS wOBAscale 
, (runBB+runMinus)*@ws AS wobaBB 
, (runHB+runMinus)*@ws AS wobaHB 
, (run1B+runMinus)*@ws AS woba1B 
, (run2B+runMinus)*@ws AS woba2B 
, (run3B+runMinus)*@ws AS woba3B 
, (runHR+runMinus)*@ws AS wobaHR 
, runSB*@ws AS wobaSB 
, runCS*@ws AS wobaCS 
FROM tblRunValues1A;

This is the table that we’ll refer to when calculating wOBA for individual players, factoring wobaXX against their counting stats XX.  It’ll make more sense when we actually do it.

This is almost all of the data we need to complete our batting stats.  What remains is park factor.  I had been hoping that we could use the factors listed on the parks table.  Unfortunately, OOTP’s resident MySQL Whisperer says it’s not true. I’m going to set that aside for today- while I figure out how to calculate park factors – and use an incremental approach to developing the batting stats view.  I figure that will give me one, maybe two, posts to get a good method for that.

Run Environment 5: An Intermediate Table

In addition to using variables, Wyers also does the other thing I had thought about using: an intermediate table to store the runMinus, runPlus, and wOBAScale values.  On the one hand, this makes sense as we will only have to write out the calculations once and then refer to them in this table when we need them.  On the other hand, if we managed to use variables for defining the run values in the previous table, we should be able to do so with these values as well.

Also, in this table, we define league wOBA.  It makes sense that the weighted On Base Average for the entire league is just the entire league’s On Base Percentage.  If we’re looking at an average for the entire league, there is nothing to weigh it against.  We’ll see this idea again when we look at wRC+.

A few minor adjustments to the code to allow for OOTP’s naming conventions and one final difference:  I’m taking the batting stats from the league_history_batting table rather than the players_ file.  It’s already there and there are far fewer rows to sort through.  And given that we’re not filtering out any stats, it’s silly not to use it.  And, because we’re using it, we can use the column for singles (s) rather than deriving singles from H-2b-3b-HR.

#Creating an intermediate table so as not to have to write out formulae for rumMinus etc.
DROP TABLE IF EXISTS tblRunValues1A;
CREATE TABLE IF NOT EXISTS tblRunValues1A AS
SELECT r.year
, r.league_id
, r.RperOut 
, r.runBB 
, r.runHB 
, r.run1B 
, r.run2B 
, r.run3B 
, r.runHR 
, r.runSB 
, r.runCS 
, SUM(runBB*(BB-IBB) + runHB * HP + run1B * s + run2B * d 
   + run3B * t + 1.4 * HR + runSB * SB - runCS * CS) 
   / SUM(AB - H + SF) AS runMinus 

, SUM(runBB * (BB-IBB) + runHB * HP + run1B * s + run2B * d 
   + run3B * t + 1.4 * HR + runSB * SB - runCS * CS) 
   / SUM(BB-IBB + HP + H) AS runPlus

, SUM(H+BB-IBB+HP) / SUM(AB+BB-IBB+HP+SF) AS wOBA 
 
FROM tblRunValues r
INNER JOIN league_history_batting_stats 
   ON r.year=league_history_batting_stats.year 
   AND r.league_id=league_history_batting_stats.league_id

 
GROUP BY 
r.year
, r.league_id
, r.RperOut 
, r.runBB 
, r.runHB 
, r.run1B 
, r.run2B 
, r.run3B 
, r.runHR 
, r.runSB 
, r.runCS 
 
ORDER BY 
r.year DESC;

Run Environment 4: Run Values A Different Way

My attempt in the last post resulted in a view of linear weights that, while it seemed to work, was very unwieldy.  It was so because for each of the linear weights, I had to spell out fully the calculations for runPlus, runMinus, and wobaScale.

I’ve been aware of this code from Colin Wyers for a while now, but wanted to try my hand at translating Tango’s views into MySQL on my own.  I’ve done that.  Now, I’m going to try to pick apart how Wyers uses variables to simplify the process and configure this back to my OOTP tables.  Probably best to have that link open in another tab while I walk through this.

I’ll be breaking this across several posts.

We’re going to ignore the first statement that creates a primary position (PrimPos) table.  As mentioned before, this table would serve to eliminate stats created by non-pitchers pitching.  My feeling is that they are part of the run environment and should be included and, besides, this is a rarer event in OOTP than it is in real life.

The second thing to note is that Wyers is creating a table rather than a view.  This is actually a good idea.  Views save disk space but have slower performance than tables.  That’s because views are essentially stored queries that are run every time they are accessed.  Tables are created once and saved as a data object in their own right.  Since space is not a concern for me and all of the advanced stats will require complex queries, I think that opting for performance here is the right thing to do.

Next, I am going to leave in place my view vLeagueRunsPerOut – the foundational view/table.  They’re too similar for me to make any changes.  Besides, I like that I thought to total up league Plate Appearances in mine.  It comes in handy later.

Minor changes to the RunValues table give me this:

DROP TABLE IF EXISTS tblRunValues;
CREATE TABLE IF NOT EXISTS tblRunValues
AS SELECT year
, league_id
, RperOut 
, @rb := RperOut+0.14 AS runBB 
, @rb+0.025 AS runHB 
, @rs := @rb+0.155 AS run1B 
, @rd := @rs+0.3 AS run2B 
, @rd+0.27 AS run3B 
, 1.4 AS runHR 
, 0.2 AS runSB 
, 2*RperOut+0.075 AS runCS 
FROM vLeagueRunsPerOut;

 

Two things are happening here that I didn’t think we could do.  First, declaring variables with just the ‘@’ and the ‘:=’ operator.  Murach, whose book I’ve been using to learn MySQL, doesn’t mention this as an option.  He does usually point out shortcuts that some devs take before advising against using them yourself.  This squares with the other web searches I’ve done regarding variables in MySQL.  But, it seems to work, so…

Second, I could have sworn that I had read that you can’t rely on how MySQL will evaluate variables when used like this.  That is, there’s not a guarantee that it will evaluate ‘@rb’ first and use that value to evaluate ‘@rs’.  It could decide to do ‘@rs’ first and force ‘@rb’ to null.  That didn’t happen here, and I am a little confused. (NOTE: I emailed Wyers about this and will update if he gets back to me.)

Bottom line, though, is that this worked.  Comparing the results from my original view without variables to Wyers’ table with them shows identical results if we disregard all of the trailing 0’s:

Oirignal View created without variables
Results from modified RunValues table with variables

OK then.  On to the next one.