The Gap Between Public and Private Information

This post was written by Adam Guttridge and David Ogren, the co-founders of NEIFI Analytics, an outfit which consults for Major League teams. Guttridge began his MLB career in 2005 as an intern with the Colorado Rockies, and most recently worked as Manager of Baseball of Research and Development for the Milwaukee Brewers until the summer of 2015, when he helped launch NEIFI. As part of their current project, they tweet from @NEIFIco, and maintain a blog at their site as well.

Analysts in the public space often assume a very deferential position. Surely, they may say, teams are doing similar work with far more information, using far more sophisticated tools, and know vastly more than those working in the public sphere.

We’d venture that the true size of that gap is far, far smaller than is often suspected. Injury information? Of course teams have far greater detail. But as regards primary questions like “who has pitched better?” or “how should one separate batted ball skill from variance?” — in terms of the salient data, there simply has not been a remarkable gap between what’s available to teams and what’s available to the public.

At least, perhaps until recently.

Third-party companies are supplying a wealth of data which previously didn’t exist. The most publicized forms of that have been Trackman and Statcast. The key phrase here is data, as opposed to supplying new analysis. Data is the manna from which new analysis may come, and new types or sources of data expand the curve under which we can operate. That’s a fundamentally good thing.

There’s a wave of companies providing something different than Statcast and Trackman. While Statcast and Trackman are generally providing data that’s a more granular form of information which we already have — i.e. more detailed accounts of hitting, fielding, or pitching — others are aiming to provide information in spaces it hasn’t yet been available. A startup named DeCervo is using brain-scan technology to map the relationship between cognition and athletic performance. Wearable-tech companies like Motus and Zepp aim to provide detailed, data-centric information in the form of bat speed, a pitcher’s arm path, and more. Biometric solutions like Kitman Labs are competing to capture and provide biometric data to teams as well.

The solutions which provide more granular data (Trackman, Statcast, and also ever-evolving developments from Baseball Info Solutions) are of perhaps unknown significance. They offer a massive volume of data, but it’s an open question as to whether it yet offers significant actionable information, whether it has value as a predictive/evaluative tool rather than merely a descriptive one.

PITCHF/x was supposed to revolutionize our understanding of pitching. It may or may not have, but it doesn’t appear to have meaningfully altered our evaluation of pitchers. The simple truth is, it’s probably much better to gauge the effectiveness of a pitcher by the performance of the batters against him than it is via the separation between his breaking ball and fastball. Not because those must be exclusive items, of course. Merely that if one knew absolutely nothing about the slider movement or fastball-changeup velocity disparity of any pitcher in MLB, it would have little or no effect on their ability to assess/project the effectiveness of those pitchers, which is the question most salient to an organization’s success, more so than using pitch profiles to optimize batter-pitcher matchups, or questions of similar scope.

That’s not to say, at all, that PITCHF/x brought no benefit. Just that probably the most tangible evaluative gains perhaps came from such basic elements as providing a reliably consistent backboard for all pitchers velocities. Otherwise, the gains from PITCHF/x have seem to have been largest with pitchers themselves (and their coaches), rather than within front offices. In the plot twist which most proves the point, by far the largest evaluative gains from vastly expanding the information we have about pitching came in regards to catchers and their framing abilities.

Which is all to say that by making information more granular, it’s entirely possible to gain 10,000% more data and only 1% more evaluative power. In much more detailed and eloquent terms, Russell Carleton described this with regard to Statcast.

Adding staff to manage this new data, just so that it may begin to be handled and analyzed, is the single largest driver of growth in front offices today. That’s with the hope, more than the expectation, that these expanded data sources will deliver substantial evaluative power.

It may well be a while before teams even really know. First, there are technological issues at times, as there were with PITCHF/x in the earlier years. Mostly, though, there are certain things that simply can’t or won’t be known until we have data covering a larger time period. At what rate do sliders tend to develop, compared to other pitches? Can hitters consistently generate power despite lower bat speed? Can outfielder routes be taught and learned, or is that ability fairly innate? Those questions won’t be effectively answered without multiple seasons, if not many seasons, worth of this information. Not to torture the comparison, but the wonderful revelations of catcher framing ability came five years after the advent of the PITCHF/x data.

There’s an important lesson hidden here. As teams and the sabermetric public are on the hunt for new insights, there’s a natural assumption to make: that the next answers lie within the information we can’t yet see. If only we knew the spin on the slider, we might understand the strikeouts. There are two issues with this approach. On one hand, the presumption that the new level of detail will contain those exact details which reveal further truth, for example, that significant elements which determine strikeouts are contained within the particular information Trackman is providing, and not within other areas not captured. On the other side of that coin is the simple and fundamental truth that the most valuable insights in sabermetrics have come not from new data sources, but by re-imagining elements of the performance record which already existed in sufficient detail.

Voros McCracken’s defense-independent pitching observations forever changed the way pitching is evaluated, on a remarkable scale. That was entirely an execution of novel theory, not more granular data. Win Probability Added and Leverage Index, which require nothing more than simple play-by-play data, have given us a new framework through which to understand the value of relievers. Just a reminder that massive scale isn’t required for progress. Work showing Johnny Cueto’s baserunner control in 2012 shone a light on a skill which is, for some pitchers, worth a handful of runs per year—seems small, but truly a huge difference. These revolutions, amongst the bulk of the sabermetric progress we’ve seen, have not come from big data and advanced math. They’ve come from challenging old assumptions, and most importantly, knowing the right questions to ask.

The point is, there are still plenty of discoveries yet to be unearthed in the information already available to us. The relationship between pitching and fielding in run prevention is nowhere near a settled science. Aging curves are nowhere near a settled science. Batted ball variance is nowhere near a settled science. Player projection is nowhere near a settled science. It would be as dangerous as it would be flatly incorrect to be guilty of the assumption that the answers to such issues and more rely upon adding a deeper level of detail to our dataset. These are questions of theory. These are questions of baseball, not of statistics or technology. These are the areas that determine an organization’s ability to evaluate players.

The increased detail of the existing information will present new opportunities. It’s an exciting time to be in sabermetrics for that reason among others. It is not, however, the only place, or even likely the primary place, from which further evaluative power will come. The size of the gains yet unearthed (or unearthed by only some parties, privately) in terms of baseball theory far outweigh the gains available from more granular data, by an enormous magnitude. Sabermetrics, in either the public or private space, would be imprudent to primarily rely upon further detail to provide further wisdom.



Print This Post



Dave is the Managing Editor of FanGraphs.


Sort by:   newest | oldest | most voted
szielinski
Member
Member
szielinski
4 months 6 days ago

DIPs can be considered a paradigm shift and the initial expression of a new research program. These are uncommon events. They reflect an attempt to solve puzzles which commonsense cannot manage. Tests and evidence follow in the wake of the new research program. Eventually, the research program generates new puzzles it cannot manage. A new research program may take its place.

I can’t even guess what would integrate the different kinds of research generating data today.

Perhaps the Pentagon would like to invest billions in this research!

boogshine
Member
boogshine
4 months 6 days ago

I wonder if we can’t trace all of sabermetrics back to Sparky Lyle’s fictional traveling secretary in The Year I Owned The Yankees (1990)? Stats took the Yankees to the World Series. I loved that book.

Paul22
Member
Paul22
4 months 6 days ago

So where is all this statcast data?. mlb.com has a taste, but not much to sink your teeth in, especially on the defensive side. Unlike what happened with PITCHF/x, there doesn’t appear to be any publicly-accessible source of the tracking data.

Still waiting for defensive game logs and splits, without which the advanced fielding data is less than optimum, and even the old school fielding data does not have this (although one can get from retrosheet if you have the skills, tools and time).

Pitch f/x data which is readily available has helped quantify the expansion of the strike zone and is a great tool for trend analysis, showing a pitchers evolution or decline. Its been far more useful than I expected, perhaps because unlike the above data, its more readily available to analysts and laymen.

Baseball4ever
Member
Baseball4ever
4 months 6 days ago

“There’s an important lesson hidden here. As teams and the sabermetric public are on the hunt for new insights, there’s a natural assumption to make: that the next answers lie within the information we can’t yet see. On the other side of that coin is the simple and fundamental truth that the most valuable insights in sabermetrics have come not from new data sources, but by re-imagining elements of the performance record which already existed in sufficient detail. On the other side of that coin is the simple and fundamental truth that the most valuable insights in sabermetrics have come not from new data sources, but by re-imagining elements of the performance record which already existed in sufficient detail. ”

While I couldn’t agree more, my point would be when people share new ideas with others maybe it would be best to keep an open mind and not disparage the people that are willing to share those ideas.

Sometimes the snark and arrogance on this site, especially in the comments section is just too much for me, regardless of whether I am or am not the one being disparaged.

People also have to realize that since it is a “data intensive endeavor” to flesh out ideas, those ideas cannot be fully expressed in a couple of sentences of comments. Some of us let the reams of spreadsheet, databases and our formulas do the talking for us, because that is where we live to do our analysis.

RedsManRick
Member
4 months 6 days ago

In my experience, the bulk of the snark is reserved for shaming those people who dismiss new ideas out-of-hand when those new ideas conflict with an existing belief. Usually the existing belief isn’t supported by analysis (be that statistical or otherwise), rather just asserted as obviously true.

I agree with the principal of humility and openness, but at the same time, not all ideas are created equal. To pretend otherwise, out of a false humility or deference, would be counterproductive.

Noah Baron
Member
Noah Baron
4 months 6 days ago

The gap is in the details.

Like incorporating fielding, stolen base prevention, and pitch f(x) data into pitcher evaluation.

Like using exit velocity, batted ball distance and pull/oppo rates to look at hitting in a more granular manner (rather than just looking at observed hits and using a basic park factor).

Like evaluating defense based on acceleration time, route efficiency, and average maximum speed instead of UZR/DRS, which while effective don’t eliminate the issue of team positioning.

There is plenty of public data on the first two things. The main issue in my opinion has been a lack of analytical will/resources (as analysts get swooped up by teams and FanGraphs/BP writers focus on day to day operations). To be fair, that problem is much worse at Baseball Prospectus than it is here.

Hopefully the issues with evaluating defense will be solved when the statcast and field f(x) data becomes public, although there really isn’t much of an incentive for smart teams using (and benefiting) from the data to allow public analysts to spoil their trade secrets.

Joe Blow
Member
Joe Blow
4 months 6 days ago

The gap between public vs. private info is how someone as annoying at Rany Jazayerli can be so wrong for so many years, yet still somehow feel like his annual prediction of doom was prescient until the last two years..

He continually said he was only wrong because he didn’t know the private stuff. He also doesn’t know anything about baseball..

wpDiscuz