I realize I'm preaching to the choir here mostly, but if you indulge me for a bit the conversations about Ruben Amaro's job status and the two camps of advocates have me thinking.
In regression analysis, one of the key statistics that we use to assess the quality of the model is the coefficient of determination, or more simply, R-squared (R2). It's a fundamentally simple concept. R2 is a number between 0 and 1 that tells us the proportion of variation explained by the regression model. In layman's terms, we understand that data and our knowledge are limited - each regression uses some set of independent variables to predict some outcome/dependent variable, but we can't possibly include every single variable that fully explains that outcome variable - there are too many idiosyncrasies and too much randomness.
For example, let's say that we want to predict how well a set of children will do in school (let's say using GPA). The first thing to note is that GPA is not an especially good measure of "how well a child does in school" because the measurement itself is based on a host of factors completely beyond the student's control, like grade inflation, the quality of the school, etc. Nevertheless, what "independent variables" can help us predict those outcomes? How about the education level of the child's parents? So let's use parents' years of schooling to predict child GPA. We would expect that as parents are more education, a child's GPA will be higher - and indeed, we find this time and again. However, there are no shortage of children whose parents are not well educated but yet excel in school, and similarly, many children of very educated parents struggle. There is a strong correlation, but it's not a perfect one. In other words, our R2 for this study would be very low. We might be able to explain about 2% of the variation in children's GPA through parent's education. There are many, many other variables that likely affect GPA, like the child's intelligence, social status, school quality, parental involvement, and I could go on and on.
With each variable we add to the equation, we are able to explain more and more of the variation. But note a couple of things - those problems I pointed out with the measurement of GPA? They exist for each of the other measures too - all measurements of social phenomena are flawed. But perhaps even more importantly, with all those variables, the R2 never gets very close to 1. Think of all the possible variables that we could include that should be correlated with GPA, and we'll still be lucky to explain 50% of the variation. Why?
Because people are individuals and while we all share some commonalities, we also have our own idiosyncrasies. Some students have bad days/bad years, some are more motivated, some function well in a school environment while others would be better off learning more by doing, etc. Which brings me to baseball. Sabrematic analysis is incredibly useful and hopefully the Phillies front office has finally realized that. But at its best, the analysis is only every going to be able to capture an incomplete proportion of the variation in team wins. Why? Because teams are made up of people, dealing with their own idiosyncratic issues - athletic ability, attitude, bad days, divorces, sick relatives, new swings, new grips, social groups, levels of happiness, team cohesion, etc. All of these things have some relationship with winning, and large scale correlational analysis is unlikely to ever to uncover those relationships, both because the concepts themselves are so hard to measure, but also because they are so idiosyncratic. A divorce might make one player more aggressive and do better, while it might make another despondent and distracted.
This is where scouting, the qualitative analysis of baseball, comes in. But like qualitative research, scouting is interpretive, and much much more difficult than data analysis. Predicting broad trends is, while not easy, relatively straight forward. Predicting individual trends is next to impossible. That scouts get it right sometimes is impressive, because they are being asked to explain 50% of the variation in team wins without any of the tools available to the data analysts in figuring out their 50%.
Given these challenges - data analysis can only explain so much, and scouting is incredibly difficult - smart teams should use both. Despite the calls for Ruben Amaro's head, I'm hopeful that the organization has begun to recognize the utility of using mixed methods to manage an inherently complex endeavor.