How To Use R For Sports Stats, Part 1: The Absolute Basics

by Brice Russ

July 27, 2015

If you’ve spent a sufficient amount of time messing around with sports statistics, there’s a good chance the following two things have happened, in order:

You probably started off with Excel, because Excel does a lot of stuff pretty easily and everyone has Microsoft Office.
At some point, you mentioned to someone that you use Excel to do statistical analysis and got a response along the lines of, “Oh, that’s cool, but you should really be using R.”

Politeness issues aside, they might well be right.

R is a programming language and software platform commonly used, particularly in research and academia, for data analysis and visualization. Because it’s a programming language, the learning curve is a bit steeper than it is for something like Excel–but if you dig into it, you’ll find that R makes it possible to do a wider variety of tasks more quickly. If you’re interested in finding interesting insights with just a few lines of code, if you want to easily work with large sets of data, or if you’re interested in using most any statistical test known to man, you should take a look at R.

Also, R is totally free, both as in “open-source” and as in “costs no money”. So that’s nice.

In this series, we’ll learn the basics of working in R with the goal of exploring sports data—baseball, in particular. I’m going to presume that you have no background whatsoever in coding or programming, but to keep things moving, I’ll try not to get too bogged down in the details (like how “=” does something different from “==”) unless absolutely necessary. This guide was made using R on Windows 7, but most everything should be the same on whatever OS you use.

Okay, let’s do this.

Getting Started

You can download R from https://cran.rstudio.com/.

You’ll have to click on a few links (you want the ‘base’ install) and actually install R, but once that’s done you should have a screen that looks like:

The “R console” is where your code is soon going to run–but first, we need some data. Let’s take FanGraphs’ standard dashboard data for qualifying MLB batters in 2013 and 2014. Save it as something short, like “FGdat.csv”. (If you have a custom FG dashboard or just want to take a shortcut, you can just download the data we’ll be using here.)

In R, we’ll be focusing mostly on functions (that look like, say, function(arg1, arg2)), which are what actually do things, and naming the output of these functions so we can refer back to it later. For example, a line of R code might look like this:

fgdata = read.csv("FGdat.csv")

The function here is the read.csv(), which basically means “read this CSV file into R”, and the argument inside is the file that we want to read. The left part (fgdata =) is us saying that we want to take the data we’re reading and name it “fgdata”.

This is, in fact, the first line we want to run in R to load our data, so type/paste it in and hit Enter to execute it.

(You may get an error like cannot open file ‘FGdat.csv’: No such file or directory; if you do, you likely need to change the directory that R is trying to read files from. Go to “File” -> “Change dir”, and change the working directory to the folder you saved the CSV in, or just move the CSV to the folder R has listed as the working directory.)

If you didn’t get an error and R simply moves on to the next line, you should be good to go!

Basic Stats

The head() function returns the first 6 rows of data; since our data set is named “fgdata”, we can try this out with the line of code:

> head(fgdata)

And to get a basic overview of the entire data set, there’s the summary() function:

> summary(fgdata)

See! Already, data on 20 variables in the blink of an eye.

“1st Qu.” and “3rd Qu.” are the first and third quartiles; the mean, median, minimum and maximum should be self-explanatory. So we can see that the average player in this data set had roughly a .270 average with 17 dingers and 10 steals in 146 games–not far from Alex Gordon’s 2014, basically.

Want to compare how the 2013 and 2014 stats stack up? R makes it pretty easy to pick out subsets of data. It’s called, reasonably, the “subset” function, and all you need to include is the data set you’re taking a subset of and the criteria the subset data should conform to.

Since we have “Season” as a field in the table, we just need to say “Season == “2013”” to get the 2013 players and “Season == “2014”” to get the 2014 players. We’ll name these new data sets ‘fg13’ and ‘fg14’:

> fg13 = subset(fgdata, Season == "2013")
> fg14 = subset(fgdata, Season == "2014")

A quick check should confirm that, yes, the data did subset correctly:

> summary(fg13)

and now we can do some basic statistical comparisons, like comparing the mean BABIPs between 2013 and 2014. (To single out a specific column in a data set, use the $ symbol.)

> mean(fg13$BABIP)
> mean(fg14$BABIP)

You can do whatever basic statistical tests you like–sd() for the standard deviation, et cetera–and pull out different subsets of the data based on whatever criteria you like. So “HR > 20” for all players who hit more than 20 home runs, or “Player == “Mike Trout”” to get data for all players named Mike Trout:

> fgtrout = subset(fgdata, Name == "Mike Trout")
> fgtrout

Lastly, it’s not too common to need to reorder your data in R, but if you do, you can do so with the order() function. This line sorts the data by wRC+, ascending order:

> fgdata = fgdata[order(fgdata$wRC.),]

then returns the top 10 rows:

> head(fgdata, n = 10)

You can sort in descending order by placing a minus sign before the column:

> fgdata = fgdata[order(-fgdata$wRC.),]

And, as you’ve probably noticed, most of these functions can be tweaked or expanded depending on the different arguments you use–adding “n = 10” to head(), for example, to view 10 rows instead of 6. One of the more fascinating and infuriating things about R is that pretty much every function is like that–but at least they’re all documented!

And, of course, you can access the documentation through a function. Use help() (help(head), help(summary), etc.) and a page will pop up with the arguments, and more additional details than you probably ever wanted.

Wrap-up

One final note: typing code directly into the console is fine, but it gets a bit annoying if you want to write more than a line or two. Instead, you can create a new window within R to load, edit and run scripts. In Windows, use “Ctrl+N” to open a new script window. Type some code; to run it, highlight the lines you want to run and hit “Ctrl+R”.

You can also use these windows to save your R script in R files–as I’ve done here for all the code used in this article. Feel free to download and start tinkering.

So those are the basics of R; not enough to really show its potential, but enough to start experimenting and exploring as you wish. For Part 2, we’ll start some data plotting and correlation tests, and in Part 3 we’ll try to recreate some basic baseball projection models. I actually haven’t done this before in R, so it should be interesting. Stay tuned!

(Thanks to Jim Hohmann for helping test this article.)

TechGraphs News Roundup: 7/24/2015

Independent Baseball’s Newest Umpire Isn’t Human

Brice lives in the Washington, DC area, where he does communications for linguistics and space exploration organizations. Brice has previously written for Ars Technica, Discovery News and the Winston-Salem Journal. He's on Twitter at @KilroyWasHere.

25 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Mustbunique

8 years ago

So there was another Techgraphs article not too long ago asking what the readers would like to see more of. I would like to re-submit my answer, because I didn’t know it until I saw it. “This article and more like it,” would now be my submission. Great stuff Brice, I plan on installing R on my comp ASAP and seeing how much better than excel it can be.

Brice Russmember

8 years ago

Reply to Mustbunique

Thanks for the kind words! That survey was actually one of the inspirations for this post, since “analysis in R” was one of the more common answers. Glad to hear it’s proving interesting so far.

Bradley Woodrummember

8 years ago

This is great. Thanks for this!

Kris

8 years ago

R is my favourite. You can connect it to MySQL really easily, and unlike a lot of the productivity software, it doesn’t make you wanna blow your brains out. I can’t recall if you guys have covered the python implementation of pfx scraping, but using python / mysql / and R basically allows you to query whatever you need. Toss in the Lehman DB and you can have a blast.

Just about every test you can imagine is either installed by default, available as a package, or scripted somewhere online.

If you have statistical knowledge and basic computer knowledge, you can probably knock out the learning curve for MYSQL and R in a weekend.

scott

8 years ago

As a scientist in my professional career and a baseball stats geek in my free time these TechGraphs “how to” articles on R, Tableau, and Excel stuff have been simply phenomenal. I have not only learned things for Baseball data manipulation but also have learned some things that I can use in my actual work. One of the “excel wizardry” articles had an excellent discussion on using INDEX-MATCH instead of VLOOKUP in the comments that has been life-changing for me.

Keep up the great work TechGraphs team!

DSMok1

8 years ago

Reply to scott

Just read about INDEX-MATCH vs. VLOOKUP based on your comment….

Why didn’t I know about this!?! I’ll be using that from now on.

Joe W

8 years ago

Loved this article and can’t wait for the next parts!

Kyle

8 years ago

Great article! Been looking for a good introduction to R and MySQL and using baseball is a great way to use it for learning by keeping me interested with a topic I actually care about. I look forward to the continuation of these articles

Eric

8 years ago

For your next installment, I’d say *please* encourage your audience to use RStudio instead of the R console. Being able to see all your code, re-run bits of it without typing again, having your code and your results in two different panes — it really lowers the barrier of frustration for new users. Add in having R help in its own pane, and being able to inspect dataframes in an Excel-like window, and I can’t imagine starting anyone off in R just staring at that blank console window.

Oh, and that said, great article! What a gentle but substantive introduction to R.

Brice Russmember

8 years ago

Reply to Eric

Thanks! You’re absolutely right–I didn’t even think to recommend an IDE. I’ll make a note for the next segment.

8 years ago

I tried to get the ball rolling myself in learning R, but without a lot of spare time I’d much rather have a tutorial like this.

I’ve bookmarked this article, and look forward to the rest of the series!

Steve

8 years ago

The inputs “fg13 = subset(fgdata, Season == “2013”)” and “fg14 = subset(fgdata, Season == “2014”)” are returning the error message “Error in eval(expr, envir, enclos) : object ‘Season’ not found”.

The other commands are working properly.

Brice Russmember

8 years ago

Reply to Steve

That’s odd; that’s the error it should give if your data set doesn’t have a column named “Season”. I’ll assume that you’re using the CSV file that’s linked in the post, and that the other subset commands (like the Mike Trout one) are working fine.

Try the command colnames(fgdata), which returns the exact text of the column names; if your “Season” column is spelled weird or something, use whatever text it gives you (without the quotes, of course). If that doesn’t fix it, you can just call the row explicitly by replacing “Season” with “fgdata[,1]”, where the number is the column number that Season is.

And if *that* doesn’t fix it, send me the results you’re getting for the commands in the last paragraph, because this is interesting.

Steven

8 years ago

Thanks Brice,

The column heading was spelled weird, colnames(fgdata) showed “ï..Season”. I deleted the cell and retyped “Season” into the FGdat.csv file and it fixed the column heading.

Looking forward to the rest of the series!

Brice Russmember

8 years ago

Reply to Steven

Huh, that’s odd. Glad it got worked out!

Simon McPherson

8 years ago

Hi, Nice piece.

I’m excited about exploiting the data for DFS.

In the article you mention a query to find all players with HR>10. What is the code for that? I tried working it out but failed. I’m a bad student.

Regards
Simon

Simon McPherson

8 years ago

Reply to Simon McPherson

Oops, worked it out.

subset (fgdata, HR>10)

Andy

8 years ago

“You may get an error like cannot open file ‘FGdat.csv’: No such file or directory; if you do, you likely need to change the directory that R is trying to read files from. Go to “File” -> “Change dir”, and change the working directory to the folder you saved the CSV in, or just move the CSV to the folder R has listed as the working directory.”

Sorry, I’m lost already. How do I “go to” “File” -> “Change dir”? Not by typing that, apparently. Or where do I find the folder R has listed as the working directory?

I get the message: Warning message:
In file(file, “rt”) :
cannot open file ‘FGdat.csv’: No such file or directory

Is it “rt”, or “file, “rt””? If so, where is that?

Brice Russmember

8 years ago

Reply to Andy

Sorry! If you’re on Windows, it’ll be in the File menu at the top of the R program. (You may need to click on the console window first, if you have a script or something else open.)

Here’s a more detailed guide on changing the R working directory–you can change it with a command, too, if you prefer:

https://sites.google.com/site/manabusakamoto/home/r-tutorials/r-tutorial-4

Andy

8 years ago

Reply to Brice Russ

Thanks, Brice. I changed the directory (I have a Mac, and it’s under Misc), but have a problem downloading the FG data. There is supposed to be a way of downloading it into Excel using the Data/External Data tab, but I have no “new web entry” choice or any other choice that can be used. I can download the data by highlighting it and dragging it into Excel, but then the data are in multiple sheets, and only one sheet at a time can be saved as a .csv file.

In the meantime, I’m just using the FGdat(1).csv file that you linked to in your article, but for future reference, is there a way of downloading the FG leaderboards directly into a text or .csv file? I don’t seem to be able to get the data in a useful format into Excel. Or if you can show me a way to do that…

Brice Russmember

8 years ago

Reply to Andy

Yep; the link (“Export Data”) is above the top-right corner of the data, below the share buttons and above the ‘XXX items in X pages’ text. That should let you download the data as a CSV.

Andy

8 years ago

P.S. – I have the 2004 version of Excel (which I don’t really understand, because I just bought MS Office a couple of months ago). There is supposed to be a tab used for downloading data into Excel in either the 2003 or 2007 versions (different names for the tabs, one of them is new web entry, I forget what the other is), but neither tab is in my version.

Andy

8 years ago

OK, I now can get the data dragged into Excel saved as a .csv file. I must have done something wrong before. But I still can’t import all the data using a tab in Excel.

Andy

8 years ago

Just want to add I went through the rest of this article using the FGdat(1) file, and everything was very clear and easy to replicate.

Great tutorial, and on to Part II.

Kevin

8 years ago

I’ve always been told to use “assign” instead of “equals”…

So it’d be fgdata <- read.csv("FGdat.csv")

Google has a good style guide for R:

https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG