I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format.
http://www.sports-reference.com/cfb/players/ryan-aplin-1.html
I cannot find an aggregate of all players so I need to go page by page and pull out the bottom row of each passing scoring Rushing & receiving etc. html table
Each player is catagorized by their last name with links to each alphabet going here.
http://www.sports-reference.com/cfb/players/
For instance, each player with the last name A is found here.
http://www.sports-reference.com/cfb/players/a-index.html
This is my first time really getting into data scraping so I tried to find similar questions with answers. The closest answer I found was this question
I believe I could use something very similar where I switch page number with the collected player's name. However, I'm not sure how to change it to look for player name instead of page number.
Samuel L. Ventura also gave a talk about data scraping for NFL data recently that can be found here.
EDIT:
Ben was really helpful and provided some great code. The first part works really well, however when I attempt to run the second part I run into this.
> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it!
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) :
unexpected '<' occurred on line 1
use a command like
x <- edit()
to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
deparse may be incomplete
This could be an error due to not having sufficient memory (only 4gb on computer I am currently using). Although I do not understand the error
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
Looking through my other datasets my players really only go back to 2007. If there would be some way to pull just people from 2007 onwards that may help shrink the data. If I had a list of people whose names I wanted to pull could I just replace the lnk in
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
with only the players that I need?
Here's how you can easily get all the data in all the tables on all the player pages...
First make a list of the URLs for all the players' pages...
require(RCurl); require(XML)
n <- length(letters)
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
print(i) # keep track of what the function is up to
# get all html on each page of the a-z index pages
inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
# scrape URLs for each player from each index page
lnk <- unname(xpathSApply(inx_page, "//a/@href"))
# skip first 63 and last 10 links as they are constant on each page
lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
# only keep links that go to players (exclude schools)
lnk <- lnk[grep("players", lnk)]
# now we have a list of all the URLs to all the players on that index page
# but the URLs are incomplete, so let's complete them so we can use them from
# anywhere
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)
Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:
Second, scrape all the tables at each URL to get their data, like so:
# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
# pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
print(i)
# error handling - skips to next URL if it gets an error
result <- try(
all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names
The result looks like this (this is just a snippet of the output):
all_tables
$`neli-aasa`
$`neli-aasa`$defense
Year School Conf Class Pos Solo Ast Tot Loss Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007 Utah MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0
$`neli-aasa`$kick_ret
Year School Conf Class Pos Ret Yds Avg TD Ret Yds Avg TD
1 *2007 Utah MWC FR DL 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 2 24 12.0 0 0 0 0
$`neli-aasa`$receiving
Year School Conf Class Pos Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
1 *2007 Utah MWC FR DL 1 41 41.0 0 0 0 0 1 41 41.0 0
2 *2010 Utah MWC SR DL 0 0 0 0 0 0 0 0 0
Finally, let's say we just want to look at the passing tables...
# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)
And we end up with a data frame that is ready for further analyses (also just a snippet)...
Year School Conf Class Pos Cmp Att Pct Yds Y/A AY/A TD Int Rate
james-aaron 1978 Air Force Ind QB 28 56 50.0 316 5.6 3.6 1 3 92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA JR QB 100 182 54.9 1135 6.2 6.0 5 3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA SR QB 77 148 52.0 828 5.6 4.3 4 6 99.8