PowRful

Download

To this point, I've discussed many of the fundamentals of R at length. While solidifying this foundation will ultimately serve to help you navigate more advanced material, it can be tedious. If you have experience using Excel or other programming languages, you may even be asking yourself: "What can I do with R that I can't already using Excel or other tools?"

At this juncture, I think it's worth providing a short demonstration of what can be done with R. I'll be working through a few examples using the raw Play-By-Play data from Corsica. My hope is that by the end of this lesson, I will have kindled your desire to continue learning R and advance to more interesting topics.

Some of the code in this demonstration will appear unfamiliar to you – That's okay! Upon completion of the curriculum, you should be able to revisit this chapter and understand it in more detail. Until then, I'll try to explain the process as best I can while saving most of the details for future lectures.

If you haven't already, download the file attached above. It contains all the code to follow. You should install each of the packages listed in the dependencies:

> install.packages("dplyr")
> install.packages("ggplot2")
> install.packages("glmnet")
> install.packages("doMC")
> install.packages("Kmisc")

Next, load the RData file stored remotely on the Corsica server:

# Load data
load(url("http://fenwicka.com/shiny/powrful.RData"))

It holds pbp, a data frame containing the complete Play-By-Play from the 2016-17 NHL season to this point. It is easily the largest object you've worked with thus far in this course and, depending on your background, it could be the first time you've dealt with a table containing over one million rows. Not to worry, though. Properly vectorized R code shouldn't have too much difficulty parsing such large quantities of data. If you View() the data frame, you can begin to get a feel for its structure.

Without spending too much time on it, it's worth going over the general form of the Play-By-Play. General information about each game is supplied by the Game.ID, Date, Season and Season.Type columns, among others. For each game, the recorded events are ordered in chronological order. These events include player substitutions, stoppages, shots, hits and more. Additional information on the events and the game state is given by the remaining columns.

The Play-By-Play is the starting point for everything on Corsica and the majority of my analysis. We can begin by gleaning something relatively simple from this raw data, like every team's 5-on-5 Corsi percentage. An easy place to begin is subsetting the PBP into 5-on-5 events only:

# Subset 5v5
pbp %>%
  filter(Strength.State == "5v5") ->
  pbp_5

%>% is the pipe operator, supplied by the dplyr package. It's used to pass forward an object to be used as the first argument of the following function. This is a concept I'll cover in greater detail in a future lesson. For now, think of the command x %>% function() as equivalent to function(x). The filter() function, from dplyr, is similar to subset().

Now that we've whittled down the PBP to exclude non-5v5 events, we can proceed to compile team Corsi by counting shot attempts for and against each team. The Event types that belong to the shot attempt family are "GOAL", "SHOT", "MISS" and "BLOCK". In each case, the ev.team column indicates the team whose player attempted the shot (an important distinction when considering blocked shots).

The following code makes use of yet more functionalities provided by the invaluable dplyr package. Used in tandem, the group_by() and summarise() functions will group a data frame into congruent pieces and summarize each chunk according to specifications. In our case, we'll group shot attempts by team (first home, then away) and count the total number for and against for each group. Additionally, we'll gather other useful information like Games Played, Time On Ice and Goals For and Against. After binding these home and away team summaries, we'll then summarise() the lot and store the tidy data frame to a new object:

# Compile shots
bind_rows(pbp_5 %>%
            group_by(Home.Team) %>%
            rename(Team = Home.Team) %>%
            summarise(Venue = "Home",
                      GP = length(unique(Game.ID)),
                      TOI = sum(na.omit(Event.Length))/60,
                      CF = sum(Event %in% c("GOAL", "SHOT", "MISS", "BLOCK") & ev.team == Team),
                      CA = sum(Event %in% c("GOAL", "SHOT", "MISS", "BLOCK") & ev.team == Away.Team),
                      GF = sum(Event == "GOAL" & ev.team == Team),
                      GA = sum(Event == "GOAL" & ev.team == Away.Team)
                      ),

          pbp_5 %>%
            group_by(Away.Team) %>%
            rename(Team = Away.Team) %>%
            summarise(Venue = "Away",
                      GP = length(unique(Game.ID)),
                      TOI = sum(na.omit(Event.Length))/60,
                      CF = sum(Event %in% c("GOAL", "SHOT", "MISS", "BLOCK") & ev.team == Team),
                      CA = sum(Event %in% c("GOAL", "SHOT", "MISS", "BLOCK") & ev.team == Home.Team),
                      GF = sum(Event == "GOAL" & ev.team == Team),
                      GA = sum(Event == "GOAL" & ev.team == Home.Team)
                      )
          ) %>%
  data.frame() %>%
  group_by(Team) %>%
  summarise(GP = sum(GP),
            TOI = sum(TOI),
            CF = sum(CF),
            CA = sum(CA),
            GF = sum(GF),
            GA = sum(GA)
            ) %>%
  mutate(Corsi = CF/(CF + CA),
         Goals = GF/(GF + GA)
         ) %>%
  data.frame() ->
  team_corsi

Now, we can list the top and bottom five teams ranked by Corsi percentage:

# Top 5 teams
team_corsi %>%
  arrange(desc(Corsi)) %>%
  mutate(Corsi = scales::percent(Corsi)) %>%
  select(Team, GP, CF, CA, Corsi) %>%
  slice(1:5)

# Bottom 5 teams
team_corsi %>%
  arrange(Corsi) %>%
  mutate(Corsi = scales::percent(Corsi)) %>%
  select(Team, GP, CF, CA, Corsi) %>%
  slice(1:5)

An alternate way to view this would be to visualize the data with a plot. A suitable chart option is a scatter plot, with Corsi For and Against rates per 60 minutes on X and Y. The first step is to compute these rates using the data we've stored in team_corsi:

# Scatter plot
team_corsi %>%
  mutate(CF60 = CF/TOI*60,
         CA60 = CA/TOI*60,
         GF60 = GF/TOI*60,
         GA60 = GA/TOI*60
         ) ->
  newdata

The ggplot2 package offers a wide array of plotting tools beyond the base R options. With it, we can generate a quick and easy scatter plot with labels:

ggplot(newdata, 
       aes(x = CF60, y = CA60)
       ) +
  geom_point(size = 3,
             alpha = 0.8,
             colour = "darkblue"
             ) +
  geom_text(aes(label = Team),
            vjust = -1,
            size = 3,
            fontface = "bold"
            ) +
  theme_bw() +
  scale_y_reverse()

Finally, we may want to evaluate the correlation between the rate of teams' shot attempts with the rate of goals. Visualizing a data set is always a good place to start when tackling a regression problem. To this end, we can use the basic plotting capabilities offered by R:

## Correlation
# Plot data
plot(newdata$GF60, newdata$CF60)

Despite the limited number of data points, we can observe a slight positive trend. We'll use the lm() function to perform a linear regression, then print out a summary of the fit:

# Linear regression
fit1 <- lm(data = newdata,
           formula = GF60 ~ CF60
           )

summary(fit1)

Next, we'll take on something a little more challenging. Using the PBP I'll develop a very rudimentary player rating to approximate shooter and goalie talent.

The basis will be a regularized logistic regression model, where the response variable is the result of an unblocked shot – whether it resulted in a goal or not. The output from a logistic regression model is interpretable as a probability, as the sigmoid function always produces values between 0 and 1. Thus, a logistic regression model will approximate the probability of a binary outcome as a function of the set of its explanatory variables.

In simple terms, we'll fit a logistic regression model to the binary response (goal or no goal?) and a collection of additional independent variables containing some information about the shot attempt we expect might help predict the outcome. The player ratings will arise from including which players are involved – in the act of shooting, or tending the goal – as explanatory variables in the regression.

Again, I'll start with some data preparation. The idea is to normalize fields like Strength.State and Score.Cat to the shooting team's perspective rather than the home or away team:

# Prepare data
bind_rows(pbp %>%
            filter(Event %in% c("GOAL", "SHOT", "MISS"),
                   Strength.State %in% c("3v4", "3v5", "4v5", "5v5", "4v4", "3v3", "5v3", "5v4", "4v3"),
                   ev.team == Away.Team
                   ) %>%
            mutate(Goalie = Home.Goalie,
                   Venue = "Away",
                   Score.Cat = -Score.Cat,
                   Strength.State = str_rev(Strength.State)
                   ),

          pbp %>%
            filter(Event %in% c("GOAL", "SHOT", "MISS"),
                   Strength.State %in% c("3v4", "3v5", "4v5", "5v5", "4v4", "3v3", "5v3", "5v4", "4v3"),
                   ev.team == Home.Team
                   ) %>%
            mutate(Goalie = Away.Goalie,
                   Venue = "Home"
                   )
          ) %>%
  data.frame() %>%
  rename(Shooter = p1) %>%
  select(Event,
         Shooter,
         Goalie,
         xG,
         Strength.State,
         Venue,
         Score.Cat
         ) %>%
  mutate(is.Goal = 1*(Event == "GOAL")) %>%
  data.frame() ->
  fulldata

The remaining mise en place is required to format the data in a way that is agreeable to the glmnet package, which we'll be using to perform the logistic regression. To perform the regression, we're expected to provide two matrices containing the response and input variables. It's less important to understand this process, but feel free to examine the matrices before proceeding:

# Functions
do_shooter <- function(shooter, data) {...}
do_goalie <- function(goalie, data) {...}
do_strength <- function(strength, data) {...}
do_score <- function(score, data) {...}

.
.
.

ymat <- as.matrix(fulldata$is.Goal)

unique(fulldata$Shooter) %>%
  lapply(do_shooter,
         data = fulldata
         ) %>%
  unlist() %>%
  matrix(ncol = length(unique(fulldata$Shooter)),
         byrow = FALSE
         ) ->
  shootermat

unique(fulldata$Goalie) %>%
  lapply(do_goalie,
         data = fulldata
         ) %>%
  unlist() %>%
  matrix(ncol = length(unique(fulldata$Goalie)),
         byrow = FALSE
         ) ->
  goaliemat

unique(fulldata$Strength.State) %>%
  lapply(do_strength,
         data = fulldata
         ) %>%
  unlist() %>%
  matrix(ncol = length(unique(fulldata$Strength.State)),
         byrow = FALSE
         ) ->
  strengthmat

unique(fulldata$Score.Cat) %>%
  lapply(do_score,
         data = fulldata
         ) %>%
  unlist() %>%
  matrix(ncol = length(unique(fulldata$Score.Cat)),
         byrow = FALSE
         ) ->
  scoremat

xmat <- cbind(shootermat,
              goaliemat,
              strengthmat,
              scoremat,
              fulldata$xG,
              as.numeric(as.factor(fulldata$Venue))
              )

The object xmat contains a set of boolean indicators for the involvement of players, and various possible game states. For example, for each possible shooter, there is an indicator variable with the value 1 if the player assumes the role of shooter in a given observation and 0 otherwise.

The glmnet package offers a comprehensive collection of regularized regression methods. We'll apply the ridge penalty, cross-validating on 4 folds to optimize the penalty term lambda:

# Regression
registerDoMC(cores = 4)
cv <- cv.glmnet(xmat,
                ymat,
                family = "binomial",
                nfolds = 4,
                nlambda = 100,
                alpha = 0,
                parallel = TRUE
                )

Note: This may take several minutes to run. In order to speed up the process, I use parallel computing on 4 cores. If your computer does not support this, omit the registerDoMC() command and set parallel = FALSE.

Next, we construct the player ratings by decanting the coefficients produced in the regression. The exponent of each player coefficient is interpretable as an approximate multiplier on the baseline odds of any unblocked shot becoming a goal, independent from the influence of all other variables. By the inclusion of xG, we isolate pure shooting talent and goalie ability from the expected goal likelihood given the circumstances associated with shots.

ratings <- as.numeric(exp(coef(cv, s = "lambda.min")))

variables <- c("Intercept",
               unique(fulldata$Shooter),
               unique(fulldata$Goalie),
               unique(fulldata$Strength.State),
               unique(fulldata$Score.Cat),
               "xG",
               "Venue"
               )

table <- data.frame(cbind(variables, ratings))

shooters <- filter(table, variables %in% fulldata$Shooter)
goalies <- filter(table, variables %in% fulldata$Goalie)

And thus ends my demonstration, throughout which I hope you've gained interest in studying and perfecting the art of R. I'll reiterate that it's normal not to have completely understood every part of this lesson. Luckily, many of these concepts will be expanded upon in the next section!

Complete and Continue

Discussion

An Intro to R

PowRful

2 comments