Since 1994, MLB leagues (National and American) were divided into three divisions each: East, Central, and West.

On this format, the team with the best record in each Division is called the Division Champion, and it has the right to play the playoffs. Nevertheless, using only the Divisional Champion to advance to the postseason would make an odd number of teams in each league; so a fourth team was needed to rectify this situation. Then it was when the Wild Card (WC) team born in the MLB, allowing the second-best team in each league to make the postseason, even if it was not a Division Champion.

Since 2011, the Wild Card team is decided in a previous round after the end of the regular season, where two teams in each league qualify as WC and they play a one-game playoff and the winner advances to the Division Series.

Despite this format to advance to the playoff, the 30 MLB teams can be ranked from 1 to 30 based on their winning percentage.

This post looks to visualize this overall rank of each team on each season along the time, since 1994 until 2016. The Lahman database has info to analyze this regular season ranking. R has a package available with the Lahman database, so it will be used for this analysis.

Loading needed packages

# Loading packages

library(Lahman)  # Data source
library(dplyr)   # To ease the data manipulation

The following variables would be needed to create a WLP percentage for each team: - Team name / ID; - Season; - Games won; - Games lost.

Data preparation

This info is available into the Teams data frame of the Lahman R package. As the analysis is considering just the Wild Card Era, only data since 1994 will be taken:

# getting the data from Lahman package: Since 1995 to 2015

Teams_data <- tbl_df(Teams) %>%
  select(yearID, name, teamIDBR, W, L, DivWin, WCWin, WSWin) %>%
  filter(yearID >= 1994)

Let’s add a variable with the winning percentage, WLP, and order the tibble by yearID and WLP (in descending order).

Teams_data <- mutate(Teams_data,
        WLP = W/(W+L)) %>%
  arrange(yearID, desc(WLP))

Now that Teams_data is ordered, let’s add a new variable with the overall rank (called OverallRank) for each team on each regular season, defining number 1 to the team with the best WLP, 2 for the team with the best second WLP and so on.

Teams_data <- mutate(Teams_data,
         OverallRank = ave(WLP, yearID, FUN = seq_along))

By the time of building this article, the R Lahman Package (in version 5.0-0) did not have the 2016 data. So, I had to build an Excel file to get the 2016 season’s info baseball-reference. Then, bind this 2016 data to the Teams_data tibble.

# get 2016 data from baseball-reference
library(readxl)  # To read Excel files into R 

T2016 <- tbl_df(read_excel("C:/Users/1328/Documents/R projects/darh78.github.io/data/T2016.xlsx"))

# binding both tibbles
Teams_data <- rbind(Teams_data, T2016)

During this period (1994-2016), four MLB franchises changed their names at least once, so I standardized each team ID with the ID of MLB franchises that played the 2016 season.

Teams_data <- mutate(Teams_data,
                  FranchID = ifelse(teamIDBR == "ANA" | teamIDBR == "CAL" | teamIDBR == "LAA", "LAA", # Anaheim Angels
                                        ifelse(teamIDBR == "FLA" | teamIDBR == "MIA", "MIA", # Miami Marlins
                                               ifelse(teamIDBR == "MON" | teamIDBR == "WAS" | teamIDBR == "WSN", "WSN", # Washington Nationals
                                                      ifelse(teamIDBR == "TBD" | teamIDBR == "TBR", "TBR", # Tampa Bay Rays
                                                             teamIDBR)
                                               )
                                        )
                      )
                  )

Data cleaning

Additionally, to better prepare Teams_data for the analysis, let’s modify some of the classes of the variables and give better names to some of them:

Teams_data$WLP <- as.numeric(Teams_data$WLP)
Teams_data$yearID <- as.integer(Teams_data$yearID)
Teams_data$W <- as.integer(Teams_data$W)
Teams_data$L <- as.integer(Teams_data$L)
Teams_data$OverallRank <- as.integer(Teams_data$OverallRank)
Teams_data <- Teams_data %>% rename(Season = yearID, Team = name)

Resulting on a tibble with this preview:

Teams_data %>% slice(c(1, 100, 200, 300, 400, 500))
## # A tibble: 6 × 11
##   Season                 Team teamIDBR     W     L DivWin WCWin WSWin
##    <int>                <chr>    <chr> <int> <int>  <chr> <chr> <chr>
## 1   1994       Montreal Expos      MON    74    40   <NA>  <NA>  <NA>
## 2   1997    Milwaukee Brewers      MIL    78    83      N     N     N
## 3   2000       Montreal Expos      MON    67    95      N     N     N
## 4   2004      Minnesota Twins      MIN    92    70      Y     N     N
## 5   2007  St. Louis Cardinals      STL    78    84      N     N     N
## 6   2010 Arizona Diamondbacks      ARI    65    97      N     N     N
## # ... with 3 more variables: WLP <dbl>, OverallRank <int>, FranchID <chr>

Exploratory Data Analysis

Before analyzing or ploting anything, let’s see a summary of the variables in the tibble TeamsStd:

summary(Teams_data)
##      Season         Team             teamIDBR               W         
##  Min.   :1994   Length:682         Length:682         Min.   : 43.00  
##  1st Qu.:1999   Class :character   Class :character   1st Qu.: 71.00  
##  Median :2005   Mode  :character   Mode  :character   Median : 80.00  
##  Mean   :2005                                         Mean   : 79.61  
##  3rd Qu.:2011                                         3rd Qu.: 89.00  
##  Max.   :2016                                         Max.   :116.00  
##        L             DivWin             WCWin              WSWin          
##  Min.   : 40.00   Length:682         Length:682         Length:682        
##  1st Qu.: 71.00   Class :character   Class :character   Class :character  
##  Median : 79.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 79.61                                                           
##  3rd Qu.: 89.00                                                           
##  Max.   :119.00                                                           
##       WLP          OverallRank      FranchID        
##  Min.   :0.2654   Min.   : 1.00   Length:682        
##  1st Qu.:0.4506   1st Qu.: 8.00   Class :character  
##  Median :0.5000   Median :15.00   Mode  :character  
##  Mean   :0.5000   Mean   :15.34                     
##  3rd Qu.:0.5556   3rd Qu.:23.00                     
##  Max.   :0.7160   Max.   :30.00

From this summary we can extract that the maximum number of wins by a team in a single season is 116, while the maximum number of losses is 119. Let’s check who were those teams:

knitr::kable(Teams_data %>%
  filter(W == 116 | L == 119) %>%
  select(Season, Team, W, L, WLP, OverallRank))
Season Team W L WLP OverallRank
2001 Seattle Mariners 116 46 0.7160494 1
2003 Detroit Tigers 43 119 0.2654321 30

These two records correspond also to the minumim and maximum WLP registered in this period.

Now, let’s visualize the winning percentage of each team between 1994 and 2016.

library(ggplot2) # To visualize results
library(ggthemes)# To format vizes

Linegraph <- ggplot(Teams_data, aes(x = Season, y = OverallRank)) +
  geom_line(color = "cadetblue3", size = .8) +
  scale_y_reverse(breaks = c(1,30)) +
  facet_wrap(~ FranchID, ncol = 5) +
  labs(title = "Overall rank of MLB teams in regular season",
       subtitle = "based on WLP in the Wild Card Era (since 1994)",
       caption = "Data from Lahman R package 5.0-0")+
  theme_tufte() +
  theme(axis.ticks = element_blank(),
        panel.grid.major.y = element_line(colour = "gray86", linetype = "dotted", size = 0.1),
        panel.grid.minor.y = element_blank(),
        strip.text.x = element_text(size = 10, family = "serif", face = "bold", colour = "black", angle = 0),
        axis.text.x=element_text(angle = 90, hjust = 0, vjust = 1, size = 7),
        axis.text.y=element_text(angle = 0, hjust = 1, vjust = 0.5, size = 6)) +
  scale_x_continuous(breaks = c(1994, 1998, 2002, 2006, 2010, 2014))

Linegraph

This sparkline-type visualization shows some interesting things:

  • Arizona and Tampa Bay have records from 1998, when they joing the MLB;
  • The most consistent team over the whole period is The New York Yankees;
  • Baltimore, Kansas City, Pittsburgh and Washington had very bad years for long time;
  • Most of the teams have ups and downs in the overall rank / WLP;

Now, let’s see how teams clinched the postseason and who of them became World Champs based on their overall rank during regular season. First let’s add a new variable, called clinch with this info.

Teams_data <- mutate(Teams_data,
                     clinch = ifelse((DivWin == "Y" | WCWin == "Y") & WSWin == "N", "Clinched Playoff",
                                      ifelse(WSWin == "Y", "World Champion", NA)))

And let’s add those results to the previous sparkline plot.

Linegraph_ps <- ggplot(Teams_data, aes(x = Season, y = OverallRank)) +
  geom_line(color = "cadetblue3", size = .8) +
  geom_point(aes(shape = clinch, color = clinch, fill = clinch)) +
  scale_color_manual(name = "Team's performance",
                    breaks = c("Clinched Playoff", "World Champion"),  
                    values = c("darkblue", "red3"),
                    labels = c("Clinched Playoff", "World Champion")) +
  scale_shape_manual(name = "Team's performance",
                     breaks = c("Clinched Playoff", "World Champion"),
                     values = c(21, 18),
                     labels = c("Clinched Playoff", "World Champion")) +
  scale_fill_manual(name = "Team's performance",
                     breaks = c("Clinched Playoff", "World Champion"),
                     values = c("white", "red3"),
                     labels = c("Clinched Playoff", "World Champion")) +
  scale_y_reverse(breaks = c(1,30)) +
  facet_wrap(~ FranchID, ncol = 5) +
  labs(title = "Overall rank of MLB teams in regular season",
       subtitle = "based on WLP in Wild Card Era (since 1995)",
       caption = "Data from Lahman R package 5.0-0")+
  theme_tufte() +
  theme(axis.ticks = element_blank(),
        panel.grid.major.y = element_line(colour = "gray86", linetype = "dotted", size = 0.1),
        panel.grid.minor.y = element_blank(),
        strip.text.x = element_text(size = 10, family = "serif", face = "bold", colour = "black", angle = 0),
        axis.text.x=element_text(angle = 90, hjust = -2, vjust = 1, size = 7),
        axis.text.y=element_text(angle = 0, hjust = 1, vjust = 0.5, size = 6)) +
  scale_x_continuous(breaks = c(1994, 1998, 2002, 2006, 2010, 2014))

Linegraph_ps

Note: There was no Postseason in 1994.

Final words

This plot shows that, in past years, the Atlanta Braves and the New York Yankees had very consistent season with top overall WLP by season who allowed them to clinch the playoff several times.

Over this period, all teams have clinched the postseason at least twice. The (Florida) Marlins, Milwaukee Brewers, and Toronto Blue Jays are the three teams with less presence in the postseason (only twice each); nevertheless, the Marlins is the only team among them with two World Series Championships, thus 100% of efficiency when they go to the postseason and to the fall classic.

On the other hand, the St. Louis Cardinals have not had a very consistent overall WLP over these seasons, but it’s a team that has clinched the playoff several times.

Indeed, in 2006, although they won the NL Central Division, they were the team with the worst WLP of those going to the postseason (they were the 13th team in the overall rank). The White Sox (6th overall rank), Los Angeles Angels (7th), the Blue Jays (10th), the Red Sox (11th) and the Phillies (12th) could not make the postseason even tough they had a higher WLP than the Cardinals.

The curious feat is that St. Louis could win the NL Series, played the World Series against the Detroit Tigers, and they won the Commissioner’s Trophy. This makes the Cardinals the only team in the Wild Card Era to win the World Series after having the worst WLP in the regular season (WLP = 0.516).