Since 1994, MLB leagues (National and American) were divided into three divisions each: East, Central, and West.
On this format, the team with the best record in each Division is called the Division Champion, and it has the right to play the playoffs. Nevertheless, using only the Divisional Champion to advance to the postseason would make an odd number of teams in each league; so a fourth team was needed to rectify this situation. Then it was when the Wild Card (WC) team born in the MLB, allowing the second-best team in each league to make the postseason, even if it was not a Division Champion.
Since 2011, the Wild Card team is decided in a previous round after the end of the regular season, where two teams in each league qualify as WC and they play a one-game playoff and the winner advances to the Division Series.
Despite this format to advance to the playoff, the 30 MLB teams can be ranked from 1 to 30 based on their winning percentage.
This post looks to visualize this overall rank of each team on each season along the time, since 1994 until 2016. The Lahman database has info to analyze this regular season ranking. R has a package available with the Lahman database, so it will be used for this analysis.
Loading needed packages
# Loading packages
library(Lahman) # Data source
library(dplyr) # To ease the data manipulation
The following variables would be needed to create a WLP
percentage for each team: - Team name / ID; - Season; - Games won; - Games lost.
Data preparation
This info is available into the Teams
data frame of the Lahman R package. As the analysis is considering just the Wild Card Era, only data since 1994 will be taken:
# getting the data from Lahman package: Since 1995 to 2015
Teams_data <- tbl_df(Teams) %>%
select(yearID, name, teamIDBR, W, L, DivWin, WCWin, WSWin) %>%
filter(yearID >= 1994)
Let’s add a variable with the winning percentage, WLP
, and order the tibble by yearID
and WLP
(in descending order).
Teams_data <- mutate(Teams_data,
WLP = W/(W+L)) %>%
arrange(yearID, desc(WLP))
Now that Teams_data
is ordered, let’s add a new variable with the overall rank (called OverallRank
) for each team on each regular season, defining number 1 to the team with the best WLP
, 2 for the team with the best second WLP
and so on.
Teams_data <- mutate(Teams_data,
OverallRank = ave(WLP, yearID, FUN = seq_along))
By the time of building this article, the R Lahman Package (in version 5.0-0) did not have the 2016 data. So, I had to build an Excel file to get the 2016 season’s info baseball-reference. Then, bind this 2016 data to the Teams_data
tibble.
# get 2016 data from baseball-reference
library(readxl) # To read Excel files into R
T2016 <- tbl_df(read_excel("C:/Users/1328/Documents/R projects/darh78.github.io/data/T2016.xlsx"))
# binding both tibbles
Teams_data <- rbind(Teams_data, T2016)
During this period (1994-2016), four MLB franchises changed their names at least once, so I standardized each team ID with the ID of MLB franchises that played the 2016 season.
Teams_data <- mutate(Teams_data,
FranchID = ifelse(teamIDBR == "ANA" | teamIDBR == "CAL" | teamIDBR == "LAA", "LAA", # Anaheim Angels
ifelse(teamIDBR == "FLA" | teamIDBR == "MIA", "MIA", # Miami Marlins
ifelse(teamIDBR == "MON" | teamIDBR == "WAS" | teamIDBR == "WSN", "WSN", # Washington Nationals
ifelse(teamIDBR == "TBD" | teamIDBR == "TBR", "TBR", # Tampa Bay Rays
teamIDBR)
)
)
)
)
Data cleaning
Additionally, to better prepare Teams_data
for the analysis, let’s modify some of the classes
of the variables and give better names to some of them:
Teams_data$WLP <- as.numeric(Teams_data$WLP)
Teams_data$yearID <- as.integer(Teams_data$yearID)
Teams_data$W <- as.integer(Teams_data$W)
Teams_data$L <- as.integer(Teams_data$L)
Teams_data$OverallRank <- as.integer(Teams_data$OverallRank)
Teams_data <- Teams_data %>% rename(Season = yearID, Team = name)
Resulting on a tibble with this preview:
Teams_data %>% slice(c(1, 100, 200, 300, 400, 500))
## # A tibble: 6 × 11
## Season Team teamIDBR W L DivWin WCWin WSWin
## <int> <chr> <chr> <int> <int> <chr> <chr> <chr>
## 1 1994 Montreal Expos MON 74 40 <NA> <NA> <NA>
## 2 1997 Milwaukee Brewers MIL 78 83 N N N
## 3 2000 Montreal Expos MON 67 95 N N N
## 4 2004 Minnesota Twins MIN 92 70 Y N N
## 5 2007 St. Louis Cardinals STL 78 84 N N N
## 6 2010 Arizona Diamondbacks ARI 65 97 N N N
## # ... with 3 more variables: WLP <dbl>, OverallRank <int>, FranchID <chr>
Exploratory Data Analysis
Before analyzing or ploting anything, let’s see a summary of the variables in the tibble TeamsStd
:
summary(Teams_data)
## Season Team teamIDBR W
## Min. :1994 Length:682 Length:682 Min. : 43.00
## 1st Qu.:1999 Class :character Class :character 1st Qu.: 71.00
## Median :2005 Mode :character Mode :character Median : 80.00
## Mean :2005 Mean : 79.61
## 3rd Qu.:2011 3rd Qu.: 89.00
## Max. :2016 Max. :116.00
## L DivWin WCWin WSWin
## Min. : 40.00 Length:682 Length:682 Length:682
## 1st Qu.: 71.00 Class :character Class :character Class :character
## Median : 79.00 Mode :character Mode :character Mode :character
## Mean : 79.61
## 3rd Qu.: 89.00
## Max. :119.00
## WLP OverallRank FranchID
## Min. :0.2654 Min. : 1.00 Length:682
## 1st Qu.:0.4506 1st Qu.: 8.00 Class :character
## Median :0.5000 Median :15.00 Mode :character
## Mean :0.5000 Mean :15.34
## 3rd Qu.:0.5556 3rd Qu.:23.00
## Max. :0.7160 Max. :30.00
From this summary we can extract that the maximum number of wins by a team in a single season is 116, while the maximum number of losses is 119. Let’s check who were those teams:
knitr::kable(Teams_data %>%
filter(W == 116 | L == 119) %>%
select(Season, Team, W, L, WLP, OverallRank))
Season | Team | W | L | WLP | OverallRank |
---|---|---|---|---|---|
2001 | Seattle Mariners | 116 | 46 | 0.7160494 | 1 |
2003 | Detroit Tigers | 43 | 119 | 0.2654321 | 30 |
These two records correspond also to the minumim and maximum WLP
registered in this period.
Now, let’s visualize the winning percentage of each team between 1994 and 2016.
library(ggplot2) # To visualize results
library(ggthemes)# To format vizes
Linegraph <- ggplot(Teams_data, aes(x = Season, y = OverallRank)) +
geom_line(color = "cadetblue3", size = .8) +
scale_y_reverse(breaks = c(1,30)) +
facet_wrap(~ FranchID, ncol = 5) +
labs(title = "Overall rank of MLB teams in regular season",
subtitle = "based on WLP in the Wild Card Era (since 1994)",
caption = "Data from Lahman R package 5.0-0")+
theme_tufte() +
theme(axis.ticks = element_blank(),
panel.grid.major.y = element_line(colour = "gray86", linetype = "dotted", size = 0.1),
panel.grid.minor.y = element_blank(),
strip.text.x = element_text(size = 10, family = "serif", face = "bold", colour = "black", angle = 0),
axis.text.x=element_text(angle = 90, hjust = 0, vjust = 1, size = 7),
axis.text.y=element_text(angle = 0, hjust = 1, vjust = 0.5, size = 6)) +
scale_x_continuous(breaks = c(1994, 1998, 2002, 2006, 2010, 2014))
Linegraph
This sparkline-type visualization shows some interesting things:
- Arizona and Tampa Bay have records from 1998, when they joing the MLB;
- The most consistent team over the whole period is The New York Yankees;
- Baltimore, Kansas City, Pittsburgh and Washington had very bad years for long time;
- Most of the teams have ups and downs in the overall rank / WLP;
Now, let’s see how teams clinched the postseason and who of them became World Champs based on their overall rank during regular season. First let’s add a new variable, called clinch
with this info.
Teams_data <- mutate(Teams_data,
clinch = ifelse((DivWin == "Y" | WCWin == "Y") & WSWin == "N", "Clinched Playoff",
ifelse(WSWin == "Y", "World Champion", NA)))
And let’s add those results to the previous sparkline plot.
Linegraph_ps <- ggplot(Teams_data, aes(x = Season, y = OverallRank)) +
geom_line(color = "cadetblue3", size = .8) +
geom_point(aes(shape = clinch, color = clinch, fill = clinch)) +
scale_color_manual(name = "Team's performance",
breaks = c("Clinched Playoff", "World Champion"),
values = c("darkblue", "red3"),
labels = c("Clinched Playoff", "World Champion")) +
scale_shape_manual(name = "Team's performance",
breaks = c("Clinched Playoff", "World Champion"),
values = c(21, 18),
labels = c("Clinched Playoff", "World Champion")) +
scale_fill_manual(name = "Team's performance",
breaks = c("Clinched Playoff", "World Champion"),
values = c("white", "red3"),
labels = c("Clinched Playoff", "World Champion")) +
scale_y_reverse(breaks = c(1,30)) +
facet_wrap(~ FranchID, ncol = 5) +
labs(title = "Overall rank of MLB teams in regular season",
subtitle = "based on WLP in Wild Card Era (since 1995)",
caption = "Data from Lahman R package 5.0-0")+
theme_tufte() +
theme(axis.ticks = element_blank(),
panel.grid.major.y = element_line(colour = "gray86", linetype = "dotted", size = 0.1),
panel.grid.minor.y = element_blank(),
strip.text.x = element_text(size = 10, family = "serif", face = "bold", colour = "black", angle = 0),
axis.text.x=element_text(angle = 90, hjust = -2, vjust = 1, size = 7),
axis.text.y=element_text(angle = 0, hjust = 1, vjust = 0.5, size = 6)) +
scale_x_continuous(breaks = c(1994, 1998, 2002, 2006, 2010, 2014))
Linegraph_ps
Note: There was no Postseason in 1994.
Final words
This plot shows that, in past years, the Atlanta Braves and the New York Yankees had very consistent season with top overall WLP
by season who allowed them to clinch the playoff several times.
Over this period, all teams have clinched the postseason at least twice. The (Florida) Marlins, Milwaukee Brewers, and Toronto Blue Jays are the three teams with less presence in the postseason (only twice each); nevertheless, the Marlins is the only team among them with two World Series Championships, thus 100% of efficiency when they go to the postseason and to the fall classic.
On the other hand, the St. Louis Cardinals have not had a very consistent overall WLP
over these seasons, but it’s a team that has clinched the playoff several times.
Indeed, in 2006, although they won the NL Central Division, they were the team with the worst WLP
of those going to the postseason (they were the 13th team in the overall rank). The White Sox (6th overall rank), Los Angeles Angels (7th), the Blue Jays (10th), the Red Sox (11th) and the Phillies (12th) could not make the postseason even tough they had a higher WLP
than the Cardinals.
The curious feat is that St. Louis could win the NL Series, played the World Series against the Detroit Tigers, and they won the Commissioner’s Trophy. This makes the Cardinals the only team in the Wild Card Era to win the World Series after having the worst WLP
in the regular season (WLP = 0.516
).