Bnad Final Project
Bnad Final Project
Created By:
➢ Gabby Lambert
➢ Orlando Acosta
➢ Jack Rahe
➢ Molly Flannery
➢ Aidan Ross
Company:
➢ ClubCorp
Group Number:
➢ #42
Table of Contents
Introduction ..................................................................................................................................... 2
Data Exploration ............................................................................................................................. 2
Statistical Analysis .......................................................................................................................... 3
Overview of the Raw Data .......................................................................................................... 3
Choosing the Best Model ............................................................................................................ 4
Linear Regression Model ............................................................................................................ 5
Relevance of Findings..................................................................................................................... 7
Recommendations ........................................................................................................................... 7
Conclusion ...................................................................................................................................... 8
Appendix ......................................................................................................................................... 8
Table of Figures .......................................................................................................................... 8
Section 1: Regression Output for Predicting Attendance ........................................................... 9
Section 2: Residual Plots for Common Violations ..................................................................... 9
Section 3: Correlation Matrix for Common Violations ............................................................ 11
1|Page
Introduction
ClubCorp assigned our team the task of finding the most profitable college football stadiums. We
were provided with raw data that contained components such as game information, stadium
location, and the weather of that specific game. Our team utilized that data to create variables
and different models that could help predict total attendance at a college football game. We
decided to use attendance as a dependent variable since having more people in the stadium
would mean more tickets and thus more profit. Some of the original variables that we used from
the raw data included: minimum temperature, current wins, current losses, and tailgating. We
then created the following 10 variables based off the data we were provided with: rain (d), power
five (d), away rank (d), home rank (d), big four network (d), year, tv (d), home score, away
score, and cold days (d).
The model that we found to be the most accurate was a non-linear regression with an R2 of 77%
and a Se of 11,999 people. From the visualizations that we created in Tableau as well as our
statistical analysis we would like to recommend a couple of things. The first thing we
recommend is to operate at Beaver stadium (Penn State), Bryant-Denny Stadium (Alabama), and
Kyle Field (Texas A&M). If those options are taken, then we suggest going to a stadium located
in the SEC conference and to a stadium that allows tailgating. This document will explain why
those recommendations were chosen in more detail. Specifically, this report covers the research
teams’ data exploration, statistical analysis, relevance of findings, and the recommendations.
Data Exploration
To better understand and visualize the data, we created multiple visuals which can be displayed
through this link to Tableau. In total five dashboards were created. The first dashboard was a dot
map of average attendance at a college football game. We decided to make the dots bigger if they
had a higher average attendance and we highlighted the top five stadiums in the color red to
identify which stadiums would be the most profitable. We found that Beaver stadium (Penn
State), Bryant-Denny Stadium (Alabama), Kyle Field (Texas A&M), Oklahoma Memorial
Stadium (Oklahoma), and Memorial Stadium (Nebraska) had the highest average attendance.
The second dashboard illustrated average attendance based on wins and losses. The overall trend
that we discovered from this illustration was that when teams had more wins the trendline would
steadily move up, indicating a higher average attendance. When teams had more losses the
trendline would steadily decrease, indicating a lower average attendance.
In our third dashboard we created a line graph that would represent the average attendance for
each of the power five conferences. What we found was that the SEC had the highest average
attendance at 79,180 people. The Big-10 had the second highest average attendance at 70,590
people. The Big-12 had the third highest attendance at 52,689 people. The ACC ranked fourth
with 52,441 people and the Pac-12 was last with 50,470 people.
2|Page
The fourth dashboard represented the average attendance per game during different times of the
day. The categories that were created for the times of day included: afternoon, early, and late.
This visualization helped display that the early games had the highest average attendance
followed by afternoon games and then late games.
The last dashboard that we created was a scatterplot that displayed the average attendance per
game when it snowed/rained. From this visualization we were able to see that when it rained a
little, more people attended the games. Although when it snowed there were less people that
attended the games.
Statistical Analysis
Overview of the Raw Data
When our group first took on this project for ClubCorp, we were given raw Excel data that came
with information on the game, the stadium, and the weather. Under the game tab, the raw data
included data for the: game ID, date, home rank, time, opponent, opponent rank, result, current
wins, current losses, conference, if there was a new coach and if the game was televised. Under
the stadium tab, the raw data included the: game ID, site, attendance, stadium capacity, stadium
fill rate, and if tailgating was allowed. The last tab was for weather and the original data included
the: game ID, rain in inches, snow in inches, maximum temperature, and the minimum
temperature.
First, our team started cleaning up the data in the stadium tab. Under the “Site” column we
noticed that the stadium name, the city, and the state name were all smushed together in the same
cell. To fix this issue we created separate columns for the site name, the city, and the state. We
utilized the commands: LEN (length), SEARCH, and RIGHT to separate the three subjects.
There were some sites that had blank cells for their city and state, so we solved this issue by
filtering their respective columns and manually entering the city and state. When some cities and
state cells had unnecessary words, we used the TRIM commands to get rid of it. In preparation
for future regression tests, we created a dummy variable column for tailgating that returned a one
if tailgating was allowed and a zero if it was not allowed.
After cleaning up the stadium data we shifted to the weather tab. The raw data here was
organized well so we made no changes to the original data. Although, we did create additional
columns to be used as future variables. The first two columns that we created were for rain and
snow dummy variables that returned a one if that respective weather condition occurred and zero
if it did not. We then created two additional dummy variable columns for hot days and cold days.
A one was returned if that specific game was hot or cold and a zero if it was neither hot nor cold.
The last tab we worked on was the game tab. We started this clean up by creating a time category
column. We used an IF command that returned “Early” if the game time was less than or equal to
12:00:00 PM, “Afternoon” if the game was greater than 12:00:00 PM but less than 18:00:00 PM,
and “Late” if the game was greater than 18:00:00 PM. Next, we used the games date to create
separate columns for the specific day, month, and year. The text to column command was used
to separate the date into its own columns. When we used the filter option, we deleted one game
in January and one game in April because they were outliers. Another issue we noticed in the
3|Page
data was that the game result and the score were placed in the same cell. To fix this issue, we
utilized the text to column command once more, and that allowed us to create a home score
column and an away score column. Once complete we used the filter tool on the result column,
the home score column, and the away score column. We had to delete four games that had a
different result other than win or lose. We then deleted eight games that had blanks or dates
under the home and away score column.
Furthermore, our team created dummy variables for early games, afternoon games, if the home
team was ranked, if the away team was ranked, if the game was televised, if the game was
televised on a big four network (ABC, FOX, CBS, ESPN), if the team won, if there was a new
coach, if the conference was the SEC, and if the conference was power five. A one was returned
if the events listed were true and a zero was returned if it was not true. The home score and the
away score were utilized as continuous variables.
➢ Se: On average our prediction of total attendance is off by 11,999 people holding all else
constant.
➢ The following variables had p-values smaller than the 10% significance level and were
therefore found to be significant:
▪ Rain (d), Power 5 (d), Away Rank (d), Home Rank (d), Big 4 Network (d), Tailgating
(d), TV (d), Cold Days (d), Year, Home Score, Away Score, Current Wins, Current
Losses, Min Temp
5|Page
➢ The following variables had p-values greater than the 10% significance level and were
therefore found to be insignificant:
▪ Month, Day, Early (d), Afternoon (d), Win, New Coach, (d), SEC Conference (d),
Hot Days (d)
Common Violations:
➢ Non-Linear Pattern:
o No violations were discovered for non-linear patterns.
➢ Heteroskedasticity:
o The current wins and current losses variables did appear to have changing
variability. The left side of the residual plots were large and clumped together but
as you worked your way right the residual plots’ data points got smaller (funnel
shape). We should not trust the standard errors for these coefficient estimates as
they are biased.
➢ Endogeneity:
o There were no violations with endogeneity.
➢ Multicollinearity:
o No violations were present for multicollinearity.
➢ Click the Figures Below to View the Residual Plots & Correlation Matrix:
o Figure 2 – Year
o Figure 3 – Home Score
o Figure 4 – Away Score
o Figure 5 – Current Wins
o Figure 6 – Current Losses
o Figure 7 – Minimum Temperature
o Figure 8 – Correlation Matrix
Coefficient Interpretations:
➢ 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑊𝑖𝑛𝑠: As current wins increase by 1-win, total attendance increases by 726
people, on average and all else constant.
➢ 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝐿𝑜𝑠𝑠𝑒𝑠: As current losses increase by 1-loss, total attendance decreases by 847
people, on average and all else constant.
➢ 𝑅𝑎𝑖𝑛(d): Games that rain receive 1,616 less attendees than games that don’t rain on
average and all else constant.
➢ 𝑃𝑜𝑤𝑒𝑟 𝐹𝑖𝑣𝑒(d) : Games that are part of the power five conference receive 19,371 more
attendees than games that are not part of the power five conference, on average and all
else constant.
➢ 𝐴𝑤𝑎𝑦 𝑅𝑎𝑛𝑘(d): Games that have a ranked away team receive 2,924 more attendees than
games that have no ranked away team, on average and all else constant.
➢ 𝐵𝑖𝑔 𝐹𝑜𝑢𝑟 𝑁𝑒𝑡𝑤𝑜𝑟𝑘(d) : Games that are televised on a big four network (ABC, FOX,
CBS, ESPN) receive 5,228 more attendees than games that are not on a big four network,
on average and all else constant.
➢ 𝑇𝑎𝑖𝑙𝑔𝑎𝑡𝑖𝑛𝑔(d) : Games that allow tailgating receive 29,631 more attendees than games
that do not allow tailgating, on average and all else constant.
6|Page
➢ 𝑌𝑒𝑎𝑟: As the year increases by 1-year, total attendance decreases by 274 people, on
average and all else constant.
➢ 𝐻𝑜𝑚𝑒 𝑅𝑎𝑛𝑘(d) : Games that have a ranked home team receive 5,458 more attendees than
games that do not have a ranked home team, on average and all else constant.
➢ 𝑇𝑉(d): Games that are televised receive 6,780 more attendees than games that are not
televised, on average and all else constant.
➢ 𝐻𝑜𝑚𝑒 𝑆𝑐𝑜𝑟𝑒: As the home score increases by 1-point, total attendance increases by 19
people, on average and all else constant.
➢ 𝐴𝑤𝑎𝑦 𝑆𝑐𝑜𝑟𝑒: As the away score increases by 1-point, total attendance decreases by 847
people, on average and all else constant.
➢ 𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒: As the minimum temperature increases by 1-degree, total
attendance increases by 75 people, on average and all else constant.
➢ 𝐶𝑜𝑙𝑑 𝐷𝑎𝑦𝑠(d): Games that are on cold days receive 1,368 less people than games that are
not on cold days, on average and all else constant.
Relevance of Findings
An important discovery that we made was that tailgating had the largest effect on whether people
attended the game or not. 29,631 more people attended games with tailgating than games without
it. If the school was part of a power five conference was also a very important discovery as
19,371 more people attended those games verse the games that were not a power five
conference. Those are the two biggest relationships that we would like ClubCorp to know.
When looking over the data we also found it surprising that as we advanced to the next year, we
lost an average of 274 people. We think that football is still a growing sport, and we believe
more people would attend each additional year. A variable that was not significant but should be
discussed is the SEC conference dummy variable. The SEC is the most viewed conference in
college football, the variable just didn’t have a great enough overall effect on all the other
conferences to be significant.
There were multiple important findings in Tableau as well. The biggest finding that should be
known was that Beaver stadium (Penn State), Bryant-Denny Stadium (Alabama), Kyle Field
(Texas A&M) had the highest average attendance for college football games. This could indicate
that these stadiums are more profitable. Another important discovery was that more fans attended
games if the time had more current wins.
Recommendations
Based on the regressions we ran and the data visualization in Tableau we have a couple of things
we would like to recommend. Based on our findings in Tableau, the first thing we would like to
recommend is for ClubCorp to operate clubs at Beaver stadium (Penn State), Bryant-Denny
7|Page
Stadium (Alabama), and Kyle Field (Texas A&M) since these stadiums had the highest average
attendance. If these stadiums are not a viable option, we would also recommend operating at
stadiums that are located in the SEC conference as they had the highest average attendance at
79,180 people.
From the linear regression that we ran we discovered two very important variables which were
tailgating and being part of a power five conference. We recommend operating at stadiums that
allow tailgating as total attendance increases by 29,631 people on average and all else constant.
We previously stated that the SEC is the conference with the highest average attendance but if
the SEC isn’t the right fit, we would like for ClubCorp to operate at other stadiums within the
power five conference which include the: ACC, Big 10, Big-12, and the Pac-12.
Future data that we would like to acquire to improve our model is all-time wins, if the football
team made the playoffs the previous year, and the total population of that football teams state.
After this data is gathered, we would like to run ANOVA tests as well as descriptive statistics.
We recommend gathering this data to improve the model for the following reasons. Higher all-
time wins could possibly mean more fans because that college football team has more history.
Making the playoffs the previous year would generate more fans because it means that team has
high potential. Total population of the state provides a cap on total attendance at stadiums which
is also important to know.
Conclusion
In this report we analyzed data on college football games, the stadiums they were located at, and
the weather at those games. We recommend ClubCorp to operate at Beaver stadium, Bryant-
Denny Stadium or Kyle Field. If those are not viable options, then we recommend shifting to
stadiums in the SEC that allow tailgating. It is important to note that our regression model had an
R2 of 77% and a Se of 11,999 people so if you would like to discuss future models, please send
us a quick email by May 10, 2023. You can reach the research team at
[email protected].
Appendix
Table of Figures
Section 1: Regression Output for Predicting Attendance
8|Page
➢ Figure 4 - Away Score Variable…………………………………………………………11
ANOVA
df SS MS F Significance F
Regression 14 3.15517E+12 2.2537E+11 1565.354925 0
Residual 6350 9.14231E+11 143973444.7
Total 6364 4.0694E+12
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 569,317.16 65626.37042 8.675127987 5.18399E-18 440667.3185 697967.0068 440667.3185 697967.0068
Rain (d) -1,616.01 343.0197068 -4.71112762 2.51594E-06 -2288.444058 -943.5751722 -2288.444058 -943.5751722
Minimum Temperature 75.02 17.16490714 4.370385139 1.26004E-05 41.3682415 108.6662686 41.3682415 108.6662686
Power 5 (d) 19,370.81 346.2156459 55.95014198 0 18692.11498 20049.5141 18692.11498 20049.5141
Away Rank (d) 2,924.33 449.3418795 6.508027337 8.196E-11 2043.467436 3805.191035 2043.467436 3805.191035
BIG 4 Networks (d) 5,227.78 475.8102718 10.9871136 7.84085E-28 4295.032724 6160.530296 4295.032724 6160.530296
Tailgating (d) 29,630.65 430.8060284 68.77955848 0 28786.12315 30475.17369 28786.12315 30475.17369
Year -274.05 32.77932727 -8.360574255 7.59581E-17 -338.3125488 -209.7954506 -338.3125488 -209.7954506
Home Rank (d) 5,458.43 466.9915661 11.6885021 3.04377E-31 4542.970756 6373.893046 4542.970756 6373.893046
TV (d) 6,779.83 446.1233952 15.19721908 2.91885E-51 5905.282492 7654.38746 5905.282492 7654.38746
Home Score 19.47 10.94464188 1.778584803 0.075355689 -1.989219715 40.92116718 -1.989219715 40.92116718
Away Score -73.00 11.86816545 -6.150756345 8.18061E-10 -96.26380539 -49.73258248 -96.26380539 -49.73258248
Current Wins 725.98 78.56565926 9.240387097 3.28605E-20 571.961885 879.9923232 571.961885 879.9923232
Current Losses -847.11 85.77574893 -9.875838026 7.73524E-23 -1015.256832 -678.9579737 -1015.256832 -678.9579737
Cold Days (d) -1,368.26 513.7458038 -2.663306737 0.007757211 -2375.377897 -361.1474237 -2375.377897 -361.1474237
9|Page
Figure 2 – Year Varible:
0
2000 2005 2010 2015 2020
-50000
Year
0
0 20 40 60 80
-50000
Home Score
0
0 10 20 30 40 50 60 70
-50000
Away Score
0
0 2 4 6 8 10 12
-50000
Current Wins
10 | P a g e
Figure 6 – Current Losses Variable:
0
-1 1 3 5 7 9 11
-50000
Current Losses
Minimum Temperature
Residual Plot
50000
Residuals
0
-10 10 30 50 70 90
-50000
Minimum Temperature
11 | P a g e