Project Draft

Introduction

Tennis is one of the simplest yet highly competitive individual sports. The results rely on just 2 players instead of a team like most competitive sports. With this lots of analyzation can be done due to the limitation of outside factors. Our groups main question is whether or not the age of a player has an impact on the chance of them winning a match. We will also look if handedness has an impact on the chance of winning a match. We think that players in the 28-32 range will preform the best as that is considered the athletic peak for most male athletes and we also think that left handed players might have a slight advantage due to the fact players are used to playing right handed players. Finally we will look at the height of players as being taller might be an advantage.

Background

The main data set that we will be using is the list of matches from the ATP in 2022. The ATP or Association of Tennis Players is the main governing body for mens tennis. The ATP records all the matches stats along with every player in the leagues stats which provides for accurate data collection. The data set is from the ATP and was downloaded from Github. There are tons of data points collected from the matches but the main variables we will be looking at are the winner and losers age, their handedness, their height and the tournament being played. The one unusual circumstance that we will take into account when analyzing the data is the playing surface of the match. Tennis is unique with having 3 different playing surfaces and players tend to do better on their preferred surface. These surfaces are clay, hard and grass. With the large sample size of all 3 surfaces this should not effect the results we are looking for but it is something to keep in mind as some players do tend to favor once surface over the others. With this all in mind we will now calculate the winning percentage of players based on ages, height and their handedness and interpret the results.

## # A tibble: 2,917 ×
## #   6
##    tourney_name
##    <chr>       
##  1 Atp Cup     
##  2 Atp Cup     
##  3 Atp Cup     
##  4 Atp Cup     
##  5 Atp Cup     
##  6 Atp Cup     
##  7 Atp Cup     
##  8 Atp Cup     
##  9 Atp Cup     
## 10 Atp Cup     
## # ℹ 2,907 more rows
## # ℹ 5 more
## #   variables:
## #   surface <chr>,
## #   winner_id <dbl>,
## #   winner_age <dbl>,
## #   winner_ht <dbl>, …

## # A tibble: 2,917 ×
## #   6
##    tourney_name
##    <chr>       
##  1 Atp Cup     
##  2 Atp Cup     
##  3 Atp Cup     
##  4 Atp Cup     
##  5 Atp Cup     
##  6 Atp Cup     
##  7 Atp Cup     
##  8 Atp Cup     
##  9 Atp Cup     
## 10 Atp Cup     
## # ℹ 2,907 more rows
## # ℹ 5 more
## #   variables:
## #   surface <chr>,
## #   loser_id <dbl>,
## #   loser_age <dbl>,
## #   loser_ht <dbl>, …

## # A tibble: 2,847 × 8
##    tourney_name surface winner_id winner_age winner_ht winner_hand height_group 
##    <chr>        <chr>       <dbl>      <dbl>     <dbl> <chr>       <chr>        
##  1 Atp Cup      Hard       200000       21.4       193 R           185 to 194 Cm
##  2 Atp Cup      Hard       133430       22.7       185 L           185 to 194 Cm
##  3 Atp Cup      Hard       105138       33.7       183 R           175 to 184 Cm
##  4 Atp Cup      Hard       105807       30.4       188 R           185 to 194 Cm
##  5 Atp Cup      Hard       106421       25.8       198 R           195 to 204 Cm
##  6 Atp Cup      Hard       133430       22.7       185 L           185 to 194 Cm
##  7 Atp Cup      Hard       134770       23         183 R           175 to 184 Cm
##  8 Atp Cup      Hard       105936       29.8       185 R           185 to 194 Cm
##  9 Atp Cup      Hard       106426       25.5       185 R           185 to 194 Cm
## 10 Atp Cup      Hard       105936       29.8       185 R           185 to 194 Cm
## # ℹ 2,837 more rows
## # ℹ 1 more variable: age_group <chr>

## # A tibble: 2,767 × 8
##    tourney_name surface loser_id loser_age loser_ht loser_hand height_group 
##    <chr>        <chr>      <dbl>     <dbl>    <dbl> <chr>      <chr>        
##  1 Atp Cup      Hard      105138      33.7      183 R          175 to 184 Cm
##  2 Atp Cup      Hard      105807      30.4      188 R          185 to 194 Cm
##  3 Atp Cup      Hard      128034      24.8      196 R          195 to 204 Cm
##  4 Atp Cup      Hard      200000      21.4      193 R          185 to 194 Cm
##  5 Atp Cup      Hard      126128      24.4      185 R          185 to 194 Cm
##  6 Atp Cup      Hard      105583      31.5      180 R          175 to 184 Cm
##  7 Atp Cup      Hard      126340      24.7      185 R          185 to 194 Cm
##  8 Atp Cup      Hard      105583      31.5      180 R          175 to 184 Cm
##  9 Atp Cup      Hard      126214      24.5      188 L          185 to 194 Cm
## 10 Atp Cup      Hard      105583      31.5      180 R          175 to 184 Cm
## # ℹ 2,757 more rows
## # ℹ 1 more variable: age_group <chr>

This is our first data frame which we will go off of that shows both the winners and losers keys stats that we need to look at for further interpretation.

A scatter plot of age and winning percentage

## # A tibble: 200 × 4
##    winner_age  wins total win_percent
##         <dbl> <int> <int>       <dbl>
##  1       17.7     1  2917    0.000343
##  2       18.2     2  2917    0.000686
##  3       18.4     2  2917    0.000686
##  4       18.7     8  2917    0.00274 
##  5       18.8    12  2917    0.00411 
##  6       18.9    17  2917    0.00583 
##  7       19      11  2917    0.00377 
##  8       19.1     6  2917    0.00206 
##  9       19.2    12  2917    0.00411 
## 10       19.3    11  2917    0.00377 
## # ℹ 190 more rows

# winning probability by age difference
matches <- matches %>% 
  mutate(winner_youger = winner_age < loser_age)

win_age_prob <- matches %>%
  filter(!is.na(winner_age) & !is.na(loser_age)) %>%
  mutate(age_diff = abs(winner_age - loser_age)) %>%
  group_by(age_diff) %>%
  summarise(wins = sum(winner_youger),
            total = n(),
            win_prob = wins / total)

It seems that players around the 25 year old mark do better than players older and younger. There is a significant drop off after around 35 which makes sense as most people retire around that age. We initially predicted this mark to be around 28 but it seems it is at 25 years old. There is a significant increase compared to the other ages but its span is not very large only consisting of a few years.

Hypothesis Test: H0: There is no significant relationship between age difference and winning probability (β = 0). Ha: There is a significant relationship between age difference and winning probability (β ≠ 0).

Regression Model: Yi=β0+β1Xi+εi We can use a linear regression model with the response variable being the probability of winning a match and the explanatory variable being the difference in age between players:

age_lm = lm(win_prob ~ age_diff, data = win_age_prob)
print(age_lm)

## 
## Call:
## lm(formula = win_prob ~ age_diff, data = win_age_prob)
## 
## Coefficients:
## (Intercept)     age_diff  
##     0.45188      0.01097

cf = coef(age_lm)
cf

## (Intercept)    age_diff 
##  0.45187598  0.01097492

We can use a significance level of 0.05. In addition, the range of the slope can be estimated using the confidence interval for β. If the p value is less than 0.5 and the confidence interval for β does not contain 0, we can reject the null hypothesis and confirm a positive relationship.

summary(age_lm)

## 
## Call:
## lm(formula = win_prob ~ age_diff, data = win_age_prob)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71308 -0.16428  0.02679  0.20467  0.53825 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.451876   0.031304  14.435  < 2e-16 ***
## age_diff    0.010975   0.003379   3.248  0.00127 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3109 on 372 degrees of freedom
## Multiple R-squared:  0.02758,    Adjusted R-squared:  0.02497 
## F-statistic: 10.55 on 1 and 372 DF,  p-value: 0.001267

confint(age_lm, level = 0.95)

##                   2.5 %     97.5 %
## (Intercept) 0.390320792 0.51343117
## age_diff    0.004331248 0.01761859

Based on the summary of the linear regression model, The p-value for the slope coefficient in the linear regression model is 0.001267, which is less than the significance level of 0.05. In addition, the confidence interval for β suggests that values range from 0.0043 to 0.0176. Since this interval does not contain 0, we can reject the null hypothesis. There is a positive relationship between age difference and winning probability in tennis matches. This means that as the absolute value of the age difference increases the winning probability increases and vice versa.

A scatter plot of winning probability by the difference in height

With this graph there are a few extreme outliers in height difference but for the most part it does seem taller players do better when playing a shorter player. Except when the difference is only a few centimeters then it seems that the shorter player does well which is interesting. But when there is a significant difference in height the taller player definitely has an advantage with winning percentages above the expected 50%. This matches our prediction as taller people are able to usually move quicker along with generate more power easier which both have a tremendous advantage in a sport like tennis. When applying a linear regression to the data set we see the slope of the line is .007. With that it means every inch of height you have on your opponent you gain 0.6% chance of winning. What is interesting is the y intercept is only .42 meaning if you are the same height as your opponent you would only on average have a .42 chance on winning. This should be at 50 percent but due to an outlying point it does seem without it it would be near or at .5

height_lm = lm(win_prob ~ height_diff, data = win_prob)
print(height_lm)

## 
## Call:
## lm(formula = win_prob ~ height_diff, data = win_prob)
## 
## Coefficients:
## (Intercept)  height_diff  
##    0.425746     0.006934

Hypothesis test: H0: There is no significant relationship between height difference and winning probability (β = 0). Ha: There is a significant relationship between height difference and winning probability (β ≠ 0).

## (Intercept) height_diff 
##  0.42574642  0.00693433

We can use a significance level of 0.05.In addition, the range of the slope can be estimated using the confidence interval for β.If the p value is less than 0.5 and the confidence interval for β does not contain 0, we can reject the null hypothesis and confirm a positive relationship.

summary(height_lm)

## 
## Call:
## lm(formula = win_prob ~ height_diff, data = win_prob)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65458 -0.03417  0.03097  0.08033  0.32462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.425746   0.070038   6.079 4.06e-06 ***
## height_diff 0.006934   0.003905   1.776   0.0896 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2008 on 22 degrees of freedom
## Multiple R-squared:  0.1253, Adjusted R-squared:  0.08559 
## F-statistic: 3.153 on 1 and 22 DF,  p-value: 0.08964

confint(height_lm, level = 0.95)

##                    2.5 %     97.5 %
## (Intercept)  0.280495509 0.57099732
## height_diff -0.001164905 0.01503356

Based on the summary of the linear regression model, there is a non-significant relationship between the height difference between players and the probability of winning a match (β = 0.0069, p-value = 0.0896). The confidence interval for β suggests that plausible values range from -0.0012 to 0.0150. Since this contains 0, we cannot reject the null hypothesis. This means that we cannot say for certain whether or not there is an advantage for being taller than your opponent.

A bar graph of Winning percentages with relation to which hand the player uses

It actually seems that left handed players do worse than right handed players which is kind of surprising but when really thinking about it it makes some sense. There is no clear advantage of being left handed in tennis compared to a sport of baseball where being left handed has a clear advantage. With this it is possible that more left handed people pick up sports they have a bigger advantage in. There also could be a coaching struggle with left handed players causing them to be worse as coaching someone of the opposite hand is always more difficult. The 8% chance is the largest we have observed yet showing a strong possibility that being right handed might be an advantage.

ci1 = win_percent_left +  1.96 * sqrt(win_percent_left * (1 - win_percent_left)/total_left_players)
ci2 = win_percent_left - 1.96 * sqrt(win_percent_left * (1 - win_percent_left)/total_left_players)
ci_left = c(ci2, ci1)
print(ci_left)

## [1] 0.1086931 0.1573337

After conduction a confidence interval to attempt to try and find the true mean of the winning percentage of matches won by left handed people we can say we are 95% confident that the percentage of matches won by left handed players is between .109 AND .157 which compared to the 21% of left handed players in all the matches it does seem that left handed players do perform worse than right handed players.

Final Conclusion:

In conclusion it seems that age does play a factor into winning percentage. With the age we predicted a little older of a prime but were only off by a few years where a clear winning percentage spike occurs. This spike was very dramatic which is something we thought but the extent of how much higher the winning percentage was was a little bit surprising. What was really surprising is that left handed players have a smaller winning percentage and by a good margin. They seem to preform worse by a large margin than the other two variables we looked at. Handedness should not have a major factor compared to a stat like height but it seems to have a huge impact. With the height summary the graph does seem to display an advantage for the taller players but this mainly comes when the height advantage is very large. We were unable to reject the null hypothesis not allowing us to prove our original hypothesis. In the future further analysis would need to be done to see for sure whether or not height has an impact.