Okay, so yesterday I was messing around, trying to get a handle on some tennis data. The task? Figuring out who’d win between Noskova and Townsend. Sounds simple, right? Wrong.

First, I grabbed some match history data. Found a decent CSV file online – you know, the usual stats: aces, faults, first serve percentage, all that jazz. Started by loading it into pandas in my Jupyter Notebook. Standard stuff.
Imported pandas, read the CSV:
import pandas as pd
df = *_csv('tennis_*')
Next up, cleaning the data. This is always the worst part. Missing values everywhere! Some rows were just plain garbage. So, I started dropping rows with too many NaNs. Used dropna()
, specifying a threshold. Got rid of a bunch of useless rows.
Cleaned up the NaNs:
df_cleaned = *(thresh=10) # Keep rows with at least 10 non-NaN values
Then, I had to filter the data down to just the matches involving Noskova and Townsend. I made two dataframes, one for each player, filtering the original dataframe based on player names in the ‘player1’ and ‘player2’ columns. It was a bit messy, with a lot of .loc
calls, but it got the job done.

Filtered for Noskova and Townsend matches:
noskova_df = df_cleaned[((df_cleaned['player1'] == 'Noskova') (df_cleaned['player2'] == 'Noskova'))]
townsend_df = df_cleaned[((df_cleaned['player1'] == 'Townsend') (df_cleaned['player2'] == 'Townsend'))]
After that, I was trying to build some simple features. I wanted to compare their recent performance. So I calculated the average first serve percentage, win rate, and ace count for each player over their last few matches. I used a sliding window approach, grouping by player and calculating rolling averages. This part involved a lot of trial and error with groupby()
and rolling()
.
Calculated rolling averages for key stats:
noskova_stats = noskova_*('player1')[['first_serve_percentage', 'win_rate', 'aces']].rolling(window=5).mean()
townsend_stats = townsend_*('player1')[['first_serve_percentage', 'win_rate', 'aces']].rolling(window=5).mean()
Finally, to actually “predict” the winner, I just compared these averages. If Noskova’s average first serve percentage was higher, and her win rate was also higher, then I’d predict Noskova to win. Super basic, I know. I just wanted to see if I could get a semi-reasonable prediction.

Compared the stats and made a “prediction”:
if noskova_stats['first_serve_percentage'].mean() > townsend_stats['first_serve_percentage'].mean() and noskova_stats['win_rate'].mean() > townsend_stats['win_rate'].mean():
print("Prediction: Noskova wins!")
else:
print("Prediction: Townsend wins!")
The result? Well, it was a coin flip, honestly. Sometimes the “model” predicted correctly, sometimes it didn’t. This wasn’t about building a perfect prediction model; it was about getting my hands dirty with the data and seeing how different stats relate to each other. It also reminded me how much work goes into cleaning and prepping data before you can even start doing anything interesting with it.
Ultimately, it was a fun little project, even if the prediction wasn’t super accurate. Definitely going to try something more sophisticated next time!