CSGO team analysis with Python and Pandas part 02

Thibaut Cordenier
9 min readFeb 27, 2020

With IEM Katowice’s final showdown coming in this week end, I thought this could be a nice opportunity to examine one of the key components of the esports scene : Counter Strike Global Offensive (CS:GO).

In this second part we will clean the dataset we created in the first part of this series of articles dedicated to the professionnal teams of CS:GO. You can find the whole series on my github.

Preparing the datasets

Concatening the csv extracted from Liquipedia

There are several ways to get data from liquipedia.net/counterstrike’s API. You can send requests from a Python script (see Part 01 of this series) or use their “search engine” that does the same queries. Though you can only download one 500 rows max CSV at a time.

In our case, that means we will have to concatenate 3 of them since liquipedia.net counts 1013 CS:GO teams.

Depending on when you get these data, the overall figures might evolve drastically — it’s logical, since liquipedia.net updates its informations as soon as they happen.

To directly download the 3 csvs I use you can follow this link, it will lead you to the dedicated Liquipedia Counter Strike page.

# if you did not place your datasets in the same directory as your ipynb or py file modify the next line to match their actual location
path = os.getcwd() + r"\Data"

dfs = []
for file in os.listdir(path):
directories = path + "\\" + file #change the "\\" according to you system if you're not on Windows
df = pd.read_csv(directories)
dfs.append(df)

CSGO_Teams = pd.concat(dfs, ignore_index=True)
duplicateRowsDF = CSGO_Teams[CSGO_Teams.duplicated(keep='first')]
nbrDuplicates = len(duplicateRowsDF)
CSGO_Teams = CSGO_Teams.drop_duplicates()
print("\nConcatening over:")

#sanity check
if nbrDuplicates != 0:
print("There were",nbrDuplicates, "duplicates that were deleted from the dataframe.")
else:
print("\tNo duplicated rows detected.")
print("\tNbr of rows in the dataframe :",CSGO_Teams.shape[0])
print("\tNbr columns in the dataframe :",CSGO_Teams.shape[1])
CSGO_Teams.sample(3)
A piece of some of the 26 columns of this dataset.

Adding other statistical informations

We’ll add some other informations about the medals each team has won. Since this didn’t show up with the API we will use a Pandas feature that is great for parsing html tables : pd.read_html()

Stats = pd.read_html("https://liquipedia.net/counterstrike/Statistics/Total")
dfs_stats = []
for i in range(len(Stats)):
df = Stats[i]
dfs_stats.append(df)

df_medals = pd.concat(dfs_stats, ignore_index=True)
df_medals.drop_duplicates(inplace=True)
df_medals.drop("Earnings",axis=1,inplace=True)
df_medals.rename(columns={'Organization':'Has name',
'Unnamed: 1':'Gold_medals',
'Unnamed: 2':'Silver_medals',
'Unnamed: 3':'Bronze_medals'},
inplace=True)

df_medals["Total_Medals"] = df_medals.Gold_medals + df_medals.Silver_medals + df_medals.Bronze_medals

A quick df.shape will tell us that the dataframe has 600 rows and 5 columns. If we sort it by the “Total_Medals” column we’ll get the dataframe shown below :

df_medals.sort_values("Total_Medals",ascending=False).head()

Checking for potential multiple entries

Now let’s see if some teams appear more than once in our dataframes, whatever the reason.

CSGO_Teams[CSGO_Teams["Has name"].duplicated()]
df_medals[df_medals["Has name"].duplicated()]

Only three entries are duplicated in these datasets : 1 in CSGO_Teams and 3 in df_medals. This needs to be investigated since it will biais our analysis after our incoming merge.

But before droping these rows by their indexes let’s take a look at them.

CSGO_Teams[CSGO_Teams["Has name"].str.contains("Team X")]

Actually there were two different teams that used the same name “Team X”, one was from Sweeden and the other one from Danemark. We’ll need to specify that in the cells.

CSGO_Teams.loc[811, 'Has name'] = "Team X (Swedish team)"
CSGO_Teams.loc[812, 'Has name'] = "Team X (Danish team)"

Now let’s take a look at the duplicate entries in df_medals :

df_medals[df_medals["Has name"].str.contains("Counter Logic Gaming")]

df_medals[df_medals["Has name"] == "MIBR"]

df_medals[df_medals["Has name"] == "OverGaming"]

The double entries in df_medals are a bit more challenging to handle. According to Liquiepedia there are also two teams that share the name “Counter Logic Gaming”. They are from the same organization, but the one still active has a “Red” addendum to its name that is missing in the dataset — hence the double entry…

Therefore, according to Liquiepedia, at index 40 is “Counter Logic Gaming” and the one at index 67 is “Counter Logic Gaming Red”.

df_medals.loc[67, 'Has name'] = "Counter Logic Gaming Red"

Regarding MIBR team, there are actually two different entries on Liquiepdia : one where all letters are capital wich is a CS:GO team and the other where just first letter is in capital, who was focused in Counter-Strike and Counter-Strike Source.

Since our study is about CS:GO and because the “Mibr” entry is, therefore, not in our CSGO_Teams dataset we will get rid of this one in the df_medals dataset (index 54).

df_medals.drop([55],axis=0,inplace=True)df_medals[df_medals["Has name"] == "MIBR"]

For OverGaming, since the first row has no medal, we will simple delete it.

df_medals.drop([314],axis=0,inplace=True)df_medals[df_medals["Has name"].str.contains("OverGaming")]

Merging the two dataframes

Now it’s time to combine these two dataframes with a left join and our dataset will be ready for some cleaning. We’ll name it CSGO_Merged

CSGO_Merged = pd.merge(CSGO_Teams,df_medals,how="left",on="Has name",indicator=True)

CSGO_Merged.head()

Data cleaning

Before starting to analyse this dataset we still need to make some actions are requested :

  • Changing column names
  • Remove the “unnamed: 0” column
  • Changing “Is active” results from booleans to Active/Inactive
  • Cleaning the “Was created” and “Was disbanded” to only keep the year
  • Extracting the sponsors names in the “Has sponsor” column
  • Rearanging columns order

Since the first 3 are quite straightforward I’ll condense them below and take more time detailing the process for the others :

#changing columns'names
CSGO_Merged.rename(columns={'Has name': "team",
'Is active':"status",
'Was created':"creation_date",
'Was disbanded':"disbanding_date",
'Has earnings':"total_gains",
'Has location': "country",
'Has region': "region",
'Has sponsor':"sponsors",
'Has earnings in 2012': "2012_gains",
'Has earnings in 2013': "2013_gains",
'Has earnings in 2014': "2014_gains",
'Has earnings in 2015': "2015_gains",
'Has earnings in 2016': "2016_gains",
'Has earnings in 2017': "2017_gains",
'Has earnings in 2018': "2018_gains",
'Has earnings in 2019': "2019_gains",
'Has site' : "website",
'Has twitter': "twitter_account",
'Has twitch stream': "twitch_account",
'Has instagram': "instagram_account",
'Has facebook' : "facebook_account",
'Has youtube channel':"youtube_channel",
'Has faceit profile': "faceit_profile",
'Has vk': "VKontakte_account",
'Has esea id': "esea_id",
'Gold_medals':"gold_medals",
'Silver_medals': "silver_medals",
'Bronze_medals': "bronze_medals",
'Total_Medals':"total_medals"},
inplace=True)
#droping unnecessary columns
CSGO_Merged.drop("Unnamed: 0",axis=1,inplace=True)
#Changing "Is active" results from booleans to Active/Inactive
CSGO_Merged.status = CSGO_Merged.status.map({False: 'Inactive', True: 'Active'})

Cleaning the “Was created” and “Was disbanded” to only keep the year

First, let’s find out how many teams are active/inactive and how many empty values are in these columns

CSGO_Merged.status.value_counts(dropna=False)
#### results
#### Inactive 703
#### Active 310
CSGO_Merged.creation_date.isnull().sum()
CSGO_Merged.creation_date.isnull().mean()*100
#### results
#### 234
#### 23.099703849950643
CSGO_Merged.disbanding_date.isnull().sum()
CSGO_Merged.disbanding_date.isnull().mean()*100
#### results
#### 310
#### 30.60217176702863

If the 310 null values in the “disbanding_date” columns correspond to the 310 teams that are still active, the 234 null values in the “creation_date”column are simply missing : we have no data regarding them. That’s 23% of the total dataframe.

In addition, there are three rows (index 466, 605 and 720) not considered “null” where the creation dates is unclear with patterns like “20??-??” or “????-??-??”. They will be turned into NaN once we will use regular expressions to extract our points of interest :

- the year of creation of the team
- the year the team started playing or switched to CS:GO
- the year of their disbanding - if any

Then all these columns will be converted into datetime objects.

Switch2CSGO = CSGO_Merged.creation_date.str.extract(".*(\d{4}).*$")
CSGO_Merged.creation_date = CSGO_Merged.creation_date.str.extract("(\d{4})")
CSGO_Merged.disbanding_date = CSGO_Merged.disbanding_date.str.extract("(\d{4})")
CSGO_Merged.insert(3,"starts_playing_CSGO", Switch2CSGO)CSGO_Merged.creation_date = pd.to_datetime(CSGO_Merged.creation_date)
CSGO_Merged.creation_date = CSGO_Merged.creation_date.dt.to_period('Y')

CSGO_Merged.disbanding_date = pd.to_datetime(CSGO_Merged.disbanding_date)
CSGO_Merged.disbanding_date = CSGO_Merged.disbanding_date.dt.to_period('Y')

CSGO_Merged.starts_playing_CSGO = pd.to_datetime(CSGO_Merged.starts_playing_CSGO,errors="coerce")
CSGO_Merged.starts_playing_CSGO = CSGO_Merged.starts_playing_CSGO.dt.to_period('Y')

All of these lines give us a much nicer dataframe :

CSGO_Merged.head()

Extracting the sponsors names in the “Has sponsor” column

The data in sponsor column is a far more complex challenge than what we had so far. Not only there are several brands per rows, they are “encrypted” in html like format which make a basic cleaning or extraction with str.split(expand=True) impossible if we want to be exhaustive.

Hopefully there is nothing regex and pd.str.extractall() can solve.

First let’s examine some rows.

CSGO_Merged.sponsors.loc[[1,3,5,22,85,720]].to_frame()

For each sponsor’s name there’s an url which is an interesting piece of information that we want to keep. But if a team has more than one sponsor they are all in one row, which is problematic.

Moreover, if the main pattern is [URL1 brand1] br tag [URL2 brand2] and so on and so forth up to 18 times for a japonese team, with a lot of noticeable variations.

The presence of NaNs — where teams have no sponsors — will also hinder our sponsors extraction since pd.str.extractall(), unlike pd.str.extract(), doesn’t return them.

To get rid of that we need to transform our CSGO_Merged into a multi-index dataframe in order to be able to merge to what pd.str.extractall() will return once we’ll pass some regex patterns as arguments.

CSGO_MI = CSGO_Merged.copy()CSGO_MI.set_index(['team', 'country'], inplace=True)CSGO_MI.head()

Giving names to your regex capture groups allows pd.str.extractall() to use them as names for the columns it returns. To name a capture group put some word inside (?P<…>).

The pattern here is long, but not as complicated as it might look.

pattern = r"(?P<url>[\w\./-]+)(?P<ToBdroped1>/?\s)(?P<brand>\b.*?)\](?P<ToBdroped2><br\s/>\[)?"Extract_df = CSGO_MI["sponsors"].str.extractall(pattern)
Extract_df.info()

From just this extract we can already assume that 2211 sponsor partnerships have been made between CS:GO teams and brands. Here’s a view of our dataframe :

Now we’ll go back to a single index in order to be able to merge it later.

Extract_df.reset_index(inplace=True)
Extract_df.drop(["match","ToBdroped1","ToBdroped2"],axis=1,inplace=True)
Extract_df.head(15)

Now let’s merge it the CSGO_MI dataframe :

CSGO_MI = pd.merge(CSGO_MI,Extract_df,how="left",on=["team","country"]).drop("sponsors",axis=1)
CSGO_MI.head()

Now we can find how much sponsors each team have with pd.crosstab().

Nbr_Sponsor = pd.crosstab(CSGO_MI["team"],CSGO_MI["brand"],margins=True).iloc[:,-1].to_frame()# Nbr_Sponsor = Nbr_Sponsor.rename({"All":"nbr_sponsors"},axis=1)
Nbr_Sponsor = Nbr_Sponsor.rename({"All":"nbr_sponsors"},axis=1).reset_index()
# Nbr_Sponsor = Nbr_Sponsor.reset_index()
Nbr_Sponsor.sample(7)
Number of sponsors for 7 random teams from the “Nbr_Sponsor” dataframe.

Since the CSGO_MI dataframe is a bit big with all that brand/sponsors information — that goes as up as 18 sponsors for one japonese team — we’ll add the Nbr_Sponsor dataframe to the CSGO_Merged dataframe.

So that we will have the option : whether we want only the number of sponsors for each team, or that + the name of all the sponsors.

CSGO_MI = pd.merge(CSGO_MI,Nbr_Sponsor,how="left",on="team")
CSGO_MI.nbr_sponsors = CSGO_MI.nbr_sponsors.fillna(0).astype('int64')
CSGO_Merged = pd.merge(CSGO_Merged,Nbr_Sponsor,how="left",on="team")
CSGO_Merged.nbr_sponsors = CSGO_Merged.nbr_sponsors.fillna(0).astype('int64')

Col_List = ["nbr_sponsors","country","total_medals"]
Other_cols = [i for i in CSGO_Merged if i not in Col_List]
Start = Other_cols[0:7]
End = Other_cols[7:]
CSGO_Merged = CSGO_Merged[Start + Col_List + End]

Rearanging columns order

First we’ll create a list of the column(s) we want to move. Then we’ll create another list with the rest of the columns. And we’ll specify where we want to place or columns.

Col_List = ["nbr_sponsors","brand","url"]    
Other_cols = [i for i in CSGO_MI if i not in Col_List]
Start = Other_cols[0:7]
End = Other_cols[7:]
CSGO_MI = CSGO_MI[Start + Col_List + End]

Next steps

In the next part we’ll start to analyse this dataset. But for now, as an appetizer, let’s find out how the top 10 CS:GO teams has performed in terms of gains, and how much medals and sponsorships they have.

CSGO_Merged.groupby(["team","nbr_sponsors"]).agg({"total_gains":sum,'total_medals':(sum)}).sort_values(["total_gains","total_medals"],ascending=False).head(10)

And here are the 10 brands that has sponsored the most CS:GO teams :

CSGO_MI.groupby('brand').agg({'team':(np.size)}).sort_values("team",ascending=False).nlargest(10,"team")

--

--

Thibaut Cordenier

I love UX, I love Data and use them both to make things better. Reach me on Twitter : @T_Cordenier