Project 1: Predicting the Pokemon Card's Type from its Features¶

By John Li

This projects takes most pokemon cards that exists for the first generation Pokemons. In total there are 151 first generation pokemons and about 4574 total combined pokemon cards for all of them.

My original idea was to categorize each card by the Pokemon but this would be far more intensive than what we have learned in class so I opted to predict the HP of the Pokemon card.

Scrapping the Data¶

The most time comsuming aspect of this project is gathering the data. The features that I scraped were:

  • Pokemon Name
  • Stage
  • HP
  • Type
  • Attack Type(s)
  • Attack Name <- There was a bug in my code and all Attack Names were lost during the scraping
  • Attack Damage
  • Attack Text
  • Weakness Type
  • Resistance Type
  • Retreat cost
  • Rarity

No description has been provided for this image

These features are present on all pokemon cards which makes them perfect for predicting the HP of the pokemon card.

I got all the data from pkmncards.com. They have all the features in plain text so I did not have do any image recognition. For more information on the scrapping process, I included my process in the final markdown block below.

Project Plan¶

  • Import and clean data
  • Seperate and categorize columns
  • Feature scale the data
  • Train the model
  • Evaluate

The model that I plan to use is linear regression with regularization since predicting a continuous number, the HP, is a regression problem. I will use Ridge regression since I think most features that I scraped are going to be usefully for predicting the HP of the pokemon.

Importing all Necessary Libraries and Packages¶

In [ ]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_absolute_error, r2_score

Data Preparation and Preprocessing¶

Load Data¶

In [2]:
df = pd.read_csv('./data/pokemon_data.csv')
df = df.drop(columns=['attack_1_name', 'attack_2_name', 'attack_3_name', 'attack_4_name', 'attack_5_name'])  # Dropping attack name due to bug.

print(df.shape)
df.head()
(4558, 23)
Out[2]:
pokemon_name hp type stage rarity weakness resistance retreat_cost attack_1_cost attack_1_damage ... attack_2_text attack_3_cost attack_3_damage attack_3_text attack_4_cost attack_4_damage attack_4_text attack_5_cost attack_5_damage attack_5_text
0 Bulbasaur 40 HP Grass Basic Common Fire No Resistance 1 Grass Grass 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Bulbasaur 40 HP Grass Basic Common Fire No Resistance 1 Grass Grass 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Erika’s Bulbasaur 50 HP Grass Basic Uncommon Fire No Resistance 1 Grass 10 ... Flip a coin. If heads, you may search your dec... NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Bulbasaur 40 HP Grass Basic Common Fire No Resistance 1 Grass Grass 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Bulbasaur 50 HP Grass Basic Common Fire No Resistance 1 Colorless 10 ... Flip a coin. If heads, the Defending Pokémon i... NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 23 columns

Process and Clean Data¶

We can see from the structure of our data that there is a lot of cleaning up to do. Looking at the data, we see that there is a total of 5 attacks with most of the attacks having the values of NaN. Since some pokemons have multiple attacks, we can combine them and find the total, average, and max of the attack_cost and attack_damage.

We can also combine the attack text and using TfidfVectorizer from sklearn to find the features.

Since the HP column is not numeric, we can easiy change that too.

In [3]:
text_cols = [col for col in df.columns if 'text' in col]
damage_cols = [col for col in df.columns if 'damage' in col]
attack_cols = [col for col in df.columns if 'attack' in col] # To be dropped later
attack_cost_cols = [col for col in df.columns if 'attack' in col and 'cost' in col]

df[text_cols] = df[text_cols].fillna('')
df['all_attack_text'] = df[text_cols].apply(lambda row: ' '.join(row), axis=1)

for col in damage_cols:
  df[col] = pd.to_numeric(df[col], errors='coerce')
df['num_attacks'] = df[damage_cols].notna().sum(axis=1)
df['max_damage'] = df[damage_cols].max(axis=1)
df['total_damage'] = df[damage_cols].sum(axis=1)
df['avg_damage'] = df[damage_cols].mean(axis=1)

for col in attack_cost_cols:
  df[col] = df[col].fillna('').apply(lambda x: len(str(x).split()))
df['max_energy_cost'] = df[attack_cost_cols].max(axis=1)
df['total_energy_cost'] = df[attack_cost_cols].sum(axis=1)

denominator = df['num_attacks'].clip(lower=1) # Prevent division by zero
df['avg_energy_cost'] = df[attack_cost_cols].sum(axis=1) / denominator

df['hp'] = df['hp'].replace(' HP', '', regex=True)
df['hp'] = pd.to_numeric(df['hp'], errors='coerce')

df.fillna(0, inplace=True)

df = df.drop(columns=attack_cols)

print(df.shape)
df.head()
(4558, 16)
Out[3]:
pokemon_name hp type stage rarity weakness resistance retreat_cost all_attack_text num_attacks max_damage total_damage avg_damage max_energy_cost total_energy_cost avg_energy_cost
0 Bulbasaur 40 Grass Basic Common Fire No Resistance 1 Unless all damage from this attack is prevente... 1 20.0 20.0 20.0 2 2 2.0
1 Bulbasaur 40 Grass Basic Common Fire No Resistance 1 Unless all damage from this attack is prevente... 1 20.0 20.0 20.0 2 2 2.0
2 Erika’s Bulbasaur 50 Grass Basic Uncommon Fire No Resistance 1 The Defending Pokémon is now Asleep. Flip a co... 1 10.0 10.0 10.0 2 3 3.0
3 Bulbasaur 40 Grass Basic Common Fire No Resistance 1 Unless all damage from this attack is prevente... 1 20.0 20.0 20.0 2 2 2.0
4 Bulbasaur 50 Grass Basic Common Fire No Resistance 1 Flip a coin. If heads, the Defending Pokémon ... 2 10.0 20.0 10.0 2 3 1.5

Identify and Separate Column Types¶

Now that our data frame is cleaned up. We can move on to categorizing the data into numeric form.

I first split the features into categorical, numberical, or just text. With categorical data we can use get_dummies from pandas which helps seperate the column into binary columns for each category.

We mentioned above that we can use TfidfVectorizer to find the features in all_attack_text. This gives us a numeric matrix with the most common words and a value between 0 and 1 based on how many time that word showed up for that row.

In [4]:
y = df['hp']
X = df.drop(columns=['hp'])

categorical_features = ['type', 'stage', 'weakness', 'resistance', 'rarity']
numerical_features = ['retreat_cost', 'num_attacks', 'max_damage', 'total_damage', 'avg_damage', 'max_energy_cost', 'total_energy_cost', 'avg_energy_cost']
text_features = 'all_attack_text'

X_categorical_encoded = pd.get_dummies(X[categorical_features], drop_first=True)
X_base_features = pd.concat([X[numerical_features], X_categorical_encoded], axis=1)

tfidf = TfidfVectorizer(max_features=500) # Limit to top 500 features to reduce dimensionality
X_text_features = tfidf.fit_transform(X[text_features])
X_text_df = pd.DataFrame(X_text_features.toarray(), columns=tfidf.get_feature_names_out())

X_processed = pd.concat([X_base_features, X_text_df], axis=1)

print(X_processed.shape)
X_processed.head()
(4558, 581)
Out[4]:
retreat_cost num_attacks max_damage total_damage avg_damage max_energy_cost total_energy_cost avg_energy_cost type_Darkness type_Dragon ... without working works would you your yours zapdos zone zubat
0 1 1 20.0 20.0 20.0 2 2 2.0 False False ... 0.0 0.0 0.0 0.0 0.166934 0.000000 0.0 0.0 0.0 0.0
1 1 1 20.0 20.0 20.0 2 2 2.0 False False ... 0.0 0.0 0.0 0.0 0.166934 0.000000 0.0 0.0 0.0 0.0
2 1 1 10.0 10.0 10.0 2 3 3.0 False False ... 0.0 0.0 0.0 0.0 0.130478 0.326134 0.0 0.0 0.0 0.0
3 1 1 20.0 20.0 20.0 2 2 2.0 False False ... 0.0 0.0 0.0 0.0 0.166934 0.000000 0.0 0.0 0.0 0.0
4 1 2 10.0 20.0 10.0 2 3 1.5 False False ... 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0

5 rows × 581 columns

Split the Data for Training and Testing¶

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
  X_processed,
  y,
  test_size=0.2,
  random_state=1234
)

Feature Scaling¶

Since we have many columns that are not equal, we have to scale them so that the model does not mistake retreat_cost (which is usually 1 or 2) for a smaller coefficient as compared to attack_damage (which is anywhere from 10 to 200+).

In [6]:
scaler = StandardScaler(with_mean=False)
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Model Training¶

As discussed before, we are using linear regression with regularization so we will be using Ridge from sklearn as our model. I set alpha=1.0 as a default value since I do not know how much regularization we would need.

In [7]:
model = Ridge(alpha=1.0)
model.fit(X_train_scaled, y_train)
Out[7]:
Ridge()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
alpha  1.0
fit_intercept  True
copy_X  True
max_iter  None
tol  0.0001
solver  'auto'
positive  False
random_state  None

What the Model Learned¶

Looking at the features that are the most important to the model, we see that high rarity, damage, and retreat_cost raised the predicted HP while a low or basic stage lowered the predicted HP.

In [8]:
coefficients = pd.Series(model.coef_, index=X_processed.columns)
print(coefficients.sort_values(ascending=False))
rarity_Ultra Rare     13.797590
max_damage            11.294669
rarity_Double Rare     9.829171
stage_VMAX             9.712907
retreat_cost           9.504890
                        ...    
if                    -5.751410
10                    -6.288372
stage_Stage 1         -8.787541
turn                  -9.902094
stage_Basic          -12.576607
Length: 581, dtype: float64

Model Evaluation¶

Here we predict the HP of our test group and evaluate the accuracy of our model. Looking at our absolute error of 14.13 HP, I would say our model is very accurate. Since Pokemon cards HP only increases in increments of 10, our error is between 1 and 2 increments.

We can also see the R-squared value which is 0.91 this means that 91% of the variability in HP values can be explained with the features in this model.

In [9]:
predictions = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("--- Model Evaluation Results ---")
print(f"Mean Absolute Error (MAE): {mae} HP")
print(f"R-squared (R^2): {r2}")
--- Model Evaluation Results ---
Mean Absolute Error (MAE): 14.125096947960028 HP
R-squared (R^2): 0.9058198006025279

Visualize¶

In [10]:
plt.style.use('ggplot')

plt.figure(figsize=(8, 6))
plt.scatter(y_test, predictions, alpha=0.7, edgecolors='w', s=100)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)

plt.xlabel("True HP")
plt.ylabel("Predicted HP")
plt.title("Model Performance: True vs. Predicted HP")
plt.show()
No description has been provided for this image

Finding the Best Alpha For Our Model¶

We know that a high alpha value or strong regularization creates a simplier model while a low alpha value or weak regularization creates a more complex model. We can take a set of alpha values from 0.001 to 1000 and run the model for each alpha value. We can then compare the R^2 values for all of them.

In [11]:
alpha_range = np.logspace(-3, 3, 20)

train_scores = []
test_scores = []

for alpha in alpha_range:
  model = Ridge(alpha=alpha)
  model.fit(X_train_scaled, y_train)
  
  y_train_pred = model.predict(X_train_scaled)
  y_test_pred = model.predict(X_test_scaled)
  
  train_scores.append(r2_score(y_train, y_train_pred))
  test_scores.append(r2_score(y_test, y_test_pred))

print("Best Alpha:", alpha_range[test_scores.index(max(test_scores))], "with R^2:", max(test_scores))
Best Alpha: 112.88378916846884 with R^2: 0.9084990186539924

We see that the best alpha value is 112.88 which gives us a R^2 value of 0.908. This is higher than the R^2 value of our initial pass through with an alpha value of 1.0 which was 0.906.

In [12]:
plt.style.use('ggplot')
plt.figure(figsize=(10, 6))

plt.plot(alpha_range, train_scores, 'o-', color='blue', label='Training Score')
plt.plot(alpha_range, test_scores, 'o-', color='green', label='Testing Score (Validation)')

plt.xscale('log')

plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('R-squared Score')
plt.title('Validation Curve for Ridge Regression')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

Future Considerations¶

For iterations on this project, I would definitely consider a few things:

  • Make sure all data is scraped accurately. This would mean re-scrapping and making sure the attack_name is there.
  • Consider using Lasso regression as validation and to see which features the Lasso technique would find useless.

Scraping Process (Pains)¶

There are many popular libraries for scrapping but by far the most popular is Playwright. Written in Node.js, Playwright emulates a browser and gives the developer easy parsing tools and browser navigation tools. Playwright has a python wrapper but I didn't use Playwright; instead, I used Camoufox which is built on top of Playwright with a custom browser that has many stealth tools built in.

I realized only during the development process that CamouFox makes it harder to change the browser's context which makes it harder to create asynchronous workers. After trying and failing a few times, I gave up and stuck with a single worker. This makes development simple but it also severely limits the speed at which I was able to scrape. This wasn't the biggest issue however:

  • (1) The biggest issue was the Proxy service I used before, Decodo, required me to go through a verification process that includes an image of my passport and a selfie. I was not comfortable with my information being given to a 3rd party so I opted out.
  • (2) Since I could not use proxies to scrape the website, I had to rely on my home internet which made me scared of getting suddenly blacklisted by the website so I set a 5+ second delay to every request.
  • (3) The five second delay on top of the scraping process and an in-built timeout for features it could not find or recognize resulted in a total scrapping time of around 14 hours.

You can see the starting and ending log and the timestamp below:

log
2025-09-22 03:53:52,723 - main - INFO - Starting main scraping process
2025-09-22 03:53:52,726 - main - INFO - Added 4588 URLs to the queue
2025-09-22 03:53:58,924 - main - INFO - Processing URL: https://pkmncards.com/card/bulbasaur-base-set-bs-44/
...
...
...
2025-09-22 17:44:47,009 - main - INFO - Processing URL: https://pkmncards.com/card/mew-ex-paldean-fates-paf-232/
2025-09-22 17:45:17,628 - main - INFO - Successfully scraped and wrote data for: Mew ex
2025-09-22 17:45:17,823 - main - INFO - Scraping process finished.

The extremely long time it took to scrape all this data forced me to keep the data even though my bot was not able to scrape the attack names for all pokemons.