2026 USAAIO Round 1 Sample problems, Problem 13

Problem 13

In this problem, we study the Breast Cancer dataset. This is a binary classification task, and all features are numeric.

You can access the training dataset by running the following code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

url = "https://huggingface.co/datasets/usaaio-official/breast_cancer_train/raw/main/breast_cancer_train.csv"
df = pd.read_csv(url)

In this training dataset, X contains 30 input features, and y contains the binary
target labels.

We also have a hidden test dataset that you cannot access during the compe-
tition.

You must submit a single Jupyter notebook (.ipynb) containing your complete
solution, including (but not limited to):

  1. Data preprocessing (if needed)
  2. Model construction
  3. Model training
  4. Inference logic

Inference requirements
For inference, you must define a function with the following signature:

def my_prediction(X_test):
    ###INSERT YOUR CODE HERE###
    return y_pred

In this function:

  1. X_test is a pandas DataFrame containing all input features from the hidden
    test set.
  2. y_pred must be a pandas Series containing your predicted labels.

After the competition, we will execute all code in your submitted notebook from top to bottom. During evaluation, we will load the hidden test features as X_test and call your function: my_prediction(X_test) Your predictions y_pred will be evaluated using the macro-averaged F1 score (F1-macro).

Model constraints
In your solution, you must use k-Nearest Neighbors (kNN) as part of your classification approach. However, this does not imply that you must directly apply kNN to the raw
training and test data. You may apply any data preprocessing, feature engineering,
or pipeline you find appropriate. You may use any module from scikit-learn. Do not use deep neural network to solve this problem.

Notebook requirements
In your submitted notebook, please ensure the following:

  1. Your code includes sufficient comments to make it readable.
  2. At the end of the notebook, include a text cell summarizing:
    • Your overall approach.
    • The intuition behind your design choices.
    • Any alternative approaches or models you considered but did not pursue.

Your report does not need to be long, but it should be clear, concise, and well-
reasoned.

Grading criteria:
Your submission will be evaluated based on:

  1. Whether the entire notebook runs successfully from start to finish.
  2. Performance on the hidden test dataset.
  3. The quality of your reasoning and problem-solving approach.
    We do not evaluate code style or quality.
1 Like

My code currently returns 0.964 and 0.976 accuracies for the 2 models that I created. If you have code that could results in a greater macro-f1 score, post it here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
url_part1 = "https://huggingface.co/datasets/usaaio-official/"
url_part2 = "2026_USAAIO_samples/raw/main/"
url_part3 = "2026_USAAIO_samples_breast_cancer_train.csv"
url = url_part1 + url_part2 + url_part3
df = pd.read_csv(url)
X = df.drop("target", axis=1)
y = df["target"]


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
#This code involves importation of key material in
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=2026, stratify=y
)
#This code involves splitting the data to training and testing datasets
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
#scaling the dataset

knn1 = KNeighborsClassifier(
    n_neighbors=5,
    weights="distance",
    metric="euclidean"
)
#setup the knn classifier on scikit learn
knn1.fit(X_train_scaled, y_train)

y_val_pred = knn1.predict(X_val_scaled)
f1_1 = f1_score(y_val, y_val_pred, average="macro")

print("Model 1 macro-F1:", f1_1)


knn2 = KNeighborsClassifier(
    n_neighbors=11,
    weights="distance",
    metric="manhattan"
)

knn2.fit(X_train_scaled, y_train)

y_val_pred2 = knn2.predict(X_val_scaled)
f1_2 = f1_score(y_val, y_val_pred2, average="macro")

print("Model 2 macro-F1:", f1_2)

best_model = knn1 if f1_1 >= f1_2 else knn2
print("Using model:", "knn1" if f1_1 >= f1_2 else "knn2")

X_scaled_full = scaler.fit_transform(X)
best_model.fit(X_scaled_full, y)
def my_prediction(X_test):
  ###INSERT YOUR CODE HERE###
  y_pred = X_test
  return y_pred
1 Like