Problem 13
In this problem, we study the Breast Cancer dataset. This is a binary classification task, and all features are numeric.
You can access the training dataset by running the following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
url = "https://huggingface.co/datasets/usaaio-official/breast_cancer_train/raw/main/breast_cancer_train.csv"
df = pd.read_csv(url)
In this training dataset, X contains 30 input features, and y contains the binary
target labels.
We also have a hidden test dataset that you cannot access during the compe-
tition.
You must submit a single Jupyter notebook (.ipynb) containing your complete
solution, including (but not limited to):
- Data preprocessing (if needed)
- Model construction
- Model training
- Inference logic
Inference requirements
For inference, you must define a function with the following signature:
def my_prediction(X_test):
###INSERT YOUR CODE HERE###
return y_pred
In this function:
- X_test is a pandas DataFrame containing all input features from the hidden
test set. - y_pred must be a pandas Series containing your predicted labels.
After the competition, we will execute all code in your submitted notebook from top to bottom. During evaluation, we will load the hidden test features as X_test and call your function: my_prediction(X_test) Your predictions y_pred will be evaluated using the macro-averaged F1 score (F1-macro).
Model constraints
In your solution, you must use k-Nearest Neighbors (kNN) as part of your classification approach. However, this does not imply that you must directly apply kNN to the raw
training and test data. You may apply any data preprocessing, feature engineering,
or pipeline you find appropriate. You may use any module from scikit-learn. Do not use deep neural network to solve this problem.
Notebook requirements
In your submitted notebook, please ensure the following:
- Your code includes sufficient comments to make it readable.
- At the end of the notebook, include a text cell summarizing:
• Your overall approach.
• The intuition behind your design choices.
• Any alternative approaches or models you considered but did not pursue.
Your report does not need to be long, but it should be clear, concise, and well-
reasoned.
Grading criteria:
Your submission will be evaluated based on:
- Whether the entire notebook runs successfully from start to finish.
- Performance on the hidden test dataset.
- The quality of your reasoning and problem-solving approach.
We do not evaluate code style or quality.