2026 USAAIO Round 1 Sample problems, Problem 9

Problem 9.

In this problem, we study the Bank Marketing dataset. You can load the dataset by using the following code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
url_part1 = "https://huggingface.co/datasets/usaaio-official/"
url_part2 = "2026_USAAIO_samples/raw/main/"
url_part3 = "2026_USAAIO_samples_bank.csv"
url = url_part1 + url_part2 + url_part3
df = pd.read_csv(url, sep=";")

Do the following tasks.

Part 9.1.

Print the data type of df.

Part 9.2.

Print the shape of df.

Part 9.3.

Print all column names.

Part 9.4.

For each column, print its name and the data type of that column.

Part 9.5.

Convert data type “object” to “category”.

Part 9.6.

Print the first 10 rows.

Part 9.7.

Some entry values are “unknown”. Suppose you interpret as missing value. Count the number of missing values in each column.

Part 9.8.

In this part, you do not need to consider the last column.
For each column that is numeric, normalize values in this column between 0 and 1. After normalization, print out the following statistics of this column: max value, min value, mean value, standard deviation. In your solution, you are not allowed to directly use any existing normalization function. That is, you need to do this task from scratch.

Part 9.9.

In this part, you do not need to consider the last column.

For each column that is categorical, do one hot encoding. Below is an example. Suppose one categorical column has name “ABC”. All possible values are “X”, “Y”, “Z”. Then after one-not encoding, you should create three new columns called “ABC_X”, “ABC_Y”, and “ABC_Z”.
In your solution, you are not allowed to directly use any existing one-hot encoding
function. That is, you need to do this task from scratch.

Part 9.10.

Column “y” is the target. How many target values are “yes” and how many are “no”?

Part 9.11.

Consider those whose “marital” is “married” and “age” is odd. Among these people, how many target values are “yes” and how many are “no”?

Part 9.12.

Put ages into 10 bins that are evenly split. For each bin, compute the subscription ratio. That is, within a given bin, the ratio of the number of “y” that are “yes” to the number of people in that bin.
Generate a plot the ratio v.s. age bin.

Part 9.13.

Define X to be with all features (no column “y”) and y with only column “y”.

Part 9.14.

Split the dataset into training and test datasets, where 80% of data is used for training.
Do randomly splitting. The random seed shall be 2026.

Part 9.1

print(type(df))

Part 9.2

print(df.shape)

Part 9.3

print(df.columns)

Part 9.4

print(df.dtypes)

Part 9.5

for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].astype('category')
#checking if implemented
print(df.dtypes)

Part 9.6

print(df.head(10))

Part 9.7

print(df.isin([np.nan, None, "unknown"]).sum())

Part 9.8

numeric_cols = df.columns[:-1][df.dtypes[:-1] != "category"]

for col in numeric_cols:
    values = df[col].values.astype(float)
    min_val = values.min()
    max_val = values.max()

    normalized = (values - min_val) / (max_val - min_val)

    print(col)
    print(" max:", normalized.max())
    print(" min:", normalized.min())
    print(" mean:", normalized.mean())
    print(" std:", normalized.std())

Part 9.9

categorical_cols = df.columns[:-1][df.dtypes[:-1] == "category"]

df_encoded = df.copy()

for col in categorical_cols:
    for val in df[col].cat.categories:
        df_encoded[f"{col}_{val}"] = (df[col] == val).astype(int)
    df_encoded.drop(columns=col, inplace=True)

Part 9.10

df["y"].value_counts()

Part 9.11

subset = df[(df["marital"] == "married") & (df["age"] % 2 == 1)]
subset["y"].value_counts()

Part 9.12

bins = pd.cut(df["age"], bins=10)

ratios = []
bin_centers = []

for interval in bins.cat.categories:
    group = df[bins == interval]
    if len(group) > 0:
        ratio = (group["y"] == "yes").mean()
        ratios.append(ratio)
        bin_centers.append(interval.mid)

plt.plot(bin_centers, ratios, marker='o')
plt.xlabel("Age bin")
plt.ylabel("Subscription ratio")
plt.title("Subscription Ratio vs Age")
plt.show()

Part 9.13

X = df.drop(columns="y")
y = df["y"]

Part 9.14

np.random.seed(2026)
indices = np.random.permutation(len(df))

train_size = int(0.8 * len(df))
train_idx = indices[:train_size]
test_idx = indices[train_size:]

X_train = X.iloc[train_idx]
X_test = X.iloc[test_idx]
y_train = y.iloc[train_idx]
y_test = y.iloc[test_idx]