Problem 3 (100 points)
Before starting this problem, make sure to run the following code first without any change:
# DO NOT CHANGE
import numpy as np
import pandas as pd
import copy
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
np.random.seed(2025)
""" END OF THIS PART """
\color{red}{\text{WARNING !!!}}
-
Beyond importing libraries/modules/classes/functions in the preceeding cell, you are NOT allowed to import anything else for the following purposes:
-
As a part of your final solution. For instance, if a problem asks you to build a model without using sklearn but you use it, then you will not earn points.
-
Temporarily import something to assist you to get a solution. For instance, if a problem asks you to manually compute eigenvalues but you temporarily use
np.linalg.eig
to get an answer and then delete your code, then you violate the rule.
Rule of thumb: Each part has its particular purpose to intentionally test you something. Do not attempt to find a shortcut to circumvent the rule.
-
-
All coding tasks shall run on CPUs, not GPUs.
Part 1 (5 points, coding task)
We study the dataset USAAIO_2025_round1_prob3_train.csv
provided in this contest.
The dataset can be found here:
url = "https://drive.google.com/file/d/125YsFPS2nCNRvYyy1tgnD8RhYIUglLX9/view?usp=sharing"
Do the following tasks in this part.
-
Load
USAAIO_2025_round1_prob3_train.csv
into a pandas DataFrame object calleddf_1
. -
Print the first 10 rows.
-
Define a function called
data_summary
that-
Takes a DataFrame object as an input.
-
Prints the shape of the DataFrame.
-
Prints the data type for each column.
-
Prints the count of missing values for each column.
-
Delivers no output.
-
-
After defining the above function, call it by feeding
df_1
to it.