2025 USA-NA-AIO Round 1, Problem 3, Part 6

Part 6 (5 points, coding and conceptual reasoning task)

In df_5, columns Sex and Embarked are categorical data.

Do the following tasks to process these categorical data.

  1. To do logistic regression on this dataset, we need to do one hot encoding on these two columns. Explain why?

  2. Do one hot encoding on these two columns. Set drop_first = True and dtype = np.int8. Save the new dataframe object as df_6.

  3. Explain what drop_first = True means and why we do so.

  4. Print the first five rows of df_5 and df_6.

  5. Print the shapes of df_5 and df_6.

### WRITE YOUR SOLUTION HERE ###

# Question 1
"""
Answer:

Logistic regression requires numerical data, not categorical data.

"""

# Question 2
# Answer: (put your code here)

df_6 = pd.get_dummies(df_5, columns=['Sex', 'Embarked'], drop_first = True, dtype = np.int8)

# Question 3
"""
Answer:

Suppose a categorical variable takes value k chosen from K categories, indexed as 0, 1, ..., K-1.

By setting drop_first = True, it is replaced by a vector with shape K-1.

If k = 0, then in this vector, all entries are 0.

If k is not 0, then in this vector, the (k-1)th entry (entry indices starts from 0 and ends with K-2) is 1 and all other entries are 0.

Setting drop_first = True avoids multicollinearity.

"""

# Question 4
# Answer: (put your code here)

print(df_5.head())
print(df_6.head())

# Question 5
# Answer: (put your code here)

print(df_5.shape)
print(df_6.shape)

""" END OF THIS PART """