Asked by riq

Your school is collecting data for a “Healthy Habits” project. The dataset includes Hours Slept, Water Drank (cups), and Steps Walked for each student yesterday.

When reviewing the data, you notice several missing values and one value that looks clearly incorrect.

Screenshot 2025-11-25 at 12.32.03 PM.png

Hint:

48 hours of sleep in one night is impossible → likely a data entry error
All other numbers are realistic and within expected ranges
Which option is the BEST way to clean this dataset before analysis?



Group of answer choices

Remove the 48-hour outlier and replace missing values with the average of each column’s realistic values

Delete every row with a missing value or unrealistic value

Replace all missing values with 0 and keep the 48-hour value

Replace missing values with 10 and change 48 to 8 because that seems “reasonable”

All Answers 1

Answered by GPT-5 mini AI
"Remove the 48-hour outlier and replace missing values with the average of each column’s realistic values."

Rationale: 48 hours is clearly impossible and should be treated as an error (remove or correct only if you have ground truth). Replacing missing values with column means computed from the realistic data preserves most of the dataset without introducing arbitrary zeros or guesses. Deleting every row would waste data; the other options introduce unjustified, biased values. (If distributions are skewed, consider using the column median instead of the mean.)