An airline company wishes to build a predictor for whether or not a passenger will show up on time for a flight (a yes/no answer), based on predictive features such as: (1) how many times the passenger missed their flight in the past, (2) the time that the ticket was purchased and (3) the predicted amount of traffic for that day.

Let [mathjaxinline]\mathbf X[/mathjaxinline] be the vector of predictive features and let [mathjaxinline]Y[/mathjaxinline] be the desired feature we wish to predict for future passengers.

Remark: In this problem, we consider reasons for which a more generalized version of regression – as opposed to simple linear regression – might be more appropriate.

In the same setting as the previous problem (and in the context of the discussion of the solution), which of the following are true statements about \mu and the pair (\mathbf X,Y)? Choose all that apply.

The range of values of Y is bounded.

The range of values of \mu is strictly positive.

Based on the range of values of Y, it is harder to assume that the noise is Gaussian.

Mathematically, linear regression is impossible to compute for Yes/No responses.

Mathematically, linear regression is impossible to compute for integer-valued features, (e.g. X_1 = the number of missed flights).

1 answer

The correct statements are:

Based on the range of values of Y, it is harder to assume that the noise is Gaussian.

Mathematically, linear regression is impossible to compute for Yes/No responses.

Mathematically, linear regression is impossible to compute for integer-valued features, (e.g. [mathjaxinline]X_1[/mathjaxinline] = the number of missed flights).

Explanation:

1. The range of values of Y is bounded: This statement is not necessarily true. The range of values of Y may or may not be bounded depending on the specific problem and data.

2. The range of values of [mathjaxinline]\mu[/mathjaxinline] is strictly positive: This statement is not necessarily true. [mathjaxinline]\mu[/mathjaxinline] represents the expected value of Y, and it can be any real number, including negative or zero values.

3. Based on the range of values of Y, it is harder to assume that the noise is Gaussian: This statement is true. If the range of values of Y is limited (e.g., bounded between 0 and 1 for a Yes/No response), it is harder to assume that the noise follows a Gaussian distribution, which is typically assumed in linear regression.

4. Mathematically, linear regression is impossible to compute for Yes/No responses: This statement is true. Linear regression assumes a continuous target variable, and it cannot be directly applied to Yes/No responses. Instead, methods like logistic regression or other classification algorithms should be used.

5. Mathematically, linear regression is impossible to compute for integer-valued features: This statement is true. Linear regression assumes that the predictors (features) are continuous variables. If the features are integer-valued, linear regression cannot be directly applied. Different regression techniques, such as generalized linear regression, may be more appropriate.