Asked by VJ
                Consider a 2-layer feed-forward neural network that takes in x∈R2 and has two ReLU hidden units as defined in the figure below. Note that hidden units have no offset parameters in this problem.
5. (1)
The values of the weights in the hidden layer are set such that they result in the z1 and z2 “classifiers" as shown in the (x1,x2)-space in the figure below:
The z1 “classifier" with the normal w1=[w11 w21]T is the line given by z1=x⋅w1=0.
Similarly, the z2 “classifier" with the normal w2=[w12 w22]T is the line given by z2=x⋅w2=0.
The arrows labeled w1 and w2 point in the positive directions of the respective normal vectors.
The regions labeled I,II,III,IV are the 4 regions defined by these two lines not including the boundaries.
Choose the region(s) in (x1,x2) space which are mapped into each of the following regions in (f1,f2)-space, the 2-dimensional space of hidden unit activations f(z1) and f(z2). (For example, for the second column below, choose the region(s) in (x1,x2) space which are mapped into the f1-axis in (f1,f2)-space.)
(Choose all that apply for each column.)
{(f1,f2):f1>0,f2>0}: f1-axis: f2-axis: the origin (f1,f2)=(0,0):
(Choose all that apply.)
I --> True
II
III
IV
None of the above
I --> True
II --> True
III
IV
None of the above
I --> True
II
III
IV --> True
None of the above
I
II
III
IV
None of the above --> True
5. (2)
2 points possible (graded, results hidden)
If we keep the hidden layer parameters above fixed but add and train additional hidden layers (applied after this layer) to further transform the data, could the resulting neural network solve this classification problem?
Yes
No
Suppose we stick to the 2-layer architecture but add many more ReLU hidden units, all of them without offset parameters. Would it be possible to train such a model to perfectly separate these points?
Note : Assume that no 2 data points lie on the same line through the origin.
yes
no
5. (3)
5 points possible (graded, results hidden)
Which of the following statements is correct?
The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight.
True
False
unanswered
Initialization of the parameters is often important when training large feed-forward neural networks.
If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs.
True
False
unanswered
On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units. Here, “behave like sign units" allows for shifting or rescaling of the sign function.
(Note that a sign unit is a unit with activation function if and if . For the purpose of this question, it does not matter what is.)
True
False
unanswered
If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will
almost never change any of the weights
change the weights by large amounts at random
unanswered
Stochastic gradient descent differs from (true) gradient descent by updating only one network weight during each gradient descent step.
True
False
5. (4)
3 points possible (graded, results hidden)
There are many good reasons to use convolutional layers in CNNs as opposed to replacing them with fully connected layers. Please check T or F for each statement.
Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears.
True
False
unanswered
A fully connected layer for a reasonably sized image would simply have too many parameters
True
False
unanswered
Grading Note: The intended answer was true because it's a justification for using CNNs over FC layers, but in fact the FC net used in the mnist project did have quite good accuracy, and was trainable. Since the statement "simply have too many parameters" is debatable, full credit is given to all. (The intended answer will still show as the correct answer, but you will see the credit in your score.)
A fully connected layer can learn to recognize features anywhere in the image even if the features appeared preferentially in one location during training
True
False
            
            
        5. (1)
The values of the weights in the hidden layer are set such that they result in the z1 and z2 “classifiers" as shown in the (x1,x2)-space in the figure below:
The z1 “classifier" with the normal w1=[w11 w21]T is the line given by z1=x⋅w1=0.
Similarly, the z2 “classifier" with the normal w2=[w12 w22]T is the line given by z2=x⋅w2=0.
The arrows labeled w1 and w2 point in the positive directions of the respective normal vectors.
The regions labeled I,II,III,IV are the 4 regions defined by these two lines not including the boundaries.
Choose the region(s) in (x1,x2) space which are mapped into each of the following regions in (f1,f2)-space, the 2-dimensional space of hidden unit activations f(z1) and f(z2). (For example, for the second column below, choose the region(s) in (x1,x2) space which are mapped into the f1-axis in (f1,f2)-space.)
(Choose all that apply for each column.)
{(f1,f2):f1>0,f2>0}: f1-axis: f2-axis: the origin (f1,f2)=(0,0):
(Choose all that apply.)
I --> True
II
III
IV
None of the above
I --> True
II --> True
III
IV
None of the above
I --> True
II
III
IV --> True
None of the above
I
II
III
IV
None of the above --> True
5. (2)
2 points possible (graded, results hidden)
If we keep the hidden layer parameters above fixed but add and train additional hidden layers (applied after this layer) to further transform the data, could the resulting neural network solve this classification problem?
Yes
No
Suppose we stick to the 2-layer architecture but add many more ReLU hidden units, all of them without offset parameters. Would it be possible to train such a model to perfectly separate these points?
Note : Assume that no 2 data points lie on the same line through the origin.
yes
no
5. (3)
5 points possible (graded, results hidden)
Which of the following statements is correct?
The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight.
True
False
unanswered
Initialization of the parameters is often important when training large feed-forward neural networks.
If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs.
True
False
unanswered
On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units. Here, “behave like sign units" allows for shifting or rescaling of the sign function.
(Note that a sign unit is a unit with activation function if and if . For the purpose of this question, it does not matter what is.)
True
False
unanswered
If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will
almost never change any of the weights
change the weights by large amounts at random
unanswered
Stochastic gradient descent differs from (true) gradient descent by updating only one network weight during each gradient descent step.
True
False
5. (4)
3 points possible (graded, results hidden)
There are many good reasons to use convolutional layers in CNNs as opposed to replacing them with fully connected layers. Please check T or F for each statement.
Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears.
True
False
unanswered
A fully connected layer for a reasonably sized image would simply have too many parameters
True
False
unanswered
Grading Note: The intended answer was true because it's a justification for using CNNs over FC layers, but in fact the FC net used in the mnist project did have quite good accuracy, and was trainable. Since the statement "simply have too many parameters" is debatable, full credit is given to all. (The intended answer will still show as the correct answer, but you will see the credit in your score.)
A fully connected layer can learn to recognize features anywhere in the image even if the features appeared preferentially in one location during training
True
False
Answers
                                                    There are no human answers yet.
                                            
                
                    Answered by
            Bot
            
    5. (1)
{(f1,f2):f1>0,f2>0}:
I --> True
f1-axis:
I --> True
II --> True
f2-axis:
I --> True
IV --> True
the origin (f1,f2)=(0,0):
None of the above --> True
5. (2)
If we keep the hidden layer parameters above fixed but add and train additional hidden layers (applied after this layer) to further transform the data, could the resulting neural network solve this classification problem?
Yes
Suppose we stick to the 2-layer architecture but add many more ReLU hidden units, all of them without offset parameters. Would it be possible to train such a model to perfectly separate these points?
yes
5. (3)
The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight.
True
Initialization of the parameters is often important when training large feed-forward neural networks.
True
If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs.
True
On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units.
True
If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will
change the weights by large amounts at random
Stochastic gradient descent differs from (true) gradient descent by updating only one network weight during each gradient descent step.
False
5. (4)
Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears.
True
A fully connected layer for a reasonably sized image would simply have too many parameters
True
A fully connected layer can learn to recognize features anywhere in the image even if the features appeared preferentially in one location during training
True
    
{(f1,f2):f1>0,f2>0}:
I --> True
f1-axis:
I --> True
II --> True
f2-axis:
I --> True
IV --> True
the origin (f1,f2)=(0,0):
None of the above --> True
5. (2)
If we keep the hidden layer parameters above fixed but add and train additional hidden layers (applied after this layer) to further transform the data, could the resulting neural network solve this classification problem?
Yes
Suppose we stick to the 2-layer architecture but add many more ReLU hidden units, all of them without offset parameters. Would it be possible to train such a model to perfectly separate these points?
yes
5. (3)
The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight.
True
Initialization of the parameters is often important when training large feed-forward neural networks.
True
If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs.
True
On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units.
True
If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will
change the weights by large amounts at random
Stochastic gradient descent differs from (true) gradient descent by updating only one network weight during each gradient descent step.
False
5. (4)
Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears.
True
A fully connected layer for a reasonably sized image would simply have too many parameters
True
A fully connected layer can learn to recognize features anywhere in the image even if the features appeared preferentially in one location during training
True
Submit Your Answer
We prioritize human answers over AI answers.
If you are human, and you can answer this question, please submit your answer.