I'm currently working on a project to see which data imputation method works best with a dataset I have.
I have the complete dataset.
Independent variable : Yield of the crop
Dependent variables : Year , Season , Production per hectare
So I'm planning to apply data imputation methods such as Multiple Linear Regression, KNN, Polynomial Interpolation.
My method is to randomly remove some independent variable fields(test set) and then try to imputate them using above techniques by training the rest of the dataset using above techniques, and comparing with the original Yield value.
Then I plan to select the data imputation method which works best for this dataset.
Consider this procedure done using Python programming language.(Google Colab environment)
Now I've coded upto the part where I've trained the model using 80:20 train:test data ratio.
I've computed the linear regression coefficients and my test dataset already have been inserted with the Yield values from the model.
Since, I need graphical and statistical evidence of the efficiency and accuracy of each model, how am I supposed to impute Yield values to the whole dataset and compare with original Yield values.
Do I have to manually create an equation containing the equation of the linear model , substitute independent variables and then find the Yield values from the model and then then compare with the original Yield value?
Is there any code that automatically adds a column with the Yield values derived from the linear regression model, for the whole dataset, just any method that will give the estimate values for all the Yield values in the dataset.