13. Missing Data
DEALING WITH THE PROBLEM OF MISSING DATA
Activity
As you can see in the Data.csv, there are two missing data.
There is one missing data inside Age for Spain, and another missing data for Salary in Germany
Well the first idea is to remove the lines of the observations where there is some missing data.
But imagine if this dataset contains crucial information, it would be quite dangerous to remove an observation.
Another idea is to take the mean of the columns, and replace the missing data with the mean
Dealing with missing Data in Python
From sklearn, it contains important libaries to preprocess any dataset
Imputer will allow us to take care of any missing data
Press Ctrl - i in Windows, and you can see the documentation associated with the function
You can also go to help and just type in the object to get the documentation
For the first value, we are going to input missing_values = "NaN"
For the second value, the strategy would be to replace the missing value with the mean
Axis will be equal to 0, as we want to take the mean of the column.
To fit the imputer to X, we use the following method
1 represents the lowerbound column that is included and 3 represents the upperbound column that is excluded
The method transform that is going to replace the missing data by the mean of the column
Now select the block of code and Ctrl + i to run it, if there is output on the console, it means the code has run properly
Input X on the console, and you should now see a complete data
You can change the strategy of taking the mean to taking the median for some datasets to replace the missing values.
Dealing with missing Data in R
We check and replace missing data in dataset with the average of column age, if not we retain the value
Ctrl enter and run it to see if all values in age are there now
The dataset in Data tab, should now show the average age for the missing values in Age now
Now lets do the same for salary
Run it and the dataset should have no missing values now
Last updated