13. Missing Data

DEALING WITH THE PROBLEM OF MISSING DATA

Activity

  • As you can see in the Data.csv, there are two missing data.

  • There is one missing data inside Age for Spain, and another missing data for Salary in Germany

  • Well the first idea is to remove the lines of the observations where there is some missing data.

  • But imagine if this dataset contains crucial information, it would be quite dangerous to remove an observation.

  • Another idea is to take the mean of the columns, and replace the missing data with the mean

Dealing with missing Data in Python

    # Taking care of missing Data
    from sklearn.preprocessing import Imputer
  • From sklearn, it contains important libaries to preprocess any dataset

  • Imputer will allow us to take care of any missing data

    imputer = Imputer.(missing_values = "NaN", strategy = "mean", axis = 0)
  • Press Ctrl - i in Windows, and you can see the documentation associated with the function

  • You can also go to help and just type in the object to get the documentation

  • For the first value, we are going to input missing_values = "NaN"

  • For the second value, the strategy would be to replace the missing value with the mean

  • Axis will be equal to 0, as we want to take the mean of the column.

  • To fit the imputer to X, we use the following method

  • 1 represents the lowerbound column that is included and 3 represents the upperbound column that is excluded

  • The method transform that is going to replace the missing data by the mean of the column

  • Now select the block of code and Ctrl + i to run it, if there is output on the console, it means the code has run properly

  • Input X on the console, and you should now see a complete data

  • You can change the strategy of taking the mean to taking the median for some datasets to replace the missing values.

Dealing with missing Data in R

  • We check and replace missing data in dataset with the average of column age, if not we retain the value

  • Ctrl enter and run it to see if all values in age are there now

  • The dataset in Data tab, should now show the average age for the missing values in Age now

  • Now lets do the same for salary

  • Run it and the dataset should have no missing values now

Last updated