13. Missing Data

DEALING WITH THE PROBLEM OF MISSING DATA

Activity

As you can see in the Data.csv, there are two missing data.
There is one missing data inside Age for Spain, and another missing data for Salary in Germany
Well the first idea is to remove the lines of the observations where there is some missing data.
But imagine if this dataset contains crucial information, it would be quite dangerous to remove an observation.
Another idea is to take the mean of the columns, and replace the missing data with the mean

Dealing with missing Data in Python

    # Taking care of missing Data
    from sklearn.preprocessing import Imputer

From sklearn, it contains important libaries to preprocess any dataset
Imputer will allow us to take care of any missing data

    imputer = Imputer.(missing_values = "NaN", strategy = "mean", axis = 0)

Press Ctrl - i in Windows, and you can see the documentation associated with the function
You can also go to help and just type in the object to get the documentation
For the first value, we are going to input missing_values = "NaN"
For the second value, the strategy would be to replace the missing value with the mean
Axis will be equal to 0, as we want to take the mean of the column.

    imputer = imputer.fit(X[:, 1:3])

To fit the imputer to X, we use the following method
1 represents the lowerbound column that is included and 3 represents the upperbound column that is excluded

    X[:, 1:3] = imputer.transform(X[:, 1:3])

The method transform that is going to replace the missing data by the mean of the column
Now select the block of code and Ctrl + i to run it, if there is output on the console, it means the code has run properly
Input X on the console, and you should now see a complete data
You can change the strategy of taking the mean to taking the median for some datasets to replace the missing values.

Dealing with missing Data in R

    # Taking are of missing data
    dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE), dataset$Age)

We check and replace missing data in dataset with the average of column age, if not we retain the value
Ctrl enter and run it to see if all values in age are there now
The dataset in Data tab, should now show the average age for the missing values in Age now

Now lets do the same for salary

  dataset$Salary = ifelse(is.na(dataset$Salary), 
                   ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                   dataset$Salary)

Run it and the dataset should have no missing values now

Previous12. For Python learners, summary of Object-oriented programming: classes & objects Next14. Categorical Data

Last updated 7 years ago