13. Missing Data
DEALING WITH THE PROBLEM OF MISSING DATA
Activity
- As you can see in the Data.csv, there are two missing data. 
- There is one missing data inside Age for Spain, and another missing data for Salary in Germany 
- Well the first idea is to remove the lines of the observations where there is some missing data. 
- But imagine if this dataset contains crucial information, it would be quite dangerous to remove an observation. 
- Another idea is to take the mean of the columns, and replace the missing data with the mean 
Dealing with missing Data in Python
    # Taking care of missing Data
    from sklearn.preprocessing import Imputer- From sklearn, it contains important libaries to preprocess any dataset 
- Imputer will allow us to take care of any missing data 
    imputer = Imputer.(missing_values = "NaN", strategy = "mean", axis = 0)- Press Ctrl - i in Windows, and you can see the documentation associated with the function 
- You can also go to help and just type in the object to get the documentation 
- For the first value, we are going to input missing_values = "NaN" 
- For the second value, the strategy would be to replace the missing value with the mean 
- Axis will be equal to 0, as we want to take the mean of the column. 
    imputer = imputer.fit(X[:, 1:3])- To fit the imputer to X, we use the following method 
- 1 represents the lowerbound column that is included and 3 represents the upperbound column that is excluded 
    X[:, 1:3] = imputer.transform(X[:, 1:3])- The method transform that is going to replace the missing data by the mean of the column 
- Now select the block of code and Ctrl + i to run it, if there is output on the console, it means the code has run properly 
- Input X on the console, and you should now see a complete data 
- You can change the strategy of taking the mean to taking the median for some datasets to replace the missing values. 
Dealing with missing Data in R
    # Taking are of missing data
    dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE), dataset$Age)- We check and replace missing data in dataset with the average of column age, if not we retain the value 
- Ctrl enter and run it to see if all values in age are there now 
- The dataset in Data tab, should now show the average age for the missing values in Age now 
- Now lets do the same for salary - dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary)
- Run it and the dataset should have no missing values now 
Last updated
