Machine Learning A-Z™: Hands-On Python & R In Data
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Welcome to the course!
    • 1. Applications of Machine Learning
    • 2. Why Machine Learning is the Future
    • 3. Important notes, tips & tricks for this course
    • 4. Installing Python and Anaconda (Mac, Linux & Windows)
    • 5. Update: Recommended Anaconda Version
    • 6. Installing R and R Studio (Mac, Linux & Windows)
    • 7. BONUS: Meet your instructors
  • Section 2: Part 1 Data Preprocessing
    • 8. Welcome to Part 1 - Data Preprocessing
    • 9. Get the dataset
    • 10. Importing the Libraries
    • 11. Importing the Dataset
    • 12. For Python learners, summary of Object-oriented programming: classes & objects
    • 13. Missing Data
    • 14. Categorical Data
    • 15. WARNING - Update
    • 16. Splitting the Dataset into the Training set and Test set
    • 17. Feature Scaling
    • 18. And here is our Data Preprocessing Template!
    • Quiz 1: Data Preprocessing
  • Section 3: Part 2 Regression
    • 19. Welcome to Part 2 - Regression
  • Section 4: Simple Linear Regression
    • 20. How to get the dataset
    • 21. Dataset + Business Problem Description
    • 22. Simple Linear Regression Intuition - Step 1
    • 23. Simple Linear Regression Intuition - Step 2
Powered by GitBook
On this page
  • DEALING WITH THE PROBLEM OF MISSING DATA
  • Activity
  • Dealing with missing Data in Python
  • Dealing with missing Data in R
  1. Section 2: Part 1 Data Preprocessing

13. Missing Data

DEALING WITH THE PROBLEM OF MISSING DATA

Activity

  • As you can see in the Data.csv, there are two missing data.

  • There is one missing data inside Age for Spain, and another missing data for Salary in Germany

  • Well the first idea is to remove the lines of the observations where there is some missing data.

  • But imagine if this dataset contains crucial information, it would be quite dangerous to remove an observation.

  • Another idea is to take the mean of the columns, and replace the missing data with the mean

Dealing with missing Data in Python

    # Taking care of missing Data
    from sklearn.preprocessing import Imputer
  • From sklearn, it contains important libaries to preprocess any dataset

  • Imputer will allow us to take care of any missing data

    imputer = Imputer.(missing_values = "NaN", strategy = "mean", axis = 0)
  • Press Ctrl - i in Windows, and you can see the documentation associated with the function

  • You can also go to help and just type in the object to get the documentation

  • For the first value, we are going to input missing_values = "NaN"

  • For the second value, the strategy would be to replace the missing value with the mean

  • Axis will be equal to 0, as we want to take the mean of the column.

    imputer = imputer.fit(X[:, 1:3])
  • To fit the imputer to X, we use the following method

  • 1 represents the lowerbound column that is included and 3 represents the upperbound column that is excluded

    X[:, 1:3] = imputer.transform(X[:, 1:3])
  • The method transform that is going to replace the missing data by the mean of the column

  • Now select the block of code and Ctrl + i to run it, if there is output on the console, it means the code has run properly

  • Input X on the console, and you should now see a complete data

  • You can change the strategy of taking the mean to taking the median for some datasets to replace the missing values.

Dealing with missing Data in R

    # Taking are of missing data
    dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE), dataset$Age)
  • We check and replace missing data in dataset with the average of column age, if not we retain the value

  • Ctrl enter and run it to see if all values in age are there now

  • The dataset in Data tab, should now show the average age for the missing values in Age now

  • Now lets do the same for salary

      dataset$Salary = ifelse(is.na(dataset$Salary), 
                       ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                       dataset$Salary)
  • Run it and the dataset should have no missing values now

Previous12. For Python learners, summary of Object-oriented programming: classes & objectsNext14. Categorical Data

Last updated 6 years ago