Machine Learning A-Z™: Hands-On Python & R In Data
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Welcome to the course!
    • 1. Applications of Machine Learning
    • 2. Why Machine Learning is the Future
    • 3. Important notes, tips & tricks for this course
    • 4. Installing Python and Anaconda (Mac, Linux & Windows)
    • 5. Update: Recommended Anaconda Version
    • 6. Installing R and R Studio (Mac, Linux & Windows)
    • 7. BONUS: Meet your instructors
  • Section 2: Part 1 Data Preprocessing
    • 8. Welcome to Part 1 - Data Preprocessing
    • 9. Get the dataset
    • 10. Importing the Libraries
    • 11. Importing the Dataset
    • 12. For Python learners, summary of Object-oriented programming: classes & objects
    • 13. Missing Data
    • 14. Categorical Data
    • 15. WARNING - Update
    • 16. Splitting the Dataset into the Training set and Test set
    • 17. Feature Scaling
    • 18. And here is our Data Preprocessing Template!
    • Quiz 1: Data Preprocessing
  • Section 3: Part 2 Regression
    • 19. Welcome to Part 2 - Regression
  • Section 4: Simple Linear Regression
    • 20. How to get the dataset
    • 21. Dataset + Business Problem Description
    • 22. Simple Linear Regression Intuition - Step 1
    • 23. Simple Linear Regression Intuition - Step 2
Powered by GitBook
On this page
  • HOW TO ENCODE CATEGORICAL DATA
  • Activity
  • Encoding categorical data in Python
  • Dummy Encoding in Python
  • Encoding categorical data in R
  1. Section 2: Part 1 Data Preprocessing

14. Categorical Data

HOW TO ENCODE CATEGORICAL DATA

Activity

  • Looking at the Data.csv, we see that we have two categorical variables which are Country and Purchased

  • Their values contain categories

  • So we need to encode the text we have over here into numbers

Encoding categorical data in Python

    # Encoding categorical data
    from sklearn.preprocessing import LabelEncoder
    labelEncoder_X = LabelEncoder()
    X[:, 0] = labelEncoder_X.fit_transform(X[:, 0])
  • We use the another library LabelEncoder from sklearn and create a class of it

  • We then fit and transform the country column

  • All of the countries are now encoded in numbers instead

  • Now instead replace X[:, 0] country column with the labelEncoder_X.fit_transform(X[:, 0]) to show the countries in text instead

  • If you print X on Ipython console, now X contains all number values

  • The problem is that you need to somehow make the encoded values make sense

  • So we have to prevent the machine learning equations from thinking that Germany is greater than France, and Spain is greater than Germany

  • To prevent this we are going to the dummy variables

Dummy Encoding in Python

  • Instead of having one column for Country, we are going to have three columns equal to the number of categories countries

  • So if the country is France, the France column is going to be one, else zero and likewise for the rest of the columns

    from sklearn.preprocessing import LabelEncoder, OneHotEncoder

    onehotencoder = OneHotEncoder(categorical_features = [0])
  • We are going to use OneHotEncoder class to do this, so inspect this class documentation

  • Look at the categorical features, and we need to specify the column of the categorical data which is 0

    X = onehotencoder.fit_transform(X).toarray()
  • The following method will encode the 0 column of X

  • Ctrl Enter to see if the code can run, and print X in Ipython Console

  • If you look at it in variable explorer, you can now see the values being encoded in 0s and 1s for Countries column

  • Change the formatting of decimas to %.0f to no decimals if your data cannot be seen easily

  • Now if you open the dataset, you can see the Countries column being replaced in X by 3 columns

    labelEncoder_Y = LabelEncoder()
    Y = labelEncoder_Y.fit_transform(Y)
  • So now we are going to do the same thing for Y, for Purchased column

  • Since there's only 2 categories we do not need to onehotencoder to encode more columns

Encoding categorical data in R

    # Encoding categorical data
    dataset$Country = factor(dataset$Country, levels = c('France', 'Spain', 'Germany'), labels = c(1, 2, 3))
  • We use a factor method to specify a vector, and c is a vector in R and we creating a vector of 3 elements of Countries

  • Labels are used to represent the number associate with that country

  • Execute the code and see the difference in output for Data

  • Do the same for the Purchased column and you should see the Data being reflected differently for categorical columns of Country and Purchased.

Previous13. Missing DataNext15. WARNING - Update

Last updated 6 years ago