Machine Learning A-Z™: Hands-On Python & R In Data
  • Introduction
  • Introduction
    • Introduction
  • Section 1: Welcome to the course!
    • 1. Applications of Machine Learning
    • 2. Why Machine Learning is the Future
    • 3. Important notes, tips & tricks for this course
    • 4. Installing Python and Anaconda (Mac, Linux & Windows)
    • 5. Update: Recommended Anaconda Version
    • 6. Installing R and R Studio (Mac, Linux & Windows)
    • 7. BONUS: Meet your instructors
  • Section 2: Part 1 Data Preprocessing
    • 8. Welcome to Part 1 - Data Preprocessing
    • 9. Get the dataset
    • 10. Importing the Libraries
    • 11. Importing the Dataset
    • 12. For Python learners, summary of Object-oriented programming: classes & objects
    • 13. Missing Data
    • 14. Categorical Data
    • 15. WARNING - Update
    • 16. Splitting the Dataset into the Training set and Test set
    • 17. Feature Scaling
    • 18. And here is our Data Preprocessing Template!
    • Quiz 1: Data Preprocessing
  • Section 3: Part 2 Regression
    • 19. Welcome to Part 2 - Regression
  • Section 4: Simple Linear Regression
    • 20. How to get the dataset
    • 21. Dataset + Business Problem Description
    • 22. Simple Linear Regression Intuition - Step 1
    • 23. Simple Linear Regression Intuition - Step 2
Powered by GitBook
On this page
  • MAKING THE MACHINE LEARNING MODELS
  • Activity
  • Splitting the Datasets in Python
  • Splitting the Datasets in R
  1. Section 2: Part 1 Data Preprocessing

16. Splitting the Dataset into the Training set and Test set

MAKING THE MACHINE LEARNING MODELS

Activity

  • We are going to split the two data sets into training and test set.

  • Machine learning means the machine is going to learn to do something from a data set.

  • We are going to build our machine learning models on a data set but then we have to test on a new set which is going to be slightly different from the data set on which we build th machine learning model

  • Test the performance on the test set shouldnt be that different from the performance from the training set

Splitting the Datasets in Python

    # Splitting the dataset into the Training set and Test set
    import sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
  • The better we learn the correlationship, the better we are at predicting the test set learned correctly from the training set.

Splitting the Datasets in R

    # Splitting the data set into the Training set and Test set
    # install.packages('caTools')
    set.seed(123)
    split = sample.split(datasets$Purchased, SplitRatio = 0.8)
    training_set = subset(dataset, split == TRUE)
    test_set = subset(dataset, split == FALSE)
  • Once you installed the caTools packages once, you can comment it because you do need to install it again

  • You can check if caTools library in the Packages on the right hand side

  • Select caTools package before running the code

  • Now, you can view the training_set and the test_set which should show the respective number of observations

Previous15. WARNING - UpdateNext17. Feature Scaling

Last updated 6 years ago