16. Splitting the Dataset into the Training set and Test set

MAKING THE MACHINE LEARNING MODELS

Activity

We are going to split the two data sets into training and test set.
Machine learning means the machine is going to learn to do something from a data set.
We are going to build our machine learning models on a data set but then we have to test on a new set which is going to be slightly different from the data set on which we build th machine learning model
Test the performance on the test set shouldnt be that different from the performance from the training set

Splitting the Datasets in Python

    # Splitting the dataset into the Training set and Test set
    import sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

The better we learn the correlationship, the better we are at predicting the test set learned correctly from the training set.

Splitting the Datasets in R

    # Splitting the data set into the Training set and Test set
    # install.packages('caTools')
    set.seed(123)
    split = sample.split(datasets$Purchased, SplitRatio = 0.8)
    training_set = subset(dataset, split == TRUE)
    test_set = subset(dataset, split == FALSE)

Once you installed the caTools packages once, you can comment it because you do need to install it again
You can check if caTools library in the Packages on the right hand side
Select caTools package before running the code
Now, you can view the training_set and the test_set which should show the respective number of observations

Previous15. WARNING - Update Next17. Feature Scaling

Last updated 7 years ago