16. Splitting the Dataset into the Training set and Test set



  • We are going to split the two data sets into training and test set.

  • Machine learning means the machine is going to learn to do something from a data set.

  • We are going to build our machine learning models on a data set but then we have to test on a new set which is going to be slightly different from the data set on which we build th machine learning model

  • Test the performance on the test set shouldnt be that different from the performance from the training set

Splitting the Datasets in Python

    # Splitting the dataset into the Training set and Test set
    import sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
  • The better we learn the correlationship, the better we are at predicting the test set learned correctly from the training set.

Splitting the Datasets in R

    # Splitting the data set into the Training set and Test set
    # install.packages('caTools')
    split = sample.split(datasets$Purchased, SplitRatio = 0.8)
    training_set = subset(dataset, split == TRUE)
    test_set = subset(dataset, split == FALSE)
  • Once you installed the caTools packages once, you can comment it because you do need to install it again

  • You can check if caTools library in the Packages on the right hand side

  • Select caTools package before running the code

  • Now, you can view the training_set and the test_set which should show the respective number of observations

Last updated