# 16. Splitting the Dataset into the Training set and Test set

## MAKING THE MACHINE LEARNING MODELS

### Activity

* We are going to split the two data sets into training and test set.
* Machine learning means the machine is going to learn to do something from a data set.
* We are going to build our machine learning models on a data set but then we have to test on a new set which is going to be slightly different from the data set on which we build th machine learning model
* Test the performance on the test set shouldnt be that different from the performance from the training set

### Splitting the Datasets in Python

```
    # Splitting the dataset into the Training set and Test set
    import sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
```

* The better we learn the correlationship, the better we are at predicting the test set learned correctly from the training set.

### Splitting the Datasets in R

```
    # Splitting the data set into the Training set and Test set
    # install.packages('caTools')
    set.seed(123)
    split = sample.split(datasets$Purchased, SplitRatio = 0.8)
    training_set = subset(dataset, split == TRUE)
    test_set = subset(dataset, split == FALSE)
```

* Once you installed the caTools packages once, you can comment it because you do need to install it again
* You can check if caTools library in the Packages on the right hand side
* Select caTools package before running the code
* Now, you can view the training\_set and the test\_set which should show the respective number of observations
