{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentiment Analysis\n",
    "\n",
    "## Using XGBoost in SageMaker\n",
    "\n",
    "_Deep Learning Nanodegree Program | Deployment_\n",
    "\n",
    "---\n",
    "\n",
    "As our first example of using Amazon's SageMaker service we will construct a random tree model to predict the sentiment of a movie review. You may have seen a version of this example in a pervious lesson although it would have been done using the sklearn package. Instead, we will be using the XGBoost package as it is provided to us by Amazon.\n",
    "\n",
    "## Instructions\n",
    "\n",
    "Some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this notebook. You will not need to modify the included code beyond what is requested. Sections that begin with '**TODO**' in the header indicate that you need to complete or implement some portion within them. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `# TODO: ...` comment. Please be sure to read the instructions carefully!\n",
    "\n",
    "In addition to implementing code, there will be questions for you to answer which relate to the task and your implementation. Each section where you will answer a question is preceded by a '**Question:**' header. Carefully read each question and provide your answer below the '**Answer:**' header by editing the Markdown cell.\n",
    "\n",
    "> **Note**: Code and Markdown cells can be executed using the **Shift+Enter** keyboard shortcut. In addition, a cell can be edited by typically clicking it (double-click for Markdown cells) or by pressing **Enter** while it is highlighted."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Downloading the data\n",
    "\n",
    "The dataset we are going to use is very popular among researchers in Natural Language Processing, usually referred to as the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/). It consists of movie reviews from the website [imdb.com](http://www.imdb.com/), each labeled as either '**pos**itive', if the reviewer enjoyed the film, or '**neg**ative' otherwise.\n",
    "\n",
    "> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.\n",
    "\n",
    "We begin by using some Jupyter Notebook magic to download and extract the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2018-10-12 04:31:50--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n",
      "Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10\n",
      "Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 84125825 (80M) [application/x-gzip]\n",
      "Saving to: ‘../data/aclImdb_v1.tar.gz’\n",
      "\n",
      "../data/aclImdb_v1. 100%[===================>]  80.23M  24.7MB/s    in 4.0s    \n",
      "\n",
      "2018-10-12 04:31:54 (20.3 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%mkdir ../data\n",
    "!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n",
    "!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Preparing the data\n",
    "\n",
    "The data we have downloaded is split into various files, each of which contains a single review. It will be much easier going forward if we combine these individual files into two large files, one for training and one for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import glob\n",
    "\n",
    "def read_imdb_data(data_dir='../data/aclImdb'):\n",
    "    data = {}\n",
    "    labels = {}\n",
    "    \n",
    "    for data_type in ['train', 'test']:\n",
    "        data[data_type] = {}\n",
    "        labels[data_type] = {}\n",
    "        \n",
    "        for sentiment in ['pos', 'neg']:\n",
    "            data[data_type][sentiment] = []\n",
    "            labels[data_type][sentiment] = []\n",
    "            \n",
    "            path = os.path.join(data_dir, data_type, sentiment, '*.txt')\n",
    "            files = glob.glob(path)\n",
    "            \n",
    "            for f in files:\n",
    "                with open(f) as review:\n",
    "                    data[data_type][sentiment].append(review.read())\n",
    "                    # Here we represent a positive review by '1' and a negative review by '0'\n",
    "                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)\n",
    "                    \n",
    "            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \\\n",
    "                    \"{}/{} data size does not match labels size\".format(data_type, sentiment)\n",
    "                \n",
    "    return data, labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg\n"
     ]
    }
   ],
   "source": [
    "data, labels = read_imdb_data()\n",
    "print(\"IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg\".format(\n",
    "            len(data['train']['pos']), len(data['train']['neg']),\n",
    "            len(data['test']['pos']), len(data['test']['neg'])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils import shuffle\n",
    "\n",
    "def prepare_imdb_data(data, labels):\n",
    "    \"\"\"Prepare training and test sets from IMDb movie reviews.\"\"\"\n",
    "    \n",
    "    #Combine positive and negative reviews and labels\n",
    "    data_train = data['train']['pos'] + data['train']['neg']\n",
    "    data_test = data['test']['pos'] + data['test']['neg']\n",
    "    labels_train = labels['train']['pos'] + labels['train']['neg']\n",
    "    labels_test = labels['test']['pos'] + labels['test']['neg']\n",
    "    \n",
    "    #Shuffle reviews and corresponding labels within training and test sets\n",
    "    data_train, labels_train = shuffle(data_train, labels_train)\n",
    "    data_test, labels_test = shuffle(data_test, labels_test)\n",
    "    \n",
    "    # Return a unified training data, test data, training labels, test labets\n",
    "    return data_train, data_test, labels_train, labels_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IMDb reviews (combined): train = 25000, test = 25000\n"
     ]
    }
   ],
   "source": [
    "train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)\n",
    "print(\"IMDb reviews (combined): train = {}, test = {}\".format(len(train_X), len(test_X)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\"The Leap Years\" is a movie adapted from an e-novella by Singapore writer Catherine Lim, which became the first Singapore novel/novella to be sold over the internet. The film had a tortuous post-production schedule: shot in early 2005, it was slated for release at the end of 2005, but only turn up eventually 3 years later, on the 29th February 2008, a leap year.<br /><br />Before I say anything, I must first admit I\\'m no fan of the romance genre, so I may be a little biased against this film - I watched it merely because it was a Singapore production, and that it\\'s available for borrowing at my neighborhood library. Here\\'s my two cents on the movie.<br /><br />Let\\'s just start by saying that other than Qi Yu-wu\\'s KS and Wong Li-Lin, everybody here of note seems to be a Eurasian. The love interest is a Eurasian (Ananda Everingham), and Wong\\'s trio of buddies are all, er-hem, Eurasians. Does this film perpetuate the stereotype that falling in love and associating with Eurasians are more \"in\" than the common Chinese (or whatever Asian race you are?) I don\\'t know, it sure seems that way. Also, everyone in the movie speaks in some mystical \"anglified\" accent which doesn\\'t exist anywhere, certainly not in Singapore. It\\'s the kind of \"semi-perfect English\" that authorities would like us speak, but which doesn\\'t exist anywhere outside, say, the MTV Channel. The effect is that the dialog of the movie sounds forced and stilted, not helped by the lack of true-blue Singaporeans in the cast.<br /><br />The scriptwriter seems to be trying too hard to string one-liners after one-liners. After twenty minutes, the \"wit\" of the movie starts to pall and the film starts serving up its usual plate of clichés. <br /><br />I guess I didn\\'t enjoy the movie because the entire premise of sustaining a love affair over 16 long years seems unbelievable. <br /><br />There are other incredulities in the film. I can\\'t for one believe that KS (played by Qi Yu-wu) would fall for one of Wong\\'s girlfriends. And the scene where the bridegroom says, \"Go, before I change my mind,\" has been used in a hundred East Asian (Korean, Chinese, Hong Kong, Taiwanese etc) TV serials...<br /><br />So 4 stars for this film. The production value is fair, and Wong Li-lin tries her best, but she\\'s not helped by the script. Joan Chen has a 15-minute bit-part in the movie as the older Wong and is perhaps the best actress of the lot, but, hey, her role is just cameo.<br /><br />If you come across \"The Leap Years\" in the rental or library, you may want to pop it in the DVD player for curiosity\\'s sake, but otherwise, for people who don\\'t exactly enjoy the romance genre, you can decide whether or not to give it a miss.'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_X[100]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Processing the data\n",
    "\n",
    "Now that we have our training and testing datasets merged and ready to use, we need to start processing the raw data into something that will be useable by our machine learning algorithm. To begin with, we remove any html formatting that may appear in the reviews and perform some standard natural language processing in order to homogenize the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Unzipping corpora/stopwords.zip.\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "nltk.download(\"stopwords\")\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.porter import *\n",
    "stemmer = PorterStemmer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def review_to_words(review):\n",
    "    text = BeautifulSoup(review, \"html.parser\").get_text() # Remove HTML tags\n",
    "    text = re.sub(r\"[^a-zA-Z0-9]\", \" \", text.lower()) # Convert to lower case\n",
    "    words = text.split() # Split string into words\n",
    "    words = [w for w in words if w not in stopwords.words(\"english\")] # Remove stopwords\n",
    "    words = [PorterStemmer().stem(w) for w in words] # stem\n",
    "    \n",
    "    return words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "cache_dir = os.path.join(\"../cache\", \"sentiment_analysis\")  # where to store cache files\n",
    "os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists\n",
    "\n",
    "def preprocess_data(data_train, data_test, labels_train, labels_test,\n",
    "                    cache_dir=cache_dir, cache_file=\"preprocessed_data.pkl\"):\n",
    "    \"\"\"Convert each review to words; read from cache if available.\"\"\"\n",
    "\n",
    "    # If cache_file is not None, try to read from it first\n",
    "    cache_data = None\n",
    "    if cache_file is not None:\n",
    "        try:\n",
    "            with open(os.path.join(cache_dir, cache_file), \"rb\") as f:\n",
    "                cache_data = pickle.load(f)\n",
    "            print(\"Read preprocessed data from cache file:\", cache_file)\n",
    "        except:\n",
    "            pass  # unable to read from cache, but that's okay\n",
    "    \n",
    "    # If cache is missing, then do the heavy lifting\n",
    "    if cache_data is None:\n",
    "        # Preprocess training and test data to obtain words for each review\n",
    "        #words_train = list(map(review_to_words, data_train))\n",
    "        #words_test = list(map(review_to_words, data_test))\n",
    "        words_train = [review_to_words(review) for review in data_train]\n",
    "        words_test = [review_to_words(review) for review in data_test]\n",
    "        \n",
    "        # Write to cache file for future runs\n",
    "        if cache_file is not None:\n",
    "            cache_data = dict(words_train=words_train, words_test=words_test,\n",
    "                              labels_train=labels_train, labels_test=labels_test)\n",
    "            with open(os.path.join(cache_dir, cache_file), \"wb\") as f:\n",
    "                pickle.dump(cache_data, f)\n",
    "            print(\"Wrote preprocessed data to cache file:\", cache_file)\n",
    "    else:\n",
    "        # Unpack data loaded from cache file\n",
    "        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],\n",
    "                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])\n",
    "    \n",
    "    return words_train, words_test, labels_train, labels_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote preprocessed data to cache file: preprocessed_data.pkl\n"
     ]
    }
   ],
   "source": [
    "# Preprocess data\n",
    "train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract Bag-of-Words features\n",
    "\n",
    "For the model we will be implementing, rather than using the reviews directly, we are going to transform each review into a Bag-of-Words feature representation. Keep in mind that 'in the wild' we will only have access to the training set so our transformer can only use the training set to construct a representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.externals import joblib\n",
    "# joblib is an enhanced version of pickle that is more efficient for storing NumPy arrays\n",
    "\n",
    "def extract_BoW_features(words_train, words_test, vocabulary_size=5000,\n",
    "                         cache_dir=cache_dir, cache_file=\"bow_features.pkl\"):\n",
    "    \"\"\"Extract Bag-of-Words for a given set of documents, already preprocessed into words.\"\"\"\n",
    "    \n",
    "    # If cache_file is not None, try to read from it first\n",
    "    cache_data = None\n",
    "    if cache_file is not None:\n",
    "        try:\n",
    "            with open(os.path.join(cache_dir, cache_file), \"rb\") as f:\n",
    "                cache_data = joblib.load(f)\n",
    "            print(\"Read features from cache file:\", cache_file)\n",
    "        except:\n",
    "            pass  # unable to read from cache, but that's okay\n",
    "    \n",
    "    # If cache is missing, then do the heavy lifting\n",
    "    if cache_data is None:\n",
    "        # Fit a vectorizer to training documents and use it to transform them\n",
    "        # NOTE: Training documents have already been preprocessed and tokenized into words;\n",
    "        #       pass in dummy functions to skip those steps, e.g. preprocessor=lambda x: x\n",
    "        vectorizer = CountVectorizer(max_features=vocabulary_size,\n",
    "                preprocessor=lambda x: x, tokenizer=lambda x: x)  # already preprocessed\n",
    "        features_train = vectorizer.fit_transform(words_train).toarray()\n",
    "\n",
    "        # Apply the same vectorizer to transform the test documents (ignore unknown words)\n",
    "        features_test = vectorizer.transform(words_test).toarray()\n",
    "        \n",
    "        # NOTE: Remember to convert the features using .toarray() for a compact representation\n",
    "        \n",
    "        # Write to cache file for future runs (store vocabulary as well)\n",
    "        if cache_file is not None:\n",
    "            vocabulary = vectorizer.vocabulary_\n",
    "            cache_data = dict(features_train=features_train, features_test=features_test,\n",
    "                             vocabulary=vocabulary)\n",
    "            with open(os.path.join(cache_dir, cache_file), \"wb\") as f:\n",
    "                joblib.dump(cache_data, f)\n",
    "            print(\"Wrote features to cache file:\", cache_file)\n",
    "    else:\n",
    "        # Unpack data loaded from cache file\n",
    "        features_train, features_test, vocabulary = (cache_data['features_train'],\n",
    "                cache_data['features_test'], cache_data['vocabulary'])\n",
    "    \n",
    "    # Return both the extracted features as well as the vocabulary\n",
    "    return features_train, features_test, vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote features to cache file: bow_features.pkl\n"
     ]
    }
   ],
   "source": [
    "# Extract Bag of Words features for both training and test datasets\n",
    "train_X, test_X, vocabulary = extract_BoW_features(train_X, test_X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Classification using XGBoost\n",
    "\n",
    "Now that we have created the feature representation of our training (and testing) data, it is time to start setting up and using the XGBoost classifier provided by SageMaker.\n",
    "\n",
    "### (TODO) Writing the dataset\n",
    "\n",
    "The XGBoost classifier that we will be using requires the dataset to be written to a file and stored using Amazon S3. To do this, we will start by splitting the training dataset into two parts, the data we will train the model with and a validation set. Then, we will write those datasets to a file and upload the files to S3. In addition, we will write the test set input to a file and upload the file to S3. This is so that we can use SageMakers Batch Transform functionality to test our model once we've fit it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# TODO: Split the train_X and train_y arrays into the DataFrames val_X, train_X and val_y, train_y. Make sure that\n",
    "#       val_X and val_y contain 10 000 entires while train_X and train_y contain the remaining 15 000 entries.\n",
    "val_X = pd.DataFrame(train_X[:10000])\n",
    "train_X = pd.DataFrame(train_X[10000:])\n",
    "\n",
    "val_y = pd.DataFrame(train_y[:10000])\n",
    "train_y = pd.DataFrame(train_y[10000:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The documentation for the XGBoost algorithm in SageMaker requires that the saved datasets should contain no headers or index and that for the training and validation data, the label should occur first for each sample.\n",
    "\n",
    "For more information about this and other algorithms, the SageMaker developer documentation can be found on __[Amazon's website.](https://docs.aws.amazon.com/sagemaker/latest/dg/)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we make sure that the local directory in which we'd like to store the training and validation csv files exists.\n",
    "data_dir = '../data/xgboost'\n",
    "if not os.path.exists(data_dir):\n",
    "    os.makedirs(data_dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First, save the test data to test.csv in the data_dir directory. Note that we do not save the associated ground truth\n",
    "# labels, instead we will use them later to compare with our model output.\n",
    "\n",
    "pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)\n",
    "\n",
    "# TODO: Save the training and validation data to train.csv and validation.csv in the data_dir directory.\n",
    "#       Make sure that the files you create are in the correct format.\n",
    "pd.concat([val_y, val_X], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)\n",
    "pd.concat([train_y, train_X], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To save a bit of memory we can set text_X, train_X, val_X, train_y and val_y to None.\n",
    "\n",
    "test_X = train_X = val_X = train_y = val_y = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (TODO) Uploading Training / Validation files to S3\n",
    "\n",
    "Amazon's S3 service allows us to store files that can be access by both the built-in training models such as the XGBoost model we will be using as well as custom models such as the one we will see a little later.\n",
    "\n",
    "For this, and most other tasks we will be doing using SageMaker, there are two methods we could use. The first is to use the low level functionality of SageMaker which requires knowing each of the objects involved in the SageMaker environment. The second is to use the high level functionality in which certain choices have been made on the user's behalf. The low level approach benefits from allowing the user a great deal of flexibility while the high level approach makes development much quicker. For our purposes we will opt to use the high level approach although using the low-level approach is certainly an option.\n",
    "\n",
    "Recall the method `upload_data()` which is a member of object representing our current SageMaker session. What this method does is upload the data to the default bucket (which is created if it does not exist) into the path described by the key_prefix variable. To see this for yourself, once you have uploaded the data files, go to the S3 console and look to see where the files have been uploaded.\n",
    "\n",
    "For additional resources, see the __[SageMaker API documentation](http://sagemaker.readthedocs.io/en/latest/)__ and in addition the __[SageMaker Developer Guide.](https://docs.aws.amazon.com/sagemaker/latest/dg/)__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-440180731255\n"
     ]
    }
   ],
   "source": [
    "import sagemaker\n",
    "\n",
    "session = sagemaker.Session() # Store the current SageMaker session\n",
    "\n",
    "# S3 prefix (which folder will we use)\n",
    "prefix = 'sentiment-xgboost'\n",
    "\n",
    "# TODO: Upload the test.csv, train.csv and validation.csv files which are contained in data_dir to S3 using sess.upload_data().\n",
    "test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)\n",
    "val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)\n",
    "train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (TODO) Creating the XGBoost model\n",
    "\n",
    "Now that the data has been uploaded it is time to create the XGBoost model. To begin with, we need to do some setup. At this point it is worth discussing what a model is in SageMaker. It is easiest to think of a model of comprising three different objects in the SageMaker ecosystem, which interact with one another.\n",
    "\n",
    "- Model Artifacts\n",
    "- Training Code (Container)\n",
    "- Inference Code (Container)\n",
    "\n",
    "The Model Artifacts are what you might think of as the actual model itself. For example, if you were building a neural network, the model artifacts would be the weights of the various layers. In our case, for an XGBoost model, the artifacts are the actual trees that are created during training.\n",
    "\n",
    "The other two objects, the training code and the inference code are then used the manipulate the training artifacts. More precisely, the training code uses the training data that is provided and creates the model artifacts, while the inference code uses the model artifacts to make predictions on new data.\n",
    "\n",
    "The way that SageMaker runs the training and inference code is by making use of Docker containers. For now, think of a container as being a way of packaging code up so that dependencies aren't an issue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import get_execution_role\n",
    "\n",
    "# Our current execution role is require when creating the model as the training\n",
    "# and inference code will need to access the model artifacts.\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.\n",
    "# As a matter of convenience, the training and inference code both use the same container.\n",
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "\n",
    "container = get_image_uri(session.boto_region_name, 'xgboost')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Create a SageMaker estimator using the container location determined in the previous cell.\n",
    "#       It is recommended that you use a single training instance of type ml.m4.xlarge. It is also\n",
    "#       recommended that you use 's3://{}/{}/output'.format(session.default_bucket(), prefix) as the\n",
    "#       output path.\n",
    "\n",
    "xgb = None\n",
    "\n",
    "\n",
    "# TODO: Set the XGBoost hyperparameters in the xgb object. Don't forget that in this case we have a binary\n",
    "#       label so we should be using the 'binary:logistic' objective.\n",
    "xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use\n",
    "                                    role,                                    # What is our current IAM Role\n",
    "                                    train_instance_count=1,                  # How many compute instances\n",
    "                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances\n",
    "                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),\n",
    "                                    sagemaker_session=session)\n",
    "\n",
    "xgb.set_hyperparameters(max_depth=5,\n",
    "                        eta=0.2,\n",
    "                        gamma=4,\n",
    "                        min_child_weight=6,\n",
    "                        subsample=0.8,\n",
    "                        silent=0,\n",
    "                        objective='binary:logistic',\n",
    "                        early_stopping_rounds=10,\n",
    "                        num_round=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fit the XGBoost model\n",
    "\n",
    "Now that our model has been set up we simply need to attach the training and validation datasets and then ask SageMaker to set up the computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')\n",
    "s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:sagemaker:Creating training-job with name: xgboost-2018-10-12-05-41-22-383\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2018-10-12 05:41:22 Starting - Starting the training job...\n",
      "Launching requested ML instances......\n",
      "Preparing the instances for training.........\n",
      "2018-10-12 05:43:56 Downloading - Downloading input data\n",
      "2018-10-12 05:44:21 Training - Training image download completed. Training in progress..\n",
      "\u001b[31mArguments: train\u001b[0m\n",
      "\u001b[31m[2018-10-12:05:44:23:INFO] Running standalone xgboost training.\u001b[0m\n",
      "\u001b[31m[2018-10-12:05:44:23:INFO] File size need to be processed in the node: 238.47mb. Available memory size in the node: 8566.75mb\u001b[0m\n",
      "\u001b[31m[2018-10-12:05:44:23:INFO] Determined delimiter of CSV input is ','\u001b[0m\n",
      "\u001b[31m[05:44:23] S3DistributionType set as FullyReplicated\u001b[0m\n",
      "\u001b[31m[05:44:26] 15000x5000 matrix with 75000000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,\u001b[0m\n",
      "\u001b[31m[2018-10-12:05:44:26:INFO] Determined delimiter of CSV input is ','\u001b[0m\n",
      "\u001b[31m[05:44:26] S3DistributionType set as FullyReplicated\u001b[0m\n",
      "\u001b[31m[05:44:27] 10000x5000 matrix with 50000000 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,\u001b[0m\n",
      "\u001b[31m[05:44:31] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 38 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[0]#011train-error:0.2964#011validation-error:0.3013\u001b[0m\n",
      "\u001b[31mMultiple eval metrics have been passed: 'validation-error' will be used for early stopping.\n",
      "\u001b[0m\n",
      "\u001b[31mWill train until validation-error hasn't improved in 10 rounds.\u001b[0m\n",
      "\u001b[31m[05:44:32] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[1]#011train-error:0.281#011validation-error:0.2829\u001b[0m\n",
      "\u001b[31m[05:44:33] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[2]#011train-error:0.273867#011validation-error:0.277\u001b[0m\n",
      "\u001b[31m[05:44:35] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[3]#011train-error:0.269#011validation-error:0.2739\u001b[0m\n",
      "\u001b[31m[05:44:36] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 44 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[4]#011train-error:0.253867#011validation-error:0.2556\u001b[0m\n",
      "\u001b[31m[05:44:37] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[5]#011train-error:0.2554#011validation-error:0.2605\u001b[0m\n",
      "\u001b[31m[05:44:39] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[6]#011train-error:0.248267#011validation-error:0.2553\u001b[0m\n",
      "\u001b[31m[05:44:40] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[7]#011train-error:0.239933#011validation-error:0.247\u001b[0m\n",
      "\u001b[31m[05:44:41] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 42 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[8]#011train-error:0.226733#011validation-error:0.2323\u001b[0m\n",
      "\u001b[31m[05:44:43] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[9]#011train-error:0.2222#011validation-error:0.2298\u001b[0m\n",
      "\u001b[31m[05:44:44] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[10]#011train-error:0.218667#011validation-error:0.2275\u001b[0m\n",
      "\u001b[31m[05:44:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[11]#011train-error:0.214333#011validation-error:0.2253\u001b[0m\n",
      "\u001b[31m[05:44:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[12]#011train-error:0.2104#011validation-error:0.2208\u001b[0m\n",
      "\u001b[31m[05:44:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[13]#011train-error:0.206467#011validation-error:0.2156\u001b[0m\n",
      "\u001b[31m[05:44:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[14]#011train-error:0.203267#011validation-error:0.2129\u001b[0m\n",
      "\u001b[31m[05:44:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[15]#011train-error:0.199067#011validation-error:0.2109\u001b[0m\n",
      "\u001b[31m[05:44:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[16]#011train-error:0.197067#011validation-error:0.2093\u001b[0m\n",
      "\u001b[31m[05:44:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[17]#011train-error:0.191867#011validation-error:0.2067\u001b[0m\n",
      "\u001b[31m[05:44:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[18]#011train-error:0.189667#011validation-error:0.205\u001b[0m\n",
      "\u001b[31m[05:44:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[19]#011train-error:0.185867#011validation-error:0.2011\u001b[0m\n",
      "\u001b[31m[05:44:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[20]#011train-error:0.1832#011validation-error:0.1991\u001b[0m\n",
      "\u001b[31m[05:44:59] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[21]#011train-error:0.181467#011validation-error:0.1969\u001b[0m\n",
      "\u001b[31m[05:45:00] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[22]#011train-error:0.178067#011validation-error:0.1956\u001b[0m\n",
      "\u001b[31m[05:45:01] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[23]#011train-error:0.1758#011validation-error:0.1926\u001b[0m\n",
      "\u001b[31m[05:45:03] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[24]#011train-error:0.174333#011validation-error:0.1907\u001b[0m\n",
      "\u001b[31m[05:45:04] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[25]#011train-error:0.170667#011validation-error:0.1877\u001b[0m\n",
      "\u001b[31m[05:45:05] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[26]#011train-error:0.168#011validation-error:0.1873\u001b[0m\n",
      "\u001b[31m[05:45:07] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[27]#011train-error:0.165867#011validation-error:0.1856\u001b[0m\n",
      "\u001b[31m[05:45:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[28]#011train-error:0.165533#011validation-error:0.1841\u001b[0m\n",
      "\u001b[31m[05:45:09] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[29]#011train-error:0.164733#011validation-error:0.1841\u001b[0m\n",
      "\u001b[31m[05:45:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[30]#011train-error:0.1626#011validation-error:0.1827\u001b[0m\n",
      "\u001b[31m[05:45:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[31]#011train-error:0.161067#011validation-error:0.1826\u001b[0m\n",
      "\u001b[31m[05:45:13] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[32]#011train-error:0.1592#011validation-error:0.1803\u001b[0m\n",
      "\u001b[31m[05:45:15] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[33]#011train-error:0.158933#011validation-error:0.1802\u001b[0m\n",
      "\u001b[31m[05:45:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[34]#011train-error:0.157667#011validation-error:0.1794\u001b[0m\n",
      "\u001b[31m[05:45:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[35]#011train-error:0.156867#011validation-error:0.1793\u001b[0m\n",
      "\u001b[31m[05:45:19] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[36]#011train-error:0.153467#011validation-error:0.1787\u001b[0m\n",
      "\u001b[31m[05:45:20] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[37]#011train-error:0.1512#011validation-error:0.1784\u001b[0m\n",
      "\u001b[31m[05:45:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[38]#011train-error:0.151#011validation-error:0.1776\u001b[0m\n",
      "\u001b[31m[05:45:23] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[39]#011train-error:0.1488#011validation-error:0.1764\u001b[0m\n",
      "\u001b[31m[05:45:24] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[40]#011train-error:0.147867#011validation-error:0.1763\u001b[0m\n",
      "\u001b[31m[05:45:26] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[41]#011train-error:0.147467#011validation-error:0.1757\u001b[0m\n",
      "\u001b[31m[05:45:27] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[42]#011train-error:0.146933#011validation-error:0.1738\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31m[05:45:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[43]#011train-error:0.145467#011validation-error:0.1733\u001b[0m\n",
      "\u001b[31m[05:45:30] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[44]#011train-error:0.145333#011validation-error:0.1731\u001b[0m\n",
      "\u001b[31m[05:45:31] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[45]#011train-error:0.143067#011validation-error:0.1726\u001b[0m\n",
      "\u001b[31m[05:45:32] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[46]#011train-error:0.142733#011validation-error:0.1723\u001b[0m\n",
      "\u001b[31m[05:45:34] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[47]#011train-error:0.141933#011validation-error:0.172\u001b[0m\n",
      "\u001b[31m[05:45:35] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[48]#011train-error:0.140933#011validation-error:0.1717\u001b[0m\n",
      "\u001b[31m[05:45:36] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[49]#011train-error:0.1398#011validation-error:0.1689\u001b[0m\n",
      "\u001b[31m[05:45:38] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[50]#011train-error:0.139867#011validation-error:0.1684\u001b[0m\n",
      "\u001b[31m[05:45:39] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[51]#011train-error:0.137867#011validation-error:0.1675\u001b[0m\n",
      "\u001b[31m[05:45:40] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[52]#011train-error:0.1366#011validation-error:0.1671\u001b[0m\n",
      "\u001b[31m[05:45:42] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[53]#011train-error:0.136333#011validation-error:0.1664\u001b[0m\n",
      "\u001b[31m[05:45:43] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[54]#011train-error:0.135667#011validation-error:0.1659\u001b[0m\n",
      "\u001b[31m[05:45:44] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[55]#011train-error:0.135467#011validation-error:0.1661\u001b[0m\n",
      "\u001b[31m[05:45:46] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[56]#011train-error:0.1352#011validation-error:0.1652\u001b[0m\n",
      "\u001b[31m[05:45:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 18 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[57]#011train-error:0.1344#011validation-error:0.1642\u001b[0m\n",
      "\u001b[31m[05:45:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[58]#011train-error:0.134267#011validation-error:0.1644\u001b[0m\n",
      "\u001b[31m[05:45:50] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[59]#011train-error:0.133733#011validation-error:0.1653\u001b[0m\n",
      "\u001b[31m[05:45:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[60]#011train-error:0.132533#011validation-error:0.1639\u001b[0m\n",
      "\u001b[31m[05:45:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[61]#011train-error:0.1324#011validation-error:0.1627\u001b[0m\n",
      "\u001b[31m[05:45:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[62]#011train-error:0.130533#011validation-error:0.1637\u001b[0m\n",
      "\u001b[31m[05:45:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[63]#011train-error:0.129667#011validation-error:0.1624\u001b[0m\n",
      "\u001b[31m[05:45:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[64]#011train-error:0.129067#011validation-error:0.1613\u001b[0m\n",
      "\u001b[31m[05:45:58] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[65]#011train-error:0.127867#011validation-error:0.1608\u001b[0m\n",
      "\u001b[31m[05:45:59] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[66]#011train-error:0.1264#011validation-error:0.1591\u001b[0m\n",
      "\u001b[31m[05:46:00] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[67]#011train-error:0.125667#011validation-error:0.1592\u001b[0m\n",
      "\u001b[31m[05:46:02] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[68]#011train-error:0.124267#011validation-error:0.1598\u001b[0m\n",
      "\u001b[31m[05:46:03] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[69]#011train-error:0.1238#011validation-error:0.1606\u001b[0m\n",
      "\u001b[31m[05:46:04] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[70]#011train-error:0.1236#011validation-error:0.1595\u001b[0m\n",
      "\u001b[31m[05:46:05] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[71]#011train-error:0.122733#011validation-error:0.1591\u001b[0m\n",
      "\u001b[31m[05:46:07] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[72]#011train-error:0.122867#011validation-error:0.1578\u001b[0m\n",
      "\u001b[31m[05:46:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[73]#011train-error:0.122133#011validation-error:0.1586\u001b[0m\n",
      "\u001b[31m[05:46:10] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 18 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[74]#011train-error:0.121067#011validation-error:0.158\u001b[0m\n",
      "\u001b[31m[05:46:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[75]#011train-error:0.120467#011validation-error:0.1588\u001b[0m\n",
      "\u001b[31m[05:46:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[76]#011train-error:0.120667#011validation-error:0.1587\u001b[0m\n",
      "\u001b[31m[05:46:13] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[77]#011train-error:0.120067#011validation-error:0.1582\u001b[0m\n",
      "\u001b[31m[05:46:15] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[78]#011train-error:0.119067#011validation-error:0.1579\u001b[0m\n",
      "\u001b[31m[05:46:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[79]#011train-error:0.118467#011validation-error:0.1572\u001b[0m\n",
      "\u001b[31m[05:46:18] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[80]#011train-error:0.118667#011validation-error:0.1568\u001b[0m\n",
      "\u001b[31m[05:46:19] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[81]#011train-error:0.118733#011validation-error:0.1556\u001b[0m\n",
      "\u001b[31m[05:46:20] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 18 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[82]#011train-error:0.118533#011validation-error:0.1554\u001b[0m\n",
      "\u001b[31m[05:46:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[83]#011train-error:0.117733#011validation-error:0.1551\u001b[0m\n",
      "\u001b[31m[05:46:23] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[84]#011train-error:0.117333#011validation-error:0.1556\u001b[0m\n",
      "\u001b[31m[05:46:24] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[85]#011train-error:0.116267#011validation-error:0.1539\u001b[0m\n",
      "\u001b[31m[05:46:26] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[86]#011train-error:0.1156#011validation-error:0.1542\u001b[0m\n",
      "\u001b[31m[05:46:27] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[87]#011train-error:0.114867#011validation-error:0.1522\u001b[0m\n",
      "\u001b[31m[05:46:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[88]#011train-error:0.113867#011validation-error:0.1528\u001b[0m\n",
      "\u001b[31m[05:46:30] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[89]#011train-error:0.113467#011validation-error:0.1523\u001b[0m\n",
      "\u001b[31m[05:46:31] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[90]#011train-error:0.113#011validation-error:0.1527\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31m[05:46:32] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[91]#011train-error:0.112667#011validation-error:0.1519\u001b[0m\n",
      "\u001b[31m[05:46:34] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[92]#011train-error:0.1126#011validation-error:0.1513\u001b[0m\n",
      "\u001b[31m[05:46:35] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[93]#011train-error:0.112067#011validation-error:0.1516\u001b[0m\n",
      "\u001b[31m[05:46:36] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[94]#011train-error:0.112133#011validation-error:0.1509\u001b[0m\n",
      "\u001b[31m[05:46:37] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[95]#011train-error:0.110933#011validation-error:0.1504\u001b[0m\n",
      "\u001b[31m[05:46:39] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[96]#011train-error:0.110333#011validation-error:0.1504\u001b[0m\n",
      "\u001b[31m[05:46:40] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[97]#011train-error:0.1092#011validation-error:0.1497\u001b[0m\n",
      "\u001b[31m[05:46:41] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[98]#011train-error:0.109267#011validation-error:0.1489\u001b[0m\n",
      "\u001b[31m[05:46:43] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[99]#011train-error:0.109#011validation-error:0.1494\u001b[0m\n",
      "\u001b[31m[05:46:44] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[100]#011train-error:0.1086#011validation-error:0.1488\u001b[0m\n",
      "\u001b[31m[05:46:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[101]#011train-error:0.108067#011validation-error:0.1493\u001b[0m\n",
      "\u001b[31m[05:46:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[102]#011train-error:0.107267#011validation-error:0.1496\u001b[0m\n",
      "\u001b[31m[05:46:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[103]#011train-error:0.107067#011validation-error:0.1486\u001b[0m\n",
      "\u001b[31m[05:46:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[104]#011train-error:0.1062#011validation-error:0.1486\u001b[0m\n",
      "\u001b[31m[05:46:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[105]#011train-error:0.106067#011validation-error:0.148\u001b[0m\n",
      "\u001b[31m[05:46:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[106]#011train-error:0.1052#011validation-error:0.1481\u001b[0m\n",
      "\u001b[31m[05:46:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[107]#011train-error:0.1054#011validation-error:0.148\u001b[0m\n",
      "\u001b[31m[05:46:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[108]#011train-error:0.105467#011validation-error:0.1481\u001b[0m\n",
      "\u001b[31m[05:46:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[109]#011train-error:0.1044#011validation-error:0.1481\u001b[0m\n",
      "\u001b[31m[05:46:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[110]#011train-error:0.1044#011validation-error:0.1482\u001b[0m\n",
      "\u001b[31m[05:46:59] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[111]#011train-error:0.103933#011validation-error:0.1479\u001b[0m\n",
      "\u001b[31m[05:47:00] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[112]#011train-error:0.1042#011validation-error:0.1477\u001b[0m\n",
      "\u001b[31m[05:47:01] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[113]#011train-error:0.103467#011validation-error:0.1479\u001b[0m\n",
      "\u001b[31m[05:47:03] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[114]#011train-error:0.103#011validation-error:0.1474\u001b[0m\n",
      "\u001b[31m[05:47:04] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[115]#011train-error:0.102933#011validation-error:0.1475\u001b[0m\n",
      "\u001b[31m[05:47:05] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[116]#011train-error:0.1026#011validation-error:0.147\u001b[0m\n",
      "\u001b[31m[05:47:07] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[117]#011train-error:0.102133#011validation-error:0.1471\u001b[0m\n",
      "\u001b[31m[05:47:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[118]#011train-error:0.102133#011validation-error:0.1467\u001b[0m\n",
      "\u001b[31m[05:47:09] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 22 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[119]#011train-error:0.101267#011validation-error:0.1468\u001b[0m\n",
      "\u001b[31m[05:47:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[120]#011train-error:0.1004#011validation-error:0.1467\u001b[0m\n",
      "\u001b[31m[05:47:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[121]#011train-error:0.100067#011validation-error:0.1466\u001b[0m\n",
      "\u001b[31m[05:47:13] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[122]#011train-error:0.100133#011validation-error:0.1468\u001b[0m\n",
      "\u001b[31m[05:47:15] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[123]#011train-error:0.0994#011validation-error:0.1478\u001b[0m\n",
      "\u001b[31m[05:47:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[124]#011train-error:0.099267#011validation-error:0.1474\u001b[0m\n",
      "\u001b[31m[05:47:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[125]#011train-error:0.098333#011validation-error:0.1474\u001b[0m\n",
      "\u001b[31m[05:47:19] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[126]#011train-error:0.098133#011validation-error:0.1468\u001b[0m\n",
      "\u001b[31m[05:47:20] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[127]#011train-error:0.0978#011validation-error:0.147\u001b[0m\n",
      "\u001b[31m[05:47:21] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[128]#011train-error:0.097467#011validation-error:0.147\u001b[0m\n",
      "\u001b[31m[05:47:23] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[129]#011train-error:0.0974#011validation-error:0.1466\u001b[0m\n",
      "\u001b[31m[05:47:24] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[130]#011train-error:0.0966#011validation-error:0.1452\u001b[0m\n",
      "\u001b[31m[05:47:25] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[131]#011train-error:0.096#011validation-error:0.1453\u001b[0m\n",
      "\u001b[31m[05:47:27] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[132]#011train-error:0.096133#011validation-error:0.1459\u001b[0m\n",
      "\u001b[31m[05:47:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[133]#011train-error:0.096#011validation-error:0.1455\u001b[0m\n",
      "\u001b[31m[05:47:29] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[134]#011train-error:0.0958#011validation-error:0.145\u001b[0m\n",
      "\u001b[31m[05:47:31] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[135]#011train-error:0.0954#011validation-error:0.1438\u001b[0m\n",
      "\u001b[31m[05:47:32] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[136]#011train-error:0.095533#011validation-error:0.144\u001b[0m\n",
      "\u001b[31m[05:47:33] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[137]#011train-error:0.095267#011validation-error:0.1443\u001b[0m\n",
      "\u001b[31m[05:47:35] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[138]#011train-error:0.094733#011validation-error:0.1436\u001b[0m\n",
      "\u001b[31m[05:47:36] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[139]#011train-error:0.094533#011validation-error:0.144\u001b[0m\n",
      "\u001b[31m[05:47:37] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[140]#011train-error:0.093733#011validation-error:0.1433\u001b[0m\n",
      "\u001b[31m[05:47:39] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[141]#011train-error:0.0934#011validation-error:0.1433\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31m[05:47:40] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[142]#011train-error:0.092867#011validation-error:0.1437\u001b[0m\n",
      "\u001b[31m[05:47:41] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[143]#011train-error:0.093067#011validation-error:0.1435\u001b[0m\n",
      "\u001b[31m[05:47:43] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[144]#011train-error:0.091067#011validation-error:0.1427\u001b[0m\n",
      "\u001b[31m[05:47:44] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 18 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[145]#011train-error:0.090467#011validation-error:0.143\u001b[0m\n",
      "\u001b[31m[05:47:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[146]#011train-error:0.091267#011validation-error:0.1425\u001b[0m\n",
      "\u001b[31m[05:47:47] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[147]#011train-error:0.0914#011validation-error:0.1423\u001b[0m\n",
      "\u001b[31m[05:47:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[148]#011train-error:0.0912#011validation-error:0.1424\u001b[0m\n",
      "\u001b[31m[05:47:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[149]#011train-error:0.0908#011validation-error:0.1424\u001b[0m\n",
      "\u001b[31m[05:47:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[150]#011train-error:0.090667#011validation-error:0.142\u001b[0m\n",
      "\u001b[31m[05:47:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[151]#011train-error:0.090533#011validation-error:0.1417\u001b[0m\n",
      "\u001b[31m[05:47:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[152]#011train-error:0.090333#011validation-error:0.1412\u001b[0m\n",
      "\u001b[31m[05:47:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[153]#011train-error:0.09#011validation-error:0.1413\u001b[0m\n",
      "\u001b[31m[05:47:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[154]#011train-error:0.089467#011validation-error:0.1413\u001b[0m\n",
      "\u001b[31m[05:47:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[155]#011train-error:0.089067#011validation-error:0.1411\u001b[0m\n",
      "\u001b[31m[05:47:59] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[156]#011train-error:0.088667#011validation-error:0.1412\u001b[0m\n",
      "\u001b[31m[05:48:00] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[157]#011train-error:0.088533#011validation-error:0.1416\u001b[0m\n",
      "\u001b[31m[05:48:01] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[158]#011train-error:0.088133#011validation-error:0.1413\u001b[0m\n",
      "\u001b[31m[05:48:03] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[159]#011train-error:0.0876#011validation-error:0.1414\u001b[0m\n",
      "\u001b[31m[05:48:04] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[160]#011train-error:0.086933#011validation-error:0.1411\u001b[0m\n",
      "\u001b[31m[05:48:05] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[161]#011train-error:0.086867#011validation-error:0.1416\u001b[0m\n",
      "\u001b[31m[05:48:07] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[162]#011train-error:0.087067#011validation-error:0.1411\u001b[0m\n",
      "\u001b[31m[05:48:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[163]#011train-error:0.086467#011validation-error:0.1406\u001b[0m\n",
      "\u001b[31m[05:48:09] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[164]#011train-error:0.086133#011validation-error:0.1402\u001b[0m\n",
      "\u001b[31m[05:48:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[165]#011train-error:0.086#011validation-error:0.1399\u001b[0m\n",
      "\u001b[31m[05:48:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[166]#011train-error:0.086#011validation-error:0.1403\u001b[0m\n",
      "\u001b[31m[05:48:13] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[167]#011train-error:0.085267#011validation-error:0.1402\u001b[0m\n",
      "\u001b[31m[05:48:15] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[168]#011train-error:0.085467#011validation-error:0.1404\u001b[0m\n",
      "\u001b[31m[05:48:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[169]#011train-error:0.085#011validation-error:0.1408\u001b[0m\n",
      "\u001b[31m[05:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[170]#011train-error:0.084733#011validation-error:0.141\u001b[0m\n",
      "\u001b[31m[05:48:19] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[171]#011train-error:0.083867#011validation-error:0.141\u001b[0m\n",
      "\u001b[31m[05:48:20] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[172]#011train-error:0.083867#011validation-error:0.1409\u001b[0m\n",
      "\u001b[31m[05:48:21] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[173]#011train-error:0.083#011validation-error:0.1403\u001b[0m\n",
      "\u001b[31m[05:48:23] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[174]#011train-error:0.082533#011validation-error:0.1403\u001b[0m\n",
      "\u001b[31m[05:48:24] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[31m[175]#011train-error:0.082667#011validation-error:0.1401\u001b[0m\n",
      "\u001b[31mStopping. Best iteration:\u001b[0m\n",
      "\u001b[31m[165]#011train-error:0.086#011validation-error:0.1399\n",
      "\u001b[0m\n",
      "\n",
      "2018-10-12 05:49:28 Uploading - Uploading generated training model\n",
      "2018-10-12 05:49:34 Completed - Training job completed\n",
      "Billable seconds: 338\n"
     ]
    }
   ],
   "source": [
    "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (TODO) Testing the model\n",
    "\n",
    "Now that we've fit our XGBoost model, it's time to see how well it performs. To do this we will use SageMakers Batch Transform functionality. Batch Transform is a convenient way to perform inference on a large dataset in a way that is not realtime. That is, we don't necessarily need to use our model's results immediately and instead we can peform inference on a large number of samples. An example of this in industry might be peforming an end of month report. This method of inference can also be useful to us as it means to can perform inference on our entire test set. \n",
    "\n",
    "To perform a Batch Transformation we need to first create a transformer objects from our trained estimator object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:sagemaker:Creating model with name: xgboost-2018-10-12-05-41-22-383\n"
     ]
    }
   ],
   "source": [
    "# TODO: Create a transformer object from the trained model. Using an instance count of 1 and an instance type of ml.m4.xlarge\n",
    "#       should be more than enough.\n",
    "xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we actually perform the transform job. When doing so we need to make sure to specify the type of data we are sending so that it is serialized correctly in the background. In our case we are providing our model with csv data so we specify `text/csv`. Also, if the test data that we have provided is too large to process all at once then we need to specify how the data file should be split up. Since each line is a single entry in our data set we tell SageMaker that it can split the input on each line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:sagemaker:Creating transform job with name: xgboost-2018-10-12-05-50-53-078\n"
     ]
    }
   ],
   "source": [
    "# TODO: Start the transform job. Make sure to specify the content type and the split type of the test data.\n",
    "xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently the transform job is running but it is doing so in the background. Since we wish to wait until the transform job is done and we would like a bit of feedback we can run the `wait()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "........................................!\n"
     ]
    }
   ],
   "source": [
    "xgb_transformer.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the transform job has executed and the result, the estimated sentiment of each review, has been saved on S3. Since we would rather work on this file locally we can perform a bit of notebook magic to copy the file to the `data_dir`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Completed 256.0 KiB/370.3 KiB (1.2 MiB/s) with 1 file(s) remaining\r",
      "Completed 370.3 KiB/370.3 KiB (1.7 MiB/s) with 1 file(s) remaining\r",
      "download: s3://sagemaker-us-east-1-440180731255/xgboost-2018-10-12-05-50-53-078/test.csv.out to ../data/xgboost/test.csv.out\r\n"
     ]
    }
   ],
   "source": [
    "!aws s3 cp --recursive $xgb_transformer.output_path $data_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last step is now to read in the output from our model, convert the output to something a little more usable, in this case we want the sentiment to be either `1` (positive) or `0` (negative), and then compare to the ground truth labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)\n",
    "predictions = [round(num) for num in predictions.squeeze().values]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.86072"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "accuracy_score(test_y, predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optional: Clean up\n",
    "\n",
    "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we will remove all of the files contained in the data_dir directory\n",
    "!rm $data_dir/*\n",
    "\n",
    "# And then we delete the directory itself\n",
    "!rmdir $data_dir\n",
    "\n",
    "# Similarly we will remove the files in the cache_dir directory and the directory itself\n",
    "!rm $cache_dir/*\n",
    "!rmdir $cache_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}