{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Distributed Q-Learning and SARSA\n",
    "\n",
    "The goal of this assignment is to implement both single-core and distributed versions of two temporal difference (TD) reinforcement learning algorithms, Q-Learning and SARSA. In particuar, Q-Learning and SARSA will be run in a Markov Decision Process environment in order to compute policies that optimize expected infinite horizon discounted cummulative reward.\n",
    "\n",
    "The relevant content about Q-Learning and SARSA are in the following course notes from CS533.\n",
    "\n",
    "https://oregonstate.instructure.com/courses/1719746/files/74944693/download?wrap=1\n",
    "\n",
    "\n",
    "## Recap of Q-Learning and SARSA \n",
    "\n",
    "Q-Learning and SARSA appear to be extremely similar algorithms at the code level. In particular, their update formulas are as follows:\n",
    "\n",
    "\\begin{equation}\n",
    "\\mbox{SARSA:  } Q(s,a) \\leftarrow Q(s,a) + \\alpha\\cdot \\left(r + \\beta Q(s',a') - Q(s,a)\\right) \\\\\n",
    "\\mbox{Q-Learning:  } Q(s,a) \\leftarrow Q(s,a) + \\alpha\\cdot \\left(r + \\beta \\max_a' Q(s',a') - Q(s,a)\\right) \n",
    "\\end{equation}\n",
    "where $s'$ and $r$ are the next state and reward after taking action $a$ in state $s$. Further, for SARSA, $a'$ is the action that was taken by the agent in $s'$. \n",
    "\n",
    "While these updates a are similar, the algorithms are different in a fundamental way. SARSA is an on-policy algorithm, which aims to learn the value of the policy that is being used to collect data (we call this the behavior policy). In the case of policy optimization, this behavior policy is an exploration-exploitation policy. This is why $a'$ in the SARSA algorithm is based on the action actually taken by the behavior policy. \n",
    "\n",
    "Rather, Q-Learning is an off-policy algorithm, which can learn from any set of experience tuples, regardless of how they were collected. In particular, notice that Q-Learning is not sensitive to what action (if any) was taken in $s'$, while SARSA must be given the action that was actually taken in $s'$. Even if the exploration policy were completely random, Q-Learning will converge in the limit to an optimal policy under standard assumptions on the learning rate decay. \n",
    "\n",
    "The off-policy versus on-policy difference can lead the algorithms to sometimes learn different policies in practice. The assignment below will give you the opportunity to implement both SARSA and Q-Learning and to observe and explain such differences. After first implementing a single-core version of the algorithms, you will then implement a distributed version of Q-Learning. Note that the off-policy nature of Q-Learning makes it more directly compatible with distributed implementations compared to SARSA, since the experience need not come from a particular behavior policy. \n",
    "\n",
    "To complete the exercise, you need to finish all parts of this notebook. You also need to turn in a report---the notebook exercises will indicate what should be included in the report.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You will need to uncomment the following pip commands if the libraries need to be installed. \n",
    "# You may get some errors related to readchar, but they should not break the project.\n",
    "\n",
    "#!pip3 install --user numpy\n",
    "#!pip3 install --user gym\n",
    "#!pip3 install --user tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import ray\n",
    "import time\n",
    "import math\n",
    "from copy import deepcopy\n",
    "import matplotlib.pyplot as plt\n",
    "from random import randint, choice, uniform\n",
    "import random\n",
    "%matplotlib inline\n",
    "import pickle\n",
    "import numpy as np\n",
    "import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "## FrozenLake  \n",
    "<span style=\"color:red\">(The game code and transition probability is different from assignment 2)</span>  \n",
    "We will use the FrozenLake environment as the MDP environment for this experiment. This is a type of gridworld environment, whose size (number of states) can be controlled by adjusting the grid dimensions. The environment is intended to model the process of navigating a frozen lake, while avoiding falling into holes with the objective of reaching a goal location. \n",
    "\n",
    "The environment is defined as follows:\n",
    "\n",
    "- The environment is a rectangular grid of states/cells. There are four different types of cells as indicated by the following cell labels: \n",
    "\n",
    "    - S labels the starting/initial cell, always in the top left corner\n",
    "    \n",
    "    - F labels frozen cells that are safe to step on\n",
    "\n",
    "    - H labels holes and if the agent enters a hole cell there is a pentalty of -1000 and the episode ends\n",
    "\n",
    "    - G labels the goal cell and when reached gives a reward of 1000 and the episode ends\n",
    "\n",
    "- There are four possible actions (Left, Right, Down, Up). \n",
    "\n",
    "- The transition function moves the agent in the expected direction with 0.97 probability, and there is a 0.03 probability of transitioning to one of the other randomly selected directions. \n",
    "\n",
    "- There is a reward of -1 for each action taken by the agent, which is intended to encourage the agent to reach the goal as fast as possible. \n",
    "\n",
    "- Episodes end whenever the agent falls in a hole or reaches the goal. An end-of-episode is modeled by transitioning to a zero-reward terminal state (all actions lead to that state). \n",
    "\n",
    "The version we are using for assignment 3 is a modified version of the environment at the following location.   \n",
    "  \n",
    "Source: https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py  \n",
    "\n",
    "Execute the following cell to initialize the MDP environments. (You do not need to change the code in this part.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from contextlib import closing\n",
    "\n",
    "from six import StringIO, b\n",
    "\n",
    "from gym import utils\n",
    "from gym.envs.toy_text import discrete\n",
    "\n",
    "LEFT = 0\n",
    "DOWN = 1\n",
    "RIGHT = 2\n",
    "UP = 3\n",
    "np.set_printoptions(threshold=sys.maxsize, linewidth=sys.maxsize, precision = 2)\n",
    "#TransitionProb = [0.7, 0.1, 0.1, 0.1]\n",
    "#TransitionProb = [1, 0, 0, 0]\n",
    "TransitionProb = [0.97, 0.01, 0.01, 0.01]\n",
    "def generate_row(length, h_prob):\n",
    "    row = np.random.choice(2, length, p=[1.0 - h_prob, h_prob])\n",
    "    row = ''.join(list(map(lambda z: 'F' if z == 0 else 'H', row)))\n",
    "    return row\n",
    "\n",
    "\n",
    "def generate_map(shape):\n",
    "    \"\"\"\n",
    "\n",
    "    :param shape: Width x Height\n",
    "    :return: List of text based map\n",
    "    \"\"\"\n",
    "    h_prob = 0.1\n",
    "    grid_map = []\n",
    "\n",
    "    for h in range(shape[1]):\n",
    "\n",
    "        if h == 0:\n",
    "            row = 'SF'\n",
    "            row += generate_row(shape[0] - 2, h_prob)\n",
    "        elif h == 1:\n",
    "            row = 'FF'\n",
    "            row += generate_row(shape[0] - 2, h_prob)\n",
    "\n",
    "        elif h == shape[1] - 1:\n",
    "            row = generate_row(shape[0] - 2, h_prob)\n",
    "            row += 'FG'\n",
    "        elif h == shape[1] - 2:\n",
    "            row = generate_row(shape[0] - 2, h_prob)\n",
    "            row += 'FF'\n",
    "        else:\n",
    "            row = generate_row(shape[0], h_prob)\n",
    "\n",
    "        grid_map.append(row)\n",
    "        del row\n",
    "\n",
    "    return grid_map\n",
    "\n",
    "MAPS = {\n",
    "    \n",
    "    \"4x4\": [\n",
    "        \"SFFF\",\n",
    "        \"FHFH\",\n",
    "        \"FFFF\",\n",
    "        \"HFFG\"\n",
    "    ],\n",
    "    \"8x8\": [\n",
    "        \"SFFFFFFF\",\n",
    "        \"FFFFFFFF\",\n",
    "        \"FFFHFFFF\",\n",
    "        \"FFFFFHFF\",\n",
    "        \"FFFHFFFF\",\n",
    "        \"FHHFFFHF\",\n",
    "        \"FHFFHFHF\",\n",
    "        \"FFFHFFFG\"\n",
    "    ],\n",
    "    \"Dangerous Hallway\": [\n",
    "        \"SFFFFFFF\",\n",
    "        \"FFFFFFFF\",\n",
    "        \"HFHHHFFF\",\n",
    "        \"HFHHHFFF\",\n",
    "        \"HFHHHFFF\",\n",
    "        \"HFHHHFFF\",\n",
    "        \"HFFFFFFF\",\n",
    "        \"FGFFFFFF\"\n",
    "    ],\n",
    "    \"16x16\": [\n",
    "        \"SFFFFFFFFHFFFFHF\",\n",
    "        \"FFFFFFFFFFFFFHFF\",\n",
    "        \"FFFHFFFFHFFFFFFF\",\n",
    "        \"FFFFFFFFHFFFFFFF\",\n",
    "        \"FFFFFFFFFFFFFFFF\",\n",
    "        \"FFHHFFFFFFFHFFFH\",\n",
    "        \"FFFFFFFFFFFFFFFF\",\n",
    "        \"FFFFFHFFFFFFHFFF\",\n",
    "        \"FFFFFHFFFFFFFFFH\",\n",
    "        \"FFFFFFFHFFFFFFFF\",\n",
    "        \"FFFFFFFFFFFFHFFF\",\n",
    "        \"FFFFFFHFFFFFFFFF\",\n",
    "        \"FFFFFFFFHFFFFFFF\",\n",
    "        \"FFFFFFFFFHFFFFHF\",\n",
    "        \"FFFFFFFFFFHFFFFF\",\n",
    "        \"FFFHFFFFFFFFFFFG\",\n",
    "    ],\n",
    "    \n",
    "    \"32x32\": [\n",
    "        'SFFFFFFFFFFFFFFFFFFFFFFFFFHFFFFF',\n",
    "        'FFFFFFFFHFFFFFFFFFFFFFFFFFHFFFFF',\n",
    "        'FFFHFFFFFFFFHFFHFFFFFFFFFFFFFFFF',\n",
    "        'FFFFFFFFFFFFFFHFHHFHFHFFFFFHFFFH',\n",
    "        'FFFFHFFFFFFFFFFFFFFFHFHFFFFFFFHF',\n",
    "        'FFFFFHFFFFFFFFFFHFFFFFFFFFFHFFFF',\n",
    "        'FFHHFFFFHFFFFFFFFFFFFFFFFFFFFFFF',\n",
    "        'FFFHFFFFFFFFFFHFFFHFHFFFFFFFFHFF',\n",
    "        'FFFFHFFFFFFHFFFFHFHFFFFFFFFFFFFH',\n",
    "        'FFFFHHFHFFFFHFFFFFFFFFFFFFFFFFFF',\n",
    "        'FHFFFFFFFFFFHFFFFFFFFFFFHHFFFHFH',\n",
    "        'FFFHFFFHFFFFFFFFFFFFFFFFFFFFHFFF',\n",
    "        'FFFHFHFFFFFFFFHFFFFFFFFFFFFHFFHF',\n",
    "        'FFFFFFFFFFFFFFFFHFFFFFFFHFFFFFFF',\n",
    "        'FFFFFFHFFFFFFFFHHFFFFFFFHFFFFFFF',\n",
    "        'FFHFFFFFFFFFHFFFFFFFFFFHFFFFFFFF',\n",
    "        'FFFHFFFFFFFFFHFFFFHFFFFFFHFFFFFF',\n",
    "        'FFFFFFFFFFFFFFFFFFFFFFFFFFHFFFFF',\n",
    "        'FFFFFFFFHFFFFFFFHFFFFFFFFFFFFFFH',\n",
    "        'FFHFFFFFFFFFFFFFFFHFFFFFFFFFFFFF',\n",
    "        'FFFFFFFHFFFFFFFFFFFFFFFFFFFFFFFF',\n",
    "        'FFFFFFFFFFFFFFFHFFFFHFFFFFFFHFFF',\n",
    "        'FFHFFFFHFFFFFFFFFHFFFFFFFFFFFHFH',\n",
    "        'FFFFFFFFFFHFFFFHFFFFFFFFFFFFFFFF',\n",
    "        'FFFFFFFFFFFFFFFFFHHFFHHHFFFHFFFF',\n",
    "        'FFFFFFFFFFFFFFHFFFFHFFFFFFFHFFFF',\n",
    "        'FFFFFFFHFFFFFFFFFFFFFFFFFFFFFFFF',\n",
    "        'FFFFFHFFFFFFFFFFFFFFFFHFFHFFFFFF',\n",
    "        'FFFFFFFHFFFFFFFFFHFFFFFFFFFFFFFF',\n",
    "        'FFFFFFFFFFFFFFFFFFFFFFFFHFFFFFFF',\n",
    "        'FFFFFFFFFFFFFFFFFFFFFFFFHFFFFFFF',\n",
    "        'FFFFFFFFFFFFFFFHFFFFFFFFHFFFFFFG',\n",
    "    ]\n",
    "}\n",
    "\n",
    "\n",
    "def generate_random_map(size=8, p=0.8):\n",
    "    \"\"\"Generates a random valid map (one that has a path from start to goal)\n",
    "    :param size: size of each side of the grid\n",
    "    :param p: probability that a tile is frozen\n",
    "    \"\"\"\n",
    "    valid = False\n",
    "\n",
    "    # BFS to check that it's a valid path.\n",
    "    def is_valid(arr, r=0, c=0):\n",
    "        if arr[r][c] == 'G':\n",
    "            return True\n",
    "\n",
    "        tmp = arr[r][c]\n",
    "        arr[r][c] = \"#\"\n",
    "\n",
    "        # Recursively check in all four directions.\n",
    "        directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]\n",
    "        for x, y in directions:\n",
    "            r_new = r + x\n",
    "            c_new = c + y\n",
    "            if r_new < 0 or r_new >= size or c_new < 0 or c_new >= size:\n",
    "                continue\n",
    "\n",
    "            if arr[r_new][c_new] not in '#H':\n",
    "                if is_valid(arr, r_new, c_new):\n",
    "                    arr[r][c] = tmp\n",
    "                    return True\n",
    "\n",
    "        arr[r][c] = tmp\n",
    "        return False\n",
    "\n",
    "    while not valid:\n",
    "        p = min(1, p)\n",
    "        res = np.random.choice(['F', 'H'], (size, size), p=[p, 1-p])\n",
    "        res[0][0] = 'S'\n",
    "        res[-1][-1] = 'G'\n",
    "        valid = is_valid(res)\n",
    "    return [\"\".join(x) for x in res]\n",
    "\n",
    "\n",
    "class FrozenLakeEnv(discrete.DiscreteEnv):\n",
    "    \"\"\"\n",
    "    Winter is here. You and your friends were tossing around a frisbee at the park\n",
    "    when you made a wild throw that left the frisbee out in the middle of the lake.\n",
    "    The water is mostly frozen, but there are a few holes where the ice has melted.\n",
    "    If you step into one of those holes, you'll fall into the freezing water.\n",
    "    At this time, there's an international frisbee shortage, so it's absolutely imperative that\n",
    "    you navigate across the lake and retrieve the disc.\n",
    "    However, the ice is slippery, so you won't always move in the direction you intend.\n",
    "    The surface is described using a grid like the following\n",
    "\n",
    "        SFFF\n",
    "        FHFH\n",
    "        FFFH\n",
    "        HFFG\n",
    "\n",
    "    S : starting point, safe\n",
    "    F : frozen surface, safe\n",
    "    H : hole, fall to your doom\n",
    "    G : goal, where the frisbee is located\n",
    "\n",
    "    The episode ends when you reach the goal or fall in a hole.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    metadata = {'render.modes': ['human', 'ansi']}\n",
    "\n",
    "    def __init__(self, desc=None, map_name=\"4x4\"):\n",
    "        if desc is None and map_name is None:\n",
    "            desc = generate_random_map()\n",
    "        elif desc is None:\n",
    "            desc = MAPS[map_name]\n",
    "        self.desc = desc = np.asarray(desc,dtype='c')\n",
    "        self.nrow, self.ncol = nrow, ncol = desc.shape\n",
    "        self.reward_range = (0, 1)\n",
    "\n",
    "        nA = 4\n",
    "        nS = nrow * ncol\n",
    "        self.nS = nS\n",
    "        \n",
    "        isd = np.array(desc == b'S').astype('float64').ravel()\n",
    "        isd /= isd.sum()\n",
    "\n",
    "        rew_hole = -1000\n",
    "        rew_goal = 1000\n",
    "        rew_step = -1\n",
    "        \n",
    "        exit = nrow * ncol\n",
    "        P = {s : {a : [] for a in range(nA)} for s in range(nS + 1)}\n",
    "        \n",
    "        def to_s(row, col):\n",
    "            return row*ncol + col\n",
    "        \n",
    "        def inc(row, col, a):\n",
    "            if a == LEFT:\n",
    "                col = max(col-1,0)\n",
    "            elif a == DOWN:\n",
    "                row = min(row+1,nrow-1)\n",
    "            elif a == RIGHT:\n",
    "                col = min(col+1,ncol-1)\n",
    "            elif a == UP:\n",
    "                row = max(row-1,0)\n",
    "            return (row, col)\n",
    "\n",
    "        for row in range(nrow):\n",
    "            for col in range(ncol):\n",
    "                s = to_s(row, col)\n",
    "                for a in range(4):\n",
    "                    li = P[s][a]\n",
    "                    letter = desc[row, col]\n",
    "                    if letter in b'H':\n",
    "                        li.append((1.0, exit, -1000, True))\n",
    "                    elif letter in b'G':\n",
    "                        li.append((1.0, exit, 1000, True))\n",
    "                    else:\n",
    "                        for b, p in zip([a, (a+1)%4, (a+2)%4, (a+3)%4], TransitionProb):\n",
    "                            newrow, newcol = inc(row, col, b)\n",
    "                            newstate = to_s(newrow, newcol)\n",
    "                            newletter = desc[newrow, newcol]\n",
    "                            rew = rew_step\n",
    "                            li.append((p, newstate, rew, False))\n",
    "\n",
    "        super(FrozenLakeEnv, self).__init__(nS, nA, P, isd)\n",
    "\n",
    "    def render(self, mode='human'):\n",
    "        outfile = StringIO() if mode == 'ansi' else sys.stdout\n",
    "\n",
    "        row, col = self.s // self.ncol, self.s % self.ncol\n",
    "        desc = self.desc.tolist()\n",
    "        desc = [[c.decode('utf-8') for c in line] for line in desc]\n",
    "        if self.s < self.nS:\n",
    "            desc[row][col] = utils.colorize(desc[row][col], \"red\", highlight=True)\n",
    "        else:\n",
    "            outfile.write(\"exit\\n\")\n",
    "        if self.lastaction is not None:\n",
    "            outfile.write(\"  ({})\\n\".format([\"Left\",\"Down\",\"Right\",\"Up\"][self.lastaction]))\n",
    "        else:\n",
    "            outfile.write(\"\\n\")\n",
    "        outfile.write(\"\\n\".join(''.join(line) for line in desc)+\"\\n\")\n",
    "\n",
    "        if mode != 'human':\n",
    "            with closing(outfile):\n",
    "                return outfile.getvalue()\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "***\n",
    "## Initializations\n",
    "\n",
    "Run the following cell to initilize maps of different sizes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#map_4 = (MAPS[\"4x4\"], 4)\n",
    "#map_8 = (MAPS[\"8x8\"], 8)\n",
    "#map_32 = (MAPS[\"32x32\"], 32)\n",
    "#map_50 = (generate_map((50,50)), 50)\n",
    "#map_110 = (generate_map((110,110)), 110)\n",
    "\n",
    "map_DH = (MAPS[\"Dangerous Hallway\"], 8)\n",
    "map_16 = (MAPS[\"16x16\"], 16)\n",
    "\n",
    "MAP = map_DH\n",
    "map_size = MAP[1]\n",
    "run_time = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Helper Function\n",
    "The following will be used to plot the performance of learning and display the learned value function. Note that both SARSA and Q-Learning produce Q-function tables, which store a value for each state-action pair. Given a Q-function $Q(s,a)$, the corresponding value function is given by $V(s)=\\max_a Q(s,a)$. This relation is used to visualize the value function learned for the environment. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_result(total_rewards, learning_num, legend):\n",
    "    print(\"\\nLearning Performance:\\n\")\n",
    "    episodes = []\n",
    "    for i in range(len(total_rewards)):\n",
    "        episodes.append(i * learning_num + 1)\n",
    "        \n",
    "    plt.figure(num = 1)\n",
    "    fig, ax = plt.subplots()\n",
    "    plt.plot(episodes, total_rewards)\n",
    "    plt.title('performance')\n",
    "    plt.legend(legend)\n",
    "    plt.xlabel(\"Episodes\")\n",
    "    plt.ylabel(\"total rewards\")\n",
    "    plt.show()\n",
    "    \n",
    "def plot_image(q_table, MAP):\n",
    "    \n",
    "    best_value = np.max(q_table, axis = 1)[:-1].reshape((map_size,map_size))\n",
    "    best_policy = np.argmax(q_table, axis = 1)[:-1].reshape((map_size,map_size))\n",
    "    \n",
    "    print(\"\\n\\nBest Q-value and Policy:\\n\")\n",
    "    fig, ax = plt.subplots()\n",
    "    im = ax.imshow(best_value)\n",
    "\n",
    "    for i in range(best_value.shape[0]):\n",
    "        for j in range(best_value.shape[1]):\n",
    "            if MAP[i][j] in 'GH':\n",
    "                arrow = MAP[i][j]\n",
    "            elif best_policy[i, j] == 0:\n",
    "                arrow = '<'\n",
    "            elif best_policy[i, j] == 1:\n",
    "                arrow = 'v'\n",
    "            elif best_policy[i, j] == 2:\n",
    "                arrow = '>'\n",
    "            elif best_policy[i, j] == 3:\n",
    "                arrow = '^'\n",
    "            if MAP[i][j] in 'S':\n",
    "                arrow = 'S ' + arrow\n",
    "            text = ax.text(j, i, arrow,\n",
    "                           ha = \"center\", va = \"center\", color = \"black\")\n",
    "            \n",
    "    cbar = ax.figure.colorbar(im, ax = ax)\n",
    "    \n",
    "    fig.tight_layout()\n",
    "    plt.show() \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simulator\n",
    "\n",
    "We now create a simulator corresponding to the environment that we want to train and test the agent in. The simulator always maintains a current state, which changes using the following functions:  \n",
    "- simulator.reset(): Reset the simulator so that the current state is the initial state.   \n",
    "- simulator.step(action): take the specified action in the current state, which returns the tuple (next_state, reward, done, info) and sets the internal current stae to be next_state. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simulator = FrozenLakeEnv(desc = MAP[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "# Non-Distributed RL Agents\n",
    "\n",
    "In this part, you will implement Q-learning and SARSA on two maps: Dangerous Hall and the Size 16 map. \n",
    "\n",
    "$\\epsilon$-greedy exploration will be used as the exploration strategy with a fixed value of $\\epsilon$. \n",
    "\n",
    "- Suggested learning episodes ---  Dangerous Hall map = 30000 and Size 16 map = 100,000\n",
    "\n",
    "A typical way to evaluate the learning performance of an RL agent is as follows. The agent will learning using an explore-exploit policy for \"test_interval\" number of episodes. After each \"test_interval\" learning episodes, exploration is turned off and the performance of the greedy policy is evaluated. This is done by defining the policy as $\\pi(s)=\\arg\\max_a Q(s,a)$ and then running the policy for a specified number of episodes and averaging the results. \n",
    "\n",
    "- Evaluation --- by default testing is done every \"test_interval\" episodes, but testing can be turned off by setting the do_test argument to false. \n",
    "  \n",
    "In this part of the assignment you need to do the following. \n",
    "\n",
    "- Use both SARSA and Q-Learning to learn policies for each of the two maps using learning rates of 0.001 and 0.1 (using the default value of $\\epsilon$). \n",
    "\n",
    "- Use the better of the two learning rates to learning policies using SARSA and Q-Learning on each of the two maps using $\\epsilon$ values of 0.3, 0.05.\n",
    "\n",
    "### Your final report for the asignment should include the following. \n",
    "\n",
    "1. Provide the learning curves for the above experiments. Clearly label the curves by the parameters used. \n",
    "\n",
    "2. Did you observe differences for SARSA when using the two different learning rates? If there were significant differences, what were they and how can you explain them?  \n",
    "\n",
    "3. Repeat (2) for Q-Learning.\n",
    "\n",
    "4. Did you observe differences for SARSA when using different values of $\\epsilon$? If there were significant differences, what were they and how do you explain them?\n",
    "\n",
    "5. Repeat (4) for Q-Learning. \n",
    "\n",
    "6. For the map \"Dangerous Hallway\" did you observe differences in the policies learned by SARSA and Q-Learning for the two values of epsilon (there should be differences between Q-learning and SARSA for at least one value)? If you observed a difference, give your best explanation for why Q-learning and SARSA found different solutions.\n",
    "\n",
    "\n",
    "***\n",
    "## TDagent\n",
    "The parent calss of Q-learning and SARSA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class agent():\n",
    "    def __init__(self, epsilon, learning_rate, learning_episodes, map_size, \n",
    "                 test_interval = 100, action_space = 4, beta = 0.999, do_test = True):\n",
    "        self.beta = beta\n",
    "        self.epsilon = epsilon\n",
    "        self.test_interval = test_interval\n",
    "        self.batch_num = learning_episodes // test_interval\n",
    "        self.action_space = action_space\n",
    "        self.state_space = map_size * map_size + 1\n",
    "        self.q_table = np.zeros((self.state_space, self.action_space))\n",
    "        self.learning_rate = learning_rate\n",
    "        self.do_test = do_test\n",
    "        \n",
    "    def explore_or_exploit_policy(self, curr_state):\n",
    "        #INSERT YOUR CODE HERE\n",
    "        \n",
    "    def greedy_policy(self, curr_state):\n",
    "        #INSERT YOUR CODE HERE\n",
    "    \n",
    "    def learn_and_evaluate(self):\n",
    "        total_rewards = []\n",
    "        for i in tqdm.tqdm(range(self.batch_num), desc=\"Test Number\"):\n",
    "            #INSERT YOUR CODE HERE\n",
    "        \n",
    "        return total_rewards, self.q_table\n",
    "    \n",
    "    def evaluate(self, policy_func, trials = 100, max_steps = 1000):\n",
    "        \n",
    "        total_reward = 0\n",
    "        for _ in range(trials):\n",
    "            simulator.reset()\n",
    "            done = False\n",
    "            steps = 0\n",
    "            observation, reward, done, info = simulator.step(policy_func(0))\n",
    "            total_reward += (self.beta ** steps) * reward\n",
    "            steps += 1\n",
    "            while not done and steps < 1000:\n",
    "                observation, reward, done, info = simulator.step(policy_func(observation))\n",
    "                total_reward += (self.beta ** steps) * reward\n",
    "                steps += 1\n",
    "        \n",
    "        return total_reward / trials\n",
    "    \n",
    "    def learn(self):\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q-learning Agent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class QLAgent(agent):\n",
    "    def __init__(self, epsilon, learning_rate, learning_episodes, map_size,\n",
    "                 test_interval = 100, action_space = 4, beta = 0.999, do_test = True):\n",
    "        super().__init__(epsilon, learning_rate, learning_episodes, map_size, \n",
    "                         test_interval = test_interval, action_space = action_space, beta = beta, do_test = do_test)\n",
    "        self.agent_name = \"Q-learning Agent\"\n",
    "        \n",
    "    def learn(self):\n",
    "        for _ in range(self.test_interval):\n",
    "            #INSERT YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simulator.reset()\n",
    "\n",
    "#INSERT YOUR CODE FOR INIT PARAMS HERE\n",
    "\n",
    "ql_agent = QLAgent(epsilon, learning_rate, learning_episodes, map_size,\n",
    "                   test_interval = test_interval, do_test = do_test)\n",
    "start_time = time.time()\n",
    "total_rewards, q_table = ql_agent.learn_and_evaluate()\n",
    "run_time['Q-learning agent'] = time.time() - start_time\n",
    "print(\"Learning time:\\n\")\n",
    "print(run_time['Q-learning agent'])\n",
    "if do_test:\n",
    "    plot_result(total_rewards ,ql_agent.test_interval, [ql_agent.agent_name])\n",
    "plot_image(q_table, MAP[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SARSA Agent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SARSAagent(agent):\n",
    "    def __init__(self, epsilon, learning_rate, learning_episodes, map_size, \n",
    "                 test_interval = 100, action_space = 4, beta = 0.999, do_test = True):\n",
    "        super().__init__(epsilon, learning_rate, learning_episodes, map_size,\n",
    "                         test_interval = test_interval, action_space = action_space, beta = beta, do_test = do_test)\n",
    "        self.agent_name = \"SARSA Agent\"\n",
    "        \n",
    "    def learn(self):\n",
    "        for _ in range(self.test_interval):\n",
    "            #INSERT YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simulator.reset()\n",
    "#INSERT YOUR CODE FOR INIT PARAMS HERE\n",
    "\n",
    "sarsa_agent = SARSAagent(epsilon, learning_rate, learning_episodes, map_size,\n",
    "                         test_interval = test_interval, do_test = do_test)\n",
    "start_time = time.time()\n",
    "total_rewards, q_table = sarsa_agent.learn_and_evaluate()\n",
    "run_time['SARSA agent'] = time.time() - start_time\n",
    "print(\"Learning time:\\n\")\n",
    "print(run_time['SARSA agent'])\n",
    "if do_test:\n",
    "    plot_result(total_rewards, sarsa_agent.test_interval, [sarsa_agent.agent_name])\n",
    "plot_image(q_table, MAP[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "# Distibuted Q-Learning\n",
    "***\n",
    "## init Ray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ray.shutdown()\n",
    "ray.init(include_webui=False, ignore_reinit_error=True, redis_max_memory=500000000, object_store_memory=5000000000, temp_dir = '~/ray_tmp')d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distibuted Q-learning Agent \n",
    "\n",
    "Below you will implement a distributed Q-Learning algorithm, where both the learning and evaluation/testing is done in a distributed way. The goal is to demonstrate a speedup in learning (in terms of wall-clock time) by creating multiple workers to collect data and do the evaluation. We have provided a schema for how to implement this. \n",
    "\n",
    "- Main Agent: Setup all the wokrers and server and turn them on.  \n",
    "- Collector: There is a simulator inside each collector. Their job is collecting exprience from the simulator, and sending the experience the Q-table server.  \n",
    "- Evaluator: There is a simulator inside each evaluator. Their job is taking a Q-table from server, then testing the performace of the greedy policy and sending the results back to the server.  \n",
    "- Server: It maintains the Q-table. It Takes expriences from collector and performs the appropriate learning updates. It also responds to the evaluator periodically when it is time to perform tests/evaluations.   \n",
    "  \n",
    "We have provided the Evaluator code, so you don't need to implement that yourself. However, understanding that code should be helpful for your implementation of other parts. \n",
    "\n",
    "After you complete the coding for this part, you need to peform evaluations on the two maps using the following number of workers: (collector workers, evaluator workers) = (2,4), (4,4), (8,4).  You can use $\\epsilon = 0.1$ and the best of the learning rates that you considered. \n",
    "\n",
    "```python\n",
    "\n",
    "\"\"\"\n",
    "\n",
    " +---------------+\n",
    " |               |\n",
    " |     Agent     |\n",
    " |               | initializes all the wokrers and server.\n",
    " |               |\n",
    " +---------------+\n",
    "                                             \n",
    "                                             \n",
    "            +-----------+         +-----------+       +-----------+         +-----------+  \n",
    "            |           |         |           |       |           |         |           |       \n",
    "            |  Worker   |         |  Worker   |       |  Worker   |         |  Worker   |       \n",
    "            |(Collector)|--------▷|(Collector)|------▷|(Collector)|--------▷|(Collector)|               \n",
    "            |           |         |           |       |           |         |           |       \n",
    "            +-----------+         +-----------+       +-----------+         +-----------+       \n",
    "                  |                     |                   |                     |            \n",
    "                  |                     |                   |                     |             \n",
    "                  +-- batch Exprience --+---------+---------+-- batch Exprience --+             \n",
    "                                                  |                                             \n",
    "                                                  |\n",
    "                                                  ▽\n",
    "                                          +----------------+                                            \n",
    "                                          |                |                                                     \n",
    "                                          |  QL Server     |  \n",
    "                                          |   (learner)    |                   \n",
    "                                          +----------------+                   \n",
    "                                                   |\n",
    "                                                   |\n",
    "                   +-------- Q-tbale -------+---------+---------+-- Q-tbale -------+ \n",
    "                   |                     |                   |                     |              \n",
    "                   |                     |                   |                     | \n",
    "                   ▽                     ▽                   ▽                     ▽\n",
    "            +-----------+         +-----------+       +-----------+         +-----------+  \n",
    "            |           |         |           |       |           |         |           |       \n",
    "            |  Worker   |         |  Worker   |       |  Worker   |         |  Worker   |       \n",
    "            |(Evaluator)|         |(Evaluator)|       |(Evaluator)|         |(Evaluator)|               \n",
    "            |           |         |           |       |           |         |           |       \n",
    "            +-----------+         +-----------+       +-----------+         +-----------+                                                                                                                      \n",
    "\"\"\"\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@ray.remote    \n",
    "class QLAgent_server(agent):\n",
    "    def __init__(self, epsilon, learning_rate, learning_episodes, map_size,\n",
    "                 test_interval = 100, batch_size = 100, action_space = 4, beta = 0.999, do_test = True):\n",
    "        super().__init__(epsilon, learning_rate, learning_episodes, map_size, \n",
    "                         test_interval = test_interval, action_space = action_space, beta = beta, do_test = do_test)\n",
    "        self.collector_done = False\n",
    "        self.evaluator_done = False\n",
    "        self.learning_episodes = learning_episodes\n",
    "        self.episode = 0\n",
    "        self.reuslts = []\n",
    "        self.batch_size = batch_size\n",
    "        self.privous_q_tables = []\n",
    "        self.results = [0] * (self.batch_num + 1)\n",
    "        self.reuslt_count = 0\n",
    "        \n",
    "    def learn(self, experiences):\n",
    "        #INSERT YOUR CODE HERE\n",
    "        \n",
    "        if self.do_test:\n",
    "            if self.episode // self.test_interval + 1 > len(self.privous_q_tables):\n",
    "                self.privous_q_tables.append(self.q_table)\n",
    "        return self.get_q_table()\n",
    "        \n",
    "    def get_q_table(self):\n",
    "        if self.episode >= self.learning_episodes:\n",
    "            self.collector_done = True\n",
    "            \n",
    "        return self.q_table, self.collector_done\n",
    "    \n",
    "    # evalutor\n",
    "    def add_result(self, result, num):\n",
    "        self.results[num] = result\n",
    "    \n",
    "    def get_reuslts(self):\n",
    "        return self.results, self.q_table\n",
    "    \n",
    "    def ask_evaluation(self):\n",
    "        if len(self.privous_q_tables) > self.reuslt_count:\n",
    "            num = self.reuslt_count\n",
    "            evluation_q_table = self.privous_q_tables[num]\n",
    "            self.reuslt_count += 1\n",
    "            return evluation_q_table, False, num\n",
    "        else:\n",
    "            if self.episode >= self.learning_episodes:\n",
    "                self.evaluator_done = True\n",
    "            return [], self.evaluator_done, None\n",
    "    \n",
    "@ray.remote    \n",
    "def collecting_worker(server, simulator, epsilon, action_space = 4, batch_size = 100):\n",
    "    def greedy_policy(curr_state):\n",
    "        #INSERT YOUR CODE HERE\n",
    "\n",
    "    def explore_or_exploit_policy(curr_state):\n",
    "        #INSERT YOUR CODE HERE\n",
    "        \n",
    "    q_table, learn_done = ray.get(server.get_q_table.remote())\n",
    "    while True:\n",
    "        if learn_done:\n",
    "            break\n",
    "        #INSERT YOUR CODE HERE\n",
    "        \n",
    "@ray.remote\n",
    "def evaluation_worker(server, trials = 100, action_space = 4, beta = 0.999):\n",
    "    def greedy_policy(curr_state):\n",
    "        #INSERT YOUR CODE HERE\n",
    "    while True:\n",
    "        q_table, done, num = ray.get(server.ask_evaluation.remote())\n",
    "        if done:\n",
    "            break\n",
    "        if len(q_table) == 0:\n",
    "            continue\n",
    "        total_reward = 0\n",
    "        for _ in range(trials):\n",
    "            simulator.reset()\n",
    "            done = False\n",
    "            steps = 0\n",
    "            observation, reward, done, info = simulator.step(greedy_policy(0))\n",
    "            total_reward += (beta ** steps) * reward\n",
    "            steps += 1\n",
    "            while not done:\n",
    "                observation, reward, done, info = simulator.step(greedy_policy(observation))\n",
    "                total_reward += (beta ** steps) * reward\n",
    "                steps += 1\n",
    "        server.add_result.remote(total_reward / trials, num)\n",
    "\n",
    "class distributed_QL_agent():\n",
    "    def __init__(self, epsilon, learning_rate, learning_episodes, map_size, \n",
    "                 cw_num = 4, ew_num = 4, test_interval = 100, batch_size = 100,\n",
    "                 action_space = 4, beta = 0.999, do_test = True):\n",
    "        \n",
    "        self.server = QLAgent_server.remote(epsilon, learning_rate, learning_episodes, map_size, \n",
    "                                               test_interval = test_interval, batch_size = batch_size, do_test = do_test)\n",
    "        self.workers_id = []\n",
    "        self.epsilon = epsilon\n",
    "        self.batch_size = batch_size\n",
    "        self.cw_num = cw_num\n",
    "        self.ew_num = ew_num\n",
    "        self.agent_name = \"Distributed Q-learning\"\n",
    "        self.do_test = do_test\n",
    "        \n",
    "    def learn_and_evaluate(self):\n",
    "        workers_id = []\n",
    "        \n",
    "        #INSERT YOUR CODE HERE\n",
    "\n",
    "        ray.wait(workers_id, len(workers_id))\n",
    "        return ray.get(self.server.get_reuslts.remote())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simulator.reset()\n",
    "#INSERT YOUR CODE FOR INIT PARAMS HERE\n",
    "\n",
    "start_time = time.time()\n",
    "distributed_ql_agent = distributed_QL_agent(epsilon, learning_rate, learning_episodes, map_size, \n",
    "                                            test_interval = test_interval, batch_size = batch_size,\n",
    "                                            cw_num = 8, ew_num = 4, do_test = do_test)\n",
    "total_rewards, q_table = distributed_ql_agent.learn_and_evaluate()\n",
    "run_time['Distributed Q-learning agent'] = time.time() - start_time\n",
    "print(\"Learning time:\\n\")\n",
    "print(run_time['Distributed Q-learning agent'])\n",
    "if do_test:\n",
    "    plot_result(total_rewards, test_interval, [distributed_ql_agent.agent_name])\n",
    "plot_image(q_table, MAP[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison of different approachs\n",
    "\n",
    "Run the following cell to compare the running time of different approaches. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from copy import deepcopy\n",
    "temp_dict = deepcopy(run_time)\n",
    "print(\"All:\")\n",
    "for _ in range(len(temp_dict)):\n",
    "    min_v = float('inf')\n",
    "    for k, v in temp_dict.items():\n",
    "        if v is None:\n",
    "            continue\n",
    "        if v < min_v:\n",
    "            min_v = v\n",
    "            name = k\n",
    "    temp_dict[name] = float('inf')\n",
    "    print(name + \": \" + str(min_v))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Include in your report the following: \n",
    "\n",
    "7. Show the value functions learned by the distributed methods for the best policies learned with $\\epsilon$ equal to 0.3 and compare to those of the single-core method. Run the algorithm for the recommended number of episodes for each of the maps. Did the approaches produce similar results? \n",
    "\n",
    "8. Provide and compare the timing results for the single-core and distributed experiments, including the time to do the evaluations during learning. Describe the trends you observe as the number of workers increases. \n",
    "\n",
    "9. Provide and compare the timing results for the single-core and distributed experiments with the evaluation procedure turned off. That is, here you only want to provide timing results for the learning process without any interleaved evaluation. Compare the results with (8). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}