上QQ阅读APP看书，第一时间看更新

Using the Q-Network for real-world applications

Maintaining a table for a small number of states is possible but in the real world, states become infinite. Thus, there is a need for a solution that incorporates the state information and outputs the Q-values for the actions without using the Q-table. This is where neural network acts a function approximator, which is trained over data of different state information and their corresponding Q-values for all actions, thereby, they are able to predict Q-values for any new state information input. The neural network used to predict Q-values instead of using a Q-table is called Q-network.

Here for the FrozenLake-v0 environment, let's use a single neural network that takes state information as input, where state information is represented as a one hot encoded vector of the 1 x number of states shape (here, 1 x 16) and outputs a vector of the 1 x number of actions shape (here, 1 x 4). The output is the Q-values for all the actions:

# considering there are 16 states numbered from state 0 to state 15, then state number 4 will be # represented in one hot encoded vector as
input_state = [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]

With the options of adding more hidden layers and different activation functions, a Q-network definitely has many advantages over a Q-table. Unlike a Q-table, in a Q-network, the Q-values are updated by minimizing the loss through backpropagation. The loss function is given by:

Let's try to implement this in Python and learn how to implement a basic Q-Network algorithm to make an agent learn to navigate across this frozen lake of 16 grids from the start to the goal without falling into the hole:

# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import tensorflow as tf
import random

# Load the Environment
env = Gym.make('FrozenLake-v0')


# Q - Network Implementation

## Creating Neural Network

tf.reset_default_graph()
# tensors for inputs, weights, biases, Qtarget
inputs = tf.placeholder(shape=[None,env.observation_space.n],dtype=tf.float32)
W = tf.get_variable(name="W",dtype=tf.float32,shape=[env.observation_space.n,env.action_space.n],initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.zeros(shape=[env.action_space.n]),dtype=tf.float32)

qpred = tf.add(tf.matmul(inputs,W),b)
apred = tf.argmax(qpred,1)

qtar = tf.placeholder(shape=[1,env.action_space.n],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(qtar-qpred))

train = tf.train.AdamOptimizer(learning_rate=0.001)
minimizer = train.minimize(loss)


## Training the neural network

init = tf.global_variables_initializer() #initializing tensor variables
#initializing parameters
y = 0.5 #discount factor
e = 0.3 #epsilon value for epsilon-greedy task
episodes = 10000 #total number of episodes

with tf.Session() as sess:
    sess.run(init)
    for i in range(episodes):
        s = env.reset() #resetting the environment at the start of each episode
        r_total = 0 #to calculate the sum of rewards in the current episode
        while(True):
            #running the Q-network created above
            a_pred,q_pred = sess.run([apred,qpred],feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
            #a_pred is the action prediction by the neural network
            #q_pred contains q_values of the actions at current state 's'
            if np.random.uniform(low=0,high=1) < e: #performing epsilon-greedy here
                a_pred[0] = env.action_space.sample()
                #exploring different action by randomly assigning them as the next action
            s_,r,t,_ = env.step(a_pred[0]) #action taken and new state 's_' is encountered with a feedback reward 'r'
            if r==0: 
                if t==True:
                    r=-5 #if hole make the reward more negative
                else:
                    r=-1 #if block is fine/frozen then give slight negative reward to optimize the path
            if r==1:
                r=5 #good positive goat state reward
            
            q_pred_new = sess.run(qpred,feed_dict={inputs:np.identity(env.observation_space.n)[s_:s_+1]})
            #q_pred_new contains q_values of the actions at the new state 
            
            #update the Q-target value for action taken
            targetQ = q_pred
            max_qpredn = np.max(q_pred_new)
            targetQ[0,a_pred[0]] = r + y*max_qpredn
            #this gives our targetQ
            
            #train the neural network to minimize the loss
            _ = sess.run(minimizer,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1],qtar:targetQ})
            
            s=s_
            if t==True:
                break
    
    #learning ends with the end of the above loop of several episodes above
    #let's check how much our agent has learned
    print("Output after learning")
    print()
    s = env.reset()
    env.render()
    while(True):
        a = sess.run(apred,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
        s_,r,t,_ = env.step(a[0])
        print("===============")
        env.render()
        s = s_
        if t==True:
            break
-----------------------------------------------------------------------------------------------
<<OUTPUT>>

Output after learning

SFFF
FHFH
FFFH
HFFG
===============
  (Down)
SFFF
FHFH
FFFH
HFFG
===============
  (Left)
SFFF
FHFH
FFFH
HFFG
===============
  (Up)
SFFF
FHFH
FFFH
HFFG
===============
  (Down)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Up)
SFFF
FHFH
FFFH
HFFG

There is a cost of stability associated with both Q-learning and Q-networks. There will be cases when with the given set of hyperparameters of the Q-values are not converge, but with the same hyperparameters, sometimes converging is witnessed. This is because of the instability of these learning approaches. In order to tackle this, a better initial policy should be defined (here, the maximum Q-value of a given state) if the state space is small. Moreover, hyperparameters, especially learning rate, discount factors, and epsilon value, play an important role. Therefore, these values must be initialized properly.

Q-networks provide more flexibility compared to Q-learning, owing to increasing state spaces. A deep neural network in a Q-network might lead to better learning and performance. As far as playing Atari using Deep Q-Networks, there are many tweaks, which we will discuss in the coming chapters.

本周热推：

计算机网络 AI 3.0 AI的25种可能 ABB工业机器人编程全集啊哈C！思考快你一步