
Using the Q-Network for real-world applications
Maintaining a table for a small number of states is possible but in the real world, states become infinite. Thus, there is a need for a solution that incorporates the state information and outputs the Q-values for the actions without using the Q-table. This is where neural network acts a function approximator, which is trained over data of different state information and their corresponding Q-values for all actions, thereby, they are able to predict Q-values for any new state information input. The neural network used to predict Q-values instead of using a Q-table is called Q-network.
Here for the FrozenLake-v0 environment, let's use a single neural network that takes state information as input, where state information is represented as a one hot encoded vector of the 1 x number of states shape (here, 1 x 16) and outputs a vector of the 1 x number of actions shape (here, 1 x 4). The output is the Q-values for all the actions:
# considering there are 16 states numbered from state 0 to state 15, then state number 4 will be # represented in one hot encoded vector as
input_state = [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
With the options of adding more hidden layers and different activation functions, a Q-network definitely has many advantages over a Q-table. Unlike a Q-table, in a Q-network, the Q-values are updated by minimizing the loss through backpropagation. The loss function is given by:


Let's try to implement this in Python and learn how to implement a basic Q-Network algorithm to make an agent learn to navigate across this frozen lake of 16 grids from the start to the goal without falling into the hole:
# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import tensorflow as tf
import random
# Load the Environment
env = Gym.make('FrozenLake-v0')
# Q - Network Implementation
## Creating Neural Network
tf.reset_default_graph()
# tensors for inputs, weights, biases, Qtarget
inputs = tf.placeholder(shape=[None,env.observation_space.n],dtype=tf.float32)
W = tf.get_variable(name="W",dtype=tf.float32,shape=[env.observation_space.n,env.action_space.n],initializer=tf.contrib.layers.xavier_initializer())
b = tf.Variable(tf.zeros(shape=[env.action_space.n]),dtype=tf.float32)
qpred = tf.add(tf.matmul(inputs,W),b)
apred = tf.argmax(qpred,1)
qtar = tf.placeholder(shape=[1,env.action_space.n],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(qtar-qpred))
train = tf.train.AdamOptimizer(learning_rate=0.001)
minimizer = train.minimize(loss)
## Training the neural network
init = tf.global_variables_initializer() #initializing tensor variables
#initializing parameters
y = 0.5 #discount factor
e = 0.3 #epsilon value for epsilon-greedy task
episodes = 10000 #total number of episodes
with tf.Session() as sess:
sess.run(init)
for i in range(episodes):
s = env.reset() #resetting the environment at the start of each episode
r_total = 0 #to calculate the sum of rewards in the current episode
while(True):
#running the Q-network created above
a_pred,q_pred = sess.run([apred,qpred],feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
#a_pred is the action prediction by the neural network
#q_pred contains q_values of the actions at current state 's'
if np.random.uniform(low=0,high=1) < e: #performing epsilon-greedy here
a_pred[0] = env.action_space.sample()
#exploring different action by randomly assigning them as the next action
s_,r,t,_ = env.step(a_pred[0]) #action taken and new state 's_' is encountered with a feedback reward 'r'
if r==0:
if t==True:
r=-5 #if hole make the reward more negative
else:
r=-1 #if block is fine/frozen then give slight negative reward to optimize the path
if r==1:
r=5 #good positive goat state reward
q_pred_new = sess.run(qpred,feed_dict={inputs:np.identity(env.observation_space.n)[s_:s_+1]})
#q_pred_new contains q_values of the actions at the new state
#update the Q-target value for action taken
targetQ = q_pred
max_qpredn = np.max(q_pred_new)
targetQ[0,a_pred[0]] = r + y*max_qpredn
#this gives our targetQ
#train the neural network to minimize the loss
_ = sess.run(minimizer,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1],qtar:targetQ})
s=s_
if t==True:
break
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
print("Output after learning")
print()
s = env.reset()
env.render()
while(True):
a = sess.run(apred,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
s_,r,t,_ = env.step(a[0])
print("===============")
env.render()
s = s_
if t==True:
break
-----------------------------------------------------------------------------------------------
<<OUTPUT>>
Output after learning
SFFF
FHFH
FFFH
HFFG
===============
(Down)
SFFF
FHFH
FFFH
HFFG
===============
(Left)
SFFF
FHFH
FFFH
HFFG
===============
(Up)
SFFF
FHFH
FFFH
HFFG
===============
(Down)
SFFF
FHFH
FFFH
HFFG
===============
(Right)
SFFF
FHFH
FFFH
HFFG
===============
(Right)
SFFF
FHFH
FFFH
HFFG
===============
(Up)
SFFF
FHFH
FFFH
HFFG
There is a cost of stability associated with both Q-learning and Q-networks. There will be cases when with the given set of hyperparameters of the Q-values are not converge, but with the same hyperparameters, sometimes converging is witnessed. This is because of the instability of these learning approaches. In order to tackle this, a better initial policy should be defined (here, the maximum Q-value of a given state) if the state space is small. Moreover, hyperparameters, especially learning rate, discount factors, and epsilon value, play an important role. Therefore, these values must be initialized properly.
Q-networks provide more flexibility compared to Q-learning, owing to increasing state spaces. A deep neural network in a Q-network might lead to better learning and performance. As far as playing Atari using Deep Q-Networks, there are many tweaks, which we will discuss in the coming chapters.