
The Epsilon-Greedy approach
The Epsilon-Greedy is a widely used solution to the explore-exploit dilemma. Exploration is all about searching and exploring new options through experimentation and research to generate new values, while exploitation is all about refining existing options by repeating those options and improving their values.
The Epsilon-Greedy approach is very simple to understand and easy to implement:
epsilon() = 0.05 or 0.1 #any small value between 0 to 1
#epsilon() is the probability of exploration
p = random number between 0 and 1
if p epsilon(
) :
pull a random action
else:
pull current best action
Eventually, after several iterations, we discover the best actions among all at each state because it gets the option to explore new random actions as well as exploit the existing actions and refine them.
Let's try to implement a basic Q-learning algorithm to make an agent learn how to navigate across this frozen lake of 16 grids, from the start to the goal without falling into the hole:
# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import time
#Load the environment
env = Gym.make('FrozenLake-v0')
s = env.reset()
print("initial state : ",s)
print()
env.render()
print()
print(env.action_space) #number of actions
print(env.observation_space) #number of states
print()
print("Number of actions : ",env.action_space.n)
print("Number of states : ",env.observation_space.n)
print()
#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces
def epsilon_greedy(Q,s,na):
epsilon = 0.3
p = np.random.uniform(low=0,high=1)
#print(p)
if p > epsilon:
return np.argmax(Q[s,:])#say here,initial policy = for each state consider the action having highest Q-value
else:
return env.action_space.sample()
# Q-Learning Implementation
#Initializing Q-table with zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
#set hyperparameters
lr = 0.5 #learning rate
y = 0.9 #discount factor lambda
eps = 100000 #total episodes being 100000
for i in range(eps):
s = env.reset()
t = False
while(True):
a = epsilon_greedy(Q,s,env.action_space.n)
s_,r,t,_ = env.step(a)
if (r==0):
if t==True:
r = -5 #to give negative rewards when holes turn up
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
else:
r = -1 #to give negative rewards to avoid long routes
if (r==1):
r = 100
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
Q[s,a] = Q[s,a] + lr * (r + y*np.max(Q[s_,a]) - Q[s,a])
s = s_
if (t == True) :
break
print("Q-table")
print(Q)
print()
print("Output after learning")
print()
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
s = env.reset()
env.render()
while(True):
a = np.argmax(Q[s])
s_,r,t,_ = env.step(a)
print("===============")
env.render()
s = s_
if(t==True) :
break
----------------------------------------------------------------------------------------------
<<OUTPUT>>
initial state : 0
SFFF
FHFH
FFFH
HFFG
Discrete(4)
Discrete(16)
Number of actions : 4
Number of states : 16
Q-table
[[ -9.85448046 -7.4657981 -9.59584501 -10. ] [ -9.53200011 -9.54250775 -9.10115662 -10. ] [ -9.65308982 -9.51359977 -7.52052469 -10. ] [ -9.69762313 -9.5540111 -9.56571455 -10. ] [ -9.82319854 -4.83823005 -9.56441915 -9.74234959] [ -5. -5. -5. -5. ] [ -9.6554905 -9.44717167 -7.35077759 -9.77885057] [ -5. -5. -5. -5. ] [ -9.66012445 -4.28223592 -9.48312882 -9.76812285] [ -9.59664264 9.60799515 -4.48137699 -9.61956668] [ -9.71057124 -5.6863911 -2.04563412 -9.75341962] [ -5. -5. -5. -5. ] [ -5. -5. -5. -5. ] [ -9.54737964 22.84803205 18.17841481 -9.45516929] [ -9.69494035 34.16859049 72.04055782 40.62254838] [ 100. 100. 100. 100. ]]
Output after learning
SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
===============
(Right)
SFFF
FHFH
FFFH
HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG