上QQ阅读APP看书，第一时间看更新

The Epsilon-Greedy approach

The Epsilon-Greedy is a widely used solution to the explore-exploit dilemma. Exploration is all about searching and exploring new options through experimentation and research to generate new values, while exploitation is all about refining existing options by repeating those options and improving their values.

The Epsilon-Greedy approach is very simple to understand and easy to implement:

epsilon() = 0.05 or 0.1 #any small value between 0 to 1
#epsilon() is the probability of exploration

p = random number between 0 and 1

if p epsilon() :
pull a random action
else:
pull current best action

Eventually, after several iterations, we discover the best actions among all at each state because it gets the option to explore new random actions as well as exploit the existing actions and refine them.

Let's try to implement a basic Q-learning algorithm to make an agent learn how to navigate across this frozen lake of 16 grids, from the start to the goal without falling into the hole:

# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import time

#Load the environment

env = Gym.make('FrozenLake-v0')

s = env.reset()
print("initial state : ",s)
print()

env.render()
print()

print(env.action_space) #number of actions
print(env.observation_space) #number of states
print()

print("Number of actions : ",env.action_space.n)
print("Number of states : ",env.observation_space.n)
print()

#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces
def epsilon_greedy(Q,s,na):
    epsilon = 0.3
    p = np.random.uniform(low=0,high=1)
    #print(p)
    if p > epsilon:
        return np.argmax(Q[s,:])#say here,initial policy = for each state consider the action having highest Q-value
    else:
        return env.action_space.sample()

# Q-Learning Implementation

#Initializing Q-table with zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])

#set hyperparameters
lr = 0.5 #learning rate
y = 0.9 #discount factor lambda
eps = 100000 #total episodes being 100000


for i in range(eps):
    s = env.reset()
    t = False
    while(True):
        a = epsilon_greedy(Q,s,env.action_space.n)
        s_,r,t,_ = env.step(a)
        if (r==0): 
            if t==True:
                r = -5 #to give negative rewards when holes turn up
                Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
            else:
                r = -1 #to give negative rewards to avoid long routes
        if (r==1):
                r = 100
                Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
        Q[s,a] = Q[s,a] + lr * (r + y*np.max(Q[s_,a]) - Q[s,a])
        s = s_ 
        if (t == True) :
            break

print("Q-table")
print(Q)
print()

print("Output after learning")
print()
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
s = env.reset()
env.render()
while(True):
    a = np.argmax(Q[s])
    s_,r,t,_ = env.step(a)
    print("===============")
    env.render()
    s = s_
    if(t==True) :
        break
----------------------------------------------------------------------------------------------
<<OUTPUT>>

initial state : 0

SFFF
FHFH
FFFH
HFFG

Discrete(4)
Discrete(16)

Number of actions : 4
Number of states : 16

Q-table
[[  -9.85448046   -7.4657981    -9.59584501  -10.        ]
 [  -9.53200011   -9.54250775   -9.10115662  -10.        ]
 [  -9.65308982   -9.51359977   -7.52052469  -10.        ]
 [  -9.69762313   -9.5540111    -9.56571455  -10.        ]
 [  -9.82319854   -4.83823005   -9.56441915   -9.74234959]
 [  -5.           -5.           -5.           -5.        ]
 [  -9.6554905    -9.44717167   -7.35077759   -9.77885057]
 [  -5.           -5.           -5.           -5.        ]
 [  -9.66012445   -4.28223592   -9.48312882   -9.76812285]
 [  -9.59664264    9.60799515   -4.48137699   -9.61956668]
 [  -9.71057124   -5.6863911    -2.04563412   -9.75341962]
 [  -5.           -5.           -5.           -5.        ]
 [  -5.           -5.           -5.           -5.        ]
 [  -9.54737964   22.84803205   18.17841481   -9.45516929]
 [  -9.69494035   34.16859049   72.04055782   40.62254838]
 [ 100.          100.          100.          100.        ]]

Output after learning

SFFF
FHFH
FFFH
HFFG
===============
  (Down)
SFFF
FHFH
FFFH
HFFG
===============
  (Down)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG
===============
  (Right)
SFFF
FHFH
FFFH
HFFG