6533b7dcfe1ef96bd12732af

RESEARCH PRODUCT

Least-squares temporal difference learning based on an extreme learning machine

Emilio Soria-olivasJosé M. Martínez-martínezPablo Escandell-monteroJuan Gómez-sanchisJosé D. Martín-guerrero

subject

Mathematical optimizationArtificial neural networkArtificial IntelligenceCognitive NeuroscienceBellman equationReinforcement learningState spaceMarkov decision processTemporal difference learningComputer Science ApplicationsMathematicsExtreme learning machineCurse of dimensionality

description

Abstract Reinforcement learning (RL) is a general class of algorithms for solving decision-making problems, which are usually modeled using the Markov decision process (MDP) framework. RL can find exact solutions only when the MDP state space is discrete and small enough. Due to the fact that many real-world problems are described by continuous variables, approximation is essential in practical applications of RL. This paper is focused on learning the value function of a fixed policy in continuous MPDs. This is an important subproblem of several RL algorithms. We propose a least-squares temporal difference (LSTD) algorithm based on the extreme learning machine. LSTD is typically combined with local function approximators, which scale poorly with the problem dimensionality. Our approach allows us to approximate value functions using single-hidden layer feedforward networks (SLFNs), a type of artificial neural network extensively used in many fields. Due to the global nature of SLFNs, the proposed approach is more suitable than traditional methods for high-dimensional problems. The method was empirically evaluated on a set of MDPs whose dimensionality varies from 1 to 6. For comparison purposes, experiments were replicated using a standard LSTD algorithm combined with Gaussian radial basis functions. Experimental results suggest that, although both methods can approximate accurately value functions, the proposed approach requires considerably fewer resources for the same degree of accuracy.

https://doi.org/10.1016/j.neucom.2013.11.040