Automation and Robots for Disaster Response

A Survey of Offline- and Online-Learning-Based Algorithms for Multirotor Uavs

By inergency On Mar 22, 2024

[ad_1]

Offline RL has been extensively applied to multirotor UAVs. The literature review reveals 42 papers that focus on 13 different tasks, which include trajectory tracking, landing, navigation, formation control, flight control, and hovering.

3.3.1. Value-Function-Based Algorithms

Value-function-based methods use state-value and action–state functions that are presented in (1) and (2), respectively, shown next:

$v_{π} (s) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s] f o r a l l s \in S$

(1)

$q_{π} (s, a) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s, A_{t} = a]$

(2)

where $v_{π} (s)$ denotes the value function for policy $π$ at state s, while $q_{π} (s, a)$ represents the action-value function for policy $π$ at state s and action a. $E_{π} [\cdot]$ denotes the expected value under policy $π$ . $\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$ is the sum of discounted future rewards starting from time t in state s and represents the expected discounted return, and $γ$ is the discount rate, $0 \leq γ \leq 1$ , here. $S_{t}$ and $A_{t}$ represent the state and action at time t, respectively [12,76,77].

The discount rate

γ

plays a critical role in calculating the present value of future rewards. If

γ < 1

, the infinite sum has a finite value when the reward sequence,

R_{k}

, is bounded. If

γ = 0

, the agent cannot calculate future rewards; it can only calculate the immediate reward (

\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} = 0^{0} R_{t + 1} + 0^{1} R_{t + 2} + \dots = R_{t + 1}

). So, the agent learns how to choose the action

A_{t}

to maximize

R_{t + 1}

. If

γ

approaches 1, the future rewards are highlighted in the expected discount return; that is, the agent behaves in a more farsighted manner. For example, in [12] the chosen discount factor is close to 1. Li et al. [57] and Hu and Wang [53] chose a value of 0.9, and Castro et al. [66] and Panetsos et al. [60] chose a discount factor value of 0.99.

Value-function-based algorithms consist of variance algorithms, such as dynamic programming (DP), Monte Carlo (MC), and Temporal Difference (TD). The two most popular DP methods, policy iteration and value iteration, benefit from policy evaluation and policy improvement. The MC method is also based on policy evaluation and policy improvement, but unlike the DP method, an alternative policy evaluation process is utilized. The policy evaluation in the DP method employs a bootstrapping technique, while sampling and average return techniques are applied in the MC method. The TD method is created by combining the DP and MC methods applying sampling and bootstrapping [12,76,78].

TD has been extensively applied in control algorithms for multirotors executing diverse tasks, like landing, navigation, obstacle avoidance, path planning, and trajectory optimization. Q-learning and Deep Q-Networks (DQNs) stand out as commonly employed RL algorithms within the TD framework, particularly in value-function-based algorithm; see Guerra et al. [50], Pham et al. [39], and Polvara et al. [32].

Xu et al. [16] applied an end-to-end control scheme that includes a DNN and a double DQN algorithm for quadrotor landing on a stable platform. The output of the underlying deep reinforcement learning (DRL) model is the quadrotor speed in x and y, while the velocity in the z-direction is not controlled: it is considered fixed; this makes the problem easier. After testing, the improved DQN method produces good results on autonomous landing.

Imanberdiyev et al. [31] used a model-based RL algorithm on a quadrotor to create an efficient path to reach a destination by considering its battery life. The agent uses three states: the position in x and y and the battery level. The agent follows one of eight possible actions, moving on the x-y plane and learning the moving direction. The direction action is converted to trajectory commands that are executed using position control. In model-based RL methods, there is a limited number of actions the agent learns to create a sufficiently accurate environment model, as opposed to model-free RL methods. Model-based RL algorithms are not suitable for real-time systems since the planning and model learning aspects are computationally expensive. However, in [31], a parallel architecture called TEXPLORE is used. Thus, it is possible to take action fast enough based on the current policy: there is no need to wait for the planning and model update. Simulation results illustrate that the approach has the ability to learn and to perform well after a few iterations and to also perform actions in real time. Its performance was compared with that of Q-learning algorithms. In 150 episodes, while there was no significant change in the average reward in Q-learning, the average reward of TEXPLORE dramatically increased after the 25th episode. TEXPLORE obtained significantly more rewards than Q-learning in each episode.

In [46], the path design problem for a cellular-connected UAV was handled to reduce the mission completion time. A new RL-based UAV path planning algorithm was derived, and TD was applied to directly learn the state-value function. A linear function was added to the algorithm with tile coding. Function approximation has two advantages over table-based RL. It learns the parameter vector, which has a lower dimension than the state vector, instead of storing and updating the value function for all states. It also allows for generalization. Tile coding is used to build the feature vector. The parameter vector may be updated to minimize its mean squared error based on a stochastic semi-gradient method with a linear approximation for each state–reward–next-state transition observed by the agent. It is shown that TD with a tile coding algorithm overcomes problems with cellular networks in complex urban environments of size 2 km × 2 km with high-rise buildings. Also, the accumulated rewards from TD and TD with tile coding learning algorithms are almost identical, but tile coding provides faster convergence. When tested, the UAV reached the desired location without running into the coverage holes of cellular networks.

The approach discussed in [32] uses two DQNs. One is utilized for landmark detection, and the other is used to control the UAV’s vertical descent. A hierarchy representing sub-policies is applied to the DQNs to reach decisions during the different navigation phases. The DQNs can autonomously decide on the next state. However, the hierarchy decreases the sophistication of the task decision. The algorithm performance was compared with an augmented reality (AR) tracker algorithm and with human pilots. The proposed algorithm is faster than a human pilot when landing on a marked pad but also more robust than the AR tracker in finding the marker. The complete details of [32] can be found in [79].

Ye et al. [61] developed a DRL-based control algorithm to navigate a UAV swarm around an unexplored environment under partial observations. This can be accomplished by using GAT-based FANET (GAT-FANET), which is a combination of the flying ad hoc network (FANET) and the graph attention network (GAT). Partial observations lead to a loss of information. Thus, a network architecture named Deep Recurrent Graph Network (DRGN) was developed and combined with GAT-FANET to collect environment spatial information and use previous information from memory via a gated recurrent unit (GRU). A maximum-entropy RL algorithm, called the soft deep recurrent graph network (SDRGN), was developed, which is a multi-agent deep RL algorithm. It learns a DRGN-based stochastic policy with a soft Bellman function. The performance of the DRGN (a deterministic model) and the SDRGN were compared with that of DQN, multi-actor attention critic (MAAC), CommNet, and graph convolutional RL (DGN). In a partially observable environment, the stochastic policy approach is more robust than the deterministic policy one. Also, GAT-FENAT provides an advantage because of its memory unit. When the number of UAVs increases, more information is required from the GAT-FANET, and this reduces the dependency on the memory unit. Results [61] show that policies based on GAT-FANET provide better performance in coverage than other policies. It is observed that graph-based communication improves performance in cooperative exploration and path planning, too. The SDRGN algorithm has lower energy consumption than DRGN, but DQN has the lowest energy consumption when compared with DRL methods. SDRGN and DRGN performance increases linearly with the number of UAVs. SDRGN shows better performance than DRL methods: this verifies that it has better transferability. Consequently, overall, SDRGN has better performance, scalability, transferability, robustness, and interpretability than other DRL methods.

Abo et al. [59] solved the problem of UAV landing on a dynamic platform by taking advantage of Q-learning. Two types of adaptive multi-level quantization (AMLQ) were used: AMLQ 4A with four actions and AMLQ 5A with five actions; they were then compared with a PID controller. The PID position magnitude errors in x and y were higher than the corresponding AMLQ errors, while the oscillation in the AMLQ models was higher than in the PID controller. The developed AMLQ reduces the error on the targeted landing platform. This solution provides faster training and allows for knowledge representation without the need for a DNN.

Path planning is effectively used in several areas that include precision agriculture. Castro et al. [66] worked on adaptive path planning using DRL to inspect insect traps on olive trees. The proposed path planning algorithm includes two parts: the rapidly exploring random tree (RRT) algorithm and a DQN algorithm. The former searches for path options, while the latter performs optimized route planning, integrating environmental changes in real time; however, the training process of DQN is completed offline. Simulation runs were performed in an area of 300 m² with 10 dynamic objects; the UAV was provided with a safe route, determined by the proposed approach, and it arrived at the insect traps to take their picture.

Shurrab et al. [68] studied the target localization problem and proposed a Q-learning-based data-driven method in which a DQN algorithm helps overcome dimensionality challenges. Data measurements from the previous and current steps, the previous action, and the direction of the nearest boundary of the UAV compose the state space. The action space includes the UAV linear velocity and the yaw angle that determines the flight direction. This approach was compared with the traditional uniform search method and the gradient descent-based ML technique; it returned better results in terms of localization and traveled distance.

Guera et al. [50] emphasized detection and mapping for trajectory optimization. For detection, the aim is to minimize ‘wrong detection’, and for mapping, the aim is to minimize the uncertainty related to estimating the unknown environment map. The proposed MDP-based RL algorithm, inspired by Q-learning, consists of state and control estimations. The states are the UAV position depending on actions, a binary parameter that shows the presence or absence of a signal source in the environment, and the states of each cell. The action space includes the control signal to move the UAV from one cell to another in the grid (environment) map. Numerical results show that this technique provides a high probability of target detection and improves the capabilities of map exploration.

Pham et al. [39] handled the UAV navigation problem using Q-learning. The navigation problem was formulated using a discretized state space within a bounded environment. The algorithm learns the action, which is the UAV moving direction in the described environment. The state space includes the distance between the UAV and the target position, and the distance to the nearest obstacle in the north, south, west, or east direction. UAV navigation following the shortest path was demonstrated.

Kulkarni et al. [52] also used Q-learning for navigation purposes. The objective is to determine the location of a victim by using an RF signal emitted from a smart device. The transmitted signal reaches the agent, and according to the received signal strength (RSS), the agent learns to choose one of eight directions separated by 45 degrees on the x-y plane. For mapping, a grid system is utilized, and each state label is correlated to a particular RSS value (two adjacent grids in the map have different RSS values). Each location on the map has a unique state. The

ϵ

-greedy approach provides an action to the UAV, and each episode or iteration is completed when the RSS value of the grid is determined to be greater than −21 dBm: this value means that the distance from the victim is less than 2 m. The proposed approach was tested for different starting positions on different floor plans, demonstrating that the UAV successfully reached the victim’s position.

Choi et al. [33] trained a multirotor UAV by mimicking the control performance of an expert pilot. A pilot collects data from several actual flights. Then, a hidden Markov model (HMM) and dynamic time warping (DTW) are utilized to create the trajectory. Inverse RL is used to learn the hidden reward function and use it to design a controller for trajectory following. Simulations and experiments showed successful results.

Wu et al. [44] worked on the general task of ‘object finding’, for example, in rescue missions. A DQN algorithm is used for trajectory planning. The elimination of the loop storm effect that reflects the current sequence in an MDP is repeated, and it does not cause a punishment in continued actions. The Odor Storm that is caused by not reaching the highest reward value when the agent gets closer to the target increases the convergence speed of the training process. It is shown that the breakout loop storm technique and the odor effect reduce the training process time.

In Kersandt et al. [38], a DNN is trained with a DRL algorithm to control a fully autonomous quadcopter that is equipped with a stereo-vision camera to avoid obstacles. Three different DRL algorithms, DQN, double DQN (DDQN), and Dueling DDQN, are applied to the system. The average performance of each algorithm with respect to rewards is 33, 120, and 116, respectively; they are all below human performance. The results of applying DDQN and Dueling techniques show that the quadrotor reaches the target with 80% success.

Liu et al. [40] and Zhao et al. [49] used RL for formation control. In [40], a value-function-based RL control algorithm was applied to leader–follower quadrotors to tackle the attitude synchronization problem. The output of each quadrotor is synchronized with the output of the leader quadrotor by the designed control system. In [49], the aim was to solve the model-free robust optimal formation control problem by utilizing off-policy value-function-based algorithms. The algorithms are trained for robust optimal position and attitude controllers by using the input and output data of the quadrotors. The theoretical analysis and simulation results matched, and the robust formation control method worked effectively.

Vankadari et al. [37] and Lee et al. [36] worked on the landing task using a Least-Square Policy Iteration (LSPI) algorithm that is considered a form of approximate dynamic programming (ADP). Srivastava et al. [43] and Li et al. [57] applied an LSPI algorithm in multirotor UAVs for target tracking and trajectory planning, respectively. ADP is used to solve problems with large state or action spaces. ADP approximates the value function or policy with function approximation techniques, since storing values for every state–action pair is not practical.

In [37], an LSPI algorithm was used to study the landing problem. An RL algorithm estimates quadrotor control velocities using instantaneous position and velocity errors. The optimal value function of any policy is determined by solving the Bellman equation, as applied to a linear system. In the RL algorithm, the LSPI method forecasts the value function, parameterizing it into basis functions instead of calculating an optimal value function. The RL algorithm converges quickly and learns how to minimize the tracking error for a given set point. Different waypoints are used to train the algorithm for landing. The method can also be used effectively in noisy environments. Simulations and real environment results demonstrate the applicability of the approach.

The research in [63] provided a low-level control approach for a quadrotor by implementing a structured online learning-based algorithm (SOL) [80] to fly and keep the hovering position at a desired altitude. The learning procedure consists of two stages: The quadrotor is first flown with almost equal pulse-width modulation (PWM) values for each rotor; these values are collected to create an initial model. Then, learning in a closed-loop form is applied using the initial model. Before applying closed-loop learning, three pre-run flights are completed, with 634 samples collected in 68 s of flying. The state samples are determined at each time step in the control loop, and then the system model is updated using an RLS algorithm. After determining the updated model, a value function (needed to find the control value for the next step) is updated. The quadrotor is autonomously controlled. This online learning control approach successfully reaches the desired position and keeps the quadrotor hovering.

In [36], a trained NN was adopted for guidance in a simulation environment. A quadrotor with a PID controller and an onboard ground-looking camera was used. The camera provides the pixel deviation of the targeted landing platform from an image frame, and a laser rangefinder procures the altitude information. During training, the NN is trained to learn how to control the UAV attitude. In simulation studies, the UAV reached the proposed landing location. In experiments, the AI pilot was turned off below an altitude of 1.5 m, but the AI pilot could land at the targeted location using a vision sensor. The trajectories were not smooth because the landing location in the image was not accurately determined due to oscillations, because image processing errors occurred in the actor NN, because signal transmission created a total delay of 200 ms, and because there are disturbances in real-world environments.

Three target tracking approaches that deserve attention are the Image-based (IBVS), Position-based (PBVS), and Direct Visual Servoing approaches. In Kanellakis and Nikolakopoulos [81], IBVS was found to be the more effective approach for target tracking since it directly tackles the control problem in the image space; it also has better robustness when it comes to camera calibration and to depth estimation errors. Srivastava et al. [43] tracked a maneuvering target using only vision-based feedback, IBVS. However, tracking is difficult when using only monocular vision without depth measurements. This deficiency is eliminated by an RL technique where optimal control policies are learned by LSPI to track the target quadrotor. Two different basis functions (with and without velocity basis) and four types of reward functions (only exponential reward, quadratic reward function without velocity control, quadratic reward function with velocity control, asymmetric reward function) are described in [43]. The basis function with a velocity basis shows better performance than the basis function without a velocity basis.

In [57], the objective is to solve the problem of cable-suspended load transportation utilizing three quadrotors. The trajectory planning method is based on a value-function approximation algorithm with the aim to reach the final position as fast as possible while keeping the load stable. This method includes two processes: trajectory planning and tracking. The trajectory planning process consists of parameter learning and trajectory generation. Training and learning help determine the parameter vector of the approximate value function (parameter learning part). In the trajectory generation phase, the value function is approximated by using the learned parameters from the former stage, and the flight trajectory is determined via a greedy strategy. The effectiveness of the load trajectory and the physical effect on the quadrotor flight were checked based on the trajectory tracking process. The quadrotors are independent; in the trajectory tracking phase, positions and attitudes are controlled with a hierarchical control scheme using PID controllers (transmitting the position of the load to the controller of the quadrotor). The results show that the actual value function is successfully estimated. Also, the value function confirms that the proposed algorithm works effectively.

Xia et al. [64] use an RL control method for autonomous landing on a dynamic target. Unlike other studies, position and orientation constraints for safe and accurate landing are described. Adaptive learning and a cascaded dynamic estimator are utilized to create a robust RL control algorithm. In the adaptive learning part, the critic network weight is formulated and calculated in an adaptive way. Also, the stability of the closed-loop system is analyzed.

Therefore, this part of the survey demonstrates how value-function-based RL algorithms are implemented and tested for UAV navigation and control. In detail, Xu et al. [16] integrated a DNN with a double DQN algorithm for quadrotor landing, yielding improved results. Imanberdiyev et al. [31] employed model-based RL for efficient path planning, outperforming traditional methods. Zeng et al. [46] introduced an RL-based algorithm with tile coding for UAV path planning, showcasing enhanced convergence. Polvara et al. [32] proposed a hierarchical RL approach using dual DQNs for UAV landing, exhibiting robust performance. Ye et al. [61] developed a DRL-based control algorithm for UAV swarm navigation, highlighting the efficacy of graph-based communication. Abo et al. [59] studied UAV landing on dynamic platforms using Q-learning with adaptive quantization, achieving improved accuracy. Other reviewed studies explored RL algorithms for path planning, trajectory optimization, target tracking, and formation control, with the aim to illustrate the versatility and efficacy of RL techniques. Overall, value-function-based RL methods offer powerful tools for UAV navigation and control, enabling efficient decision-making and adaptation to dynamic environments. These methods continue to be refined for a wide range of UAV tasks, promising advancements in the underlying UAV technology.

3.3.2. Policy-Search-Based Algorithms

Value-function-based methods calculate the value of an agent’s every possible action to choose the one based on the best value. The probability distribution over all available actions plays a key role in policy-based methods and in the agent’s decision about the action at each time step. A comparison of value-function-based and policy-search-based algorithms is provided in Table 3.

Kooi and Babuška [54] developed an approach using deep RL to land a quadrotor on an inclined surface autonomously. Proximal Policy Optimization (PPO), Twin-Delay Deep Deterministic Gradient (TD3), and Soft Action–Critic (SAC) algorithms were applied to solve this problem. The TD3 and SAC algorithms successfully trained the set-point tracking policy network, but the PPO algorithm was trained in a shorter time and provided a better performance on the final policy. Trained policies may be implemented in real time.

Hu and Wang [53] utilized an advanced PPO RL algorithm to find the optimal stochastic control strategy for a quadrotor speed. During training, actor and critic NNs are used. They have the same nine-dimensional state vector (Euler angles, Euler angle derivatives, errors between expected and current velocities after integration on the x, y, and z axes). An integral compensator is applied to both NNs to improve speed-tracking accuracy and robustness. The learning approach includes online and offline components. In the offline learning phase, a flight control strategy is learned using a simplified quadrotor model, which is continuously optimized in the online learning phase. In offline learning, the critic NN evaluates the current action to determine an advantage value choosing a higher learning rate to improve evaluation. In online learning, the action NN is composed of four policy-trainable sub-networks. The state vector is used as input to the four sub-networks; their outputs are the mean and variance of the corresponding four Gaussian distributions, each normalized to [0, 1]. Parameters of the four policy sub-networks are also used in the old policy networks that are untrainable. The old policy sub-network parameters are fixed. The four policy sub-networks in the action NN are trained to produce new actions in the next batch. When applying new actions to the quadrotor, new states are recorded in a buffer. After the integration and compensation process, a batch of the state vector is used as input to the critic NN. The batch of the advantage values is the output of the critic NN; it is used to evaluate the quality of the actions taken to determine these states. The parameters of the critic NN are updated by minimizing the advantage value per batch. The policy network is updated per batch using the action vectors taken from the old policy network, the state vector from the buffer, and the advantage value from the critic NN.

In [53], the PPO and the PPO-IC algorithms were compared with the offline PPO one and a well-tuned PID controller. The average linear velocity steady-state error of the PPO-IC approaches zero faster, and it is smaller than that of PPO. The average accumulated reward of the PPO-IC reaches a higher value. The PPO-IC converges closer to the targeted velocity on the x-, y-, and z-axes than PPO. PPO-IC velocity errors on the x-, y-, and z-axes are much smaller compared to the PPO errors. The Euler angle errors are also smaller in the PPO-IC algorithm. In the offline learning phase, the nominal quadrotor weight is increased by 10% in each step until it reaches 150% of the nominal weight. The performance of the well-tuned PID controller and the proposed method were compared. When the quadrotor weight increased, the velocity error along the z-axis increased, too, but the PPO-IC algorithm demonstrated stable behavior without fluctuations in speed tracking. Moreover, 12 experiments were conducted when the nominal 0.2 m radius of the quadrotor was increased from 50% to 550%, that is, from 0.1 m to 1.1 m. PID and PPO-IC performed similarly when the radius was between 0.2 and 0.4 m. For higher values, the PID performance decreased, even when convergence to the desired value was observed. However, the PID controller could not control the quadrotor: it became unstable when the radius increased to more than 1 m. On the contrary, changes in the radius value slightly affected the performance of the PPO-IC algorithm.

Kahn et al. [34] and Bhan et al. [56] worked on failure avoidance and on compensating for failures occurring during flights. In [34], a Policy Learning using Adaptive Trajectory Optimization (PLATO) algorithm, a continuous, reset-free RL algorithm, was developed. In PLATO, complex control policies are trained with supervised learning using model predictive control (MPC) to observe the environment. Partially trained and unsafe policies are not utilized in the action decision. During training, taking advantage of the MPC robustness, catastrophic failures are minimized since it is not necessary to run the learned NN policy during training time. It was shown that good long-horizon performance of the resulting policy was achieved by the adaptive MPC. In [56], accommodation and recovery from fault problems occurring in an octacopter were achieved using a combination of parameter estimation, RL, and model-based control. Fault-related parameters are estimated using an Unscented Kalman Filter (UKF) or a Particle Filter (PF). These fault-related parameters are given as inputs to a DRL, and the action NN in the DRL provides a new set of control parameters. In this way, the PID controller is updated when the control performance is affected by the parameter(s) correlated with faults.

In [42], a DRL technique was applied to a hexacopter to learn stable hovering in a state–action environment. The DRL used for training is a model-free, on-policy, actor–critic-based algorithm called Trust Region Policy Optimization (TRPO). Two NNs are used as nonlinear function approximators. Experiments showed that such a learning approach achieved successful results and facilitated controller design.

Yoo et al. [18] combined RL and deterministic controllers to control a quadrotor. Five different methods, the original probabilistic inference for learning control (PILCO), PD-RL with high gain, PD-RL with low gain, LQR-RL, and LQR-RL with model uncertainty, were compared via simulations for when the quadrotor tracks a circular reference trajectory. The high-gain PD-RL approaches the reference trajectory quickly. The low-gain PD-RL behaves less aggressively and reference trajectory tracking is delayed. The convergence rates of the PD-RL and LQR-RL methods are better. The performance is also better when compared to the original PILCO. The main advantages of combining a deterministic controller with PILCO are simplicity and rapid learning convergence.

In [41], errors on the pith and roll angles were minimized to provide stability during hovering. A user-designed objective function uses simulated trajectories to choose the best action. The objective function also minimizes the cost of each state. The performance of this controller is worse than a typical quadrotor controller’s performance. However, the proposed controller achieved hovering for up to 6 s after training using 3 min of data.

Thus, this part of the survey focuses on value-function-based and policy-based methods. In detail, Kooi and Babuška [54] employed deep RL algorithms, PPO, to autonomously land quadrotors on inclined surfaces, demonstrating PPO’s superior performance. Hu and Wang [53] utilized an advanced PPO algorithm to optimize stochastic control strategies for quadrotor speed, outperforming traditional PID controllers. Kahn et al. [34] and Bhan et al. [56] addressed failure avoidance and fault compensation in UAV flights using RL and model-based control techniques. Other techniques integrate RL with deterministic controllers, enhancing trajectory tracking and stability during flight maneuvers. All reviewed methods collectively showcase the effectiveness of policy-search-based RL techniques in overcoming challenges in UAV control, from autonomous landing to fault tolerance and trajectory tracking.

3.3.3. Actor–Critic Algorithms

Actor–critic algorithms consist of both value-function-based and policy-search-based methods. The actor refers to the policy-search-based method and chooses the actions in the environment; the critic refers to the value-function-based method and evaluates the actor using the value function.

In [58], three different RL algorithms, DDPG, TD3, and SAC, were applied to study multirotor landing. Using the DDPG method did not result in successful landings. The TD3 and SAC methods successfully completed the landing task. However, TD3 required a longer training period and landing was not as smooth, most likely because of noise present in the algorithm.

Rodriguez et al. [17] studied landing on a dynamic/moving platform using DDPG. Slow and fast scenarios were tried in 150 test episodes. During the slow scenario, the moving platform (the moving platform trajectory was periodic) velocity was 0.4 m/s, and during the fast scenario, it was set to 1.2 m/s. The success rates were 90% and 78%, respectively. Using a constant velocity on the z-axis resulted in landing failure on the moving platform. This problem may be overcome by using the velocity on the z-axis as a state, but this makes the training process more complicated, and learning the landing process becomes more challenging.

Rubi et al. [47] solved the quadrotor path-following problem using a deep deterministic policy gradient (DDPG) reinforcement learning algorithm. A lemniscate and one lap of a spiral path were used to compare agents with different state configurations in DDPG and in an adaptive Nonlinear Guidance Law (NLGL) algorithm. The agent has only two states: distance error and angle error. According to the results, the adaptive NLGL has a lower distance error than the two-state agent, but its distance error is significantly greater than that of the agent with the future states on the lemniscate path.

Rubi et al. [55] also used three different approaches to solve the path-following problem using DDPG. The first agent utilizes only instantaneous information, the second uses a structure (the agent expects the curve), and the third agent computes the optimal speed according to the shape of the path. The lemniscate and spiral paths were used to test the three agents. The lemniscate path was used in the training and test phases. The agents were evaluated in tests but with the assumption that the third agent is also limited by a maximum velocity of 1 m/s. For the lemniscate path, the agents were first tested with ground-truth measurements. The second agent showed the best performance with respect to cross-track error. When the agents were tested with the sensor model, the third agent showed slightly better performance in terms of cross-track errors. Then, all agents were tested in the spiral path. When the performance of the agents was compared in simulations with ground-truth measurements and with sensor models, the third agent (with a maximum velocity of 1 m/s) showed the best performance in terms of position error. In all tests, the third agent (without a maximum velocity limitation) completed the tracks faster.

Wang et al. [45] handled the UAV navigation problem in a large-scale environment using DRL. Two policy gradient theorems within the actor–critic framework are derived to solve the problem, which is formulated as a partially observable Markov decision process (POMDP). As opposed to conventional navigation methods, raw sensor measurements are utilized in DRL; control signals are the output of the navigation algorithm. Stochastic and deterministic policy gradients for POMDP are applied to the RL algorithm. The stochastic policy requires samples from both the state and action spaces. The deterministic policy requires only samples from the state space. Therefore, the RL algorithm with a deterministic policy is faster (and preferred): it is called a fast recurrent deterministic policy gradient algorithm (Fast-RDPG). For comparisons, four different large-scale complex environments were built with random-height buildings to test the DDPG, RDPG, and Fast-RDPG. The success rate of the Fast-RDPG was significantly higher in all environments. Fast-RDPG had the lowest crash rate in one environment. DDPG provided the best performance with respect to the average crash rate in all environments. Fast-RDPG had a much lower crash rate than RDPG. However, Fast-RDPG provided a much lower stray rate than the other algorithms in all environments.

Li et al. [75] developed a new DRL-based flight resource allocation framework (DeFRA) for a typical UAV-assisted wireless sensor network used for smart farming (crop growth condition). DeFRA reduces the overall data packet loss in a continuous action space. A DDPG is used in DeFRA, and DeFRA learns to determine the instantaneous heading and speed of the quadrotor and to choose the ground device to collect data from the field. The time-varying airborne channels and energy arrivals at ground devices cause variations in the network dynamics. The network dynamics are estimated by a newly developed state characterization layer based on LSTM in DeFRA. An MDP simultaneously handles the control of the quadrotor’s maneuver and the communication schedule according to decision parameters (time-varying energy harvesting, packet arrival, and channel fading). The state space comprises the battery level, the data buffer length of all ground devices, the battery level and location of the UAV, the channel gain between the UAV and the ground devices, and the time-span parameter of the ground device. The UAV’s current battery level depends on the battery level of the UAV in the previous time step, harvested energy, and energy consumption. The quadrotor is required to keep its battery level equal to or higher than the battery level threshold. The performance was compared with two DRL-based policies, DDPG-based Movement Control (DDPG-MC) and DQNs-based Flight Resource Allocation Policy (DQN-FRA), and with two non-learning heuristics, Channel-Aware Waypoint Selection (CAWS) and Planned Trajectory Random Scheduling (PTRS). DeFRA provides lower packet loss than other methods. The relation between the packet loss rate and the number of ground devices was investigated according to all methods. The DRL-based methods outperformed CAWS and PTRS. For up to 150 ground devices, DeFRA and DDPG-MC showed similar performance and were better than the other methods, but after increasing the number of ground devices to 300, DeFRA provided better performance than DDPG-MC.

Pi et al. [48] created a low-level quadrotor control algorithm to hover at a fixed point and to track a circular trajectory using a model-free RL algorithm. A combination of on-policy and off-policy methods is used to train an agent. The standard policy gradient method determines the update direction within the parameter space, while the TRPO and PPO algorithms are designed to identify an appropriate update size. However, for updating the policy, the proposed model establishes new updating criteria that extend beyond the parameter space concentrating on local improvement. The NN output provides the thrust of each rotor. The simulator was created in Python using the dynamic model of Stevens et al. [82]. The effects of the rotation matrix and quaternion were investigated in the learning process. The model with the quaternion may converge slower in the training process than the model with the rotation matrix. However, both models showed similar performance when tested.

Ma et al. [65] developed a DRL-based algorithm for trajectory tracking under wind disturbance. The agent learns to determine the rotation speed of each rotor of a hexacopter. A DDPG algorithm is used, but in addition to the existing DDPG algorithm, a policy relief (PR) method based on an epsilon-greedy exploration-based technique and a significance weighting (SW) method are integrated into the DDPG framework. The former method improves the agent’s exploration skills and its adaptation to environmental changes. The latter helps the agent update its parameters in a dynamic environment. In training, the implementation of PR and SW methods in the DDPG algorithm provides better exploration performance and faster convergence of the learning process, respectively, even in a dynamic environment. This method reaches a higher average reward and has a lower position error compared to the DDPG, DDPG with RP, and DDPG with SW. Also, this algorithm provides higher control accuracy compared to the cascaded active disturbance rejection control algorithm in terms of position, velocity, acceleration, and attitude errors.

Hwangbo et al. [35] proposed a method to increase UAV stabilization. An NN improves UAV stability training with RL. Monte Carlo samples are produced by on-policy trajectories and are used for the value function. The value network is used to guide policy training, and the policy network controls the quadrotor. Both are updated in every iteration. A new analytical measurement method describes the distance between action distribution and a new policy for policy optimization. This policy network gives an accurate reaction to the step response. The policy stabilizes the quadrotor even under extreme situations. The algorithm shows better performance than DDPG and Trust Region Policy Optimization with a generalized advantage estimator (TRPO-gae) in terms of computation time.

Mitakidis et al. [67] also studied the target tracking problem. A CNN-based target detection algorithm is used on an octocopter platform to track a UGV. DDPG-RL is applied in a hierarchical controller (instead of a position controller). The CNN learns to detect the UGV, and the DDPG-RL algorithm learns to determine the roll, pitch, and yaw actions in the outer loop of the controller. These actions are taken from the NN output, normalized to the range

[- 1, 1]

, but these normalized values are multiplied by a spectrum of acceptable values. While the roll and pitch actions span between −3 and 3 degrees, the yaw action ranges between −5 and 5 degrees. An experiment was conducted with a low-altitude octocopter and with the manual control of a UGV. Fluctuations were observed in the distance error due to the aggressive maneuver of the UGV, but overall, the results are good.

Li et al. [51] studied the target tracking problem in uncertain environments. The proposed approach consists of a TD3 algorithm and meta-learning. The used algorithm is named meta twin delay deep deterministic policy gradient (meta-TD3). TD3 learns to control the linear acceleration of the UAV and the angular velocity of the heading angle. The state space includes the position of the quadrotor in the x-y plane, the heading angle, the linear velocity, the angle between the motion direction and the straight line between the UAV and the target, and the Euclidean distance between the UAV and the target. Meta-learning overcomes the multi-task learning challenge. Tasks are trajectories of the ground vehicle that is followed. A reply buffer is built for the task experience. When the agent interacts with the environment, the state space, the action space, the reward value, and the next-step state space that corresponds to the task are saved into the reply buffer. The method provides a significant improvement in the convergence value and rate. Meta-TD3 adapts to the different movements of the ground vehicle faster than the TD3 and DDPG algorithms. Meta-TD3 tracks the target more effectively.

Panetsos et al. [60] offer a solution to the payload transportation challenge using a DRL approach. An attitude PID controller is used in the inner loop of the cascaded controller structure, while a position controller in the outer loop is replaced with a TD3-based DRL algorithm. The DRL algorithm in the outer loop learns to create the reference Euler angles, the roll and pitch, and the reference translational velocity of the octocopter on the z-axis. The method controls the system successfully to reach the desired waypoints.

Wang and Ye [62] developed consciousness-driven reinforcement learning (CdRL) for trajectory tracking control. The CdRL learning mechanism consists of online attention learning and consciousness-driven actor–critic learning. The former selects the best action. The latter increases the learning efficiency based on the cooperation of all subliminal actors. Two different attention-learning methods are utilized for online attention learning: short-term attention learning and long-term attention learning. The aim of the former is to select the best action. The latter selects the best action to sustain the system’s stability. The long- and short-term attention arrays are combined to make a decision about which actor should be given more attention. This learning algorithm was compared with Q-learning; the position error in the proposed algorithm was lower than in Q-learning. The same was also seen in the velocity error. However, this method was slightly better than Q-learning when it comes to attitude error. The UAV was successfully controlled to track the desired trajectory by the CdRL algorithm.

Xu et al. [83] created a benchmark using PPO, SAC, DDPG, and DQN algorithms for single-agent tasks and multi-agent PPO (MAPPO), heterogeneous-agent PPO (HAPPO), multi-agent DDPG (MADDPG), and QMIX algorithms for multi-agent tasks with different drone systems. Single-agent tasks include hovering, trajectory tracking, and flythrough. Multi-agent tasks cover hover, trajectory tracking, flythrough, and formation. To increase the task variation, the payload, inverse pendulum, and transportation challenges are integrated into the single- and multi-agent tasks. The learning performance differs based on specific tasks.

Thus, applications of actor–critic RL as applied to UAV control have been summarized in this subsection. Studies like [58] tested algorithms like DDPG, TD3, and SAC for multirotor landing, with SAC and TD3 performing well. Rubi et al. [47,55] examined quadrotor path following, while Wang et al. [45] focused on UAV navigation in complex environments, favoring Fast-RDPG. Li et al. [75] introduced DeFRA for UAV-assisted networks, outperforming heuristics. Pi et al. [48] addressed low-level quadrotor control, and Ma et al. [65] developed a DDPG-based algorithm for trajectory tracking under wind. Various other studies explored target tracking, payload transportation, and consciousness-driven RL for trajectory control, demonstrating RL’s effectiveness in diverse UAV applications. Xu et al. [83] established a benchmark for UAV tasks, evaluating different RL algorithms’ performance.

[ad_2]