-
为解决采样任务环节多、约束条件多、工作环境恶劣等难题[21],“嫦娥五号”采样机械臂采用“肩2+肘1+腕1”的4自由度构型设计方案[2],机械臂连杆坐标系在展开状态下的定义如图4所示,D-H参数如表1所示。
表 1机械臂D-H参数
Table 1.The D-H parameters of the manipulator
i θ/(°) αi-1 /(°) ai-1/mm d/mm 1 θ1 90 0 101.0 2 θ2 0 0 85.5 3 θ3 0 1 970 96.0 4 θ4 0 1 770 93.0 月面采样机械臂正向运动学描述的是机械臂关节空间到末端笛卡尔空间的映射关系,基于代数法的连杆齐次变换矩阵为
$$ {}_i^{i + 1}{\boldsymbol{T}} = \left[ {\begin{array}{*{20}{c}} {c{\theta _i}}&{ - {\rm sin}{\theta _i}}&0&{{a_{i - 1}}} \\ {{\rm sin}{\theta _i}{\rm cos}{\alpha _{i - 1}}}&{{\rm cos}{\theta _i}c{\alpha _{i - 1}}}&{ - {\rm sin}{\alpha _{i - 1}}}&{ - {\rm sin}{\alpha _{i - 1}}{d_i}} \\ {{\rm sin}{\theta _i}{\rm sin}{\alpha _{i - 1}}}&{{\rm cos}{\theta _i}s{\alpha _{i - 1}}}&{ - {\rm cos}{\alpha _{i - 1}}}&{{\rm cos}{\alpha _{i - 1}}{d_i}} \\ 0&0&0&1 \end{array}} \right] $$ (1) 将表1中各行D-H参数代入齐次变换矩阵可得:
$$ {}_0^1{\boldsymbol{T}} = \left[ {\begin{array}{*{20}{c}} {{\rm cos}{\theta _1}}&{ - {\rm sin}{\theta _1}}&0&0 \\ 0&0&{ - 1}&{ - {d_1}} \\ {{\rm sin}{\theta _1}}&{{\rm cos}{\theta _1}}&0&0 \\ 0&0&0&1 \end{array}} \right] $$ (2) $$ {}_1^2{\boldsymbol{T}} = \left[ {\begin{array}{*{20}{c}} {{\rm cos}{\theta _2}}&{ - {\rm sin}{\theta _2}}&0&0 \\ {{\rm sin}{\theta _2}}&{{\rm cos}{\theta _2}}&0&0 \\ 0&0&{ - 1}&{{d_2}} \\ 0&0&0&1 \end{array}} \right] $$ (3) $$ {}_2^3{\boldsymbol{T}} = \left[ {\begin{array}{*{20}{c}} {{\rm cos}{\theta _3}}&{ - {\rm sin}{\theta _3}}&0&{{a_2}} \\ {{\rm sin}{\theta _3}}&{{\rm cos}{\theta _3}}&0&0 \\ 0&0&{ - 1}&{{d_3}} \\ 0&0&0&1 \end{array}} \right] $$ (4) $$ {}_3^4{\boldsymbol{T}} = \left[ {\begin{array}{*{20}{c}} {{\rm cos}{\theta _4}}&{ - {\rm sin}{\theta _4}}&0&{{a_3}} \\ {{\rm sin}{\theta _4}}&{{\rm cos}{\theta _4}}&0&0 \\ 0&0&{ - 1}&{{d_4}} \\ 0&0&0&1 \end{array}} \right] $$ (5) -
奖励函数作为深度强化学习算法的评价准则,决定了采样机械臂运动策略函数的更新方向,本文选取安全性、快速性、可达性作为奖励函数的约束条件,设计多约束奖励函数。以奖励函数为评价准则,空间机械臂通过与环境的交互调整网络的权值,使目标函数达到最优。多约束奖励函数流程如图5所示。
输入:当前状态St,终点坐标Cg,安全性计算方法f1(θ)、时效性计算方法f2(θ)、可达性计算方法f3(θ);
输出:当前状态对应的奖励Rt,回合结束符Done(Done = True表示回合结束,Done = False表示回合尚未结束);
Step1(初始化状态):根据当前状态St读取采样机械臂各关节角度和末端位姿;
Step2(机械臂运动策略选取):机械臂从策略库中随机选取运动方向和移动距离,机械臂由当前位置pt移动至下一位置pt+1;
Step3(碰撞检测):计算采样机械臂各连杆及末端与障碍物的距离,根据f1(θ)的计算结果,若发生碰撞,给予反向奖励−100,即Rt= Rt−100;
Step4(抵近奖励):根据f2(θ)的计算结果,若机械臂最新位置pt+1比上一位置pt更接近目标点,则更新奖励Rt= Rt+ 0.1×||pt+1−pt||;
Step5(到达目标点判断):若机械臂到达目标点,则更新到达目标点的判断符,即On_goal = On_goal+1;
Step6(回合结束判断):为了消除到达目标点的抖动问题,当采样机械臂到达目标点并持续一定时间,即On_goal> 50时,判定为回合结束,令Done =True,并给予正向奖励+ 100,即Rt= Rt+ 100;
Step7(输出结果):输出当前状态对应的奖励Rt和回合结束符Done。
-
本文基于建立的采样机械臂的运动学模型,将DQN应用于机械臂路径规划,采用奖励函数确定机械臂的策略更新方向,通过训练过程更新网络参数权重,机械臂路径规划算法如图6所示。
Step1(初始化):随机初始化深度神经网络Q的参数权重ω,给超参数α、λ赋值,清空缓存区M;
Step2(回合结束判断):判断当前回合标志符是否为最大回合数?若是,则结束训练;若否,则继续Step3~Step10;
Step3(步数结束判断):判断当前步数标志符是否为最大步数?若是,则结束当前回合;若否,则继续执行4~10;
Step4(选择动作):根据状态选择并执行相应的动作,为了探索潜在的更优策略,选择动作时遵循的策略为贪婪算法[22];
Step5(计算奖励):调用多约束算法函数获得相应的奖励Rt;
Step6(更新状态):更新状态St到St+1;
Step7(更新缓存区):将前期经验放入缓存区M;
Step8(随机选择策略):从缓存区M中随机选择n组策略;
Step9(更新Q网络):根据所选策略更新Q网络,Q(St,At)= Q(St,At)+ α(λmaxaQ(St+1,a)−Q(St,At)+ Rt+1);
Step10(网络权值更新):根据Q网络训练的损失函数反向传播更新网络权重。
Path Planning of Lunar Surface Sampling Manipulator for Chang'E-5 Mission
-
摘要:针对“嫦娥五号”月面采样任务中采样机械臂的精准控制问题,提出了一种基于深度强化学习的路径规划方法。通过设计深度强化学习算法的多约束奖赏函数,规划了满足安全性、快速性、可达性3个约束的运动路径,实现了采样机械臂的精准控制。在满足任务安全性的提前下,缩短了天地之间的交互时间,机械臂控制效果平稳。在轨实验结果表明,该方法具有较高的准确性和鲁棒性,可为后续的深空探测在轨遥操作采样任务提供借鉴。Abstract:Aiming at the problem of precise control of the sampling manipulator in the lunar surface sampling mission of "Chang'E-5", a path planning method based on deep reinforcement learning is proposed. By designing the multi-constraint reward function of the deep reinforcement learning algorithm, a motion path that satisfies the three constraints of safety, speed and reachability is planned. The precise control of the sampling robotic arm is realized. Under the advance of meeting the task safety, the interaction time between heaven and earth is greatly shortened, and the control effect of the manipulator is more stable. Experimental results show that this method has high accuracy and robustness, and can provide reference for subsequent on orbit sampling tasks.Highlights
● A path planning method of lunar surface sampling manipulator based on deep reinforcement learning is proposed. ● The control problem of slender and flexible manipulator is solved. ● The deep reinforcement learning control method has high accuracy and robustness. ● The method improves the efficiency of on orbit mission implementation -
表 1机械臂D-H参数
Table 1The D-H parameters of the manipulator
i θ/(°) αi-1 /(°) ai-1/mm d/mm 1 θ1 90 0 101.0 2 θ2 0 0 85.5 3 θ3 0 1 970 96.0 4 θ4 0 1 770 93.0 -
[1] 王琼,侯军,刘然,等. 我国首次月面采样返回任务综述[J]. 中国航天,2021(3):34-39.doi:10.3969/j.issn.1002-7742.2021.03.007 [2] 马如奇,姜清水,刘宾,等. 月球采样机械臂系统设计及试验验证[J]. 宇航学报,2018,39(12):5-12.MA R Q,JIANG Q S,LIU B,et al. Design and verification of a lunar sampling manipulator system[J]. Journal of Astronautics,2018,39(12):5-12. [3] 唐玲,梁常春,王耀兵,等. 基于柔性补偿的行星表面采样机械臂控制策略研究[J]. 机械工程学报,2017,53(11):97-103.doi:10.3901/JME.2017.11.097TANG L,LIANG C C,WANG Y B,et al. Research on flexible compensation control strategy for planetary surface sampling manipulator[J]. Journal of Mechanical Engineering,2017,53(11):97-103.doi:10.3901/JME.2017.11.097 [4] NAKANISHI H, YOSHIDA K. Impedance control for free-flying space robots -basic equations and applications[C]//International Conference on Intelligent Robots and Systems. [S. l]: IEEE, 2006. [5] SCHIELE A, HIRZINGER G. A new generation of ergonomic exoskeletons-the high-performance X-Arm-2 for space robotics telepresence[C]//International Conference on Intelligent Robots and Systems. [S. l]: IEEE, 2011. [6] NANOS K,PAPADOPOULOS E. On the use of free-floating space robots in the presence of angular momentum[J]. Intelligent Service Robotics,2011,4(1):3-15.doi:10.1007/s11370-010-0083-2 [7] SUTTON R S, BARTO A G. Introduction to reinforcement learning[M]. Cambridge: MIT press, 1998. [8] MAEDA Y, WATANABE T, MORIYAMA Y. View-based programming with reinforcement learning for robotic manipulation[C]//IEEE International Symposium on Assembly and Manufacturing. [S. l]: IEEE, 2011. [9] PARK J J,KIM J H,SONG J B. Path planning for a robot manipulator based on probabilistic roadmap and reinforcement learning[J]. International Journal of Control Automation & Systems,2007,5(6):674-680. [10] LANGE S, RIEDMILLER M, VOIGTLANDER A. Autonomous reinforcement learning on raw visual input data in a real world application[C]//International Joint Conference on Neural Networks. [S. l]: IEEE, 2012. [11] LECUN Y,BENGIO Y,HINTON G. Deep learning[J]. Nature,2015,521(7553):436.doi:10.1038/nature14539 [12] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//International Conference on Neural Information Processing Systems. [S. l]: Curran Associates Incorperation, 2012. [13] REN S,HE K,GIRSHICK R,et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence,2015,39(6):1137-1149. [14] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[J/OL]. (2021-10-9).https://arxiv.org/abs/1312.5602. [15] OSTAFEW C J, SCHOELLIG A P, BARFOOT T D. Learning-based nonlinear model predictive control to improve vision-based mobile robot path-tracking in challenging outdoor environments[C]//IEEE International Conference on Robotics and Automation. [S. l]: IEEE, 2016. [16] LEI T, MING L. A robot exploration strategy based on Q-learning network[C]//IEEE International Conference on Real-Time Computing and Robotics. [S. l]: IEEE, 2016. [17] ZHANG F Y, LEITNER J, MILFORD M, et al. Towards vision-based deep reinforcement learning for robotic motion control[C]//proceedings of Australasian Conference on Robotics and Automation(ACRA). Australasian: IEEE, 2015. [18] HASSELT H V, GUEZ A, SILVER D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence Computer Science. Phoenix, Arizona, USA: AIAA, 2016. [19] SCGAUL T , QUAN J , ANTONOGLOU I , et al. Prioritized Experience Replay[EB/OL]. (2015-11-18).https://www.semanticscholar.org/paper/Prioritized-Experience-Replay-Schaul-Quan/c6170fa90d3b2efede5a2e1660cb23e1c824f2ca?p2df. [20] WANG Z, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York , USA: JMLR. , 2015. [21] 裴照宇,任俊杰,彭兢,等. “嫦娥五号”任务总体方案权衡设计[J]. 深空探测学报(中英文),2021,8(3):215-226.PEI Z Y,REN J J,PENG J,et al. Overall scheme trade-off design of Chang’E-5 mission[J]. Journal of Deep Space Exploration,2021,8(3):215-226. [22] GOMES E R, KOWALCZYK R. Dynamic analysis of multiagent Q -learning with ε-greedy exploration[C]//International Conference on Machine Learning. [S. l]: ACM, 2009.