基于强化学习的深空探测器自主任务规划方法

姓名
邮箱
手机号码
标题
留言内容
验证码

doi:10.15982/j.issn.2096-9287.2023.20220049

毛维杨^1,,
王彬^{1, 2,,},
柳景兴¹,
熊新¹

1.
昆明理工大学信息工程与自动化学院，昆明 650500
2.
昆明理工大学云南省人工智能重点实验室，昆明 650500

基金项目:民用航天预研资助项目

详细信息

作者简介:
毛维杨（1997− ），硕士研究生，主要研究方向：人工智能，深空探测器自主任务规划。通信地址：昆明理工大学信息工程与自动化学院（650500）E-mail：1518887260@qq.com

通讯作者:
王彬（1977− ），博士，副教授，主要研究方向：实时控制、智能控制、深空探测、智能信息处理。本文通信作者。通讯地址：昆明理工大学信息工程与自动化学院（650500）E-mail：wangbin@kust.edu.cn

●　An reinforcement learning interactive environment for deep space probe agents is built. ●　The traditional policy gradient reinforcement learning method is improved by constructing a loss function which integrates resource, time and timing constraints for the task planning of deep space detectors. ●　A dynamic reward mechanism is proposed. ●　A deep space exploration task planning model with random initial state is presented.

An Autonomous Planning Method for Deep Space Exploration Tasks in Reinforcement Learning Based on Dynamic Rewards

1.
Faculty of Information Engineering & Automation, Kunming University of Science & Technology, Kunming 650500, China
2.
Yunnan key Laboratory of Artificial Intelligence, Kunming University of Science & Technology, Kunming 650500, China

摘要:针对深空探测器的自主任务规划的多约束需求，提出了基于动态奖励的强化学习深空探测器任务自主规划模型构建方法，建立了深空探测器智能体的交互环境，构建了策略网络和融合资源约束、时间约束以及时序约束的损失函数，并提出动态奖励机制对传统策略梯度学习方法进行了改进。仿真实验结果表明：该方法可以实现自主任务规划，规划成功率和规划效率相比静态奖励策略梯度算法有明显的提升，并且能在任意状态下开始规划而无需改变模型结构，提高了对不确定规划任务的适应性。该方法为深空探测器自主任务规划与决策提供了一种新的解决方案。
- 深空探测/
- 任务规划/
- 策略梯度/
- 强化学习/
- 动态奖励
Abstract:Aiming at the characteristics of multi-system parallelism and the need to meet various constraints in the process of autonomous mission planning of deep space detectors, this paper proposes a reinforcement learning task autonomous planning model construction method for deep space detectors based on dynamic rewards, and establishes a deep space detector agent. In the interactive environment, a policy network and a loss function integrating resource constraints, time constraints and timing constraints are constructed, and a dynamic reward mechanism is proposed to improve the traditional policy gradient learning method. The simulation results show that the method in this paper can realize autonomous task planning. Compared with the static reward policy gradient algorithm, the planning success rate and planning efficiency are significantly improved, and the method can start planning in any state without changing the model structure, which improves the accuracy of the algorithm. Determine suitability for planning tasks. This method provides a new solution for autonomous mission planning and decision-making of deep space probes.
- deep space exploration/
- task planning/
- policy gradient/
- reinforcement learning/
- dynamic reward
Highlights

●　An reinforcement learning interactive environment for deep space probe agents is built. ●　The traditional policy gradient reinforcement learning method is improved by constructing a loss function which integrates resource, time and timing constraints for the task planning of deep space detectors. ●　A dynamic reward mechanism is proposed. ●　A deep space exploration task planning model with random initial state is presented.

图 1基于策略梯度强化学习深空探测器任务规划方法原理图

Fig. 1Schematic diagram of the task planning method for deep space detectors based on policy gradient reinforcement learning

下载: 全尺寸图片幻灯片

图 2策略网络训练过程

Fig. 2Policy network training process

下载: 全尺寸图片幻灯片

图 3基于策略梯度强化学习方法动作输出原理图

Fig. 3The principle diagram of action output based on policy gradient reinforcement learning method

下载: 全尺寸图片幻灯片

图 4算法实现

Fig. 4Algorithm implementation

下载: 全尺寸图片幻灯片

图 56次实验的loss变化情况

Fig. 5Loss changes in six experiments

下载: 全尺寸图片幻灯片

图 66次实验的step变化情况

Fig. 6Step changes in six experiments

下载: 全尺寸图片幻灯片

图 710次实验reward箱形图

Fig. 7Rewardbox plot for ten experiments

下载: 全尺寸图片幻灯片

图 81 000次规划的reward结果图

Fig. 8Therewardresult of 1 000 planning

下载: 全尺寸图片幻灯片

图 9实验输出规划结果

Fig. 9Experimental output planning results

下载: 全尺寸图片幻灯片

图 10各子系统的任务甘特图

Fig. 10Task Gantt chart of each subsystem

下载: 全尺寸图片幻灯片

图 11动态奖励和非动态奖励Accuracy结果对比

Fig. 11Comparison ofAccuracyresults of dynamic reward and non-dynamic reward

下载: 全尺寸图片幻灯片

表 1实验环境

Table 1Experimental environment

	配置
操作系统	Windows10 64位
编程语言	Python3.8.8
CPU	Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz 2.11 GHz
内存	16GB
强化学习框架	PARL^[24]

下载: 导出CSV

表 2所用深空探测器拍照案例

Table 2Examples of deep space probes used

子系统名称	状态	可执行动作	子系统名称	状态	可执行动作
电源子系统	休眠	启动	拍照子系统	待机	接收姿态调整信号
	运行	发送低于安全阈值信号		待机	关闭镜头
		发送高于安全阈值信号		就绪	执行拍照
		读取目标坐标		拍照	写入储存器
		打开镜头		保存	发送失败信号
		打开天线		保存	发送成功信号
	充电	收起太阳能板	通信子系统	待机	写入缓存
	返回			待机	清空缓存并删除图像
姿态调整子系统	就绪	转向目标		就绪	建立连接
	运行	发送指向完成信号		发送	发送连接异常信号
	等待	接收任务完成信号			断开
	等待	接收任务失败信号			发送传输完成信号
	阻塞	接收新任务信号
	指向完成	发送准备信号

下载: 导出CSV

表 310次实验的Accuracy结果

Table 3Accuracyresults of 10 experiments

组号	1	2	3	4	5	6	7	8	9	10	均值
实验1	0.91	0.86	0.85	0.87	0.91	0.87	0.92	0.93	0.92	0.90	0.89
实验2	0.90	0.92	0.82	0.89	0.88	0.91	0.90	0.91	0.90	0.86	0.88
实验3	0.90	0.92	0.82	0.89	0.92	0.91	0.90	0.91	0.93	0.89	0.89
实验4	0.91	0.87	0.88	0.89	0.88	0.91	0.94	0.92	0.93	0.90	0.90
实验5	0.91	0.87	0.88	0.92	0.87	0.91	0.90	0.89	0.90	0.90	0.89
实验6	0.82	0.86	0.86	0.93	0.93	0.94	0.91	0.94	0.94	0.95	0.90
实验7	0.91	0.90	0.91	0.93	0.92	0.89	0.90	0.91	0.92	0.89	0.90
实验8	0.95	0.89	0.91	0.94	0.92	0.93	0.93	0.92	0.89	0.91	0.91
实验9	0.91	0.92	0.91	0.90	0.92	0.90	0.91	0.92	0.89	0.92	0.91
实验10	0.91	0.89	0.83	0.93	0.89	0.89	0.94	0.88	0.94	0.92	0.90

下载: 导出CSV

表 410次实验的单次规划用时情况

Table 4Single planning time of 10 experiments 单位：s

组号	1	2	3	4	5	6	7	8	9	10	均值
实验1	0.109	0.111	0.122	0.121	0.117	0.113	0.112	0.132	0.110	0.114	0.116 1
实验2	0.116	0.112	0.115	0.110	0.107	0.112	0.109	0.108	0.107	0.112	0.110 8
实验3	0.112	0.111	0.111	0.112	0.110	0.111	0.109	0.113	0.109	0.113	0.111 1
实验4	0.110	0.112	0.113	0.113	0.113	0.106	0.107	0.106	0.111	0.111	0.110 2
实验5	0.112	0.113	0.114	0.116	0.116	0.119	0.114	0.121	0.116	0.12	0.116 1
实验6	0.116	0.111	0.113	0.110	0.112	0.107	0.110	0.108	0.112	0.112	0.111 1
实验7	0.113	0.113	0.112	0.111	0.113	0.110	0.114	0.108	0.112	0.113	0.111 9
实验8	0.110	0.109	0.111	0.114	0.110	0.110	0.112	0.110	0.106	0.110	0.110 2
实验9	0.109	0.107	0.111	0.110	0.112	0.112	0.109	0.109	0.112	0.110	0.110 1
实验10	0.112	0.111	0.111	0.111	0.110	0.112	0.115	0.112	0.112	0.112	0.111 8

下载: 导出CSV

[1]	崔平远. 深空探测:空间拓展的战略制高点[J]. 人民论坛·学术前沿,2017(5):13-18.doi:10.16619/j.cnki.rmltxsqy.2017.05.002 CUI P Y. Deep space exploration:strategic height of space expansion[J]. People’s Forum. Academic Frontier,2017(5):13-18.doi:10.16619/j.cnki.rmltxsqy.2017.05.002
[2]	于登云,张兴旺,张明,等. 小天体采样探测技术发展现状及展望[J]. 航天器工程,2020,29(2):1-10.doi:10.3969/j.issn.1673-8748.2020.02.001 YU D Y,ZHANG X W,ZHANG M,et al. Current status and prospects of small object sampling and detection technology[J]. Spacecraft Engineering,2020,29(2):1-10.doi:10.3969/j.issn.1673-8748.2020.02.001
[3]	赵凡宇,徐瑞,崔平远. 启发式深空探测器任务规划方法[J]. 宇航学报,2015,36(5):496-503. ZHAO F Y,XU R,CUI P Y. Heuristic mission planning method for deep space probes[J]. Journal of Astronautics,2015,36(5):496-503.
[4]	姜啸,徐瑞,朱圣英. 基于约束可满足的深空探测任务规划方法研究[J]. 深空探测学报(中英文),2018,5(3):262-268. JIANG X,XU R,ZHU S Y. Research on constrained satisfiable deep space mission planning method[J]. Journal of Deep Space Exploration,2018,5(3):262-268.
[5]	姜啸,徐瑞,陈俐均. 深空探测器动态约束规划中的外延约束过滤方法研究[J]. 深空探测学报(中英文),2019,6(6):586-594. JIANG X,XU R,CHEN L J. Study on extensive constraint filtering method for dynamic constraint planning of deep space detector[J]. Journal of Deep Space Exploration,2019,6(6):586-594.
[6]	金颢,徐瑞,朱圣英,等. 适用于深空探测器的时间线转移路标启发式规划方法[J]. 宇航学报,2021,42(7):862-872.doi:10.3873/j.issn.1000-1328.2021.07.006 JIN B,XU R,ZHU S Y,et al. Time line transfer landmark heuristic planning method for deep space detector[J]. Journal of Astronautics,2021,42(7):862-872.doi:10.3873/j.issn.1000-1328.2021.07.006
[7]	赵宇庭,徐瑞,李朝玉,等. 基于动态智能体交互图的深空探测器任务规划方法[J]. 深空探测学报(中英文),2021,8(5):519-527. ZHAO Y T,XU R,LI C Y,et al. Mission planning method for deep space probe based on dynamic agent interaction diagram[J]. Journal of Deep Space Exploration ,2021,8(5):519-527.
[8]	王晓晖,李爽. 深空探测器约束简化与任务规划方法研究[J]. 宇航学报,2016,37(7):768-774. WANG X H,LI S. Research on constraint simplification and task planning method for deep space detector[J]. Journal of Astronautics,2016,37(7):768-774.
[9]	冯小恩,李玉庆,杨晨,等. 面向自主运行的深空探测航天器体系结构设计及自主任务规划方法[J]. 控制理论与应用,2019,36(12):2035-2041. FENG X E,LI Y Q,YANG C,et al. Architecture design and autonomous mission planning for autonomous deep space exploration spacecraft[J]. Control Theory and Application,2019,36(12):2035-2041.
[10]	王鑫,赵清杰,徐瑞. 基于知识图谱的深空探测器任务规划建模[J]. 深空探测学报(中英文),2021,8(3):315-323. WANG X,ZHAO Q J,XU R. Modeling of deep space probe mission planning based on knowledge map[J]. Journal of Deep Space Exploration ,2021,8(3):315-323.
[11]	李玉庆,徐敏强,王日新. 航天器观测重调度问题中的模糊性不确定因素及其处理[J]. 宇航学报,2009,30(3):1106-1111.doi:10.3873/j.issn.1000-1328.2009.03.045 Li Y Q,XU M Q,WANG R X. Fuzzy uncertainty factors in spacecraft observation rescheduling problem and their processing[J]. Journal of Astronautics,2009,30(3):1106-1111.doi:10.3873/j.issn.1000-1328.2009.03.045
[12]	贺东雷,冯小恩,雷明佳,等. 面向深空探测任务的实数遗传编码多星任务规划算法[J]. 控制理论与应用,2019,36(12):2055-2064. HE D L,FENG X E,LEI M J,et al. Real-number genetic encoding multistar mission planning algorithm for deep space mission[J]. Control Theory and Application,2019,36(12):2055-2064.
[13]	SUTTON R S, BARTO AG. Reinforcement learning: an introduction[J]. IEEE Transactions on Neural Networks,1998,9(5):1054.
[14]	史兼郡,张进,罗亚中,等. 基于深度强化学习算法的空间站任务重规划方法[J]. 载人航天,2020,26(4):469-476.doi:10.3969/j.issn.1674-5825.2020.04.008 SHIJ J,ZHANG J,LUO Y Z,et al. Space station task replanning method based on deep enhanced learning algorithm[J]. Manned Space,2020,26(4):469-476.doi:10.3969/j.issn.1674-5825.2020.04.008
[15]	伍国威, 崔本杰, 曲耀斌, 等. 基于深度强化学习的卫星实时引导任务规划方法及系统: CN111950873A[P]. 2020. WU G W, CUI B J, QU Y B, et al. Satellite real-time guidance mission planning method and system based on deep reinforcement learning: CN111950873A[P]. 2020.
[16]	郭林杰. 基于深度强化学习的跳跃式小行星探测器规划策略研究[D]. 哈尔滨: 哈尔滨工业大学, 2019. GUO L J. Study on planning strategy of skip asteroid detector based on deep reinforcement learning [D]. Harbin: Harbin University of Technology, 2019.
[17]	FURFARO R, LINARES R. Deep learning for autonomous lunar landing[C]// Proceedings of AAS/AIAA Astrodynamics Specialist Conference. [S. l.]: AIAA, 2018.
[18]	HECKE K V, DE CROON G C H E, HENNES D, et al. Self-supervised learning as an enabling technology for future space exploration robots: ISS experiments on monocular distance learning[J]. Acta Astronautica, 2017: S0094576517302862.
[19]	徐瑞,李朝玉,朱圣英,等. 深空探测器自主规划技术研究进展[J]. 深空探测学报(中英文),2021,8(2):111-123. XU R,LI C Y,ZHU S Y,et al. Progress in deep space explorer autonomous planning[J]. Journal of Deep Space Exploration ,2021,8(2):111-123.
[20]	刘志荣,姜树海. 基于强化学习的移动机器人路径规划研究综述[J]. 制造业自动化,2019,41(3):90-92. LIU Z R,JIANG S H. A review of path planning for mobile robots based on reinforcement learning[J]. Manufacturing Automation,2019,41(3):90-92.
[21]	俞胜平, 韩忻辰, 袁志明, 等. 基于策略梯度强化学习的高铁列车动态调度方法[J]. 控制与决策, 2022（9）: 2407-2417. YU S P, HAN X C, YUAN Z M, et al. Dynamic scheduling method of high-speed train based on policy gradient reinforcement learning [J]. Control and Decision, 2022（9）: 2407-2417.
[22]	张淼,张琦,刘文韬,等. 一种基于策略梯度强化学习的列车智能控制方法[J]. 铁道学报,2020,42(1):69-75.doi:10.3969/j.issn.1001-8360.2020.01.010 ZHANG B,ZHANG Q,LIU W T,et al. A train intelligent control method based on strategic gradient enhanced learning[J]. Journal of Railways,2020,42(1):69-75.doi:10.3969/j.issn.1001-8360.2020.01.010
[23]	周飞燕,金林鹏,董军. 卷积神经网络研究综述[J]. 计算机学报,2017,40(6):1229-1251.doi:10.11897/SP.J.1016.2017.01229 ZHOU F Y,JIN L P,DONG J. A review of convolution neural networks[J]. Journal of Computer Science,2017,40(6):1229-1251.doi:10.11897/SP.J.1016.2017.01229
[24]	李高杨,吕晓鹏,张星. 基于强化学习的交通信号控制及深度学习应用[J]. 人工智能,2020(3):84-9. Li G Y,LV X P,ZHANG X. Application of traffic signal control and in-depth learning based on reinforcement learning[J]. Artificial Intelligence,2020(3):84-9.

[1]	柳景兴, 王彬, 毛维杨, 熊新.深空探测器任务规划认知图谱及多属性约束冲突检测. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2023.20220064
[2]	张宽, 于天一, 胡晓东, 刘传凯, 李立春, 赵焕洲.月面表层无人采样控制技术. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2022.20210052
[3]	于国斌.深空探测任务协同的系统工程方法应用及趋势. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2021.20210036
[4]	王卓, 徐瑞.基于多目标优化的深空探测器姿态组合规划方法. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2021.20200069
[5]	高宇辉, 师明, 蔡敦波, 张弓.一种通用型遥操作任务智能规划方法研究. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2021.20200071
[6]	陈春亮, 张正峰, 盛瑞卿, 杨孟飞.深空探测跳跃式再入返回任务设计. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2021.20210016
[7]	徐瑞, 李朝玉, 朱圣英, 王棒, 梁子璇, 尚海滨.深空探测器自主规划技术研究进展. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2021.20210039
[8]	王鑫, 赵清杰, 徐瑞.基于知识图谱的深空探测器任务规划建模. 深空探测学报(中英文）,doi:10.15982/j.issn.2096-9287.2021.20210030
[9]	牛厂磊, 罗志福, 雷英俊, 王文强, 郑见杰, 乔学荣, 罗洪义, 胡文军, 钟武烨.深空探测先进电源技术综述. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2020.20200002
[10]	王颖, 朱安文, 刘飞标, 田岱, 王铸.超远深空探测任务的能源动力方案. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2020.20190327001
[11]	雷英俊, 朱立颖, 张文佳.我国深空探测任务电源系统发展需求. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2020.20190712001
[12]	徐赫屿, 王大轶, 刘成瑞, 李文博, 符方舟, 张科备.深空探测器可重构性评价与自主重构策略. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2019.04.010
[13]	陈莉丹, 谢剑锋, 刘勇, 陈明.中国深空探测任务轨道控制技术综述. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2019.03.002
[14]	金颢, 徐瑞, 崔平远, 朱圣英.基于状态转移图的启发式深空探测器任务规划方法. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2019.04.008
[15]	李永, 丁凤林, 周成.深空探测推进技术发展趋势. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2018.04.002
[16]	金颢, 徐瑞, 崔平远, 朱圣英.基于扩展状态深空探测器任务规划方法. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2018.06.010
[17]	李群智, 贾阳, 彭松, 韩璐.月面巡视探测器任务规划顶层设计与实现. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2017.01.009
[18]	沈自才, 代巍, 白羽, 刘荣强, 丁义刚, 刘业楠.载人深空探测任务的空间环境工程关键问题. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2016.02.001
[19]	王伟, 马彦涵, 周易倩, 方宝东.深空探测磁动力技术研究进展. 深空探测学报(中英文）,doi:10.15982/j.issn.2095-7777.2015.03.002
[20]	吴伟仁, 于登云.深空探测发展与未来关键技术. 深空探测学报(中英文）,

计量

文章访问数:163
被引次数:0

注释:

●　An reinforcement learning interactive environment for deep space probe agents is built.

●　The traditional policy gradient reinforcement learning method is improved by constructing a loss function which integrates resource, time and timing constraints for the task planning of deep space detectors.

●　A dynamic reward mechanism is proposed.

●　A deep space exploration task planning model with random initial state is presented.

留言板