在线咨询
中国工业与应用数学学会会刊
主管:中华人民共和国教育部
主办:西安交通大学
ISSN 1005-3085  CN 61-1269/O1

工程数学学报

• •    下一篇

分布式变量选择---MCP正则化

王格华,   王璞玉,   张   海   

  1. 西北大学数学学院,西安  710069
  • 收稿日期:2019-02-27 接受日期:2019-06-13 出版日期:2021-06-15 发布日期:2021-08-15
  • 基金资助:
    国家自然科学基金 (11571011).

Distributed Variable Selection---MCP Regularization

WANG Ge-hua,  WANG Pu-yu,  ZHANG Hai   

  1. School of Mathematics, Northwest University, Xi'an 710069
  • Received:2019-02-27 Accepted:2019-06-13 Online:2021-06-15 Published:2021-08-15
  • Supported by:
    The National Natural Science Foundation of China (11571011).

摘要: 随着数字化时代的发展,各个学科和领域都会遇到海量高维数据.面对收集到的大量数据,如何将其转化为可存储、便分析、能为解决实际问题提供参考的材料成为当前所面临的一个巨大挑战.针对数据存储的现状,分布式存储方式应运而生.分布式存储是将数据集按照某种方式不重复的存储在不同的机器中,以此解决数据存储问题.那么,如何设计和研究出适合于分布式数据存储方式的机器学习算法便成为另一个亟待解决的问题.伴随着信息技术理论的发展,正则化方法的提出和发展为我们处理和分析海量高维数据提供了有效工具,但其仅适合于单机数据处理.鉴于非凸正则化对变量选择和特征提取的优越性,我们将分布式存储与非凸正则化方法相结合,关注基于分布式计算的非凸正则化方法,以此解决海量高维数据的存储和分析问题.本文采用分布式数据存储的形式研究变量选择问题.我们将数据分开存储于可互相通信的多个计算机,并提出分布式MCP方法,基于ADMM算法实现相邻计算机之间交互信息的分布式MCP算法,完成全数据的变量选择,并给出分布式MCP算法的收敛性分析.分布式方法的变量选择结果与非分布式方法变量选择结果相同.最后,通过实验证明本文所提出的方法适合于处理分布式存储数据.

关键词: 分布式, 稀疏, MCP, ADMM

Abstract: With the development of the digital age, a large number of high-dimensional data has been collected in various disciplines and fields. Faced with the huge amount of collected data, it becomes a great challenge for us to transform it into a form that can not only be stored and analyzed, but also can provide a reference for solving practical problems. In view of the current state of data storage, the distributed storage has emerged properly, in which data are stored in different machines in a certain way without any repetition, so as to solve the problem of data storage. Then, how to design a machine learning algorithm which is suitable for distributed data storage becomes another problem to be solved. As the theory of information technology has developed rapidly, the formulation and development of regularization methods provide us with an effective tool for processing and analyzing massive high-dimensional data, but they are only suitable for single-machine data processing. Concerning the superiority of non-convex regularization for variable selection and feature extraction, we combine distributed storage with non-convex regularization methods. We focus on non-convex regularization methods based on distributed computing to solve the storage and analysis of massive high-dimensional data. This paper studies the variable selection problem in the form of distributed data storage. We store the data separately in multiple computers that can communicate with each other, and propose a distributed MCP method. The distributed MCP algorithm implements interactive information between adjacent computers based on the ADMM algorithm, completes variable selection of full data, and ensures the convergence. The variable selection result of the distributed method is the same as that of the non-distributed method. Finally, the experimental results show that the proposed method is suitable for processing distributed storage data.

Key words: distributed, sparse, MCP, ADMM

中图分类号: