在线咨询
中国工业与应用数学学会会刊
主管:中华人民共和国教育部
主办:西安交通大学
ISSN 1005-3085  CN 61-1269/O1

工程数学学报 ›› 2021, Vol. 38 ›› Issue (6): 750-762.doi: 10.3969/j.issn.1005-3085.2021.06.001

• •    下一篇

基于机器学习的文本半自动类别标注方法

宫衍圣1,   蔡科平2,   王志强3,   李鑫鑫4,   靖稳峰4   

  1. 1. 中铁第一勘察设计院集团有限公司,西安 710043 
    2. 西安工业大学,西安 710021 
    3. 国网浙江省电力公司信息与通信分公司,杭州 310007 
    4. 西安交通大学数学与统计学院,西安 710049
  • 出版日期:2021-12-15 发布日期:2022-02-15
  • 基金资助:
    中国铁建股份有限公司 2018 年度科技重大专项 (18-A02);西安市科技计划项目 (20180916CX5JC6).

Semi-automatic Text Category Labelling Method Based on Machine Learning

GONG Yansheng1,   CAI Keping2,   WANG Zhiqiang3,   LI Xinxin4,   JING Wenfeng4   

  1. 1. China Railway First Survey and Design Institute Group Co., Ltd, Xi'an 710043
    2. Xi'an Technological University, Xi'an 710021
    3. State Grid Zhejiang Electric Power Corporation Information & Telecommunication Branch, Hangzhou 310007
    4. School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049
  • Online:2021-12-15 Published:2022-02-15
  • Supported by:
    China Railway Construction Corporation 2018 Major Science and Technology Special Project (18-A02); the Science and Technology Planning Project of Xi'an City (20180916CX5JC6).

摘要:

在文本分类问题中,人工标注方式需要耗费大量人力和财力,且需要熟悉所研究领域的专业人员才能进行文本标注。为了提高文本类数据标注的效率,提出了一种半自动化论文类别标注方法。首先使用 Word2vec 与 TF-IDF 相结合的方式得到论文的向量表示;接着使用 K-means 算法进行文本聚类;然后通过 $L_1$-LR 二分类模型构建 $K$ 个分类模型;对每个二分类模型选取其权重绝对值较大系数对应的单词作为主题词,最后根据主题词确定每一类别的标签。实验表明,所提出的论文类别半自动标注方法大大提高了文本标注的工作效率。

关键词: 半自动类别标注, 机器学习, 文本聚类, $L_1$-LR 分类模型

Abstract:

In the text classification problem, the efficiency of manual labelling is very low, and professionals familiar with the research field are needed to carry out this work. In order to improve the efficiency of text data labelling, a semi-automatic paper category labelling method is proposed. Firstly, the vector representation of paper abstracts is derived by the combination of word2vec and TF-IDF; then the K-means algorithm is used to complete text clustering; K classification models are constructed through the $L_1$-LR binary classification model; For each binary classification model, the word corresponding to the coefficient with large absolute weight value is selected as the subject word. Finally, the label of each category is determined according to the subject word. The proposed semi-automatic paper category labelling method greatly improves the efficiency of text labelling.

Key words: semi-automatic category labelling, machine learning, text clustering, $L_1$-LR binary classification model

中图分类号: