您当前的位置:首页 > 论文详情

Web of Science分类合理吗?——基于梯度显著度文本特征提取的分类预测方法研究

请选择邀稿期刊:

Does Web of Science categorize reasonably?——Research on Classification Prediction Method Based on Gradient Saliency Text Feature Extraction

摘要: Web of Science是获取学术信息的重要数据库之一,拥有复杂的学科分类体系,该数据库的合理性和准确性对于学术资源的检索、促进学科内部的研究具有重要意义。本研究选取了Web of Science数据库中“多学科类别”的数据集,从极大似然理论出发进行推导,结合大模型梯度显著度的可解释理论,挖掘文本的分布特征并且量化类别特征并衡量类别相似度,由此提出了一种文本提取和分类预测方法。本文使用该方法不仅重新对Web of Science数据库中单分类标签进行预测,通过提高文本分类标注的准确率而改善了质量,而且实验证明了该方法也可对多分类有效预测,进而对文献分类提供决策依据。研究发现:通过本文所提出的方法对类别特征量化和类别相似度的计算,找出了预测标签经常在某几个特定类别集合中反复出现的原因。该方法不仅可以有效指导文献分类,也可以衡量数据库类别划分的合理性,还能通过分析期刊收录的论文,判断期刊所发表的论文与期刊实际类别相符的程度。

Abstract: Web of Science is one of the most important databases for obtaining academic information, which has a complex classification system, the rationality and accuracy of this database is of great significance for the retrieval of academic resources and the promotion of research within disciplines. In this research, a dataset was meticulously chosen from the "multi-disciplinary category" within the Web of Science database. Utilizing the foundational principles of Maximum Likelihood Theory, and integrating it with the interpretable theory of large model gradient significance, the study delved into the exploration of textual distribution characteristics. Furthermore, it succeeded in quantifying category traits and gauging inter-category similarity. Consequently, the study put forth a novel method for text extraction and classification prediction, enriching the academic discourse with its sophisticated approach and rigorous methodology. In this paper, we use this method not only to re-predict single-category labels in the Web of Science database, which improves the quality by increasing the accuracy of text categorization and annotation, but also experimentally proves that this method can also effectively predict multi-categories, which provides a basis for decision-making on document categorization. It is found that the reasons why the predicted labels often recur in a few specific sets of categories are identified through the computation of category feature quantization and category similarity by the method proposed in this paper. The method can not only effectively guide the classification of literature, but also measure the reasonableness of the classification of database categories, as well as determine the extent to which the papers published by journals match the actual categories of the journals by analyzing the papers included in the journals.

版本历史

[V1] 2024-09-09 22:40:09 PSSXiv:202409.00539V1 下载全文
点击下载全文
在线阅读
许可声明
metrics指标
  •  点击量228
  •  下载量57
  • 评论量 0
评论
分享
邀请专家评阅
收藏