DIGITAL LIBRARY ARCHIVE
HOME > DIGITAL LIBRARY ARCHIVE
< Previous   List   Next >  
A study on the classification of research topics based on COVID-19 academic research using Topic modeling
Full-text Download
So-yeon Yoo (School of Business, Hanyang University)
Gyoo-gun Lim (School of Business, Hanyang University)
Vol. 28, No. 1, Page: 155 ~ 174
Keywords
COVID-19, Topic Modeling, LDA(Latent Dirichlet Allocation), Word2vec, Keyword Extraction
Abstract
From January 2020 to October 2021, more than 500,000 academic studies related to COVID-19 (Coronavirus-2, a fatal respiratory syndrome) have been published. The rapid increase in the number of papers related to COVID-19 is putting time and technical constraints on healthcare professionals and policy makers to quickly find important research. Therefore, in this study, we propose a method of extracting useful information from text data of extensive literature using LDA and Word2vec algorithm. Papers related to keywords to be searched were extracted from papers related to COVID-19, and detailed topics were identified. The data used the CORD-19 data set on Kaggle, a free academic resource prepared by major research groups and the White House to respond to the COVID-19 pandemic, updated weekly. The research methods are divided into two main categories. First, 41,062 articles were collected through data filtering and pre-processing of the abstracts of 47,110 academic papers including full text. For this purpose, the number of publications related to COVID-19 by year was analyzed through exploratory data analysis using a Python program, and the top 10 journals under active research were identified. LDA and Word2vec algorithm were used to derive research topics related to COVID-19, and after analyzing related words, similarity was measured. Second, papers containing 'vaccine' and 'treatment' were extracted from among the topics derived from all papers, and a total of 4,555 papers related to 'vaccine' and 5,971 papers related to 'treatment' were extracted. did For each collected paper, detailed topics were analyzed using LDA and Word2vec algorithms, and a clustering method through PCA dimension reduction was applied to visualize groups of papers with similar themes using the t-SNE algorithm. A noteworthy point from the results of this study is that the topics that were not derived from the topics derived for all papers being researched in relation to COVID-19 () were the topic modeling results for each research topic (
) was found to be derived from For example, as a result of topic modeling for papers related to ‘vaccine’, a new topic titled Topic 05 ‘neutralizing antibodies’ was extracted. A neutralizing antibody is an antibody that protects cells from infection when a virus enters the body, and is said to play an important role in the production of therapeutic agents and vaccine development. In addition, as a result of extracting topics from papers related to ‘treatment’, a new topic called Topic 05 ‘cytokine’ was discovered. A cytokine storm is when the immune cells of our body do not defend against attacks, but attack normal cells. Hidden topics that could not be found for the entire thesis were classified according to keywords, and topic modeling was performed to find detailed topics. In this study, we proposed a method of extracting topics from a large amount of literature using the LDA algorithm and extracting similar words using the Skip-gram method that predicts the similar words as the central word among the Word2vec models. The combination of the LDA model and the Word2vec model tried to show better performance by identifying the relationship between the document and the LDA subject and the relationship between the Word2vec document. In addition, as a clustering method through PCA dimension reduction, a method for intuitively classifying documents by using the t-SNE technique to classify documents with similar themes and forming groups into a structured organization of documents was presented. In a situation where the efforts of many researchers to overcome COVID-19 cannot keep up with the rapid publication of academic papers related to COVID-19, it will reduce the precious time and effort of healthcare professionals and policy makers, and rapidly gain new insights. We hope to help you get It is also expected to be used as basic data for researchers to explore new research directions.
Show/Hide Detailed Information in Korean
토픽모델링을 활용한 COVID-19 학술 연구 기반 연구 주제 분류에 관한 연구
유소연 (한양대학교 경영대학)
임규건 (한양대학교 경영대학)
Keywords
코로나 19, 토픽 모델링, LDA(잠재 디리클레 할당), Word2vec, 키워드 추출
Abstract
2020년 1월부터 2021년 10월 현재까지 COVID-19(치명적인 호흡기 증후군인 코로나바이러스-2)와 관련된 학술 연구 가 500,000편 이상 발표되었다. COVID-19와 관련된 논문의 수가 급격하게 증가함에 따라 의료 전문가와 정책 담당자들 이 중요한 연구를 신속하게 찾는 것에 시간적·기술적 제약이 따르고 있다. 따라서 본 연구에서는 LDA와 Word2vec 알고 리즘을 사용하여 방대한 문헌의 텍스트 자료로부터 유용한 정보를 추출하는 방안을 제시한다. COVID-19와 관련된 논문 에서 검색하고자 하는 키워드와 관련된 논문을 추출하고, 이를 대상으로 세부 주제를 파악하였다. 자료는 Kaggle에 있는 CORD-19 데이터 세트를 활용하였는데, COVID-19 전염병에 대응하기 위해 주요 연구 그룹과 백악관이 준비한 무료 학 술 자료로서 매주 자료가 업데이트되고 있다. 연구 방법은 크게 두 가지로 나뉜다. 먼저, 47,110편의 학술 논문의 초록을 대상으로 LDA 토픽 모델링과 Word2vec 연관어 분석을 수행한 후, 도출된 토픽 중 ‘vaccine’과 관련된 논문 4,555편, ‘treatment’와 관련된 논문 5,791편을 추출한다. 두 번째로 추출된 논문을 대상으로 LDA, PCA 차원 축소 후 t-SNE 기법 을 사용하여 비슷한 주제를 가진 논문을 군집화하고 산점도로 시각화하였다. 전체 논문을 대상으로 찾을 수 없었던 숨겨진 주제를 키워드에 따라 문헌을 분류하여 토픽모델링을 수행한 결과 세부주제를 찾을 수 있었다. 본 연구의 목표는 대량의 문헌에서 키워드를 입력하여 특정 정보에 대한 문헌을 분류할 수 있는 방안을 제시하는 것이다. 본 연구의 목표는 의료 전문가와 정책 담당자들의 소중한 시간과 노력을 줄이고, 신속하게 정보를 얻을 수 있는 방법을 제안하는 것이다. 학술 논문의 초록에서 COVID-19와 관련된 토픽을 발견하고, COVID-19에 대한 새로운 연구 방향을 탐구하도록 도움을 주는 기초자료로 활용될 것으로 기대한다.
Cite this article
JIIS(APA) Style
Yoo, S.-y., & Lim, G.-g. (2022). A study on the classification of research topics based on COVID-19 academic research using Topic modeling. Journal of Intelligence and Information Systems, 28(1), 155-174.

IEEE Style
So-yeon Yoo, and Gyoo-gun Lim, "A study on the classification of research topics based on COVID-19 academic research using Topic modeling", Journal of Intelligence and Information Systems, vol. 28, no. 1, pp. 155~174, 2022.

ACM Style
Yoo, S.-y., & Lim, G.-g., 2022. A study on the classification of research topics based on COVID-19 academic research using Topic modeling. Journal of Intelligence and Information Systems. 28, 1, 155--174.
Export Formats : BiBTeX, EndNote
Advanced Search
Date Range

to
Search
@article{Yoo:JIIS:2022:867,
author = {Yoo, So-yeon and Lim, Gyoo-gun},
title = {A study on the classification of research topics based on COVID-19 academic research using Topic modeling},
journal = {Journal of Intelligence and Information Systems},
issue_date = {March 2022},
volume = {28},
number = {1},
month = Mar,
year = {2022},
issn = {2288-4866},
pages = {155--174},
url = {},
doi = {},
publisher = {Korea Intelligent Information System Society},
address = {Seoul, Republic of Korea},
keywords = { COVID-19, Topic Modeling, LDA(Latent Dirichlet Allocation), Word2vec and Keyword Extraction },
}
%0 Journal Article
%1 867
%A So-yeon Yoo
%A Gyoo-gun Lim
%T A study on the classification of research topics based on COVID-19 academic research using Topic modeling
%J Journal of Intelligence and Information Systems
%@ 2288-4866
%V 28
%N 1
%P 155-174
%D 2022
%R
%I Korea Intelligent Information System Society