< Previous   List   Next >  
Efficient Topic Modeling by Mapping Global and Local Topics
Full-text Download
Choi Hochang (Graduate School of Business IT, Kookmin University)
Kim Namgyu (School of MIS, Kookmin University)
Vol. 23, No. 3, Page: 69 ~ 94
Divide and Conquer, Big Data, Text Mining, Topic Modeling
Recently, increase of demand for big data analysis has been driving the vigorous development of related technologies and tools. In addition, development of IT and increased penetration rate of smart devices are producing a large amount of data. According to this phenomenon, data analysis technology is rapidly becoming popular. Also, attempts to acquire insights through data analysis have been continuously increasing. It means that the big data analysis will be more important in various industries for the foreseeable future. Big data analysis is generally performed by a small number of experts and delivered to each demander of analysis. However, increase of interest about big data analysis arouses activation of computer programming education and development of many programs for data analysis. Accordingly, the entry barriers of big data analysis are gradually lowering and data analysis technology being spread out.
As the result, big data analysis is expected to be performed by demanders of analysis themselves.
Along with this, interest about various unstructured data is continually increasing. Especially, a lot of attention is focused on using text data. Emergence of new platforms and techniques using the web bring about mass production of text data and active attempt to analyze text data. Furthermore, result of text analysis has been utilized in various fields. Text mining is a concept that embraces various theories and techniques for text analysis. Many text mining techniques are utilized in this field for various research purposes, topic modeling is one of the most widely used and studied. Topic modeling is a technique that extracts the major issues from a lot of documents, identifies the documents that correspond to each issue and provides identified documents as a cluster. It is evaluated as a very useful technique in that reflect the semantic elements of the document.
Traditional topic modeling is based on the distribution of key terms across the entire document. Thus, it is essential to analyze the entire document at once to identify topic of each document. This condition causes a long time in analysis process when topic modeling is applied to a lot of documents. In addition, it has a scalability problem that is an exponential increase in the processing time with the increase of analysis objects. This problem is particularly noticeable when the documents are distributed across multiple systems or regions. To overcome these problems, divide and conquer approach can be applied to topic modeling. It means dividing a large number of documents into sub-units and deriving topics through repetition of topic modeling to each unit. This method can be used for topic modeling on a large number of documents with limited system resources, and can improve processing speed of topic modeling. It also can significantly reduce analysis time and cost through ability to analyze documents in each location or place without combining analysis object documents.
However, despite many advantages, this method has two major problems. First, the relationship between local topics derived from each unit and global topics derived from entire document is unclear. It means that in each document, local topics can be identified, but global topics cannot be identified. Second, a method for measuring the accuracy of the proposed methodology should be established. That is to say, assuming that global topic is ideal answer, the difference in a local topic on a global topic needs to be measured. By those difficulties, the study in this method is not performed sufficiently, compare with other studies dealing with topic modeling.
In this paper, we propose a topic modeling approach to solve the above two problems. First of all, we divide the entire document cluster(Global set) into sub-clusters(Local set), and generate the reduced entire document cluster(RGS, Reduced global set) that consist of delegated documents extracted from each local set. We try to solve the first problem by mapping RGS topics and local topics. Along with this, we verify the accuracy of the proposed methodology by detecting documents, whether to be discerned as the same topic at result of global and local set. Using 24,000 news articles, we conduct experiments to evaluate practical applicability of the proposed methodology. In addition, through additional experiment, we confirmed that the proposed methodology can provide similar results to the entire topic modeling. We also proposed a reasonable method for comparing the result of both methods.
Show/Hide Detailed Information in Korean
전역 토픽의 지역 매핑을 통한효율적 전역 토픽의 지역 매핑을 통한효율적 토픽 모델링 방안토픽 모델링 방안
최호창 (국민대학교 비즈니스IT전문대학원)
김남규 (국민대학교 경영대학 경영정보학부)
분할 정복 접근법, 빅데이터, 텍스트 마이닝, 토픽 모델링
최근 빅데이터 분석 수요의 지속적 증가와 함께 관련 기법 및 도구의 비약적 발전이 이루어지고 있으며, 이에 따라 빅데이터 분석은 소수 전문가에 의한 독점이 아닌 개별 사용자의 자가 수행 형태로 변모하고 있다. 또한 전통적 방법으로는 분석이 어려웠던 비정형 데이터의 활용 방안에 대한 관심이 증가하고 있으며, 대표적으로 방대한 양의 텍스트에서 주제를 도출해내는 토픽 모델링(Topic Modeling)에 대한 연구가 활발히 진행되고있다.
전통적인 토픽 모델링은 전체 문서에 걸친 주요 용어의 분포에 기반을 두고 수행되기 때문에, 각 문서의 토픽 식별에는 전체 문서에 대한 일괄 분석이 필요하다. 이로 인해 대용량 문서의 토픽 모델링에는 오랜 시간이소요되며, 이 문제는 특히 분석 대상 문서가 복수의 시스템 또는 지역에 분산 저장되어 있는 경우 더욱 크게작용한다. 따라서 이를 극복하기 위해 대량의 문서를 하위 군집으로 분할하고, 각 군집별 분석을 통해 토픽을도출하는 방법을 생각할 수 있다. 하지만 이 경우 각 군집에서 도출한 지역 토픽은 전체 문서로부터 도출한 전역 토픽과 상이하게 나타나므로, 각 문서와 전역 토픽의 대응 관계를 식별할 수 없다.
따라서 본 연구에서는 전체 문서를 하위 군집으로 분할하고, 각 하위 군집에서 대표 문서를 추출하여 축소된전역 문서 집합을 구성하고, 대표 문서를 매개로 하위 군집에서 도출한 지역 토픽으로부터 전역 토픽의 성분을도출하는 방안을 제시한다. 또한 뉴스 기사 24,000건에 대한 실험을 통해 제안 방법론의 실무 적용 가능성을 평가하였으며, 이와 함께 제안 방법론에 따른 분할 정복(Divide and Conquer) 방식과 전체 문서에 대한 일괄 수행방식의 토픽 분석 결과를 비교하였다.
Cite this article
JIIS Style
Hochang , C., and K. Namgyu, "Efficient Topic Modeling by Mapping Global and Local Topics", Journal of Intelligence and Information Systems, Vol. 23, No. 3 (2017), 69~94.

IEEE Style
Choi Hochang , and Kim Namgyu, "Efficient Topic Modeling by Mapping Global and Local Topics", Journal of Intelligence and Information Systems, vol. 23, no. 3, pp. 69~94, 2017.

ACM Style
Hochang , C., and Namgyu, K., 2017. Efficient Topic Modeling by Mapping Global and Local Topics. Journal of Intelligence and Information Systems. 23, 3, 69--94.
Export Formats : BiBTeX, EndNote
Advanced Search
Date Range

@article{Hochang :JIIS:2017:698,
author = {Hochang , Choi and Namgyu, Kim},
title = {Efficient Topic Modeling by Mapping Global and Local Topics},
journal = {Journal of Intelligence and Information Systems},
issue_date = {September 2017},
volume = {23},
number = {3},
month = Sep,
year = {2017},
issn = {2288-4866},
pages = {69--94},
url = { },
doi = {10.13088/jiis.2017.23.3.069},
publisher = {Korea Intelligent Information System Society},
address = {Seoul, Republic of Korea},
keywords = { Divide and Conquer, Big Data, Text Mining and Topic Modeling },
%0 Journal Article
%1 698
%A Choi Hochang
%A Kim Namgyu
%T Efficient Topic Modeling by Mapping Global and Local Topics
%J Journal of Intelligence and Information Systems
%@ 2288-4866
%V 23
%N 3
%P 69-94
%D 2017
%R 10.13088/jiis.2017.23.3.069
%I Korea Intelligent Information System Society