DIGITAL LIBRARY ARCHIVE
HOME > DIGITAL LIBRARY ARCHIVE
< Previous   List   Next >  
Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents
Full-text Download
Jongin Park (Graduate School of Business IT, Kookmin University)
Namgyu Kim (Graduate School of Business IT, Kookmin University)
Vol. 25, No. 3, Page: 19 ~ 41
Keywords
Document Embedding, Multi-Vector Document Embedding, Word Embedding, Text Mining
Abstract
According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors.
Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used.
However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods.
This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text.
The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document.
Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster.
The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.
Show/Hide Detailed Information in Korean
복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론
박종인 (국민대학교 비즈니스IT전문대학원)
김남규 (국민대학교 비즈니스IT전문대학원)
Keywords
문서 임베딩, 다중 벡터 문서 임베딩, 단어 임베딩, 텍스트 마이닝
Abstract
텍스트 데이터에 대한 다양한 분석을 위해 최근 비정형 텍스트 데이터를 구조화하는 방안에 대한 연구가 활발하게 이루어지고 있다. doc2Vec으로 대표되는 기존 문서 임베딩 방법은 문서가 포함한 모든 단어를 사용하여벡터를 만들기 때문에, 문서 벡터가 핵심 단어뿐 아니라 주변 단어의 영향도 함께 받는다는 한계가 있다. 또한기존 문서 임베딩 방법은 하나의 문서가 하나의 벡터로 표현되기 때문에, 다양한 주제를 복합적으로 갖는 복합문서를 정확하게 사상하기 어렵다는 한계를 갖는다. 본 논문에서는 기존의 문서 임베딩이 갖는 이러한 두 가지한계를 극복하기 위해 다중 벡터 문서 임베딩 방법론을 새롭게 제안한다. 구체적으로 제안 방법론은 전체 단어가 아닌 핵심 단어만 이용하여 문서를 벡터화하고, 문서가 포함하는 다양한 주제를 분해하여 하나의 문서를 여러 벡터의 집합으로 표현한다. KISS에서 수집한 총 3,147개의 논문에 대한 실험을 통해 복합 문서를 단일 벡터로 표현하는 경우의 벡터 왜곡 현상을 확인하였으며, 복합 문서를 의미적으로 분해하여 다중 벡터로 나타내는제안 방법론에 의해 이러한 왜곡 현상을 보정하고 각 문서를 더욱 정확하게 임베딩할 수 있음을 확인하였다
Cite this article
JIIS Style
Park, J., and N. Kim, "Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents", Journal of Intelligence and Information Systems, Vol. 25, No. 3 (2019), 19~41.

IEEE Style
Jongin Park, and Namgyu Kim, "Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents", Journal of Intelligence and Information Systems, vol. 25, no. 3, pp. 19~41, 2019.

ACM Style
Park, J., and Kim, N., 2019. Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents. Journal of Intelligence and Information Systems. 25, 3, 19--41.
Export Formats : BiBTeX, EndNote

Warning: include(/home/hosting_users/ev_jiisonline/www/admin/archive/advancedSearch.php) [function.include]: failed to open stream: No such file or directory in /home/hosting_users/ev_jiisonline/www/archive/detail.php on line 429

Warning: include() [function.include]: Failed opening '/home/hosting_users/ev_jiisonline/www/admin/archive/advancedSearch.php' for inclusion (include_path='.:/usr/local/php/lib/php') in /home/hosting_users/ev_jiisonline/www/archive/detail.php on line 429
@article{Park:JIIS:2019:780,
author = {Park, Jongin and Kim, Namgyu},
title = {Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents},
journal = {Journal of Intelligence and Information Systems},
issue_date = {September 2019},
volume = {25},
number = {3},
month = Sep,
year = {2019},
issn = {2288-4866},
pages = {19--41},
url = {},
doi = {},
publisher = {Korea Intelligent Information System Society},
address = {Seoul, Republic of Korea},
keywords = { Document Embedding, Multi-Vector Document Embedding, Word Embedding and Text Mining
},
}
%0 Journal Article
%1 780
%A Jongin Park
%A Namgyu Kim
%T Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents
%J Journal of Intelligence and Information Systems
%@ 2288-4866
%V 25
%N 3
%P 19-41
%D 2019
%R
%I Korea Intelligent Information System Society