DIGITAL LIBRARY ARCHIVE
HOME > DIGITAL LIBRARY ARCHIVE
< Previous   List   Next >  
A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification
Full-text Download
Jae-Seong Lee (University of Science & Technology)
Seung-Pyo Jun (Div. of Data Analysis, Korea Institute of Science & Technology Information/ University of Science & Technology)
Hyoung Sun Yoo (Div. of Data Analysis, Korea Institute of Science & Technology Information/ University of Science & Technology)
Vol. 24, No. 3, Page: 221 ~ 241
10.13088/jiis.2018.24.3.221
Keywords
Automatic Document Classification, Korea Standard Industry Classification, Text mining, Vector space model, Natural language processing
Abstract
As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC.
Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.
Show/Hide Detailed Information in Korean
한국표준산업분류를 기준으로 한문서의 자동 분류 모델에 관한 연구
이재성 (과학기술연합대학원대학교 )
전승표 (한국과학기술정보연구원 데이터분석본부/과학기술연합대학원대학교 과학기술경영정책학과)
유형선 (한국과학기술정보연구원 데이터분석본부/과학기술연합대학원대학교 과학기술경영정책학과)
Keywords
문서자동분류, 한국표준산업분류, 텍스트마이닝, 벡터공간모델, 자연어 처리
Abstract
지식사회에 들어서며 새로운 형태의 자본으로서 정보의 중요성이 강조되고 있다. 그리고 기하급수적으로 생산되는 디지털 정보의 효율적 관리를 위해 정보 분류의 중요성도 증가하고 있다. 본 연구에서는 기업의 기술사업화 의사결정에 도움이 될 수 있는 맞춤형 정보를 자동으로 분류하여 제공하기 위하여, 기업의 사업 성격을나타내는 한국표준산업분류(이하 'KSIC')를 기준으로 정보를 분류하는 방법을 제안하였다. 정보 혹은 문서의 분류 방법은 대체로 기계학습을 기반으로 연구되어 왔으나 KSIC를 기준으로 분류된 충분한 학습데이터가 없어, 본 연구에서는 문서간 유사도를 계산하는 방식을 적용하였다. 구체적으로 KSIC 각 코드별 설명문을 수집하고벡터 공간 모델을 이용하여 분류 대상 문서와의 유사도를 계산하여 가장 적합한 KSIC 코드를 제시하는 방법과모델을 제시하였다. 그리고 IPC 데이터를 수집한 후 KSIC를 기준으로 분류하고, 이를 특허청에서 제공하는KSIC-IPC 연계표와 비교함으로써 본 방법론을 검증하였다. 검증 결과 TF-IDF 계산식의 일종인 LT 방식을 적용하였을 때 가장 높은 일치도를 보였는데, IPC 설명문에 대해 1순위 매칭 KSIC의 일치도는 53%, 5순위까지의누적 일치도는 76%를 보였다. 이를 통해 보다 정량적이고 객관적으로 중소기업이 필요로 할 기술, 산업, 시장정보에 대한 KSIC 분류 작업이 가능하다는 점을 확인할 수 있었다. 또한 이종 분류체계 간 연계표를 작성함에있어서도 본 연구에서 제공하는 방법과 결과물이 전문가의 정성적 판단에 도움이 될 기초 자료로 활용될 수 있을 것으로 판단된다.
Cite this article
JIIS Style
Lee, J.-S., S.-P. Jun, and H. S. Yoo, "A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification", Journal of Intelligence and Information Systems, Vol. 24, No. 3 (2018), 221~241.

IEEE Style
Jae-Seong Lee, Seung-Pyo Jun, and Hyoung Sun Yoo, "A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification", Journal of Intelligence and Information Systems, vol. 24, no. 3, pp. 221~241, 2018.

ACM Style
Lee, J.-S., Jun, S.-P., and Yoo, H. S., 2018. A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification. Journal of Intelligence and Information Systems. 24, 3, 221--241.
Export Formats : BiBTeX, EndNote

Warning: include(/home/hosting_users/ev_jiisonline/www/admin/archive/advancedSearch.php) [function.include]: failed to open stream: No such file or directory in /home/hosting_users/ev_jiisonline/www/archive/detail.php on line 429

Warning: include() [function.include]: Failed opening '/home/hosting_users/ev_jiisonline/www/admin/archive/advancedSearch.php' for inclusion (include_path='.:/usr/local/php/lib/php') in /home/hosting_users/ev_jiisonline/www/archive/detail.php on line 429
@article{Lee:JIIS:2018:745,
author = {Lee, Jae-Seong and Jun, Seung-Pyo and Yoo, Hyoung Sun},
title = {A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification},
journal = {Journal of Intelligence and Information Systems},
issue_date = {September 2018},
volume = {24},
number = {3},
month = Sep,
year = {2018},
issn = {2288-4866},
pages = {221--241},
url = {http://dx.doi.org/10.13088/jiis.2018.24.3.221 },
doi = {10.13088/jiis.2018.24.3.221},
publisher = {Korea Intelligent Information System Society},
address = {Seoul, Republic of Korea},
keywords = { Automatic Document Classification, Korea Standard Industry Classification, Text mining, Vector space model and Natural language processing
},
}
%0 Journal Article
%1 745
%A Jae-Seong Lee
%A Seung-Pyo Jun
%A Hyoung Sun Yoo
%T A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification
%J Journal of Intelligence and Information Systems
%@ 2288-4866
%V 24
%N 3
%P 221-241
%D 2018
%R 10.13088/jiis.2018.24.3.221
%I Korea Intelligent Information System Society