出現頻度情報に基づく単語重みづけの原理

海野敏; Bin Umino

doi:10.46895/lis.26.67

Library and Information Science

Library and Information Science ISSN: 2435-8495

三田図書館・情報学会 Mita Society for Library and Information Science

〒108‒8345 東京都港区三田2‒15‒45 慶應義塾大学文学部図書館・情報学専攻内 c/o Keio University, 2-15-45 Mita, Minato-ku, Tokyo 108-8345, Japan
https://mslis.jp/ E-mail:mita-slis@ml.keio.jp

Library and Information Science 26: 67-88 (1988)
doi:10.46895/lis.26.67

原著論文Original Article

出現頻度情報に基づく単語重みづけの原理Some principles of weighting methods based on word frequencies for automatic indexing

海野敏Bin Umino

東京大学大学院教育学研究科博士課程Graduate School of Education, University of Tokyo ◇ 〒113-0033 東京都文京区本郷七丁目3番1号 ◇ Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan

受付日：1989年1月21日Received: January 21, 1989

発行日：1989年3月25日Published: March 25, 1989

PDF

Characteristics of the occurrence frequency of words in natural language texts have been used as an indicator for the selection of significant words in automatic indexing. This paper describes some general principles common to term weighting methods which use occurrence frequency measures.

For this purpose, nearly sixty weighting fomulas were collected from the documents published in the past thirty years. Then their theoretical characteristics were analyzed and compared with each other. As a result, these formulas were classified into following five categories. 1) absolute frequency measures, 2) two kinds of relative frequency measures, 3) word dispersion measures, 4) 2-Poisson model proposed by Harter, 5) information theory similar to the one proposed by Shannon.

Various mathematical relations peculiar to the formulas of each category were found. These relations were well explained by a model consisting of two kinds of word sets, one of which is subsumed by the other; that is, the significance of a word depended on the degree of its maldistribution to the subsumed word set.

This page was created on 2021-01-26T14:02:00.81+09:00
This page was last modified on

このサイトは（株）国際文献社によって運用されています。