出現頻度情報に基づく単語重みづけの原理Some principles of weighting methods based on word frequencies for automatic indexing
東京大学大学院教育学研究科博士課程Graduate School of Education, University of Tokyo ◇ 〒113-0033 東京都文京区本郷七丁目3番1号 ◇ Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan
東京大学大学院教育学研究科博士課程Graduate School of Education, University of Tokyo ◇ 〒113-0033 東京都文京区本郷七丁目3番1号 ◇ Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan
Characteristics of the occurrence frequency of words in natural language texts have been used as an indicator for the selection of significant words in automatic indexing. This paper describes some general principles common to term weighting methods which use occurrence frequency measures.
For this purpose, nearly sixty weighting fomulas were collected from the documents published in the past thirty years. Then their theoretical characteristics were analyzed and compared with each other. As a result, these formulas were classified into following five categories. 1) absolute frequency measures, 2) two kinds of relative frequency measures, 3) word dispersion measures, 4) 2-Poisson model proposed by Harter, 5) information theory similar to the one proposed by Shannon.
Various mathematical relations peculiar to the formulas of each category were found. These relations were well explained by a model consisting of two kinds of word sets, one of which is subsumed by the other; that is, the significance of a word depended on the degree of its maldistribution to the subsumed word set.
© 1988 三田図書館・情報学会© 1988 Mita Society for Library and Information Science
This page was created on 2021-01-26T14:02:00.81+09:00
This page was last modified on
このサイトは(株)国際文献社によって運用されています。