大規模文献集合に対して階層的クラスタ分析法を適用するための単連結アルゴリズムA single-link method algorithm for clustering large document collections
駿河台大学文化情報学部Surugadai University ◇ 〒357-0046 埼玉県飯能市阿須698番地 ◇ Azu 698, Hanno, Saitama 357-8555, Japan
駿河台大学文化情報学部Surugadai University ◇ 〒357-0046 埼玉県飯能市阿須698番地 ◇ Azu 698, Hanno, Saitama 357-8555, Japan
In the 1960s and 1970s, techniques for clustering a set of documents, in order to improve the effectiveness or efficiency of information retrieval systems, have been widely explored. Similar attempts have recently been made by many researchers to allow the visualisation of search results, to provide browsing based search modes or to enhance performance in searching very large collections. The purpose of this paper is to develop an algorithm for hierarchical clustering that can work for very large document collections. The algorithm is based on a combination of two ideas proposed by other researchers to save time and space in the process of hierarchical clustering; (1) the use of an inverted file for reducing the number of document pairs for which a similarity degree is calculated, and (2) a procedure for constructing a dendrogram based on single-link method from similarity data recorded on disk and not the main memory. In this paper, the algorithm is experimentally applied to a document set consisting of about 10,000 bibliographic records, and the processing time is analyzed empirically. In addition, the effects of removing words frequently appearing in documents are examined. As a result, we find that removing such words enable us to greatly reduce the processing time without significant change in .the resulting set of clusters. Finally, an empirical comparison between the single-link method and the single-pass algorithm (leader-follower algorithm) is attempted.
© 2002 三田図書館・情報学会© 2002 Mita Society for Library and Information Science
This page was created on 2021-01-19T10:19:31.733+09:00
This page was last modified on
このサイトは(株)国際文献社によって運用されています。