Library and Information Science

Library and Information Science ISSN: 2435-8495
三田図書館・情報学会 Mita Society for Library and Information Science
〒108‒8345 東京都港区三田2‒15‒45 慶應義塾大学文学部図書館・情報学専攻内 c/o Keio University, 2-15-45 Mita, Minato-ku, Tokyo 108-8345, Japan
Library and Information Science 47: 27-38 (2002)

短報Brief Communication

大規模文献集合に対して階層的クラスタ分析法を適用するための単連結アルゴリズムA single-link method algorithm for clustering large document collections

駿河台大学文化情報学部Surugadai University ◇ 〒357-0046 埼玉県飯能市阿須698番地 ◇ Azu 698, Hanno, Saitama 357-8555, Japan

受付日:2003年1月20日Received: January 20, 2003
受理日:2003年3月14日Accepted: March 14, 2003
発行日:2003年12月1日Published: December 1, 2003

In the 1960s and 1970s, techniques for clustering a set of documents, in order to improve the effectiveness or efficiency of information retrieval systems, have been widely explored. Similar attempts have recently been made by many researchers to allow the visualisation of search results, to provide browsing based search modes or to enhance performance in searching very large collections. The purpose of this paper is to develop an algorithm for hierarchical clustering that can work for very large document collections. The algorithm is based on a combination of two ideas proposed by other researchers to save time and space in the process of hierarchical clustering; (1) the use of an inverted file for reducing the number of document pairs for which a similarity degree is calculated, and (2) a procedure for constructing a dendrogram based on single-link method from similarity data recorded on disk and not the main memory. In this paper, the algorithm is experimentally applied to a document set consisting of about 10,000 bibliographic records, and the processing time is analyzed empirically. In addition, the effects of removing words frequently appearing in documents are examined. As a result, we find that removing such words enable us to greatly reduce the processing time without significant change in .the resulting set of clusters. Finally, an empirical comparison between the single-link method and the single-pass algorithm (leader-follower algorithm) is attempted.

This page was created on 2021-01-19T10:19:31.733+09:00
This page was last modified on