Duplicate Removal in Information System Dissemination.

Tak W. Yan, Hector Garcia-Molina: Duplicate Removal in Information System Dissemination. VLDB 1995: 66-77
  author    = {Tak W. Yan and
               Hector Garcia-Molina},
  editor    = {Umeshwar Dayal and
               Peter M. D. Gray and
               Shojiro Nishio},
  title     = {Duplicate Removal in Information System Dissemination},
  booktitle = {VLDB'95, Proceedings of 21th International Conference on Very
               Large Data Bases, September 11-15, 1995, Zurich, Switzerland},
  publisher = {Morgan Kaufmann},
  year      = {1995},
  isbn      = {1-55860-379-4},
  pages     = {66-77},
  ee        = {db/conf/vldb/YanG95.html},
  crossref  = {DBLP:conf/vldb/95},
  bibsource = {DBLP,}


Our experience with the SIFT [YGM95] information dissemination system (in use by over 7,000 users daily) has identified an important and generic disseminationproblem: duplicate information. In this paper we explain why duplicates arise, we quantify the problem, and we discuss why it impairs information dissemination. We then propose a Duplicate Removal Module (DRM) for an information dissemination system. The removal of duplicates operates on a per user, per document basis - each document read by a user generates a request, or a duplicate restraint. In wide-area environments, the number of restraints handled is very large. We consider the implementation of a DRM, examining alternative algorithms and data structures that may be used. We present a performance evaluation of the alternatives and answer important design questions such as: Which implementation is the best? With "best" scheme, how expensive will duplicate removal be? How much memory is required? How fast can restraints be processed?

Copyright © 1995 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

ACM SIGMOD Anthology

CDROM Version: Load the CDROM "Volume 1 Issue 5, VLDB '89-'97" and ... DVD Version: Load ACM SIGMOD Anthology DVD 1" and ... BibTeX

Printed Edition

Umeshwar Dayal, Peter M. D. Gray, Shojiro Nishio (Eds.): VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland. Morgan Kaufmann 1995, ISBN 1-55860-379-4
Contents BibTeX


Sergey Brin, James Davis, Hector Garcia-Molina: Copy Detection Mechanisms for Digital Documents. SIGMOD Conference 1995: 398-409 BibTeX
Tim Berners-Lee, Robert Cailliau, Jean-François Groff, Bernd Pollermann: World-Wide Web: The Information Universe. Electronic Networking: Research, Applications and Policy 1(2): 74-82(1992) BibTeX
Pankaj Goyal: Duplicate record identification in bibliographic databases. Inf. Syst. 12(3): 239-242(1987) BibTeX
Shoshana Loeb, Douglas B. Terry: Information Filtering - Preface to the Secial Section. Commun. ACM 35(12): 26-28(1992) BibTeX
Narayanan Shivakumar, Hector Garcia-Molina: SCAM: A Copy Detection Mechanism for Digital Documents. DL 1995: 0- BibTeX
Tak W. Yan, Hector Garcia-Molina: Index Structures for Information Filtering Under the Vector Space Model. ICDE 1994: 337-347 BibTeX
Tak W. Yan, Hector Garcia-Molina: Index Structures for Selective Dissemination of Information Under the Boolean Model. ACM Trans. Database Syst. 19(2): 332-364(1994) BibTeX
Tak W. Yan, Hector Garcia-Molina: SIFT - a Tool for Wide-Area Information Dissemination. USENIX Winter 1995: 177-186 BibTeX

Referenced by

  1. Tak W. Yan, Hector Garcia-Molina: Efficient Dissemination of Information on the Internet. IEEE Data Eng. Bull. 19(3): 48-54(1996)
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
VLDB Proceedings: Copyright © by VLDB Endowment,
ACM SIGMOD Anthology: Copyright © by ACM (, Corrections:
DBLP: Copyright © by Michael Ley (, last change: Sat May 16 23:46:04 2009