ACM SIGMOD Anthology ACM SIGMOD dblp.uni-trier.de

Enhanced Hypertext Categorization Using Hyperlinks.

Soumen Chakrabarti, Byron Dom, Piotr Indyk: Enhanced Hypertext Categorization Using Hyperlinks. SIGMOD Conference 1998: 307-318
@inproceedings{DBLP:conf/sigmod/ChakrabartiDI98,
  author    = {Soumen Chakrabarti and
               Byron Dom and
               Piotr Indyk},
  editor    = {Laura M. Haas and
               Ashutosh Tiwary},
  title     = {Enhanced Hypertext Categorization Using Hyperlinks},
  booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
               on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
  publisher = {ACM Press},
  year      = {1998},
  isbn      = {0-89791-995-5},
  pages     = {307-318},
  ee        = {http://doi.acm.org/10.1145/276304.276332, db/conf/sigmod/ChakrabartiDI98.html},
  crossref  = {DBLP:conf/sigmod/98},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX

Abstract

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo! and the US Patent Database. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

Copyright © 1998 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


ACM SIGMOD DiSC

CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ... Online Version (ACM WWW Account required): Full Text in PDF Format

ACM SIGMOD Anthology

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ... BibTeX

Printed Edition

Laura M. Haas, Ashutosh Tiwary (Eds.): SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. ACM Press 1998, ISBN 0-89791-995-5 BibTeX , SIGMOD Record 27(2), June 1998
Contents

Online Edition: ACM SIGMOD

[Abstract]
[Full Text (Postscript)]

References

[1]
...
[2]
Chidanand Apté, Fred Damerau, Sholom M. Weiss: Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst. 12(3): 233-251(1994) BibTeX
[3]
...
[4]
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan: Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. VLDB 1997: 446-455 BibTeX
[5]
...
[6]
...
[7]
W. Bruce Croft, Howard R. Turtle: A Retrieval Model for Incorporating Hypertext Links. Hypertext 1989: 213-224 BibTeX
[8]
...
[9]
...
[10]
...
[11]
David Eppstein: Finding the k Shortest Paths. FOCS 1994: 154-165 BibTeX
[12]
Daniela Florescu, Daphne Koller, Alon Y. Levy: Using Probabilistic Information in Data Integration. VLDB 1997: 216-225 BibTeX
[13]
Hans-Peter Frei, D. Stieger: Making Use of Hypertext Links when Retrieving Information. ECHT 1992: 102-111 BibTeX
[14]
Hans-Peter Frei, D. Stieger: The Use of Semantic Links in Hypertext Information Retrieval. Inf. Process. Manage. 31(1): 1-13(1995) BibTeX
[15]
...
[16]
Marti A. Hearst, Chandu Karadi: Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. SIGIR 1997: 246-255 BibTeX
[17]
...
[18]
...
[19]
...
[20]
...
[21]
...
[22]
...
[23]
...
[24]
...
[25]
Manish Mehta, Rakesh Agrawal, Jorma Rissanen: SLIQ: A Fast Scalable Classifier for Data Mining. EDBT 1996: 18-32 BibTeX
[26]
...
[27]
...
[28]
...
[29]
...
[30]
...
[31]
Gerard Salton: Associative Document Retrieval Techniques Using Bibliographic Information. J. ACM 10(4): 440-457(1963) BibTeX
[32]
...
[33]
...
[34]
...
[35]
...
[36]
...
[37]
John C. Shafer, Rakesh Agrawal, Manish Mehta: SPRINT: A Scalable Parallel Classifier for Data Mining. VLDB 1996: 544-555 BibTeX
[38]
William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: 307-315 BibTeX
[39]
John R. Smith, Shih-Fu Chang: Visually Searching the Web for Content. IEEE MultiMedia 4(3): 12-20(1997) BibTeX
[40]
...

Referenced by

  1. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins, Eli Upfal: The Web as a Graph. PODS 2000: 1-10
  2. Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: Data Mining and the Web: Past, Present and Future. Workshop on Web Information and Data Management 1999: 43-47
  3. Ke Wang, Senqiang Zhou, Shiang Chen Liew: Building Hierarchical Classifiers Using Class Proximity. VLDB 1999: 363-374
  4. Soumen Chakrabarti, Martin van den Berg, Byron Dom: Distributed Hypertext Resource Discovery Through Examples. VLDB 1999: 375-386
  5. Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan: Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. VLDB J. 7(3): 163-178(1998)
BibTeX
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Wed Nov 19 18:54:11 2008