Extracting Schema from Semistructured Data
Svetlozar Nestorov, Serge Abiteboul, Rajeev Motwani
Full Paper (PDF)

Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.


References, where available, link to the DBLP on the World Wide Web.

Serge Abiteboul: Querying Semi-Structured Data. ICDT 1997: 1-18
Serge Abiteboul, Richard Hull, Victor Vianu: Foundations of Databases. Addison-Wesley 1995, ISBN 0-201-53771-0
Antonio Albano, R. Bergamini, Giorgio Ghelli, Renzo Orsini: An Object Data Model with Roles. VLDB 1993: 39-51
Peter Buneman: Semistructured Data. PODS 1997: 117-121
Peter Buneman, Susan B. Davidson, Mary F. Fernandez, Dan Suciu: Adding Structure to Unstructured Data. ICDT 1997: 336-350
Peter Buneman, Susan B. Davidson, Gerd G. Hillebrand, Dan Suciu: A Query Language and Optimization Techniques for Unstructured Data. SIGMOD Conf. 1996: 505-516
R. G. G. Cattell: The Object Database Standard: ODMG-93 (Release 1.1). Morgan Kaufmann 1994
Roy Goldman, Jennifer Widom: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997: 436-445
Svetlozar Nestorov, Jeffrey D. Ullman, Janet L. Wiener, Sudarshan S. Chawathe: Representative Objects: Concise Representations of Semistructured, Hierarchial Data. ICDE 1997: 79-90
Dallan Quass, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Jennifer Widom: Querying Semistructured Heterogeneous Information. DOOD 1995: 319-344
Dan Suciu: Query Decomposition and View Maintenance for Query Languages for Unstructured Data. VLDB 1996: 227-238
Jeffrey D. Ullman: Principles of Database and Knowledge-Base Systems, Volume I. Computer Science Press 1988, ISBN 0-7167-8158-1
Jeffrey D. Ullman: Principles of Database and Knowledge-Base Systems, Volume II. Computer Science Press 1989, ISBN 0-7167-8162-X
Moshé M. Zloof: Query-by-Example: A Data Base Language. IBM Systems Journal 16(4): 324-343(1977)

author = {Svetlozar Nestorov and
Serge Abiteboul and
Rajeev Motwani},
editor = {Laura M. Haas and
Ashutosh Tiwary},
title = {Extracting Schema from Semistructured Data},
booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
publisher = {ACM Press},
year = {1998},
isbn = {0-89791-955-5},
pages = {295-306},
crossref = {DBLP:conf/sigmod/98},
bibsource = {DBLP, http://dblp.uni-trier.de}

DBLP: Copyright ©1999 by Michael Ley (ley@uni-trier.de).