A Layered Architecture for Querying Dynamic Web Content

Hasan Davulcu       Juliana Freire*       Michael Kifer
University at Stony Brook       Bell Labs       University at Stony Brook
davulcu@cs.sunysb.edu       juliana@research.bell-labs.com       kifer@cs.sunysb.edu

I.V. Ramakrishnan
University at Stony Brook
ram@cs.sunysb.edu

Abstract

There is growing interest in developing web-based applications that allow end users to shop around for products and services on the web without doing a series of tedious manual form-fillouts. The design of database systems to support such applications (called {\em webbases}, the term coined by the Araneus project) is an active area of current database research. The particularly challenging problem is designing webbases for querying data that can only be extracted by multiple form fillouts --- the dynamic web content.

In this paper we propose a layered architecture for designing and implementing dynamic webbases, which closely corresponds to the traditional layering of database systems. In our architecture, the lowest layer, which we call {\em virtual physical layer}, provides {\em navigation independence}, because it shields the user from the complexities associated with retrieving data from the raw web sources. Next up is the {\em logical layer} which is akin to the traditional logical database layer but, in addition, provides for {\em site independence}. The {\em conceptual layer} is functionally analogous to the corresponding layer in traditional databases.

We show that the proposed layered architecture allows us to automate the process of data extraction and querying of dynamic web content to a much greater degree than what has been offered by previous proposals. In particular, our approach makes it possible to create all necessary wrappers for the virtual physical schema semi-automatically, by simply asking the webbase designer to navigate through the sites of interest. We call this approach {\em mapping by example}. Thus, the designer of a webbase need not have expertise in the language that maps the physical schema to the raw Web (we use a subset of {\em Transaction F-logic} to describe such mappings). This should be contrasted to other approaches, which require expertise in various web-enabling flavors of SQL. Furthermore, our architecture lets us take advantage of the vast body of work on schema integration and simplify the task of mapping the logical layer to the virtual physical layer.

For the conceptual layer, we propose a variant of the universal relation interface, which we call {\em hierarchical universal relation}. We argue that this interface provides powerful, yet reasonably simple, ad hoc querying capabilities for the end user (e.g., a web shopper) compared to the currently prevailing ``canned'' form-based interfaces on the one hand or complex web-enabling extensions of SQL on the other. Finally, we discuss the lessons we learned designing and implementing our system.