A Layered Architecture for Querying Dynamic Web Content
| Hasan Davulcu |       | Juliana Freire* |       | Michael Kifer |
| University at Stony Brook |       | Bell Labs |       | University at Stony Brook |
| davulcu@cs.sunysb.edu |       | juliana@research.bell-labs.com |       | kifer@cs.sunysb.edu |
| I.V. Ramakrishnan |
| University at Stony Brook |
| ram@cs.sunysb.edu |
In this paper we propose a layered architecture for designing and implementing dynamic webbases, which closely corresponds to the traditional layering of database systems. In our architecture, the lowest layer, which we call {\em virtual physical layer}, provides {\em navigation independence}, because it shields the user from the complexities associated with retrieving data from the raw web sources. Next up is the {\em logical layer} which is akin to the traditional logical database layer but, in addition, provides for {\em site independence}. The {\em conceptual layer} is functionally analogous to the corresponding layer in traditional databases.
We show that the proposed layered architecture allows us to automate the process of data extraction and querying of dynamic web content to a much greater degree than what has been offered by previous proposals. In particular, our approach makes it possible to create all necessary wrappers for the virtual physical schema semi-automatically, by simply asking the webbase designer to navigate through the sites of interest. We call this approach {\em mapping by example}. Thus, the designer of a webbase need not have expertise in the language that maps the physical schema to the raw Web (we use a subset of {\em Transaction F-logic} to describe such mappings). This should be contrasted to other approaches, which require expertise in various web-enabling flavors of SQL. Furthermore, our architecture lets us take advantage of the vast body of work on schema integration and simplify the task of mapping the logical layer to the virtual physical layer.
For the conceptual layer, we propose a variant of the universal relation interface, which we call {\em hierarchical universal relation}. We argue that this interface provides powerful, yet reasonably simple, ad hoc querying capabilities for the end user (e.g., a web shopper) compared to the currently prevailing ``canned'' form-based interfaces on the one hand or complex web-enabling extensions of SQL on the other. Finally, we discuss the lessons we learned designing and implementing our system.