Integrated query of the Hidden Web
Last modified: 2009-08-27
Abstract
There is a need for software that can access multiple Web sites in the same domain through a single, common interface. This would allow users, for example, to compare flights for a particular trip across all relevant airline sites by posing a single query. This paper reports on an investigation into automating this process in the case where Web sites comprise query forms accessing airline databases hidden behind the Web (the so-called Deep Web or Hidden Web). One approach is to automatically construct queries that attempt to extract all data from these hidden databases in order to create a local copy that can subsequently be accessed. In our work we used the alternative approach of creating an integrated query interface for users to pose specific queries, and evaluated mechanisms for translating these into corresponding queries on each of the relevant Hidden Web interfaces. Methods of checking, interpreting and merging result pages were also investigated. We first constructed a prototype which provided integrated querying of a handful of pre-determined airline sites. This proved useful in detecting commonalities and differences in the sites, and in selecting the most suitable technologies for working with multiple forms. A generic system was then designed and components of the prototype were incrementally replaced in order to gauge the extent to which the approach could handle arbitrary sites in the domain. Our results for the airline domain were promising as regards result interpretation, with 89% of response pages successfully handled. However query formulation presented many problems, with only 39% of query interfaces automatically interpreted correctly, and even fewer amenable to automated query propagation. This paper describes the overall architecture of our system, the techniques employed for querying and for result interpretation, and the problems encountered. A review of 55 airlines offering travel to African destinations will be presented, along with our findings relating to performance and analysis of their query forms and result pages. We conclude that automated access to the Hidden Web is considerably more challenging than automatically consolidating information on the Surface Web.
Full Text:
PDF