Publications

Active learning for hierarchical wrapper induction

Abstract

Information mediators that allow users to integrate data from several Web sources rely on wrappers that extract the relevant data from the Web documents. Wrappers turn collections of Web pages into database-like tables by applying a set of extraction rules to each individual document. Even though the extraction rules can be written by humans, this is undesirable because the process is tedious, time consuming, and requires a high level of expertise. As an alternative to manually writing extraction rules, we created STALKER (Muslea, Minton, & Knoblock 1999), which is a wrapper induction algorithm that learns highaccuracy extraction rules. The major novelty introduced by STALKER is the concept of hierarchical wrapper induction: the extraction of the relevant data is performed in a hierarchical manner based on the embedded catalog tree (ECT), which is a user-provided description of the information to be extracted. Consider the sample document< html> Name: Joe’s< br> Cuisine: American
< br> Cuisine: American

Date
July 18, 1999
Authors
Ion Muslea, Steven Minton, Craig Knoblock
Conference
AAAI/IAAI
Pages
975