Publications
Active learning for hierarchical wrapper induction
Abstract
Information mediators that allow users to integrate data from several Web sources rely on wrappers that extract the relevant data from the Web documents. Wrappers turn collections of Web pages into database-like tables by applying a set of extraction rules to each individual document. Even though the extraction rules can be written by humans, this is undesirable because the process is tedious, time consuming, and requires a high level of expertise. As an alternative to manually writing extraction rules, we created STALKER (Muslea, Minton, & Knoblock 1999), which is a wrapper induction algorithm that learns highaccuracy extraction rules. The major novelty introduced by STALKER is the concept of hierarchical wrapper induction: the extraction of the relevant data is performed in a hierarchical manner based on the embedded catalog tree (ECT), which is a user-provided description of the information to be extracted. Consider the sample document< html> Name: Joe’s< br> Cuisine: American
< br> Cuisine: American
- Date
- July 18, 1999
- Authors
- Ion Muslea, Steven Minton, Craig Knoblock
- Conference
- AAAI/IAAI
- Pages
- 975