In this paper we present Htmcleaner, a tool aimed at automatically cleaning HyperText Mark-up Language (HTML) files. It removes HTML tags and irrelevant text (like some words used as navigation menu, the common header and footer across all pages in a site, etc). It also reformats the discovered relevant text with a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.
Htmcleaner: Extracting Relevant Text from Web
Girardi, Christian
2007-01-01
Abstract
In this paper we present Htmcleaner, a tool aimed at automatically cleaning HyperText Mark-up Language (HTML) files. It removes HTML tags and irrelevant text (like some words used as navigation menu, the common header and footer across all pages in a site, etc). It also reformats the discovered relevant text with a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.File in questo prodotto:
Non ci sono file associati a questo prodotto.
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.