In this paper we present Htmcleaner, a tool aimed at automatically cleaning HyperText Mark-up Language (HTML) files. It removes HTML tags and irrelevant text (like some words used as navigation menu, the common header and footer across all pages in a site, etc). It also reformats the discovered relevant text with a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.

Htmcleaner: Extracting Relevant Text from Web

Girardi, Christian
2007-01-01

Abstract

In this paper we present Htmcleaner, a tool aimed at automatically cleaning HyperText Mark-up Language (HTML) files. It removes HTML tags and irrelevant text (like some words used as navigation menu, the common header and footer across all pages in a site, etc). It also reformats the discovered relevant text with a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/3417
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact