This report illustrates the functionalities of the HLT Web Manager toolkit. It should put you quickly in the condition of [1] crawling interesting html pages from the Web in an optimized archive; WebDownload stores web pages in a compressed archive [2] specifying with regular expressions the URLs that the crawler may follow (or not) and store (or not) and [3] extracting the information about web pages from the archive and displaying them in xml format (by default, the url, the encoding and the download date are saved with the page content). The tool can detect and extract relevant content in any language. Besides, Web Manager allows the user to download not only the website text but also any other encoded information (e.g. images, pdf, video files). By default, the toolkit follows the links found in the tag <a> of the downloaded page and any other url retrieved from JSON and - if possible - from Javascript code.

HLT Web Manager

Girardi, Christian
2011-01-01

Abstract

This report illustrates the functionalities of the HLT Web Manager toolkit. It should put you quickly in the condition of [1] crawling interesting html pages from the Web in an optimized archive; WebDownload stores web pages in a compressed archive [2] specifying with regular expressions the URLs that the crawler may follow (or not) and store (or not) and [3] extracting the information about web pages from the archive and displaying them in xml format (by default, the url, the encoding and the download date are saved with the page content). The tool can detect and extract relevant content in any language. Besides, Web Manager allows the user to download not only the website text but also any other encoded information (e.g. images, pdf, video files). By default, the toolkit follows the links found in the tag of the downloaded page and any other url retrieved from JSON and - if possible - from Javascript code.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/23969
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact