HLT Web Manager

Girardi, Christian

This report illustrates the functionalities of the HLT Web Manager toolkit. It should put you quickly in the condition of [1] crawling interesting html pages from the Web in an optimized archive; WebDownload stores web pages in a compressed archive [2] specifying with regular expressions the URLs that the crawler may follow (or not) and store (or not) and [3] extracting the information about web pages from the archive and displaying them in xml format (by default, the url, the encoding and the download date are saved with the page content). The tool can detect and extract relevant content in any language. Besides, Web Manager allows the user to download not only the website text but also any other encoded information (e.g. images, pdf, video files). By default, the toolkit follows the links found in the tag <a> of the downloaded page and any other url retrieved from JSON and - if possible - from Javascript code.

HLT Web Manager

Girardi, Christian

2011-01-01

Abstract

This report illustrates the functionalities of the HLT Web Manager toolkit. It should put you quickly in the condition of [1] crawling interesting html pages from the Web in an optimized archive; WebDownload stores web pages in a compressed archive [2] specifying with regular expressions the URLs that the crawler may follow (or not) and store (or not) and [3] extracting the information about web pages from the archive and displaying them in xml format (by default, the url, the encoding and the download date are saved with the page content). The tool can detect and extract relevant content in any language. Besides, Web Manager allows the user to download not only the website text but also any other encoded information (e.g. images, pdf, video files). By default, the toolkit follows the links found in the tag of the downloaded page and any other url retrieved from JSON and - if possible - from Javascript code.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2011

Appare nelle tipologie:

5.12 Altro

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/23969

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

IRIS Institutional Research Information System

HLT Web Manager

Girardi, Christian

2011-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Attenzione

Citazioni

social impact

IRIS Institutional Research Information System

HLT Web Manager

Girardi, Christian

2011-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Attenzione

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)