This report illustrates the functionalities of the HLT Web Manager toolkit. It should put you quickly in the condition of [1] crawling interesting html pages from the Web in an optimized archive; WebDownload stores web pages in a compressed archive [2] specifying with regular expressions the URLs that the crawler may follow (or not) and store (or not) and [3] extracting the information about web pages from the archive and displaying them in xml format (by default, the url, the encoding and the download date are saved with the page content). The tool can detect and extract relevant content in any language. Besides, Web Manager allows the user to download not only the website text but also any other encoded information (e.g. images, pdf, video files). By default, the toolkit follows the links found in the tag <a> of the downloaded page and any other url retrieved from JSON and - if possible - from Javascript code.
HLT Web Manager
Girardi, Christian
2011-01-01
Abstract
This report illustrates the functionalities of the HLT Web Manager toolkit. It should put you quickly in the condition of [1] crawling interesting html pages from the Web in an optimized archive; WebDownload stores web pages in a compressed archive [2] specifying with regular expressions the URLs that the crawler may follow (or not) and store (or not) and [3] extracting the information about web pages from the archive and displaying them in xml format (by default, the url, the encoding and the download date are saved with the page content). The tool can detect and extract relevant content in any language. Besides, Web Manager allows the user to download not only the website text but also any other encoded information (e.g. images, pdf, video files). By default, the toolkit follows the links found in the tag of the downloaded page and any other url retrieved from JSON and - if possible - from Javascript code.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.