Web site evolution is characterized by a limited support to the understanding activities offered to the developers. In fact, design diagrams are often missing or outdated. A potentially interesting option is to reverse engineer high level views of Web sites from the content of the Web pages. Clustering is a valuable technique that can be used in this respect. Web pages can be clustered together based on the similarity of summary information about their content, represented as a list of automatically extracted keywords. This paper presents an empirical study that was conducted to determine the meaningfulness for Web developers of clusters automatically produced from the analysis of the Web page content. Natural Language Processing (NLP) plays a central role in content analysis and keyword extraction. Thus, a second objective of the study was to assess the contribution of some shallow NLP techniques to the clustering task.

Using Keyword Extraction for Web Site Clustering

Tonella, Paolo;Ricca, Filippo;Pianta, Emanuele;Girardi, Christian
2003-01-01

Abstract

Web site evolution is characterized by a limited support to the understanding activities offered to the developers. In fact, design diagrams are often missing or outdated. A potentially interesting option is to reverse engineer high level views of Web sites from the content of the Web pages. Clustering is a valuable technique that can be used in this respect. Web pages can be clustered together based on the similarity of summary information about their content, represented as a list of automatically extracted keywords. This paper presents an empirical study that was conducted to determine the meaningfulness for Web developers of clusters automatically produced from the analysis of the Web page content. Natural Language Processing (NLP) plays a central role in content analysis and keyword extraction. Thus, a second objective of the study was to assess the contribution of some shallow NLP techniques to the clustering task.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/959
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact