In this paper, we present a software clustering approach that leverages the information conveyed by the zone in which each lexeme appears in the classes of object oriented systems. We define six zones in the source code: Class Name, Attribute Name, Method Name, Parameter Name, Comment, and Source Code Statement. These zones may convey information with different levels of relevance, and so their contribution should be differently weighed according to the software system under study. To this aim, we define a probabilistic model of the lexemes distribution whose parameters are automatically estimated by the Expectation-Maximization algorithm. The weights of the zones are then exploited to compute similarities among source code classes, which are then grouped by a k-Medoid clustering algorithm. To assess the validity of our solution in the software architecture recovery field, we applied our approach to 19 software systems from different application domains. We observed that the use of our probabilistic model and the defined zones improves the quality of clustering results so that they are close to a theoretical upper bound we have proved.

Weighing lexical information for software clustering in the context of architecture recovery

Maggio, Valerio;
2016-01-01

Abstract

In this paper, we present a software clustering approach that leverages the information conveyed by the zone in which each lexeme appears in the classes of object oriented systems. We define six zones in the source code: Class Name, Attribute Name, Method Name, Parameter Name, Comment, and Source Code Statement. These zones may convey information with different levels of relevance, and so their contribution should be differently weighed according to the software system under study. To this aim, we define a probabilistic model of the lexemes distribution whose parameters are automatically estimated by the Expectation-Maximization algorithm. The weights of the zones are then exploited to compute similarities among source code classes, which are then grouped by a k-Medoid clustering algorithm. To assess the validity of our solution in the software architecture recovery field, we applied our approach to 19 software systems from different application domains. We observed that the use of our probabilistic model and the defined zones improves the quality of clustering results so that they are close to a theoretical upper bound we have proved.
File in questo prodotto:
File Dimensione Formato  
PrintedPaper.pdf

solo utenti autorizzati

Descrizione: Articolo principale
Tipologia: Documento in Post-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 1.07 MB
Formato Adobe PDF
1.07 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/307367
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact