Monday 8 October 2007

Master of Science - Finally! : )

I successfully defended my master of science thesis on Friday the 5th of October 2007. The title of the thesis is "Methods for Analysis of Research Related Data in the IST World Application".

I am gratefull to my mentor Acad. Prof. Dr. Ivan Bratko from the University of Ljubljana and my mentor Doc. Dr. Dunja Mladenič from the Jožef Stefan Institute. Other members of the commission before which I defended the thesis were Prof. Dr. Hans Uszkoreit (University of Saarbrucken and DFKI) and Prof. Dr. Blaž Zupan (University of Ljubljana).

In the thesis I describe the machine learning and the data mining algorithms used in the IST World Application for the tasks of data integration (solving record linkage problem) and data analysis (topic of work and communities of collaboration identification and tracking). Following is the short abstract:

Abstract: In this master thesis we implement and apply (1) machine learning algorithms to support the integration of data coming from different data sources and (2) data mining algorithms for analysis of research related data to support the partner search process in the knowledge transfer scenario. We begin by giving an overview of the IST World portal, which is an online information system we developed for supporting partner search in the knowledge transfer process. The portal, accessible at http://www.ist-world.org is the environment in which the described algorithms for data integration and data analysis are put to use. We then describe the applied machine learning methods for integrating research related data, which originates from several data sources, into a single integrated dataset. We developed an integration approach based on state of the art data analysis methods such as text mining, inverted indexing and active learning for solving the record linkage problem in the scope of the IST World system. The approach was empirically evaluated with an experiment in integration of research related data from the European CORDIS database of research projects. The second part of the thesis is centered on research related data analysis for the purpose of supporting the partner search process. The goal of the developed and applied data mining algorithms is to automatically identify and track the topics of work and collaboration communities of the analyzed research actors. We describe the used text and graph mining algorithms enabling identification of competences, consortia, competences development and consortia development. We conclude by illustrating the effectiveness of these algorithms in several experiments and by showing that the results agree with human intuition.

Keywords: data integration, record linkage, duplicate detection, machine learning, text mining, string kernel, edit distance, active learning, support vector machine, partner search, data mining, data visualization, competence, consortium, latent semantic indexing, singular value decomposition, multidimensional scaling and clustering