Personal tools

A Benchmarking Metholology

From Benchmarking-Tel
Jump to: navigation, search


1 Methodology

Research is a process largely relying on self-organization. In a way, communicating research itself is a negotiation process aimed at reaching an agreement about knowledge, who has this knowledge, and what is its value. Large research networks therefore naturally form complex communities with overlapping groups in varying sizes. An individual’s work within this network are dependent of the whole, and their connection to others. Is it not least through this influence and identification with sub-groups that this network interacts.

Both professional and rich professional competences of the actors are necessary for such a research network to be successful (cf. Rychen & Salganik, 2003). Professional competences relate to the expertise of the network: they refer to the potential to construct domain-specific knowledge within the network or within parts of it. Rich professional competences transcend these domains and encompass – amongst others – social competence, self competence, and methodological competence. Social competence, for example, refers to the potential to undertake (collaborative) actions in order to identify, manage, and master conflicts (see Erpenbeck & Rosenstiel, 2003).

1.1 Methodology, General

Evidence for inspecting the characteristics and for estimating the distribution of professional competence can be seen to be reflected in the various research outputs produced in the scientific value creation chain: peer-reviewed media such as conference, journal, and workshop publications, in the recent years additionally also (non-peer reviewed) online articles and blog postings. Prizewinning and keynote activities provide more strongly weighted data sources. Funding data and information about joint projects can complement the picture.

Evidence about the distribution of rich professional competences on the other hand cannot be derived directly from the artefacts produced. It has to be inferred indirectly from the actions and relationships of the protagonists of the network: most notably their collaboration in authoring, but also their joint attendance in events and meetings, or their affiliation in organisations and special interest groups can serve this purpose.

This analytical work is focussing on creating higher professional awareness about the shifts the field technology-enhanced learning as such is experiencing over time: new topics emerge, old research strands finish, major topics become minors – and minors become majors.

For this contribution, we therefore have employed qualitative and quantitative analysis methodologies to depict the changes in semantic structure of the field technology-enhanced learning – exposed in one representative, large-scale TEL conference: the ED-MEDIA (See section 3).

Ed-Media started as an international conference on ‘Educational Multimedia, Hypermedia & Telecommunications’ in 1993, as a follow-up to an earlier series of International Conferences on Computers and Learning (ICCAL), which started in 1987. The idea was to create a multi-disciplinary forum for the discussion of research and development in this area. It is one of the larger and more international conferences in the field with regularly about 1500 participants from some 70 different countries.

For this initial, simple analysis we have set aside journal publication, as – although having a higher quality profile in the scientific value chain – the output cycle is much slower. So whilst it may reflect a more mature consensus around field themes, these are harder to tie to specific points in time. High quality journals, in particular, can have very heterogeneous and often very slow publication cycles.

1.2 Benchmarking via Topics

We suggest that the terminology used in the titles of research articles in general captures the essence of the contribution. When monitoring a larger number of contributions over time, changes in the use of this terminology can be detected that reflect shifting interests in the topics covered by a research network. These changes can be of very different nature.

Quantitatively, the growth of the dictionary over the years analysed is of interest. Overlaps in terminology, both newly introduced and disappearing keywords (aka ‘terms’) mark quantitatively, how the terminology shrinks or grows. Bursts in frequency facilitate the detection of shifting relevance of keywords, especially among the medium frequent ones that are considered most semantically discriminative.

Qualitatively, the structure of the semantic relationships in this dictionary is of interest: some terms are closer to others thus allowing us to see the dictionary as a semantic network: nodes are keywords and links represent their weighted semantic closeness. In such a network, the keyword nodes are strongly connected if they are semantically close to each other – thus mimicking the conceptual structures while at the same time allowing for machine supported analysis.

There are many competing models to automatically determine the 'closeness' of terms with the help of natural language processing. Among these models, latent-semantic analysis (Landauer et al., 1990; Wild, 2006) as an extension of the classical vector space model (Sal-ton et al., 1975) has been shown to provide high performance. Mapping the terminology in a lower-dimensional, ‘latent-semantic’ vector space helps to measure the geometrical distance between terms as a proxy for their semantic closeness. For the two years 2000 and 2008, a comparison of these resulting graph structures has been conducted (see section 3).

The publication data of the ED-MEDIA conference series has been partitioned by years. The titles of the contributions were sanitized by stripping all non-alphanumeric characters, converting all remaining words into lower case, and removing all words with fewer than two character length. A lower frequency bandwidth threshold was introduced, effectively eliminating all words appearing only twice or less in the titles. English stopwords (see Wild, 2008) have been removed on the other side of the frequency spectrum to keep pronouns and other functional terms such as 'the' or 'it' from distorting the analyses. The remaining vocabulary was aggregated along word stems using Porter's snowball stemmer (see Temple Lang, 2009). The first step of the analysis focuses on the change in terminology. Therefore, the frequency of the words of this remaining vocabulary in the document titles is computed. Subsequently, the yearly change in vocabulary use is assessed: a tabulation of the normalised frequencies gives insight about the nature of the terminology (depicted by bar plots and log density curves). When comparing the tabulations of two years, the terms that disappeared, the terms that are new, and changes in distribution can be assessed. The changes in distribution can give insight in diminishing or enforced roles of keywords via a simple burst detection, i.e. a significant increase or decrease in usage frequency in the comparison data set. Pseudo documents are created reflecting the frequency distribution of keywords in the classes 'new', 'gone', 'diminished', and 'enforced' are used as input for a word cloud diagram that reflects higher frequencies with larger letter size.

In a second step, a cluster analysis is applied to the document titles in order to identify meaningful core components. Therefore, for each year a latent-semantic space is calculated by conducting and truncating the results of a singular value decomposition over a frequency table with the sanitised vocabulary used in this year in the rows and the document titles in the columns, thus having the frequency of each keyword in each document in the cells. As an estimator of a reasonable number of singular values to keep, dimcalc-share (Wild et al., 2005) is used. For each year, a distance matrix using cosine distances is calculated which serves as input to the divisive clustering algorithm Diana (Maechler, 2008). In the resulting cluster hierarchy, a reason-able cut-off point is estimated visually using the dendrogramm, and the tree is cut into a set of clusters. For each cluster in each year, the graph component of this cluster is extracted from the graph, and a separate network plot (see Butts, Hunter, and Handcock, 2008) is created effectively linking the closest terms (using the all positive cosine distances as a weighted proxy). The 'backbone' structure of the component interactions is calculated by focusing on the single maxima in the directed incidence matrix of the cosine distances. The deployed software is the language and environment R with the packages lsa (Wild, 2008), network (Butts, Hunter, and Handcock, 2008), sna (Butts, 2007), and cluster (Maechler, 2008). The analyses’ R source code is available upon request from the authors.

1.3 Benchmarking via Authorship & Citations

Like any scientific venue, ED-MEDIA can be described through a set of characteristics like the number of published papers, the number of authors, the distribution of papers per author or community, co-citation networks, etc. These characteristics are studied by the fields of bibliometrics and scientometrics (Osareh, 1996; Osareh, 1996a; Hood & Wilson, 2001). Bibliometrics measures the production and consumption of scientific material, while scientometrics are focusing on the measurement of the production process. Together, these two sub-fields of informetrics have developed the methodologies and tools needed to gain deeper insights into the characteristics of conferences and journals of particular research groups.

The main source for the data analysed in section 4 was the Digital Library of the AACE and the Proceedings CDs of the ED-MEDIA Conference. The first source, the Digital Library, was used to extract the information about the title, authors and affiliation for each paper from ED-MEDIA 1999 to 2008. The second source, the Proceedings CD was used to extract the information on bibliographical references for each paper from ED-MEDIA 2005 to 2008.

The extraction of the data from the Digital Library was made by web-scraping from the ED-MEDIA pages (AACE, 2008). Small scripts in Java were used to convert those pages from HTML to structured XML documents. For each paper, a list of the authors, affiliation, country and year was extracted. This information is mainly used in the authorship and co-authorship analyses.

The main problem faced in the extraction was the variability in the names of the affiliation of the authors. In some cases, it included the department, faculty, university, city, and country. In others there is just the country. In addition, the way in which the names were formatted varied across the ED-MEDIA editions. For example, initials for the middle name are sometimes included and sometimes not. Most of these problems were solved through the use of text similarity algorithms, like ‘edit distance’ (Ristad & Yianilos, 1998), or by the removal of obvious errors with a list of reserved words (for example, the words ‘University’ or ‘Department’ in the field author name). The accuracy of the extraction was evaluated against human expert judgements to be below 5% in sample of 100 papers.

The extraction of the bibliographical references for each paper from the proceedings CD was considerably more complex. First, the content table was obtained from the START.PDF file present in the proceedings. This table helped us to divide the main proceedings file, PROCBOOK.PDF, in individual papers. These individual papers were converted to text files using an open source tool called pdftotext, which is part of a larger package called xpdf. For the reference extraction, two main tools were evaluated. The first was ParaTools v1.10, a set of Perl modules used in the ParaCite site. The second tool was parsCit, another set of Perl modules developed by the University of Singapore. Both toolsets were tested over the ED-MEDIA data and both had advantages and shortcomings. As most errors were complementary (usually one tool had the reference right when the other failed) we decide to combine the output of both tools to produce the final results. Benchmarked against human expert annotation, the percentage of correctly extracted references was about 70% in a sample of 100 papers. Circa 30% of the references were not extracted by any tool because of the non-standard way in which some ED-MEDIA papers cite their references.

Finally, 6,690 papers and 10,689 authors from 92 different countries were identified for the period between 1999 and 2008. Also, 2,946 papers were analysed from the period 2005-2008 to extract a total of 35,347 citations to 26,378 different papers. This data is used in the following analysis to gain insight into the characteristics of ED-MEDIA (see section 4).

Annex 4 provides an overview on how we intend to extend this social network analysis beyond this initial data set and beyond the constructs presented here.