Create a book
Concerns
From Stellar Deliverable 6.4
Contents |
1 Science 2.0 evangelists’ expectations
Advocates of the Science 2.0 approach consider that scientists should not have to care of tedious metadata input in the archive. But rather the archive itself should process, consume, various feeds issued from their institutional archives, from journals, editors or other digital libraries where their publications lie (see deliverable 6.3 section 3.1.A). This converge with the expectation of many users to escape manual input thanks to an automatic processing which allows to treat information once for all. This expectation been expressed clearly, no more effort has been spent on obtaining from Stellar members the manual uploading of their resources. SOA will accordingly engage in a new strategy for getting and assembling the Stellar resources.
SOA can be seen as an aggregator of feeds coming from the partner institutional repositories, or better as an harvester of feeds (as it stores metadata in its own database). This integration of the resources requires an effort for at least sharing the basic information (bibliographical description) in a common format and with integrated and coherent metadata. This appeared to be more challenging than expected. Most partners were not aware of the variety of the current state of their repositories description and metadata (discrepancies in descriptors, limitations to basic Dublin core, variation in formats). This is analysed in the next section.
In this approach, the issue of storing pdf files will remain open and might be a main concern. The current policy is that it cannot be made without the explicit approval of the author, and therefore not in an automatic way. If we have to enforce this policy, it should be examined how the responsibility will be endorsed by Stellar.
2 Analysis of the situation, the case of feeds
To fulfil the general requirement, and verify the claim that it was sufficient to gather feeds from the various institutional repositories, an investigation has been made in order to assess the feasibility of an automatic extraction and decide of the best processes to implement. The approach consisted first in a few case studies (investigation of where and how their publications can be found for six representative researchers in six different Stellar institutions). The picture appeared to be much less simple or optimal regarding the potential feeds. Either feeds don't exist at all (despite the belief of the users), or they are limited in terms of richness of information and/or processability. And furthermore there is in general no easy way to ascertain that publications are TEL related.
2.1 N case studies
Just to exemplify, a few concrete points about some valuable sources of information as they were mentioned by these researchers:
2.1.1 The case of some well established community services
- - IEEEXplore (http://ieeexplore.ieee.org) or ACM portal (http://portal.acm.org/) do not provide RSS feeds and require membership for full services. One may assemble and distribute links that point to works in these digital libraries but no more. One can get bibliographical references, for example in BibTex format, but case by case, not in a batch mode (e.g.the whole publications of a given conference or author).
- - DBLP (http://dblp.mpi-inf.mpg.de/dblp-mirror/index.php) also provide BibTex records and the whole ECTEL papers are available. But again one at a time.
- - EPFL InfoScience (http://infoscience.epfl.ch) is an institutional database which is OAI-PMH compliant, but no OAI interface was available
However, the effort undertaken by some institutions was encouraging. For example the Know-Center (Graz) is currently working on a "Graz" feed which would inform about all TEL related publications from the Know-Center, Graz University of Technology, and JUCS (Journal of universal computer science).
This led us to re-draw the picture of SOA data sources:
- Users
- Online input forms
- BibTex import
- Institutional repositories
- Feeds from Stellar Institutions (16)
- other archives …
- Conference proceedings
- Journals
In the meantime, a BibTex import mechanism was prototyped in the SOA and being fed with copy/paste references from various sources as mentioned above. This would result in increasing more rapidly the number of publications in the SOA and thus contribute to make it more attractive.
2.1.2 The case of ECTEL
Most of the ECTEL papers for example have been introduced in the archive, using the BibTex import mechanism. Still, it must be noticed that BibTex records issued from DBLP contain no abstract, no keywords; and more generally BibTex records do not represent the author/institution relationship.
However, the bunch of resources collected this way consists of 'notices' (i.e. bibliographical references, plus (if available) a link toward more information or toward the file itself). No concrete file (pdf, ps ...) can be collected this way, both for copyright and technical reasons.
In the case of ECTEL, we could take advantage of the work from the 'Conference paper database' building block; by processing their database we could enrich the SOA database (keyword extraction, citations etc.). But, as they states (http://www.stellarnet.eu/d/6/3/Conference_paper_database), "We are not allowed to share the PDF-files", and thus the SOA will not fully play its role. Uploading pdfs remains the responsibility of authors (copyright issues).
Papers from the ‘Science2.0 for TEL’ workshop (ECTEL 2009) were made available in real time as well as papers from the ‘Future Learning Spaces’ workshop (Stellar Alpine Rendez-Vous 2009).
- The case of ICCE
ICCE (International Conference on Computers in Education) is a significant conference in the TEL field. Contacts have been made with the Conference secretariat in order to enrich SOA with the bibliographical references from the two existing editions (ICCE 2008 and 2009). This represents around 600 papers. The request was welcome by ICCE, but alas data are not available in a database or easily processable format. Thus various techniques are under consideration and development to process existing documents (html and text summaries, pdf files) in the more efficient way. However the better the processing is, it never gives a complete rich metadata set, but still requires manual processing. Especially the authors/labs affiliation remains a difficult point. Again pdfs will not be copied to the SOA but remain available at the ICCE site via a web link.
2.1.3 The case of on-line journals
Journals’ repositories can also be data sources (when the copyright on resources falls, after 1,2, 3 years) and will be investigated later on.
2.2 Conclusion: need for feeds in and out
Finally, and it becomes the main current issue, the question came for Stellar partners to produce a feed that the SOA could consume. The Know-Center demonstrated a BibTex to RSS feed converter (http://ext216.know-center.tugraz.at/html/visitelf/convert.php) with the implicit goal that such feeds would go into the open archive. This would imply that the open archive (SOA) is able to process these feeds in order to store metadata in its database.
Deliverable 6.3 relates the progress made in its format definition ("publications feeds" building block (http://www.stellarnet.eu/d/6/3/Publication_feeds)). Paragraph 4 below relates the progress made by each Stellar partner in this direction.
On the other side, the SOA has to make its data widely available and processable. Export services are suitable in a mix of RSS / RDF / DC / SWRC / BuRST format, on top of the existing OAI-PMH API.
The Feeds issues are thus two fold:
- - INPUT: usually submission via input forms, or BibTex importation. This has to be complemented by submission of 'enriched feeds' that the SOA would process (extended RSS feed mechanism enabling harvesting, OAI harvesting..)
- - OUTPUT:
- - standard OAI-PMH interface (http://oa.stellarnet.eu/open-archive/oai)
- - RSS 2.0 feeds for latest resources
- - RSS 1.0 (rdf / burst / swrc) 'feed', providing the enriched feed corresponding to any request made to the SOA, in order to satisfy any use of the metadata by other service
One should not hide that the challenge is a hard one: replacing manual input by automatic processing is by far not trivial and should rely in our case on already high-quality data flows. Nonetheless human checking will still be essential to maintain concistency in the archive. Just to give an example:
- Mapping Web Personal Learning Environments (4th European Conference on Technology Enhanced Learning (EC-TEL) - Workshop on Mash-Up Personal Learning Environments (MUPPLE’09), Nice, France, September 29 - October 2, 2009.)
Two occurences of this publication can be found at: -> EPFL infoscience : http://infoscience.epfl.ch/record/140942 -> KMI : http://kmi.open.ac.uk/publications/fridolin-wild Obviously it is the same publication, deposited in to different repositories by two of the co-authors; nonetheless abstracts differ …
Then the current operational agenda of SOA is to develop tools to process feeds coming from partners, with a consensus on a minimal common standard. In the following section we review the situation of the Stellar partners with respect to this agenda and in relation with the needs or expectations they express.