M1A: Metadata

Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints

Martin Toepfer and Christin Seifert

Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.

Metadata Synthesis and Updates on Collections Harvested using the Open Archive Initiative Protocol for Metadata Harvesting

Sarantos Kapidakis

Harvesting tasks gather information to a central repository. We studied the metadata returned from 744179 harvesting tasks from 2120 harvesting services in 529 harvesting rounds during a period of two years. To achieve that, we initiated nearly 1,500,000 tasks, because a significant part of the Open Archive Initiative harvesting services never worked or have ceased working while many other services fail occasionally. We studied the synthesis (elements and verbosity of values) of the harvested metadata, and how it evolved over time. We found that most services utilize almost all Dubln Core elements, but there are services with minimal descriptions. Most services have very minimal updates and,overall, the harvested metadata is slowly improving over time. Our results help us to better understand how and when the metadata are improved and have more realistic expectations about the quality of the metadata when we design harvesting or information systems that rely on them.

Metadata Enrichment of Multi-Disciplinary Digital Library: A Semantic-based Approach

Hussein Al-Natsheh, Lucie Martinet, Fabrice Muhlenbach, Fabien Rico and Djamel A. Zighed

In the scientific digital libraries, some papers from different research communities can be described by community-dependent keywords even if they share a similar topic. Articles without appropriate keywords are poorly indexed, this causes information retrieval troubles and limits the potentially fruitful exchanges between scientific disciplines.

T1A: Entity Disambiguation

Harnessing Historical Corrections to Build Test Collections for Named Entity Disambiguation

Florian Reitz

Matching mentions of persons to the actual persons (the name disambiguation problem) is central for several digital library applications.

Homonym Detection in Curated Bibliographies: Learning from dblp’s Experience

Marcel R. Ackermann and Florian Reitz

Identifying (and fixing) homonymous and synonymous author profiles is one of the major tasks of curating personalized bibliographic metadata repositories like the dblp computer science bibliography. In this paper, we present and evaluate a machine learning approach to identify homonymous author bibliographies using a simple multilayer perceptron setup. We train our model on a novel gold-standard data set derived from the past years of active, manual curation at the dblp computer science bibliography.

T2A: Data Management

Research Data Preservation Using Process Engines and Machine-actionable Data Management Plans

Asztrik Bakos, Tomasz Miksa and Andreas Rauber

Scientific experiments in various domains require nowadays collecting, processing, and reusing data. Researchers have to comply with funder policies that prescribe how data should be managed, shared and preserved. In most cases this has to be documented in data management plans.

Maturity Models for Data and Information Management – A State of the Art

Diogo Proença and José Borbinha

A Maturity Model is a widely used technique that is proved to be valuable to assess business processes or certain aspects of organizations, as it represents a path towards an increasingly organized and systematic way of doing business. A maturity assessment can be used to measure the current maturity level of a certain aspect of an organization in a meaningful way, enabling stakeholders to clearly identify strengths and improvement points, and accordingly prioritize what to do in order to reach higher maturity levels. This paper collects and analyzes the current practice on maturity models from the data and information management domains, by analyzing a collection of maturity models from literature. It also clarifies available options for practitioners and opportunities for further research.

An operationalized DDI Infrastructure to document, publish, preserve and search social science research data

Claus-Peter Klas and Oliver Hopt

Research data is a very important issue for research and researchers for several reasons: E.g. it adds to the reputation of authors, just like papers; it is good scientific practice, so that others can evaluate research and re-use data for new ideas. In addition, funding agencies start to only fund projects with a research data plan and the expectation that the research data is accessible and preserved for re-use. The social sciences are here in a very privileged position, as there is already an existing meta-data standard defined by the Data Documentation Initiative (DDI) for survey data. But even so the DDI standard already exist since the year 2000, it is not widely used because there are almost no (open source) tools available. The reasons are manifold, e.g. that DDI is a living standard with major changes during its lifetime. Because of the complexity it became heterogeneous during version changes, it varies over different versions of DDI, different grouping, and unequal interpretation of elements. The attempt to model a complex DDI structure into a database model results in high costs of development with respect to time and money, is application specific and leads to non-reusable models. In this article we present our technical infrastructure developed within GESIS to operationalize DDI, to use DDI as living standard for documentation and preservation and to support the publishing process and search functions to foster re-use and research. The main contribution of this paper is to present our DDI infrastructure, to showcase how to operationalize DDI and to show the efficient and effective handling and usage of complex metadata. The infrastructure can be adopted and used as blueprint for other domains. The software is published under open source licenses.

T2B: Scholarly Communication

Unveiling Scholarly Communities over Knowledge Graphs

Sahar Vahdati, Guillermo Palma, Rahul Jyoti Nath, Christoph Lange, Sören Auer and Maria-Esther Vidal

Knowledge graphs represent the meaning of properties of real-world entities and relationships among them in a natural way.

Metadata Analysis of Scholarly Events of Computer Science, Physics, Engineering, and Mathematics

Said Fathalla, Sahar Vahdati, Sören Auer and Christoph Lange

Although digitization has significantly eased publishing, finding a relevant and suitable channel of publishing still remains challenging.

Venue Classification of Research Papers in Scholarly Digital Libraries

Cornelia Caragea and Corina Florescu

Most open-access scholarly digital libraries crawl periodically a list of seed URLs in order to obtain appropriate collections of freely-available research papers. The metadata of the crawled papers, e.g., title, authors, and references, are automatically extracted before the papers are indexed in a digital library. The venue of publication is another important aspect about a scientific paper, which reflects its authoritativeness. However, the venue is not always readily available for a paper. Instead, it needs to be extracted from the references lists of other papers that cite the target paper, resulting in a difficult process. In this paper, we explore a supervised learning approach to classifying the venue of a research paper by leveraging information solely available from the content of the paper. We show experimentally on a dataset of approximately 44,000 papers that this approach outperforms several baselines on venue classification.

T3A: Digital humanities

Towards better Understanding Researcher Strategies in Cross-lingual Event Analytics

Simon Gottschalk, Viola Bernacchi, Richard Rogers and Elena Demidova

With an increasing amount of information on globally important events, there is a growing demand for efficient analytics of multilingual event-centric information. Such analytics is particularly challenging due to the large amount of content, the event dynamics and the language barrier. Although memory institutions increasingly collect event-centric Web content in different languages, very little is known about the strategies of researchers who conduct analytics of such content. In this paper we present researchers’ strategies for the content, method and feature selection in the context of cross-lingual event-centric analytics observed in two case studies on multilingual Wikipedia. We discuss the influence factors for these strategies, the findings enabled by the adopted methods along with the current limitations and provide recommendations for services supporting researchers in cross-lingual event-centric analytics.

Adding words to Manuscripts: from PagesXML to TEITOK

Maarten Janssen

Library digitalization projects almost always use a page-driven file format for the description of manuscript transcriptions. But for a searchable corpus, a text-driven file format such as TEI/XML is much more appropriate. This article shows how the TEITOK corpus framework provides a two-stage approach, dealing first with transcription in a page-driven manner, and afterwards converting losslessly to a text-driven format, leading to a fully searchable corpus closely linked to the manuscript images.

T4A: User Interaction

Predicting Retrieval Success Based on Information Use for Writing Tasks

Pertti Vakkari, Michael Völske, Martin Potthast, Matthias Hagen and Benno Stein

This paper asks to what extent querying, clicking, and text editing behavior can predict the usefulness of the search results retrieved during essay writing. To render the usefulness of a search result directly observable for the first time in this context, we cast the writing task as “essay writing with text reuse,” where text reuse serves as usefulness indicator. Based on 150 essays written by 12 writers using a search engine to find sources for reuse, while their querying, clicking, reuse, and text editing activities were recorded, we build linear regression models for the two indicators (1) number of words reused from clicked search results, and (2) number of times text is pasted, covering 69% (90%) of the variation. The three best predictors from both models cover 91-95% of the explained variation. By demonstrating that rather simple models can predict retrieval success, our study constitutes a first step towards incorporating usefulness signals in retrieval personalization for general writing tasks, presuming our results generalize.

Personalised Session Difficulty Prediction in an Online Academic Search Engine

Tuan Vu Tran and Norbert Fuhr

Search sessions consist of multiple user-system interactions. As a user-oriented measure for the difficulty of a session, we regard the time needed for finding the next relevant document (TTR). In this study, we analyse the search log of an academic search engine, focusing on the user interaction data without regarding the actual content. After observing a user for a short time, we predict the TTR for the remainder of the session. In addition to standard machine learning methods for numeric prediction, we investigate a new approach based on an ensemble of Markov models. Both types of methods yield similar performance. However, when we personalise the Markov models by adapting their parameters to the current user, this leads to significant improvements.

User Engagement with Generous Interfaces for Digital Cultural Heritage

Robert Speakman, Mark Michael Hall and David Walsh

Digital cultural heritage (DCH) institutions are experiencing transitory visitation patterns to their online collections through traditional search interfaces. Generous interfaces have been lauded as a replacement to traditional search, yet their effects on user engagement are relatively unexplored. This paper presents the results of an online experiment with 3 prolific DCH generous interfaces, which aimed to quantify the effects of component use on user engagement. The results highlight that despite no significant difference in focused attention levels, novel generous interface components promote engagement factors. Participants that make more use of components were found to be more likely to experience user engagement. Additionally, the generous interfaces were found to promote serendipitous discovery of collection items and support casual museum users despite low familiarity levels with the interfaces. The success of the tested generous interfaces is contingent upon the representation of the collection items, and how interesting they are to participants on initial view.

T4B: Resources

Peer review and citation data in predicting university rankings, a large-scale analysis

David Pride and Petr Knoth

Most Performance-based Research Funding Systems (PRFS) draw on peer review and bibliometric indicators, two different methodologies which are sometimes combined. A common argument against the use of indicators in such research evaluation exercises is their low correlation at the article level with peer review judgments. In this study, we analyse 191,000 papers from 154 higher education institutes which were peer reviewed in a national research evaluation exercise. We combine these data with 6.95 million citations to the original papers. We show that when citation-based indicators are applied at the institutional or departmental level, rather than at the level of individual papers, surprisingly large correlations with peer review judgments can be observed, up to r <= 0.802, n = 37, p < 0.001 for some disciplines. In our evaluation of prediction performance based on citation data, we show we can reduce the prediction error by 25% compared to the current state-of-the-art. This suggests that citation indicators can lessen the burden of peer review on national evaluation exercises leading to considerable cost savings.

The MUIR Framework: Cross-Linking MOOC Resources to Enhance Discussion Forums

Ya-Hui An, Muthu Kumar Chandresekaran, Min-Yen Kan and Yan Fu

New learning resources are created and minted in Massive Open Online Courses every week — new videos, quizzes, assessments and discussion threads are deployed and interacted with in the era of on-demand online learning. However, these resources are often artificially siloed between platforms and artificial web application models. Facilitating the linking between such resources facilitates learning and multimodal understanding, bettering learners’ experience.

Figures in Scientific Open Access Publications

Lucia Sohmen, Jean Charbonnier, Ina Blümel, Christian Wartena and Lambert Heller

This paper summarizes the results of a comprehensive statistical analysis on a corpus of open access articles and contained figures. It gives an insight into quantitative relationships between illustrations or types of illustrations, caption lengths, subjects, publishers, author affiliations, article citations and others.

W1A: Information Extraction

Finding Person Relations in Image Data of News Collections in the Internet Archive

Eric Müller-Budack, Kader Pustu-Iren, Sebastian Diering and Ralph Ewerth

The multimedia content in the World Wide Web is rapidly growing and contains valuable information for many applications in different domains. For this reason, the Internet Archive initiative has been gathering billions of time-versioned web pages since the mid-nineties. However, the huge amount of data is rarely labeled with appropriate metadata and automatic approaches are required to enable semantic search. Normally, the textual content of the Internet Archive is used to extract entities and their possible relations across domains such as politics and entertainment, whereas image and video content is usually neglected. In this paper, we introduce a system for person recognition in image content of the Internet Archive. Thus, the system complements entity recognition in text and allows researchers and analysts to track media coverage and relations of persons more precisely. Based on a deep learning face recognition approach, we suggest a system that automatically detects persons of interest and gathers sample material, which is subsequently used to identify them in the image data of the Internet Archive. We evaluate the performance of the face recognition system on an appropriate standard benchmark dataset and demonstrate the feasibility of the approach with some use cases.

Ontology-Driven Information Extraction from Research Publications

Vayianos Pertsas and Panos Constantopoulos

Extraction of information from a research article, association with other sources and inference of new knowledge is a challenging task that has not yet been entirely addressed. We present Research Spotlight, a system that leverages existing information from DBpedia, retrieves articles from repositories, extracts and interrelates various kinds of named and non-named entities by exploiting article metadata, the structure of text as well as syntactic, lexical and semantic constraints, and populates a knowledge base in the form of RDF triples. An ontology designed to represent scholarly practices is driving the whole process. The system is evaluated through two experiments that measure the overall accuracy in terms of token- and entity- based precision, recall and F1 scores, as well as entity boundary detection, with promising results.

W2A: Information Retrieval

Scientific Claims Characterization for Claim-Based Analysis in Digital Libraries

José Maria González Pinto and Wolf-Tilo Balke

In this paper, we promote the idea of automatic semantic characterization of scientific claims to explore entity-entity relationships in Digital collections; our proposed approach aims at alleviating time-consuming analysis of query results when the information need is not just one document but an overview over a set of documents. With the semantic characterization, we propose to find what we called “dominant” claims and rely on two core properties: the consensual support of a claim in the light of the collection’s previous knowledge, and the authors’ assertiveness of the language used when expressing it. We will discuss useful features to efficiently capture these two core properties and formalize the idea of finding “dominant” claims by relying on Pareto dominance. We demonstrate the effectiveness of our method regarding quality by a practical evaluation using a real-world document collection from the medical domain to show the potential of our approach.

Automatic Segmentation and Semantic Annotation of Verbose Queries in Digital Library

Susmita Sadhu and Plaban Kumar Bhowmick

In this paper, we propose a system for automatic segmentation and semantic annotation of verbose queries with predefined metadata fields. The problem of generating optimal segmentation has been modeled as a simulated annealing problem with proposed solution cost function and neighborhood function. The annotation problem has been modeled as a sequence labeling problem and has been implemented with Hidden Markov Model (HMM). Component-wise and holistic evaluation of the system have been performed using gold standard annotation developed over query log collected from National Digital Library (NDLI). In component-wise evaluation, the segmentation module yields 82% F1 and the annotation module performs with 56% accuracy. In holistic evaluation, the F1 of the system has been obtained to be 33%.

W2B: Recommendation

Open Source Software Recommendations Using Github

Miika Koskela, Inka Simola and Kostas Stefanidis

Our goal in this work is on providing open source software recommendations using the Github API. Towards this direction, we propose a hybrid method that considers the languages, topics and README documents that appear in the user’s repositories. To demonstrate our approach, we implement a proof of concept prototype that provides software recommendations.

Recommending Scientific Videos based on Metadata Enrichment using Linked Open Data

Justyna Medrek, Christian Otto and Ralph Ewerth

The amount of available videos in the Web has significantly increased not only for the purpose of entertainment etc., but also to convey educational or scientific information in an effective way. There are several web portals that offer access to the latter kind of video material. One of them is the TIB AV-Portal of the Leibniz Information Centre for Science and Technology (TIB), which hosts scientific and educational video content. In contrast to other video portals, automatic audiovisual analysis (visual concept classification, optical character recognition, speech recognition) is utilized to enhance metadata information and semantic search. In this paper, we propose to further exploit and enrich this automatically generated information by linking it to the Integrated Authority File (GND) of the German National Library. This information is used to derive a measure to compare the similarity of two videos which serves as a basis for recommending semantically similar videos. A user study demonstrates the feasibility of the proposed approach.

P: Posters

TIB-arXiv: An Alternative Search Portal for the arXiv Pre-print Server

Matthias Springstein, Huu Hung Nguyen, Anett Hoppe and Ralph Ewerth

arXiv is a popular pre-print server focusing on natural science disciplines (e.g. physics, computer science, quantitative biology). As a platform with focus on easy publishing services it does not provide enhanced search functionality – but offers programming interfaces which allow external parties to add these services. This paper presents extensions of the open source framework arXiv Sanity Preserver (SP). With respect to the original framework, it derestricts the topical focus and allows for text-based search and visualisation of all papers in arXiv. To this end, all papers are stored in a unified back-end; the extension provides enhanced search and ranking facilities and allows the exploration of arXiv papers by a novel user interface.

An Analytics Tool for Exploring Scientific Software and Related Publications

Anett Hoppe, Jascha Hagen, Helge Holzmann, Günter Kniesel and Ralph Ewerth

Scientific software is one of the key elements for reproducible research. However, classic publications and related scientific software are typically not (sufficiently) linked, and it lacks tools to jointly explore these artefacts. In this paper, we report on our work on developing an analytics tool for jointly exploring software and publications. The presented prototype, a concept for automatic code discovery, and two use cases demonstrate the feasibility and usefulness of the proposal.

Digital Museum Map

Mark Michael Hall

The digitisation of cultural heritage has created large digital collections that have the potential to open up our cultural heritage. However, the search box, which for non-expert users presents a significant obstacle, remains the primary interface for accessing these. This demo presents a fully automated, data-driven algorithm for generating a generous interface for exploring DCH collections.

ORCID iDs in the Open Knowledge Era

Marina Morgan and Naomi Eichenlaub

The focus of this poster is to highlight the importance of sufficient metadata in ORCID records for the purpose of name disambiguation. In 2017 the authors counted ORCID iDs containing minimal information. They invoked RESTful API calls using Postman software and searched ORCID records created between 2012-2017 that did not include affiliation or organization name, Ringgold ID, and any work titles. A year later, they reproduced the same API calls and compared with the results achieved the year before. The results reveal that a high number of records are still minimal or orphan, thus making the name disambiguation process difficult. The authors recognize the benefit of a unique identifier that facilitates name disambiguation and remain confident that with continued work in the areas of system interoperability and technical integration, alongside continued advocacy and outreach, ORCID will grow and develop not only in number of iDs but also in metadata robustness.

Revealing Historical Events out of Web Archives

Quentin Lobbé

Corpora of web archives are wide and sparse. As the living web expands, worldwide volumes of web archives constantly increase, making difficult to identify archived webpages relevant to a specific sociological or historical study. We propose an application for detecting historical events out of a web archives corpus and discovering pertinent archived digital contents. We introduce here the usage of a new entity called Web Fragment to reduce issues related to corpus quality and consistency, and effectively guide researchers through exploration of web archives. A web fragment is defined as a semantic and syntactic subset of a given webpage and has the particularity to be indexed by its edition date (the time when the web fragment was written) instead of its archiving date (the time when its parent webpage was crawled and saved). Building on top of web fragments, we show how this application can be used to study a large archived Moroccan forum and to understand how this online collective reacted to the Arab Spring at the end of 2010.

The FAIR Accessor as a tool to reinforce the authenticity of digital archival information

André Pacheco

The constant increase in digital information’s volume, variety and complexity poses a plethora of problems that difficult the preservation of archival information while ensuring that it remains authentic, reliable, accessible, trustworthy, intelligible and reusable for as long as necessary. This study explores the concepts of a possible implementation of a FAIR Accessor, a technology developed with the aim of providing findable, accessible, interoperable and reusable research data, as an infrastructure that can support and aid archival information description, with the goal of ensuring its authenticity. A qualitative literature review on a selection of representative works in the fields of Information Science, Diplomatics and the FAIR principles is followed by a discussion on how the key concepts of each field overlay and thus complement each other mutually. It is concluded that the infrastructure of the FAIR Accessor can prove useful in enriching archival description, and, ultimately, in assisting to ascertain the authenticity of records.

Who Cites What in Computer Science? – Analysing Citation Patterns across Conference Rank and Gender

Tobias Milz and Christin Seifert

Citations are a means to refer to previous, relevant scientific bodies of work.

Back to the Source: Recovering Original (Hebrew) Script from Transcribed Metadata

Aaron Christianson, Rachel Heuberger and Thomas Risse

Due to technical constrains of the past, metadata in languages written with non-Latin scripts have frequently been entered using various systems of transcription. While this transcription is essential for data curators who may not be familiar with the source script, it is often an encumbrance for researchers in discovery and retrieval, especially with more complex forms of transcription, as are common with Arabic and Hebrew scripts. The University Library Johann Christian Senckenberg has a very large Judaica collection with many works in Hebrew and Yiddish. Until 2011, all these works were catalogued with transcription only. The aim of this work is to develop an open-source system to aid in the automatic conversion of Hebrew transcription back into Hebrew script, using a multi-faceted approach.

From Handwritten Manuscripts to Linked Data

Lise Stork, Andreas Weber, Jaap van den Herik, Aske Plaat, Fons Verbeek and Katherine Wolstencroft

Museums, Archives, Libraries and other institutes, specifically those in the cultural heritage domain, make increasing use of Semantic Web technologies to enrich and publish their collection items. Fewer cases exist where the contents of those items are also enriched using similar methods, disclosing the details contained within historical handwritten manuscripts. We argue that the enrichment of historical manuscripts is of central importance to the disclosure of cultural heritage archives. Elucidating the contents of historical manuscripts is, however, a time-consuming process that requires domain expertise. Different workflows have therefore been proposed to accelerate and improve this process. In this study, we present an analysis of different approaches, focussing specifically on the provenance requirements for annotating and interpreting historical manuscripts so that the contents can be published online as FAIR (Findable, Accessible, Interoperable and Reusable) data. Furthermore, we argue that provenance can play a central role in quality assessment. We demonstrate our findings with a case study from the natural history domain, where we have developed a semantic framework for extracting, annotating and curating regions of interest from digitised handwritten, historical manuscripts.

A Study on the Monetary Value Estimation of User Satisfaction with the Digital Library Service Focused on Construction Technology Information in South Korea

Seong-Yun Jeong

Korea Institute of Civil Engineering and Building Technology has been constructing a database by collecting, classifying, and processing the construction technology data required for construction engineers and practitioners and providing a database information service through the Construction Technology Digital Library system since 2001. In this study, the economic feasibility of the information service was analyzed for the purpose of using the limited information service budget to improve the satisfaction with the service. As part of the economic feasibility analysis, the user satisfaction with the construction technology information service and the willing-to-pay price for the information service were surveyed among the members of the system. The value of satisfaction with the information service was estimated by applying the double-bounded dichotomous choice contingent valuation method to the survey results.

Visual Analysis of Search Results in Scopus Database

Ondrej Klapka and Antonin Slaby

The enormous growth of research and development has been accom-

False-positive Reduction in Ontology Matching Based on Concepts’ Domain Similarity

Audun Vennesland and Trond Aalberg

The quality of an ontology matching system is ultimately determined by its ability to promote true positive semantic relations and disregard false positive ones. In this study we explore if considering the domain similarity between concepts to be matched can contribute to this. This is particularly relevant in areas where the universe of discourse encompasses several diverse domains, such as cultural heritage. Our approach is based on an algorithm that employs the lexical resource WordNet Domains to filter out relations where the two concepts to be matched are associated with different domains. While this study focuses on reducing false positive relations from string matching alignments, the same approach is also transferable to other matching techniques. We evaluate our approach in an experiment involving Bibframe and Schema.org, two ontologies of complementary nature. The results from the evaluation show that the use of such a domain filter indeed can have a positive effect on reducing false positives and consequently contribute to improving alignment quality.

Association Rule based Clustering of Electronic Resources in University Digital Library

Debashish Roy, Chen Ding, Lei Jin and Dana Thomas

Library Analytics is used to analyze the huge amount of data that is collected by most colleges and universities when the library electronic resources are browsed. In this research work, we have analyzed the library usage data to accomplish the task of e-resource item clustering. We have compared different clustering algorithms and found that association-rule (ARM) based clustering is more accurate than others and it also identifies the hidden relationships between articles which are content-wise not similar. We have also shown that items in the same cluster offer a good source for recommendation.

Hybrid Image Retrieval in Digital Libraries: A Large Scale Multicollection Experimentation of Deep Learning techniques

Jean-Philippe Moreux and Guillaume Chiron

While digital heritage libraries historically took advantage of OCR to index their printed collections, the access to iconographic resources has not progressed in the same way, and the latter remain in the shadows. Today, however, it would be possible to make better use of these resources, especially by leveraging the enormous volumes of illustrations segmented thanks to the OCR produced during the last two decades, and thus valorize these engravings, drawings, photographs, maps, etc. for their own value but also as an attractive entry point into the collections. This article presents an ETL (extract-transform-load) approach to this need, that aims to: Identify and extract iconography wherever it may be found, in image collections but also in printed materials; Transform, harmonize and enrich the image descriptive metadata (in particular with deep learning classification and indexation models); Load it all into a web app dedicated to hybrid image retrieval. The approach is doubly pragmatic, since it involves leveraging existing digital resources and (virtually) on-the-shelf technologies

Grassroots meets grasstops: Integrated research data management with EUDAT B2 Services, Dendro and LabTablet

João Rocha da Silva, Nelson Pereira, Pedro Dias and Bruno Barros

We present an integrated research data management (RDM) workflow that captures data from the moment of creation until its long-term deposit. We integrated LabTablet, our electronic laboratory notebook, Dendro, a data organisation and description platform aimed at collaborative management of research data, and EUDAT’s B2DROP and B2SHARE platforms. This approach combines the portability and automated metadata production abilities of LabTablet, Dendro as an collaborative RDM tool for dataset preparation, with the scalable storage of B2DROP and the long-term deposit of datasets in B2SHARE. The resulting workflow can be put to work in research groups where laboratorial or field work is central.

Linked publications and research data: use cases for Digital Libraries

Fidan Limani, Atif Latif and Klaus Tochtermann

Linking publications to ever increasing research data is becoming important for providing a more complete research picture. Siloes of publication and data, even within institutions surely hampers this picture. In this work we explore a linking strategy for scholarly resources — publications and research data.

The Emergence of Thai OER to Support Open Education

Titima Thumbumrung and Boonlert Aroonpiboon

This paper aims to present the practical work of the development of Open Educational Resources (OER) in a developing country, Thailand, to support open education and lifelong learning in the country. Thai OER is an on-going project under the Online Learning Resources for Distance Learning project in the Celebration of the Auspicious Occasion of Her Royal Highness Princess Maha Chakri Sirindhorn’s 60th Birthday Anniversary on the 2nd April 2015. It is developed by the social movement and collaborative efforts of multiple stakeholders in both the public and private sectors in the society to produce and share educational materials via the Internet under an open licensing agreement. This is to reduce the cost, access and usage barriers of students, teachers and learners, especially disadvantaged and disabled children and young people who lack opportunities to access education and knowledge. The materials, provided in Thai OER, cover a range of topics in different fields and in different formats for all users. Thai OER also collects and shares resources from GLAM (Galleries, Libraries, Archives and Museums) in the country. This paper presents the benefits of Thai OER for different levels as well as highlights major challenges to develop and adopt OER in Thailand. The challenges can be grouped into four categories: technical infrastructure, economic, social and legal challenges.

Fair play at Carlos III University of Madrid Library

Belén Fernandez-Del-Pino Torres and Teresa Malo-De-Molina Martín-Montalvo

This article will try to show projects held at Carlos III University Library related to FAIR principles: how we designed them, their implementation and main results. As time passes the Library evolves from a traditional library to “as open as possible, as closed as necessary” Digital Library.

Supporting description of research data: evaluation and comparison of term and concept extraction approaches

Cláudio Monteiro, Carla Teixeira Lopes and João Rocha da Silva

The importance of research data management is widely recognized, and it should start as early as possible in the research workflow to minimize the risk of data loss.

Anonymized distributed PHR using Blockchain for openness and non-repudiation guarantee

David Mendes, Irene Rodrigues, César Fonseca, Manuel Lopes, José García-Alonso and Javier Berrocal

We introduce our solution developed for data privacy, and specifically for cognitive security that can be enforced and guaranteed using blockchain technology in SAAL (Smart Ambient Assisted Living) environments. Personal clinical and demographic information segments to various levels that assures that it can only be rebuilt at the interested and authorized parties and no profiling can be extracted from the blockchain itself. Using our proposal the access to a patient’s clinical process resists tampering and ransomware attacks that have recently plagued the HIS (Hospital Information Systems) in various countries. The core of the blockchain model assures non-repudiation possible by any of the involved information producers thus maintaining ledger fidelity of the enclosed historical process information. One important side effect of this data infrastructure is that it can be accessed in open form, for research purposes for instance, since no individual re-identification or group profiling is possible by any means.

Made with in Porto @ FEUP InfoLab / INESC TEC