Enriching digital collections using tools for text mining, indexing and visualisation
Axel J. Soto
The tutorial will demonstrate a suite of tools for text mining, semantic indexing and visualisation that will facilitate enhanced searching and exploration of digital collections. Specifically, we aim to provide: (1) an introduction to modular text mining and indexing workflows developed using the Argo platform; (2) an overview of the Elasticsearch indexing engine and the Kibana visualisation platform; and (3) the know-how on building and visualising semantic indexes over textual collections without any programming effort.
The tutorial will cover the end-to-end automatic generation, querying and visualisation of a semantically enabled search index over textual collections. By the end of the tutorial, the audience will have gained knowledge on: (1) exemplar digital collections (e.g., the Biodiversity Heritage Library, British Medical Journal) enhanced with text-mined semantic metadata and visualisation tools; (2) information extraction methods for generating semantic metadata over textual collections; (3) generating Elasticsearch indexes to search over digital collections that were semantically enriched by constructing text mining workflows using Argo; and (4) using the Kibana platform to generate dashboards and visually explore digital collections indexed with Elasticsearch.
Requirements: A Web browser, preferably Google Chrome, Mozilla Firefox or Safari, is required.
Riza Batista-Navarro (http://personalpages.manchester.ac.uk/staff/riza.batista) is a Lecturer at the School of Computer Science of the University of Manchester. She obtained her bachelors and masters degrees in Computer Science from the University of the Philippines (where she was assistant professor at the Department of Computer Science). She then came to Manchester to join the National Centre for Text Mining (NaCTeM) and to pursue her PhD in Computer Science. Riza has worked in a number of challenging areas such as event extraction, coreference resolution and named entity recognition for the biomedical domain. She has been involved in a number of projects including the transatlantic collaboration Mining Biodiversity, in which she led the work on enhancing the Biodiversity Heritage Library archives with text-mined semantic metadata and search capabilities.
Axel J. Soto (http://personalpages.manchester.ac.uk/staff/Axel.Soto/index.html) is a Research Associate at the University of Manchester and the National Centre for Text Mining (NaCTeM), UK. Before this position, he was a Research Associate with the Faculty of Computer Science and Adjunct Professor at Dalhousie University (Canada). He received his B.Sc. in Computer Systems Engineering and his PhD in Computer Science at Universidad Nacional del Sur (Argentina) in 2005 and 2010, respectively. Most of his research focusses on the area of text mining, machine learning and visual analytics. During his postdoctoral time he has been conducting research in the area of visual text analytics in order to improve analytical processes for mining large collections of documents and how interactive visualisations can help the exploration and extraction of knowledge from text.
Putting Historical Data in Context: How to use DSpace-GLAM (slides)
The proposed tutorial will introduce attendees to DSpace-GLAM (Galleries, Libraries, Archives, Museums), the Digital Library Management System based on DSpace and DSpace-CRIS, developed by 4Science for the management, analysis and preservation of digital cultural heritage, covering its functional and technical aspects.
DSpace-GLAM is an additional open-source configuration for the DSpace platform. It extends the DSpace data model, providing the ability to manage, collect and expose data about every entity important for the cultural heritage domain, such as persons, events, places, concepts and so on.
The extensible data model will be explained in depth, through examples and discussions with participants.
Other main topics will be DSpace-GLAM “components”, relationships management and network analysis.
Finally 4Science new add-ons for digital cultural resources fruition and analysis (the IIIF – International Image Interoperability Framework – Image Viewer, the Audio/Video Streaming Module, the OCR Module and the CKAN integration) will be illustrated.
At the end of the tutorial the participants will be able to understand the DSpace-GLAM data model, to adapt it to their needs and to evaluate if DSpace GLAM fits the needs of their institution.
Requirements: The level of this tutorial is introductory. It is addressed to librarians, archivists, historians, archaeologists, researchers and to those who want to build their own digital library but do not want to write their own software nor buy a proprietary solution. No programming ability is required. Basic knowledge of digital libraries and repositories architectures and of the relational model, though not mandatory, can guarantee a better learning experience.
Andrea Bollini is Chief Technology and Innovation Officer (CTO / CTIO) at 4Science, with the responsibility of ensuring the use of the most efficient technologies to effectively achieve the results of each single project. Andrea is actively involved in various international open source and open standards communities, often with leading roles: DSpace committer, Deputy Leader of the euroCRIS’ CERIF Task Group “CERIF” and member of the COAR Working Group “Next Generation Repositories”. Chair, speaker and reviewer in important conferences, before joining 4Science Andrea worked for two Italian University Consortia, CILEA and CINECA, where he was responsible for the development, design and management of IT solutions and projects in the field of research, electronic publishing and open access repositories.
Claudio Cortese is Project Manager and Business Analyst at 4Science. He has a PhD in archaeology and is lecturer of “Computer Applications in Archaeology” at the Catholic University of Milan. For several years he has been dealing with modeling, management and analysis of data for cultural heritage using a great variety of methods, standards and technologies: from relational databases to GIS, from Digital Libraries to architectures based on the Semantic Web. Before joining 4Science, he worked for two Italian University Consortia, CILEA and CINECA, mainly dealing with Digital Library Management Systems to preserve, use and distribute digital cultural resources. In the historical/archaeological field Claudio focused on the design and creation of databases for many Archaeological Missions and University Research Units. He constantly gives consultancies, lessons and courses to universities and public/private institutions.
Innovation search presents many challenges to the research community and also to professional searchers and search solution providers. Patents are complex technical documents, whose content appears in many languages and contains images, chemical and genomic structures and other forms of data, intermixed and cross-referring with the text material. Further much innovation search involves the search of other forms of technical information such as scientific papers, or the integration of open linked data and so on with the patent data. Finally the realistic presentation of search and analysis results to often non-technical and time-poor audiences for purpose of strategic decision making presents particular challenges.
The course will review the state of the art and point out where the key challenges are, especially for early stage researchers and innovation professionals in patent search and related disciplines. The objectives of the tutorial are:
- understand the international patent system, patent searching and the relevant state-of-the-art.
- Understand the key limitations and challenges for the research community in the development of patent retrieval and innovation search systems in general.
- Understand how recent developments in information retrieval, multilingual and interactive information access may be applied to patent searching research.
Requirements: We expect the target audience to consist mainly of two groups. First, postgraduate students and post-doc researchers from academia engaging in studies related to information retrieval, professional search systems and natural language processing. Second, researchers from other related disciplines and professionals (e.g. search solutions providers) who will be given the opportunity to enhance their expertise towards the area of innovation search. Since the class will introduce foundations and basic concepts of patents and inno-vation search, it will also be accessible to individuals not familiar with the field of information retrieval. We hence will not rely on any particular prior knowledge.
Michail Salampasis is a professor at the department of Informatics at the TEI of Thessaloniki, Greece. He has a BSc in Informatics and a PhD in Computing. His main research interests are in applied studies in information science, innovation search and distributed information retrieval. He has published about 80 papers in refereed journals and international conferences. He was the coordinator of the Cost Action IC1002 on “Multilingual and Multifaceted Interactive Information Access’ and a Marie Curie Fellow at the Institute of Software Technology and Interactive Systems, Vienna University of Technology on a program entitled “Pluggable Platform for Personalised Multilingual Patent Search”. Salampasis has taught at different levels, from BSc/MSc lectures on information retrieval and patent search to tutorials aimed at PhD students and researchers. Web site: www.it.teithe.gr/~cs1msa.
Enabling Precise Identification and Citability of Dynamic Data – Recommendations of the RDA Working Group on Data citation (slides)
“Sound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse.” (Data Citation principles, )
While the importance of these Data Citation Principles is by now widely accepted, several challenges persist when it comes to actually providing the services needed to support precise identification and citation of data, particularly in dynamic environments. In order to repeat an earlier study, to apply data from an earlier study to a new model, we need to be able to precisely identify the very subset of data used. While verbal descriptions of how the subset was created (e.g. by providing selected attribute ranges and time intervals) are hardly precise enough and do not support automated handling, keeping redundant copies of the data in question does not scale up to the big data settings encountered in many disciplines today. Conventional approaches, such as assigning persistent identifiers to entire data sets or individual subsets or data items, are not sufficient to meet these requirements. This problem is further exacerbated if the data itself is dynamic, i.e. if new data keeps being added to a database, if errors are corrected or if data items are being deleted.
Starting from the Data citation Principles we will review the challenges identified above and discuss the solutions and recommendations that have been elaborated within the context of a Working Group of the Research Data Alliance (RDA) on Data Citation: Making Dynamic Data Citeable. These approaches are based on versioned and time-stamped data sources, with persistent identifiers being assigned to the time-stamped queries/expressions that are used for creating the subset of data. We will review examples of how these can be implemented for different types of data, including SQL-style databases, comma-separated value files (CSV) and others, and take a look at operational implementations in a variety of data centers.
 Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego, CA: FORCE11; 2014.
Andreas Rauber is Associate Professor at the Department of Software Technology and Interactive Systems (ifs) at the Vienna University of Technology (TU-Wien). He furthermore is president of AARIT, the Austrian Association for Research in IT, a Key Researcher at Secure Business Austria (SBA-Research) and Co-Chair of the RDA Working Group on Dynamic Data Citation. He received his MSc and PhD in Computer Science from the Vienna University of Technology in 1997 and 2000, respectively. In 2001 he joined the National Research Council of Italy (CNR) in Pisa as an ERCIM Research Fellow, followed by an ERCIM Research position at the French National Institute for Research in Computer Science and Control (INRIA), at Rocquencourt, France, in 2002. From 2004-2008 he was also head of the iSpaces research group at the eCommerce Competence Center (ec3).
His research interests cover the broad scope of digital libraries and information spaces, including specifically text and music information retrieval and organization, information visualization, as well as data analysis and digital preservation, all of which started to merge recently under the umbrella of reproducible science.