Herbert Van de Sompel

A Web-Centric Pipeline for Archiving Scholarly Artifacts

Keynote Abstract

Scholars are increasingly using a wide variety of online portals to conduct aspects of their research and to convey research results. These portals exist outside of the established scholarly publishing system and can be dedicated to scholarly use, such as myexperiment.org, or general purpose, such as GitHub and SlideShare. The combination of productivity features and global exposure offered by these portals is attractive to researchers and they happily deposit scholarly artifacts there. Most often, institutions are not even aware of the existence of these artifacts created by their researchers. More importantly, no infrastructure exists to systematically and comprehensively archive them, and the platforms that host them rarely provide archival guarantees; many times quite the opposite.

Initiatives such as LOCKSS and Portico offer approaches to automatically archive the output of the established scholarly publishing system. Platforms like Figshare and Zenodo allow scholars to upload scholarly artifacts created elsewhere. They are appealing from an open science perspective and researchers like the citable DOIs that are provided for contributions. But these platforms don’t offer a comprehensive archive for scholarly artifacts since not all scholars use them, and the ones that do are selective regarding their contributions.

The Scholarly Orphans project funded by the Andrew W. Mellon Foundation, explores how these scholarly artifacts could automatically be archived. Because of the scale of the problem – the number of platforms and artifacts involved – the project starts from a web-centric resource capture paradigm inspired by current web archiving practice. Because the artifacts are often created by researchers affiliated with an institution, the project focuses on tools for institutions to discover, capture, and archive these artifacts. The Scholarly Orphans team has started devising a prototype of an automatic pipeline that covers all three functions. Trackers monitor the APIs of productivity portals for new contributions by an institution’s researchers. The Memento Tracer framework generates web captures of these contributions. Its novel capturing approach allows generating high-quality captures at scale. The captures are subsequently submitted to a – potentially cross-institutional – web archive that leverages IPFS technology and supports the Memento “Time Travel for the Web” protocol. All components communicate using Linked Data Notifications carrying ActivityStreams2 payloads.

Without adequate infrastructure, scholarly artifacts will vanish from the web in much the same way regular web resources do. The Scholarly Orphans project team hopes that its work will help raise awareness regarding the problem and contribute to finding a sustainable and scalable solution for systematically archiving web-based scholarly artifacts. This talk will be the first public communication about the team’s experimental pipeline for archiving scholarly artifacts.


Herbert Van de Sompel is an Information Scientist at the Los Alamos National Laboratory and, for 15 years, has led the Prototyping Team. The Team does research regarding various aspects of scholarly communication in the digital age, including information infrastructure, interoperability, and digital preservation. Herbert has played a major role in creating the Open Archives Initiative Protocol for Metadata Harvesting, the Open Archives Initiative Object Reuse & Exchange specifications, the OpenURL Framework for Context-Sensitive Services, the SFX linking server, the bX scholarly recommender service, info URI, Web Annotation, ResourceSync, Memento “time travel for the Web”, Robust Links, and Signposting the Scholarly Web. He graduated in Mathematics and Computer Science at Ghent University, Belgium, and holds a Ph.D. in Communication Science from the same university.

Made with in Porto @ FEUP InfoLab / INESC TEC