Legacy Research Web Collections

Research related collections of digital content on the web which are now outdated and/or no longer actively maintained. This can include software and published or unpublished source code.

Web Research Outputs

Examples

Academic and institutional websites from the first decade of the web containing details of research projects and interests as well as research data

Imminence

4/5
Action is recommended within twelve months. Detailed assessment is a priority.

Effort

4/5
Loss seems likely. By the time tools or techniques have been developed, the material will likely have been lost.

Hazards

Inaccessible to web archive; bespoke code; insufficient documentation; uncertainty over IPR or the presence of orphaned works

Out of Band

Mitigations

Secured by web archive; documentation and rights information published alongside material

Bit List History

Added to list: 2019

2024: No change.

2023: No change.

Last Review

2023 Review

This entry was added in 2019. While there are overlaps with ‘Semi-Published Research Data’ and ‘Unpublished Research Data’ entries, it is a separate entry to distinguish between ‘current’ and ‘legacy’ collections with different risk profiles. In 2020, the fact that materials of legacy web collections were no longer actively maintained increased the risk classification to Critically Endangered. The 2021 Jury agreed with these distinctions, adding that loss has already occurred and future loss can be prevented through approaches such as web archiving and code preservation. They identified a 2021 risk toward greater risk based on noted security issues posed by hosting legacy technology software and services which prompted disposal of content imminently without adequate review or selection. The 2022 Taskforce agreed with this assessment, noting no change to the trend (it remained on the same basis as before).

The 2023 Council agreed with the Critically Endangered classification with risks remaining on the same basis as before (‘No change’ to trend) but also noted a greater inevitability of loss compared to previous reviews. Additionally, the Council recommended that a received nomination for an entry, on unpublished digital indices and transcriptions in the DIMEV Open-Access Digital Edition of the Index of Middle English Verse, would provide a valuable example to this entry rather than as a new, standalone entry. The 2023 Council additionally recommended that the next major review considers rescoping the entry, possibly splitting this entry into separate areas to assess different levels of risk relating to published and unpublished source code in legacy research web collections.

2024 Interim Review

These risks remain on the same basis as before, with no significant trend towards even greater or reduced risk (‘No change’ to trend).

Additional Information

These collections are valuable but lose funding and care as institutions re-configure their tasks and individuals retreat from tasks due to retirement or (as volunteers) to old age.

There are an endless number of legacy research web resources out there that people don’t know about.

Not necessarily a technical challenge but a resource challenge

The Internet Archive and other national web archiving bodies have copies of a lot of websites that would fit into this category but by no means all. There’s also a distinction between the software or code used to deliver the user experience and the data. Such code is secondary to the content.

This issue can be intensified by the legacy IT Infrastructure in cases where much of the content is hosted there, as security concerns may lead to disposal of content imminently. In these scenarios, their imminence of action becomes more urgent given the security issues posed by hosting legacy technology/software/etc.

Case Studies & Examples

The example of the British Library cyber incident as a case example of issues arising when working with legacy systems. See Learning Lessons from the Cyber-attack: British Library cyber incident review, The British Library (2024), [accessed at 2024-09-06].
One example of an at-risk legacy research web collection, provided by the nominator of this entry, is the Unpublished digital indices and transcriptions in the DIMEV Open-Access, Digital Edition of the Index of Middle English Verse. The index comprises transcriptions made by a research team of Middle English text which were gathered as XML sheets and built upon a print publication: the Index of Middle English Verse (1943). These transcriptions involved significant financial and time investment and many are transcriptions of material unavailable online as digital facsimiles (uncertain data storage of the data that underlies the web resource, or whether it is being stored by a university or could easily be recovered). See The DIMEV: An Open-Access, Digital Edition of the Index of Middle English Verse, Mooney, L., Mosser, D. Solopova, E., Thorpe, D., Hill Radcliffe, D., Hatfield, L., Cornelius, I. and Johnston, M. [accessed at 2023-10-24].
The recovery of the VecNet archive of malaria-related publications offers another example that also has obvious public health implications. VecNet was founded in 2011 as a network of institutions assembled to address the concerns and recommendations of the Malaria Eradication Research Agenda initiative. It became a portal for malaria information and analysis tools, with the goal of extending present vector control interventions and enabling incorporation of additional interventions to achieve elimination. By 2019 an important component of the portal, the DataCite repository, ceased to be available. However, the Vector-Borne Disease Network Data Warehouse (VecNet-DW), a project of departments of University of Notre Dame and the Institute of Tropical Health and Medicine at James Cook University, retained the relevant data and is collaborating with Data Futures, which created the new Invenio repository. See VecNet, Invenio [accessed at 2023-10-24].
Preserving the Carmichael Watson Research Project website at the University of Edinburgh: a case study on this project website, only online from 2013 until 2018, came to imminent risk of permanent loss and the strategy undertaken to transform it into a more sustainable format through web archiving and to revive its public accessibility. See Using ArchiveWeb.page to capture the Carmichael Watson Project, Day Thomas, S. and Hawes, A. (2021 (2021), Web Archiving & Preservation Working Group [accessed at 2023-10-24].
Secure your digital datasets — by letting a data centre look after them!’, Fellgett, M. (2021), British Geological Survey Blogs [accessed at 2023-10-24].

Keep Me Informed