Technical component

PDFfetcher

Unlocking knowledge through PDF acquisition

PDFfetcher is a tool designed to acquire the full text of publications by collecting PDFs from URL links. As of March 2026, it manages 60+ million PDF/XML items, including ~19 million full texts relevant to SciLake pilots.

Publications:

P. Koloveas, S. Chatzopoulos, C. Tryfonopoulos, T. Vergoulis (2023) BIP! NDR (NoDoiRefs): A Dataset of Citations From Papers Without DOIs in Computer Science Conferences and Workshops, doi: https://doi.org/10.48550/arXiv.2307.12794

Functionalities

Fetches PDFs/XML full texts associated with publication graph entities

Delivers documents to the IIS as input for plain-text extraction

Feeds text & data mining pipelines for further analysis

For

Research Communities

Service Providers

Provided by

Contacts

Claudio Atzori

Miriam Baglioni

Marek Horst