Technical component
PDFfetcher
Unlocking knowledge through PDF acquisition
PDFfetcher is a tool designed to acquire the full text of publications by collecting PDFs from URL links. As of March 2026, it manages 60+ million PDF/XML items, including ~19 million full texts relevant to SciLake pilots.
Publications:
- P. Koloveas, S. Chatzopoulos, C. Tryfonopoulos, T. Vergoulis (2023) BIP! NDR (NoDoiRefs): A Dataset of Citations From Papers Without DOIs in Computer Science Conferences and Workshops, doi: https://doi.org/10.48550/arXiv.2307.12794
Functionalities
Fetches PDFs/XML full texts associated with publication graph entities
Delivers documents to the IIS as input for plain-text extraction
Feeds text & data mining pipelines for further analysis