AvantGraph

Service Component Info:
- AvantGraph, Service Component Intro: AvantGraph is a tool that supports on-top services to perform analytics on graphs. It offers a high-performance graph processing engine for scientific data lakes, allowing a wide range of data processing tasks. , Service Component Logo: , Service Component Motto: High-performance graph analytics, https://avantgraph.io/,
  Documentation: https://avantgraph.io/
  
  Source code:
  
  AvantGraph: https://doi.org/10.5281/zenodo.20431260
  
  GraphAlg: An Embeddable Language for Writing Graph Algorithms in Linear Algebra: https://doi.org/10.5281/zenodo.20440843
  
  Demo: https://github.com/avantlab/scilake-demo
  
  Publications:
  
  Leeuwen, W.V., Mulder, T., Wall, B.V., Fletcher, G., & Yakovets, N. (2022). AvantGraph Query Processing Engine. Proc. VLDB Endow., 15, 3698-3701. DOI:10.14778/3554821.3554878
  
  Mulder, T., Fletcher, G. & Yakovets, N. Optimizing navigational graph queries. The VLDB Journal 34, 16 (2025). https://doi.org/10.1007/s00778-024-00892-7; arXiv: https://doi.org/10.48550/arXiv.2406.05417
  
  W. van Leeuwen, G. Fletcher, and N. Yakovets (2024) HomeRun: A Cardinality Estimation Advisor for Graph Databases. Proceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) (GRADES-NDA '24). Association for Computing Machinery, New York, NY, USA, Article 6, 1–9. DOI: https://doi.org/10.1145/3661304.3661902
SC Contact:
- SC Contact Person: Nick Yakovets, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Service Providers
- Service Component Users: Research Organisations
- Service Component Users: Funders
SC Organisations:
- SC Org Logo: , SC Org URL: Eindhoven University of Technology
SC Features:
- SC Features Name: Graph storage + analytics engine for SKGs, SC Features Desc: designed for robust and efficient processing at scale.
- SC Features Name: Cypher query support
- SC Features Name: GraphAlg integration, SC Features Desc: algorithm language, to run custom analytics within the database/query workflow
- SC Features Name: Performance-oriented compilation
- SC Features Name: Supports Lake API deployment

Technical component

AvantGraph is a tool that supports on-top services to perform analytics on graphs. It offers a high-performance graph processing engine for scientific data lakes, allowing a wide range of data processing tasks.

Functionalities

Roadmap

For

Provided by

Contacts

Data Catalogue

Service Component Info:
- Data catalogue, Service Component Intro: The SciLake Catalogue is a central registry for discovering Scientific Knowledge Graphs (SKGs) and tools across the SciLake ecosystem. , Service Component Logo: , Service Component Motto: Enriching research through comprehensive resource description,
  It provides a single point of access with rich metadata (e.g., location, provenance, dependencies), independent of where resources are hosted or running. The catalogue is hosted on D4Science, ensuring a stable and sustainable environment for publishing, managing, and long-term access to SciLake resources.
  
  URL: https://services.d4science.org/web/scilake_lab/catalogue
SC Contact:
- SC Contact Person: Miriam Baglioni, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Research Communities
- Service Component Users: Service Providers
SC Organisations:
- SC Org Logo: , SC Org URL: OpenAIRE
SC Features:
- SC Features Name: Central registry, SC Features Desc: of available SKGs and tools
- SC Features Name: Search and discovery, SC Features Desc: through rich metadata descriptions
- SC Features Name: Governance support, SC Features Desc: via provenance and dependency information
- SC Features Name: long-term availability, SC Features Desc: via D4Science hosting

The SciLake Catalogue is a central registry for discovering Scientific Knowledge Graphs (SKGs) and tools across the SciLake ecosystem.

Information Inference Service

Service Component Info:
- Information Inference Service, Service Component Intro: Information Inference Service (IIS) is a flexible data processing system for handling big data based on Apache Hadoop technologies. It is a subsystem of the OpenAIRE system and it uses algorithms to extract new entities and relations from full texts to enrich SKGs. , Service Component Logo: , Service Component Motto: Enhancing metadata through text and data mining,
  In practice, IIS defines data processing workflows that connect various modules, each one with well-defined input and output.
  
  A high-level overview of IIS can be found in the paper “Information Inference in Scholarly Communication Infrastructures: The OpenAIREplus Project Experience", Procedia Computer Science, vol. 38, 2014, 92-99”.
  
  Documentation: Enrichment by mining | OpenAIRE Graph Documentation
  
  Publications:
  
  Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł. (2013). Large Scale Citation Matching Using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_37
  
  Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. doi:10.1007/978-3-319-08425-1_10
  
  P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, and L. Bolikowski, "Content Analysis of Scientific Articles in Apache Hadoop Ecosystem", Stud. Comp.Intelligence, vol. 541, 2014.
  
  Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. doi:10.1007/978-3-031-16802-4_9
  
  Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. doi:10.1007/978-3-319-67008-9_28
  
  Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. In International Journal on Document Analysis and Recognition, 2015, vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
SC Contact:
- SC Contact Person: Marek Horst, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Content Providers
- Service Component Users: Research Communities
- Service Component Users: Research Organisations
- Service Component Users: Innovators
- Service Component Users: Funders & Policy Makers
SC Organisations:
- SC Org Logo: , SC Org URL: ICM
SC Features:
- SC Features Desc: Enhance metadata with information obtained through text and data mining
- SC Features Desc: Improved linked open science
- SC Features Desc: Improved research analytics
- SC Features Desc: Improved research monitoring and impact assessment
- SC Features Desc: Customers get structured metadata related to the publications
- SC Features Desc: Funders have access to a list of publications that acknowledge their projects
- SC Features Desc: Content providers (Repository managers/ OA publishers) may enrich their content

Information Inference Service (IIS) is a flexible data processing system for handling big data based on Apache Hadoop technologies. It is a subsystem of the OpenAIRE system and it uses algorithms to extract new entities and relations from full texts to enrich SKGs.

KG creation assistant & Interlinking

Service Component Info:
- KG creation assistant & Interlinking, Service Component Intro: The Knowledge Graph creation assistant & Interlinking tool is designed to extract knowledge graphs from unstructured or semi-structured data sources and enrich their content. , Service Component Logo: , Service Component Motto: Discovering Dependencies, Enriching Knowledge,
  Through the use of the GGDminer tool, it aims to discover Graph Generating Dependencies (GGDs) and showcase information about the graph's content. This process involves applying topological and differential constraints to generate meaningful dependencies.
  
  Source code:
  
  R2PG-DM: Relational to Property Graph Direct Mapping: https://doi.org/10.5281/zenodo.20430312
  
  sHINER: GGD Validation via G-Core on Spark: https://doi.org/10.5281/zenodo.20430948
  
  ProGGD: Profiling Knowledge Graphs with Graph Generating Dependencies: https://doi.org/10.5281/zenodo.20430854
  
  GGDMiner: Automatic Discovery of Graph Generating Dependencies: https://doi.org/10.5281/zenodo.20430786
  
  Publications:
  
  L.C. Shimomura, G. Fletcher, & N. Yakovets (2023) ProGGD - Data Profiling on Knowledge Graphs using Graph Generating Dependencies. International Workshop on the Semantic Web. URL link
  
  W. van Leeuwen, G. Fletcher, N. Yakovets (2023) A General Cardinality Estimation Framework for Subgraph Matching in Property Graphs. IEEE Transactions on Knowledge and Data Engineering, 35(6), 5485–5505. DOI: https://doi.org/10.1109/TKDE.2022.3161328
  
  W. van Leeuwen, G. Fletcher, and N. Yakovets (2024) HomeRun: A Cardinality Estimation Advisor for Graph Databases. Proceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) (GRADES-NDA '24). Association for Computing Machinery, New York, NY, USA, Article 6, 1–9. DOI: https://doi.org/10.1145/3661304.3661902
  
  L. C. Shimomura, N. Yakovets, G. Fletcher (2024) Discovering Graph Generating Dependencies for Property Graph Profiling. CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. DOI: https://doi.org/10.1145/3627673.3679764; arXiv: https://doi.org/10.48550/arXiv.2403.17082
  
  L. C. Shimomura, N. Yakovets, G. Fletcher (2024) Reasoning on property graphs with graph generating dependencies. Information Sciences, Volume 672, 120675. DOI: https://doi.org/10.1016/j.ins.2024.120675
SC Contact:
- SC Contact Person: Nick Yakovets, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Service Providers
- Service Component Users: Research Communities
SC Organisations:
- SC Org Logo: , SC Org URL: Eindhoven University of Technology
SC Features:
- SC Features Name: Data Interlinking, SC Features Desc: GGDs as prescription rules for data interlinking
- SC Features Name: Relational to graph schema mapping, SC Features Desc: Mapping of relational schemas to property graphs towards KG creation
- SC Features Name: Understanding the KG, SC Features Desc: GGDs as description rules to drive KG creation
- SC Features Name: Scales efficiently, SC Features Desc: connection pooling, multi-threading, and memory optimizations, supporting datasets up to 10GB (TPC-H) and delivering 90%+ runtime reduction.
- SC Features Name: Graph modeling, SC Features Desc: mapping join tables to labeled edges with properties for natural many-to-many relationships.
- SC Features Name: Standards-ready output, SC Features Desc: PG-Schema generation aligned with the emerging GQL standard.

The Knowledge Graph creation assistant & Interlinking tool is designed to extract knowledge graphs from unstructured or semi-structured data sources and enrich their content.

Lake API

Service Component Info:
- Lake API, Service Component Intro: The Lake API is a GraphQL web service that gives a single point of access to SciLake’s Scientific Knowledge Graphs. Through one endpoint, it lets users search and retrieve information about research products and their domain-specific enrichments (such as authors and organisations, venues, topics, funding, and other related entities). The software is open source and released under the GPL‑v2 license: https://github.com/athenarc/scilake-api , Service Component Logo: , Service Component Motto: Simplifying access to knowledge, https://github.com/athenarc/scilake-api
SC Contact:
- SC Contact Person: Nick Yakovets, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Research Communities
- Service Component Users: Service Providers
SC Organisations:
- SC Org Logo: , SC Org URL: Eindhoven University of Technology
SC Features:
- SC Features Name: Single GraphQL endpoint, SC Features Desc: to access SciLake Scientific Knowledge Graphs
- SC Features Name: Search across entities, SC Features Desc: research products, people/organisations, venues, topics, funding, pilot-specific entities
- SC Features Name: Flexible filtering, SC Features Desc: exact and partial matches, list-based filters, nested AND/OR logic
- SC Features Name: Cross-entity graph navigation, SC Features Desc: traverse relationships in one query, e.g., products ↔ technologies ↔ agents
- SC Features Name: Pagination & sorting, SC Features Desc: for scalable browsing of large result sets
- SC Features Name: Exploration of rich relationships, SC Features Desc: users can access both core product properties (title, citation count, popularity) and related entities (technologies) in a single request, supporting comprehensive analytical and discovery tasks

The Lake API is a GraphQL web service that gives a single point of access to SciLake’s Scientific Knowledge Graphs. Through one endpoint, it lets users search and retrieve information about research products and their domain-specific enrichments (such as authors and organisations, venues, topics, funding, and other related entities).

The software is open source and released under the GPL‑v2 license: https://github.com/athenarc/scilake-api

Documentation: https://scilake-api.athenarc.gr/

Source code: https://doi.org/10.5281/zenodo.20445193

User Interface: https://scilake-api.athenarc.gr/graphql

Machine Translation System

Service Component Info:
- Machine Translation System, Service Component Intro: The Machine Translation system ensures accurate and contextually appropriate translations by fine-tuning general-purpose machine translation models with domain-specific scientific data. , Service Component Logo: , Service Component Motto: Domain-specific Machine Translation, http://scinem.imsi.athenarc.gr/,
  SciLake provides open, domain-adapted machine translation models to improve the translation of scientific text, including specialised terminology and complex sentence structures. Three models were developed for French→English, Spanish→English, and Portuguese→English (fine-tuned from OPUS‑MT and specialised for the project pilot domains).
  
  The models are open-source and can be downloaded from the Hugging Face platform:
  
  French-English: https://huggingface.co/ilsp/opus-mt-big-fr-en_ct2_ft-SciLake
  
  Portuguese-English: https://huggingface.co/ilsp/opus-mt-pt-en_ct2_ft-SciLake
  
  Spanish-English: https://huggingface.co/ilsp/opus-mt-big-es-en_ct2_ft-SciLake
  
  Publications:
  
  S. Kotitsas, P. Kounoudis, E. Koutli, H. Papageorgiou (2024) Leveraging fine-tuned Large Language Models with LoRA for Effective Claim, Claimer, and Claim Object Detection Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). URL: https://aclanthology.org/2024.eacl-long.156
SC Contact:
- SC Contact Person: Sokratis Sofianopoulos, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Research Communities
SC Organisations:
- SC Org Logo: , SC Org URL: Athena Research and Innovation Center
SC Features:
- SC Features Name: Domain-adapted translation for scientific terminology
- SC Features Name: Coverage of FR/ES/PT → EN language pairs
- SC Features Name: Integration-ready, SC Features Desc: for workflows that process multilingual scholarly content (e.g., titles/abstracts)

Technical component

The Machine Translation system ensures accurate and contextually appropriate translations by fine-tuning general-purpose machine translation models with domain-specific scientific data.

Functionalities

For

Provided by

Contacts

PDFfetcher

Service Component Info:
- PDFfetcher, Service Component Intro: PDFfetcher is a tool designed to acquire the full text of publications by collecting PDFs from URL links. With a coverage of over 60 million PDF articles, it provides a comprehensive resource for researchers. , Service Component Logo: , Service Component Motto: Unlocking knowledge through PDF acquisition, http://scinem.imsi.athenarc.gr/,
  Publications:
  
  P. Koloveas, S. Chatzopoulos, C. Tryfonopoulos, T. Vergoulis (2023) BIP! NDR (NoDoiRefs): A Dataset of Citations From Papers Without DOIs in Computer Science Conferences and Workshops, doi: https://doi.org/10.48550/arXiv.2307.12794
SC Contact:
- SC Contact Person: Claudio Atzori, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Contact Person: Miriam Baglioni, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Contact Person: Marek Horst, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Research Communities
- Service Component Users: Service Providers
SC Organisations:
- SC Org Logo: , SC Org URL: OpenAIRE AMKE
SC Features:
- SC Features Desc: Fetches PDFs/XML full texts associated with publication graph entities
- SC Features Desc: Delivers documents to the IIS as input for plain-text extraction
- SC Features Desc: Feeds text & data mining pipelines for further analysis

PDFfetcher is a tool designed to acquire the full text of publications by collecting PDFs from URL links. As of March 2026, it manages 60+ million PDF/XML items, including ~19 million full texts relevant to SciLake pilots.

SciLake at GRADES-NDA ’24

Workshop

SciLake at GRADES-NDA ’24

By Stefania Amodeo, Daan de Graaf

SciLake recently participated in the 7th Joint Workshop on Graph Data Management Experiences Systems (GRADES) and Network Data Analytics (NDA), held on June 14, 2024, in Santiago, AA, Chile. This prestigious event unites researchers from academia, industry, and government sectors worldwide to discuss and share the latest breakthroughs in large-scale graph data management and graph analytics systems. It also provides a platform to discuss novel methods and techniques to address domain-specific challenges in real-world graphs.

July 2, 2024

Daan de Graaf (TU/e) at GRADES-NDA ‘24

Our SciLake partner, Daan de Graaf, had the opportunity to present an accepted article on behalf of authors Wilco van Leeuwen, George Fletcher, and Nikolay Yakovets, all from Eindhoven University of Technology (TU/e). The team showcased "HomeRun", a tool specifically designed for comparing different cardinality estimation techniques in graph databases.

For those new to the topic, the cardinality of a graph database refers to the number of elements in a set, such as the number of edges connected to a node or the total number of nodes in the database. Accurate cardinality estimation is crucial for optimising the performance of queries, as it helps plan the most efficient way to retrieve data.

One of HomeRun's key features is its ability to evaluate the performance of different cardinality estimation techniques in given usage scenarios. The tool generates visualisations automatically, helping users understand the trade-offs between various techniques. This tool is particularly useful for database developers when they face performance issues, like long-running queries, with specific query and dataset combinations.

In SciLake, HomeRun is being used to optimise the database system performance in the context of the WP2 Data Lake Search and Navigation.

For more information about HomeRun, you can refer to the paper:

Wilco van Leeuwen, George Fletcher, and Nikolay Yakovets. 2024. HomeRun: A Cardinality Estimation Advisor for Graph Databases. In Proceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) (GRADES-NDA '24). Association for Computing Machinery, New York, NY, USA, Article 6, 1–9. https://doi.org/10.1145/3661304.3661902

SciNeM

Service Component Info:
- SciNem, Service Component Intro: SciNeM is data science tool for metapath-based querying and analysis of Heterogeneous Information Networks. It enables entity ranking, similarity searches, and community detection. , Service Component Logo: , Service Component Motto: Data Science Tool for Heterogeneous Network Mining, http://scinem.imsi.athenarc.gr/,
  SciNeM is a data science tool for metapath-based querying and analysis of Heterogeneous Information Networks (HINs). It currently supports the following operations, given a user-specified metapath:
  
  ranking entities using a random walk mode,
  
  retrieving the most similar pairs of entities,
  
  finding the most similar entities to a query entity, and
  
  discovering entity communities via several community detection algorithms.
  
  All supported operations have been implemented in a scalable manner, utilising Apache Spark for scaling out through parallel and distributed computation. SciNeM has a modular architecture making it easy to extend it with additional algorithms and functionalities. Moreover, it provides an intuitive, Web-based user interface to build and execute complex constrained metapath-based queries and to explore and visualise the corresponding results.
  
  URL: https://github.com/athenarc/SciNeM
  
  Source code: https://doi.org/10.5281/zenodo.20448137
  
  Publications: https://doi.org/10.5441/002/edbt.2021.76
SC Contact:
- SC Contact Person: Serafeim Chatzopoulos, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Contact Person: Thanasis Vergoulis, SC Contact Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
SC Users:
- Service Component Users: Research Communities
- Service Component Users: Research Managers
- Service Component Users: Research Organisations
- Service Component Users: Innovators
- Service Component Users: Funders & Policy Makers
SC Organisations:
- SC Org Logo: , SC Org URL: Athena Research and Innovation Center
SC Features:
- SC Features Name: Ranking, SC Features Desc: Assigns centrality scores to entities in a graph using a random walk mode.
- SC Features Name: Similarity search, SC Features Desc: Identifies most similar entities to a given entity.
- SC Features Name: Community detection, SC Features Desc: Discovers communities of entities in the graph.

SciNeM is data science tool for metapath-based querying and analysis of Heterogeneous Information Networks. It enables entity ranking, similarity searches, and community detection.

AvantGraph

Functionalities

Roadmap

For

Provided by

Contacts

Related Articles

Data Catalogue

Information Inference Service

KG creation assistant & Interlinking

Lake API

Machine Translation System

Functionalities

For

Provided by

Contacts

Related Articles

PDFfetcher

SciLake at GRADES-NDA ’24

SciLake at GRADES-NDA ’24

SciNeM