- Service Component Info:
- Data catalogue, Service Component Intro:
The SciLake Catalogue is a central registry for discovering Scientific Knowledge Graphs (SKGs) and tools across the SciLake ecosystem. , Service Component Logo:
, Service Component Motto:
Enriching research through comprehensive resource description, It provides a single point of access with rich metadata (e.g., location, provenance, dependencies), independent of where resources are hosted or running. The catalogue is hosted on D4Science, ensuring a stable and sustainable environment for publishing, managing, and long-term access to SciLake resources.
URL: https://services.d4science.org/web/scilake_lab/catalogue
- SC Contact:
- SC Contact Person:
Miriam Baglioni, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Users:
- Service Component Users:
Research Communities
- Service Component Users:
Service Providers
- SC Organisations:
- SC Features:
- SC Features Name:
Central registry, SC Features Desc:
of available SKGs and tools
- SC Features Name:
Search and discovery, SC Features Desc:
through rich metadata descriptions
- SC Features Name:
Governance support, SC Features Desc:
via provenance and dependency information
- SC Features Name:
long-term availability, SC Features Desc:
via D4Science hosting
The SciLake Catalogue is a central registry for discovering Scientific Knowledge Graphs (SKGs) and tools across the SciLake ecosystem.
- Service Component Info:
- Information Inference Service, Service Component Intro:
Information Inference Service (IIS) is a flexible data processing system for handling big data based on Apache Hadoop technologies. It is a subsystem of the OpenAIRE system and it uses algorithms to extract new entities and relations from full texts to enrich SKGs. , Service Component Logo:
, Service Component Motto:
Enhancing metadata through text and data mining, In practice, IIS defines data processing workflows that connect various modules, each one with well-defined input and output.
A high-level overview of IIS can be found in the paper “Information Inference in Scholarly Communication Infrastructures: The OpenAIREplus Project Experience", Procedia Computer Science, vol. 38, 2014, 92-99”.
Documentation: Enrichment by mining | OpenAIRE Graph Documentation
Publications:
- Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł. (2013). Large Scale Citation Matching Using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_37
- Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. doi:10.1007/978-3-319-08425-1_10
- P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, and L. Bolikowski, "Content Analysis of Scientific Articles in Apache Hadoop Ecosystem", Stud. Comp.Intelligence, vol. 541, 2014.
- Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. doi:10.1007/978-3-031-16802-4_9
- Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. doi:10.1007/978-3-319-67008-9_28
- Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. In International Journal on Document Analysis and Recognition, 2015, vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
- SC Contact:
- SC Contact Person:
Marek Horst, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Users:
- Service Component Users:
Content Providers
- Service Component Users:
Research Communities
- Service Component Users:
Research Organisations
- Service Component Users:
Innovators
- Service Component Users:
Funders & Policy Makers
- SC Organisations:
- SC Org Logo:
, SC Org URL:
ICM
- SC Features:
- SC Features Desc:
Enhance metadata with information obtained through text and data mining
- SC Features Desc:
Improved linked open science
- SC Features Desc:
Improved research analytics
- SC Features Desc:
Improved research monitoring and impact assessment
- SC Features Desc:
Customers get structured metadata related to the publications
- SC Features Desc:
Funders have access to a list of publications that acknowledge their projects
- SC Features Desc:
Content providers (Repository managers/ OA publishers) may enrich their content
Information Inference Service (IIS) is a flexible data processing system for handling big data based on Apache Hadoop technologies. It is a subsystem of the OpenAIRE system and it uses algorithms to extract new entities and relations from full texts to enrich SKGs.
- Service Component Info:
- KG creation assistant & Interlinking, Service Component Intro:
The Knowledge Graph creation assistant & Interlinking tool is designed to extract knowledge graphs from unstructured or semi-structured data sources and enrich their content. , Service Component Logo:
, Service Component Motto:
Discovering Dependencies, Enriching Knowledge, Through the use of the GGDminer tool, it aims to discover Graph Generating Dependencies (GGDs) and showcase information about the graph's content. This process involves applying topological and differential constraints to generate meaningful dependencies.
Source code: http://github.com/avantlab/R2PG-DM
Publications:
- L.C. Shimomura, G. Fletcher, & N. Yakovets (2023) ProGGD - Data Profiling on Knowledge Graphs using Graph Generating Dependencies. International Workshop on the Semantic Web. URL link
- W. van Leeuwen, G. Fletcher, N. Yakovets (2023) A General Cardinality Estimation Framework for Subgraph Matching in Property Graphs. IEEE Transactions on Knowledge and Data Engineering, 35(6), 5485–5505. DOI: https://doi.org/10.1109/TKDE.2022.3161328
- W. van Leeuwen, G. Fletcher, and N. Yakovets (2024) HomeRun: A Cardinality Estimation Advisor for Graph Databases. Proceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) (GRADES-NDA '24). Association for Computing Machinery, New York, NY, USA, Article 6, 1–9. DOI: https://doi.org/10.1145/3661304.3661902
- L. C. Shimomura, N. Yakovets, G. Fletcher (2024) Discovering Graph Generating Dependencies for Property Graph Profiling. CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. DOI: https://doi.org/10.1145/3627673.3679764; arXiv: https://doi.org/10.48550/arXiv.2403.17082
- L. C. Shimomura, N. Yakovets, G. Fletcher (2024) Reasoning on property graphs with graph generating dependencies. Information Sciences, Volume 672, 120675. DOI: https://doi.org/10.1016/j.ins.2024.120675
- SC Contact:
- SC Contact Person:
Nick Yakovets, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Users:
- Service Component Users:
Service Providers
- Service Component Users:
Research Communities
- SC Organisations:
- SC Features:
- SC Features Name:
Data Interlinking, SC Features Desc:
GGDs as prescription rules for data interlinking
- SC Features Name:
Relational to graph schema mapping, SC Features Desc:
Mapping of relational schemas to property graphs towards KG creation
- SC Features Name:
Understanding the KG, SC Features Desc:
GGDs as description rules to drive KG creation
- SC Features Name:
Scales efficiently, SC Features Desc:
connection pooling, multi-threading, and memory optimizations, supporting datasets up to 10GB (TPC-H) and delivering 90%+ runtime reduction.
- SC Features Name:
Graph modeling, SC Features Desc:
mapping join tables to labeled edges with properties for natural many-to-many relationships.
- SC Features Name:
Standards-ready output, SC Features Desc:
PG-Schema generation aligned with the emerging GQL standard.
The Knowledge Graph creation assistant & Interlinking tool is designed to extract knowledge graphs from unstructured or semi-structured data sources and enrich their content.
- Service Component Info:
- Lake API, Service Component Intro:
The Lake API is a GraphQL web service that gives a single point of access to SciLake’s Scientific Knowledge Graphs. Through one endpoint, it lets users search and retrieve information about research products and their domain-specific enrichments (such as authors and organisations, venues, topics, funding, and other related entities). The software is open source and released under the GPL‑v2 license: https://github.com/athenarc/scilake-api , Service Component Logo:
, Service Component Motto:
Simplifying access to knowledge, https://github.com/athenarc/scilake-api
- SC Contact:
- SC Contact Person:
Nick Yakovets, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Users:
- Service Component Users:
Research Communities
- Service Component Users:
Service Providers
- SC Organisations:
- SC Features:
- SC Features Name:
Single GraphQL endpoint, SC Features Desc:
to access SciLake Scientific Knowledge Graphs
- SC Features Name:
Search across entities, SC Features Desc:
research products, people/organisations, venues, topics, funding, pilot-specific entities
- SC Features Name:
Flexible filtering, SC Features Desc:
exact and partial matches, list-based filters, nested AND/OR logic
- SC Features Name:
Cross-entity graph navigation, SC Features Desc:
traverse relationships in one query, e.g., products ↔ technologies ↔ agents
- SC Features Name:
Pagination & sorting, SC Features Desc:
for scalable browsing of large result sets
- SC Features Name:
Exploration of rich relationships, SC Features Desc:
users can access both core product properties (title, citation count, popularity) and related entities (technologies) in a single request, supporting comprehensive analytical and discovery tasks
The Lake API is a GraphQL web service that gives a single point of access to SciLake’s Scientific Knowledge Graphs. Through one endpoint, it lets users search and retrieve information about research products and their domain-specific enrichments (such as authors and organisations, venues, topics, funding, and other related entities).
- Service Component Info:
- PDFfetcher, Service Component Intro:
PDFfetcher is a tool designed to acquire the full text of publications by collecting PDFs from URL links. With a coverage of over 60 million PDF articles, it provides a comprehensive resource for researchers. , Service Component Logo:
, Service Component Motto:
Unlocking knowledge through PDF acquisition, http://scinem.imsi.athenarc.gr/, Publications:
- P. Koloveas, S. Chatzopoulos, C. Tryfonopoulos, T. Vergoulis (2023) BIP! NDR (NoDoiRefs): A Dataset of Citations From Papers Without DOIs in Computer Science Conferences and Workshops, doi: https://doi.org/10.48550/arXiv.2307.12794
- SC Contact:
- SC Contact Person:
Claudio Atzori, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Contact Person:
Miriam Baglioni, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Contact Person:
Marek Horst, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Users:
- Service Component Users:
Research Communities
- Service Component Users:
Service Providers
- SC Organisations:
- SC Features:
- SC Features Desc:
Fetches PDFs/XML full texts associated with publication graph entities
- SC Features Desc:
Delivers documents to the IIS as input for plain-text extraction
- SC Features Desc:
Feeds text & data mining pipelines for further analysis
PDFfetcher is a tool designed to acquire the full text of publications by collecting PDFs from URL links. As of March 2026, it manages 60+ million PDF/XML items, including ~19 million full texts relevant to SciLake pilots.
Workshop
SciLake at GRADES-NDA ’24
By Stefania Amodeo, Daan de Graaf
SciLake recently participated in the 7th Joint Workshop on Graph Data Management Experiences Systems (GRADES) and Network Data Analytics (NDA), held on June 14, 2024, in Santiago, AA, Chile. This prestigious event unites researchers from academia, industry, and government sectors worldwide to discuss and share the latest breakthroughs in large-scale graph data management and graph analytics systems. It also provides a platform to discuss novel methods and techniques to address domain-specific challenges in real-world graphs.
Daan de Graaf (TU/e) at GRADES-NDA ‘24
Our SciLake partner, Daan de Graaf, had the opportunity to present an accepted article on behalf of authors Wilco van Leeuwen, George Fletcher, and Nikolay Yakovets, all from Eindhoven University of Technology (TU/e). The team showcased "HomeRun", a tool specifically designed for comparing different cardinality estimation techniques in graph databases.
For those new to the topic, the cardinality of a graph database refers to the number of elements in a set, such as the number of edges connected to a node or the total number of nodes in the database. Accurate cardinality estimation is crucial for optimising the performance of queries, as it helps plan the most efficient way to retrieve data.
One of HomeRun's key features is its ability to evaluate the performance of different cardinality estimation techniques in given usage scenarios. The tool generates visualisations automatically, helping users understand the trade-offs between various techniques. This tool is particularly useful for database developers when they face performance issues, like long-running queries, with specific query and dataset combinations.
In SciLake, HomeRun is being used to optimise the database system performance in the context of the WP2 Data Lake Search and Navigation.
For more information about HomeRun, you can refer to the paper:
- Wilco van Leeuwen, George Fletcher, and Nikolay Yakovets. 2024. HomeRun: A Cardinality Estimation Advisor for Graph Databases. In Proceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) (GRADES-NDA '24). Association for Computing Machinery, New York, NY, USA, Article 6, 1–9. https://doi.org/10.1145/3661304.3661902
- Service Component Info:
- SciNem, Service Component Intro:
SciNeM is data science tool for metapath-based querying and analysis of Heterogeneous Information Networks. It enables entity ranking, similarity searches, and community detection. , Service Component Logo:
, Service Component Motto:
Data Science Tool for Heterogeneous Network Mining, http://scinem.imsi.athenarc.gr/, SciNeM is a data science tool for metapath-based querying and analysis of Heterogeneous Information Networks (HINs). It currently supports the following operations, given a user-specified metapath:
- ranking entities using a random walk mode,
- retrieving the most similar pairs of entities,
- finding the most similar entities to a query entity, and
- discovering entity communities via several community detection algorithms.
All supported operations have been implemented in a scalable manner, utilising Apache Spark for scaling out through parallel and distributed computation. SciNeM has a modular architecture making it easy to extend it with additional algorithms and functionalities. Moreover, it provides an intuitive, Web-based user interface to build and execute complex constrained metapath-based queries and to explore and visualise the corresponding results.
URL: https://github.com/athenarc/SciNeM
Publications: https://doi.org/10.5441/002/edbt.2021.76
- SC Contact:
- SC Contact Person:
Serafeim Chatzopoulos, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Contact Person:
Thanasis Vergoulis, SC Contact Mail:
This email address is being protected from spambots. You need JavaScript enabled to view it.
- SC Users:
- Service Component Users:
Research Communities
- Service Component Users:
Research Managers
- Service Component Users:
Research Organisations
- Service Component Users:
Innovators
- Service Component Users:
Funders & Policy Makers
- SC Organisations:
- SC Features:
- SC Features Name:
Ranking, SC Features Desc:
Assigns centrality scores to entities in a graph using a random walk mode.
- SC Features Name:
Similarity search, SC Features Desc:
Identifies most similar entities to a given entity.
- SC Features Name:
Community detection, SC Features Desc:
Discovers communities of entities in the graph.
SciNeM is data science tool for metapath-based querying and analysis of Heterogeneous Information Networks. It enables entity ranking, similarity searches, and community detection.