Skip to main content

SciNoBo: Science No Borders


SCILAKE TECHNICAL COMPONENTS

SciNoBo: Science No Borders

By Stefania Amodeo

In a recent webinar for SciLake partners, Haris Papageorgiou from Athena RC presented the SciNoBo toolkit for Open Science and discussed its benefits for science communities. SciNoBo, which stands for Science No Borders, is a powerful toolkit designed to facilitate open science practices. 

In this blog post, we recap the key points from the webinar and explore the different functionalities offered by SciNoBo.

The Toolkit

The SciNoBo toolkit provides a comprehensive range of modules and functionalities to support researchers in their scientific endeavors. Let's take a closer look at each of these modules and their benefits:

  • Publication Analysis

    Processes publications in PDF format and extracts valuable information such as tables, figures, text, affiliations, authors, citations, and references.

  • Field of Science (FoS) Analysis

    Uses a hierarchical classifier to assign one or more labels to a publication based on its content and metadata. The hierarchical system consists of 6 levels, with the first 3 levels being standard in the literature. This approach allows to adhere to the well-established taxonomies in scientific literature while also capturing the dynamics of scientific developments at levels 5 and 6, where new topics emerge and some fade out (see image below).

  • Collaboration Analysis

    Analyzes collaborations between fields and identifies multidisciplinary papers. Provides insights and indicators to help researchers understand the interdisciplinarity of a publication and joint efforts of researchers from different disciplines.

  • Claim/Conclusion Detection

    Detects claims and conclusions in scientific publications, providing insights to analyze disinformation and misinformation. Helps identify if news statements are grounded in scientific terms and can collect claims and conclusions from different papers.

  • Citation Analysis

    Aggregates conclusions from various sources, aiding researchers in conducting surveys on citation analysis. Facilitates a comprehensive understanding of how the scientific community adopts or builds upon previous findings.

  • SDG Classification

    Categorizes publications and artifacts based on the Sustainable Development Goals (SDGs). It is a multi-label classifier that assigns multiple labels to a publication, allowing researchers to align their work with specific SDGs.

  • Interdisciplinarity

    Explores research classification at various levels and highlights interdisciplinary aspects. Helps identify collaborations across different fields.

  • Bio-Entity Tagging

    Extracts and annotates health publications based on bio entities like genes or proteins. Helps identify and analyze relevant biological information.

  • Citance Semantic Analysis

    Analyzes statements on previous findings in a specific topic. Assesses the scientific community's adoption or expansion of these conclusions, helping researchers understand the endorsement or acceptance of previous research.

  • Research Artifact Detection

    Extracts mentions and references of research artifacts from publications. The type of artifact extracted depends on the specific domain, such as software (computer science), surveys (social sciences, humanities), genes or proteins (biomedical sciences). The goal is to accurately extract all mentions and find all the metadata that coexist in the publication. From there, we can build a database or a knowledge graph that includes all of these artifacts.

Connection to SciLake Communities

The SciNoBo toolkit aims to remove barriers in the scientific community by providing a collaborative and intuitive assistant. It utilizes powerful language models and integrates with various datasets, including OpenAIRE, Semantic Scholar, and citation databases. Researchers can interact with the assistant using natural language, asking scientific questions and receiving insights from the available modules.

One of the main features of SciNoBo is its ability to help users find literature related to specific research areas or topics within a given domain. The platform provides a list of entries ranked by significance, along with their associated metadata. This allows researchers to easily access relevant publications and explore the research conducted in their field.

Once researchers have identified publications of interest, SciNoBo offers a wide range of functionalities to support their analysis. Users can explore the conclusions, methodology, and results of specific papers, and even read the full paper. The platform also enables users to analyze research artifacts mentioned in the papers, such as databases, genes, or medical ontologies. By examining the usage, citations, and related topics of these artifacts, researchers can gain a deeper understanding of the research landscape in their chosen field.

Each pilot project utilizes a different branch of the hierarchy to narrow down the publications that users may want to further analyze. Here are some examples of possible applications:

  • The Cancer pilot can create a Chronic Lymphocytic Leukemia (CLL) specific knowledge graph (see image below).
  • The Transportation pilot can identify publications that examine "automation" in transportation domains.
  • The Energy pilot can identify publications that examine "photovoltaics" and "distributed generation".
  • The Neuroscience pilot can identify publications that examine "Parkinson's disease" and "PBM treatment".

Six levels of the Field of Science classification system for the Cancer pilot use case

The platform equips researchers with the tools and functionalities to ask any type of question and receive insights based on their collected data. By using augmented retrieval technology and feeding language models with the collection of publications, SciNoBo ensures accurate and relevant results.

Furthermore, SciNoBo allows users to create their own collections of publications and save their results. This feature enables researchers to build their own knowledge graph and share their findings with the scientific community. By collaborating and expanding on each other's work, users can collectively develop a comprehensive understanding of their respective fields.

Conclusion

In conclusion, the SciNoBo platform is a valuable resource for science communities engaged in open science practices. With its wide range of tools and functionalities, researchers can explore and analyze publications, classify research fields, detect claims and conclusions, analyze citations, and classify publications. By leveraging the power of large language models and access to diverse data sources, SciNoBo provides an intuitive and immersive platform for researchers to interact with the scientific community and gain valuable insights from scientific literature.

scinobo-articles, training

Read more …SciNoBo: Science No Borders

SciLake 2nd Plenary Meeting


Consortium meeting

SciLake 2nd Plenary Meeting

By Stefania Amodeo

The SciLake team met in Barcelona and online on November 9-10, 2023. The meeting, hosted by SIRIS Academic, provided an opportunity to review the progress made in the past year and plan future work.

This blog post gives a summary of the important topics discussed during the meeting, including the main goals and vision of the project, the challenges for the upcoming months, and the expectations for the pilot projects.

SciLake Main Motivation

The main motivation behind the SciLake project is to address the challenges of combining domain knowledge with open Scientific Knowledge Graphs (SKGs) and build valuable added-value services tailored for specific domains. This combination is hampered by various issues related to the ways domain-specific knowledge is organized (e.g., fragmentation, heterogeneous formats, multilingual texts, and interoperability issues with domain-agnostic SKGs).

"The project goal is to overcome these challenges and create a seamless integration between domain knowledge and open SKGs, ultimately empowering researchers and fostering a more interconnected and efficient scientific community." - Thanasis Vergoulis, SciLake coordinator

SciLake Vision

SciLake is developing a user-friendly “Scientific-Lake-as-a-Service” that is open, transparent, and customizable. This service, built upon the OpenAIRE Graph, will host both domain-specific and general knowledge, making it easier for communities to create, connect, and maintain their own SKGs, while also offering a unified way to access and search the respective information. The project is also developing two specialised services on top of the Scientific Lake: one to assist users in navigating the respective vast knowledge space by exploiting indicators of scientific impact, and another to improve research reproducibility in specific research domains. Finally, real-world pilot tests will be conducted to customise, test, and showcase these services in practice.

Challenges for Next Months

In the upcoming months, the project will focus on understanding the specific needs of the pilots in order to tailor the SciLake services effectively. Roadmaps will be developed for each component of the SciLake services, leading up to their alpha release in June 2024. The release will include comprehensive documentation and demos of each component.

Pilots Role

Pilots in the SciLake project play a crucial role in identifying relevant datasets, texts, knowledge bases/graphs, and ontologies for their domains. They also provide feedback on graph querying, knowledge discovery, and reproducibility requirements. Each pilot will create and update a domain-specific knowledge graph, while demo use cases will be used to test and evaluate the SciLake components for further refinement and improvement.

The SciLake plenary meeting in Barcelona was a productive gathering where the team reviewed their progress and outlined the future plans for the project. Overall, the SciLake project is making significant strides in bridging the gap between domain knowledge and open SKGs, bringing us closer to a more interconnected and efficient scientific community. 
Stay tuned for more updates on the progress of SciLake!

Read more …SciLake 2nd Plenary Meeting

AvantGraph: the Next-Generation Graph Analytics Engine


SCILAKE TECHNICAL COMPONENTS

AvantGraph: the Next-Generation Graph Analytics Engine

By Stefania Amodeo

In a webinar for SciLake partners, Nick Yakovets, Assistant Professor at the Department of Mathematics and Computer Science, Information Systems WSK&I at Eindhoven University of Technology (TU/e), introduced AvantGraph, a next-generation knowledge graph analytics engine. Yuanjin Wu and Daan de Graaf, graduate students in Nick's research group, presented a demo of the tool. 

Developed by TU/e researchers, AvantGraph aims to provide a unified execution platform for graph queries, supporting everything from simple questions to complex algorithms. In this blog post, we will delve into the philosophy behind AvantGraph, its query processing pipeline, and its impact on graph analytics.

 

The Philosophy: Questions over Graphs

The fundamental purpose of a database is to answer questions about data. For a graph database like AvantGraph, the focus is on asking questions over graphs. We can categorize these questions based on their expressiveness, and the degree to which databases can optimize their execution. Expressiveness refers to the richness and difficulty of the questions being asked, while optimization refers to how easy or difficult it is for databases to answer these questions. Based on this categorization, the range of questions that can be asked over graphs varies in complexity, as shown in the graphic below, from simple local look-ups to general algorithms that introduce iterations:

  • Local look-ups (e.g., the properties of data associated with a full text)
  • Neighborhood look-ups
  • Subgraph isomorphism (matching specific patterns of the graph)
  • Recursive path queries (introducing connectivity)
  • General algorithms (introducing iterations, e.g., PageRank)

Optimization level as a function of questions’ complexity.

AvantGraph aims to cover this full spectrum of questions, allowing users to optimize the execution of their queries and explore the richness of their data. It utilizes cutting-edge technologies to enable efficient processing of very large graphs on personal laptops.

AvantGraph Query Processing Pipeline

AvantGraph query processing pipeline, adapted from DOI:10.14778/3554821.3554878 

AvantGraph employs a standard database pipeline. It supports query languages like Cypher and SPARQL, and it features three additional main components to enable the execution of complex questions like algorithms:

  • the QuickSilver execution engine, a multi-thread execution system allowing for efficient query parallelization and hardware utilization;
  • the Magellan Planner, a query optimizer that returns efficient execution plans tailored to each query, taking into account the recursive and iterative nature of graph queries;
  • the BallPark cardinality estimator, a cost model that determines the best execution plan for different circumstances, optimizing query performance.

In addition, AvantGraph supports secondary storage, utilizing both memory and disk effectively. This allows it to process very large graphs on laptops without requiring excessive amounts of RAM.

Preparations for SciLake Pilots

As part of the SciLake project, AvantGraph is being extended with powerful data analytics capabilities and novel technologies to support research communities in defining graph algorithms.

Why do we need it?

Graph query languages such as Cypher or SPARQL are specifically designed for "subgraph matching". This makes them highly effective when you need to retrieve information such as "get me the neighbors of a specific node" or "find the shortest path between two nodes in the graph". However, unfortunately, these query languages are too limited for complex graph analytics like e.g., PageRank.

Traditional solutions to this issue involve the database vendor providing a library of built-in algorithms that can be applied to the graph. While this works well if the library includes the algorithm needed to solve the problem, it cannot accommodate simple variations or fully custom algorithms.

What AvantGraph offers

AvantGraph introduces Graphalg, a programming language designed specifically for writing graph algorithms. Graphalg is fully integrated into AvantGraph, meaning, for example, that it can be embedded into Cypher queries.

The language used in Graphalg is based on linear algebra, which makes the syntax and operations easy to learn. The goal for Graphalg is to be a high-level language that is both user-friendly and efficiently executed by a database. This is achieved by transforming queries and Graphalg programs into a unified representation that can be optimized effectively. This enables optimizations that cross the boundary between query and algorithm, that would not otherwise be possible.

AvantGraph supports the client-server model, which is commonly used by most modern database engines, including Postgres, MySQL, Neo4j, Amazon Neptune, Memgraph, and more. This allows AvantGraph databases to be queried through more than just a Command Line Interface.

As of now, AvantGraph databases can be queried from most major programming languages, including Python API, and will be expanded in the future with more algorithms and functionalities.

Conclusion

AvantGraph represents a significant advancement in knowledge graph analytics. By addressing the limitations of traditional graph query languages and introducing Graphalg, AvantGraph empowers users to perform complex graph analytics with ease. Its unified execution of simple questions to general algorithms, coupled with its efficient query processing pipeline, makes it a valuable tool for researchers and data scientists. As AvantGraph continues to evolve and gain traction within the research community, we can expect to see exciting advancements in graph analytics and a deeper understanding of complex data relationships.

Learn more

AvantGraph is presented in:

Leeuwen, W.V., Mulder, T., Wall, B.V., Fletcher, G., & Yakovets, N. (2022). AvantGraph Query Processing Engine. Proc. VLDB Endow., 15, 3698-3701.

DOI:10.14778/3554821.3554878

For more information about AvantGraph and its publications, visit https://avantgraph.io/

AvantGraph will be released under an open license soon. To test its functionalities and perform graph queries, check out the docker container available on GitHub at https://github.com/avantlab/avantgraph/.

avantgrapharticle, training

Read more …AvantGraph: the Next-Generation Graph Analytics Engine

Defining the Roadmap for a European Cancer Data Space


WORKSHOP

Defining the Roadmap for a European Cancer Data Space

By Stefania Amodeo

SciLake representatives participated to the EOSC4Cancer consultation to define a Roadmap for a European Cancer Data Space. This article recaps the key points from the workshop.

  

EOSC4Cancer, the European-wide foundation to accelerate data-driven cancer research, recently held a face-to-face workshop in Brussels to define a Roadmap for a European Cancer Data Space. SciLake representatives participated in a discussion with around 30 stakeholders from various sectors including research, industry, patient care, survivor groups, and EOSC. The discussion focused on key aspects for a sustainable cancer dataspace, such as access models, governance, data quality, security, and privacy.

The insights collected will be used in the creation of a roadmap, scheduled for publication in early 2025. This roadmap aims to shape the future of the European cancer dataspace, with policy recommendations to the European Commission.

EOSC4Cancer Objectives

EOSC4Cancer is a Europe-wide initiative that aims to accelerate data-driven cancer research. Launched in September 2022, this 30-month project will provide an infrastructure to exploit cancer data. It brings together comprehensive cancer centers, research infrastructures, leading research groups, and major computational infrastructures from across Europe.

The expected outcomes include a platform designed for storing, sharing, accessing, analyzing, and processing cancer research data. This involves interconnecting and ensuring interoperability of relevant datasets, as well as providing scientists with easy access to cancer research data and analysis systems. Additionally, the initiative will contribute to the Horizon Europe EOSC Partnership and other partnerships relevant to cancer research.

A Federated Digital Infrastructure for accelerating Cancer Research

The EOSC4Cancer Roadmap envisions a federated digital infrastructure to accelerate cancer research in the EU. It will rely on existing European and national structures and serve as the digital data infrastructure for the future Virtual European Cancer Research Institute, a platform enabling storage, sharing, access, analysis, and processing of research data. The structure will align with major European efforts for enabling the use and re-use of cancer-related data for research and will accommodate varying levels of maturity across member states. The federated infrastructure will include centralized components and capabilities, remote software execution, and a Cancer Research Commons repository. Each EU Member State is expected to have a National Data Hub for Cancer-related data and other relevant structures, such as national nodes and reference hospitals.

The plan is to develop in stages, first focusing on National Data Hubs, then involving Competence Centers and other participants. The National Data Hubs will reflect the structure of the Digital Hub. They will host a national database for cancer research data, allow the use of Research Environments, provide computing power, and coordinate outreach efforts.

Roundtable Discussions

The workshop held roundtable discussions on four topics: missing data types, missing data sources, requirements for national nodes, and any other element missing from the roadmap.

The first topic focused on identifying missing data types, such as synthetic data, specified clinical data, reference data, specified clinical data, social data, patient-generated data, and epidemiology of survivors. The discussion also emphasized the importance of preparing data for future use, connecting different types of data across various platforms, and ensuring that data comes from reliable sources by implementing quality checks and guidelines for data submission.

The second topic revolved around the data sources, discussing what is missing in the long list of data sources, how trusted they are, and how they can be connected and prioritized. The key sources identified include structural biology data, patient-generated data, demographic data, complete clinical trial datasets, and biobanks.

The third topic concerned the requirements for national nodes. The discussion recognized that these nodes should be flexible and adaptable to the needs of different countries. It was agreed that coordinating existing initiatives, while not necessarily simple, is preferable to creating something new.

In conclusion, additional elements for the roadmap were identified. These relate to the patient journey, the treatment and product development journey, and non-tangible aspects such as data governance, clinical data usage, and cybersecurity. The roadmap should also consider incentives for scientists, communication strategies, and a plan for the platform's sustainability.

Conclusion

The workshop was a significant step towards defining the Roadmap for a European Cancer Data Space. The insightful discussions and collaborative efforts of the participants identified the requirements for comprehensive cancer data, trusted data sources, and functional national nodes.

Through these collaborative efforts, the EOSC4Cancer initiative is paving the way for a more data-driven and interconnected future in cancer research across Europe.

Read more …Defining the Roadmap for a European Cancer Data Space

Amplifying Valuable Research: How?


SciLake technical components

Amplifying Valuable Research: How?

By Stefania Amodeo

In today's fast-paced scientific landscape, it has become increasingly challenging for researchers to identify valuable articles and conduct meaningful literature reviews. The exponential growth of scientific output, coupled with the pressure to publish, has made the process of discovering impactful research a daunting task.

In a webinar for SciLake partners, Thanasis Vergoulis, Development and Operation Director of OpenAIRE AMKE and scientific associate at IMSI, Athena Research Center, discussed the technologies developed to assist knowledge discovery by leveraging impact indicators as part of SciLake's WP3: Smart Impact-driven Discovery service.

This article recaps the key points from the webinar.

Impact Indicators and knowledge discovery

Scientists heavily rely on existing literature to build their expertise. The first step is to identify valuable articles before reading them. Unfortunately, this process has become increasingly tedious due to the overwhelming volume of scientific output. The increasing number of researchers and the notorious publish-or-perish culture have contributed to this exponential growth, making it difficult and time-consuming to identify truly valuable research in specific areas of interest.

Impact indicators have been widely used to address this challenge. The main idea is to look at how many articles cite a particular article, which serves as an indication of its scientific impact. This is formalized through the citation count indicator.

Thanks to the adoption of Open Science principles, there is now a wealth of citation data available from initiatives like Open Citations and Crossref. As a result, the now-available citation data offer adequate coverage to estimate the scientific impact of an article analyzing its citations. Of course, scientific impact is not always highly correlated with scientific merit, hence it is always important to remember that an article of great value might not always be popular.

While citation count is a popular impact indicator used in academic search engines like Google Scholar, it has its limitations. Scientific impact is multifaceted, and one indicator alone is not sufficient to measure it. Other pitfalls related to citation count, that may hinder the discovery of valuable research, are the introduction of biases against recent articles and the potential for gaming the system by attacking this indicator with particular malpractices. To mitigate such issues, it is crucial to use indicators that capture a wide range of impact aspects. Additionally, considering indicator semantics and provenance helps protect against improper use and misconceptions.

How it started…

To study this problem, researchers from Athena Research Center (ARC) conducted a comprehensive survey and a series of extensive experiments to explore different ways to calculate impact indicators and rank papers based on them. Four major aspects of scientific impact were identified that should be combined:

  • Traditional impact, estimated with citation counts
  • Influence, estimated using the PageRank algorithm, which considers the impact of an article even if it is not directly cited
  • Popularity, estimated using the AttRank algorithm, which considers the fact that recent papers have not had sufficient time to accumulate citations
  • Impulse, estimated using 3-year citation counts to capture how quickly a paper receives attention after its publication

Building upon these aspects, ARC researchers developed a workflow to calculate these indicators for a vast number of research products and made them openly available to the community, enabling third-party services to be built on top of them.

…how it is going

The workflow to the BIP! Database and Services

The current workflow developed by ARC starts with the OpenAIRE Graph, where citations are collected as a first proxy of impact. A citation network is built based on this information. ARC has developed an open-source Spark-based library called BIP! Ranker, which calculates indicators for approximately 150 million research products. While computationally intensive, the calculations can be performed within minutes or hours on a computer cluster, depending on the indicator. The resulting indicators are available on Zenodo as the BIP! DB dataset and advanced services, such as BIP! Finder, BIP! Scholar, and BIP! Readings are also provided based on these indicators. Finally, the indicators are integrated back into the OpenAIRE Graph, ensuring their inclusion in any data snapshot of the graph downloaded. In addition, the workflow classifies research products based on their ranking and can provide, for example, the results within certain percentage thresholds, (e.g. being in the top 1% of the whole dataset or of a particular topic). During the process of calculating the indicators, various checks should take place. For instance, to prevent the duplication of citations, it is ensured that multiple versions of the same article, such as pre-prints and published versions, are not counted twice. Additionally, there are plans to eliminate self-citations in the future and to give the option to select whether to consider only citations from peer-reviewed articles or not.

So, what can we use?

BIP! Finder: the service that improves literature search through impact-based ranking

In the BIP! Finder interface, users can perform keyword-based searches and rank results based on different aspects of impact (e.g., popularity or influence). This allows users to customize the order of the results. Each result also displays the class that each publication has according to the four main impact indicators available through the interface. The service also provides insight into how a paper is ranked among others in a specific topic. This is particularly useful in cases of highly specialized papers which would unlikely rank high in a large database.

Preparations for SciLake pilots

The BIP! services bundle now includes the BIP! spaces service that allows building domain-specific, tailored BIP! Finder replicas. The main purpose of these spaces is to use them as demonstrations for the pilots of the SciLake project. The service will provide knowledge discovery functionalities based on impact indicators and incorporating information from the domain-specific knowledge graphs that the pilots are building.

What each pilot gets:

  • a preset in the search conditions, such as the preferred way to rank the results,
  • query expansions with additional keywords based on domain-specific synonyms (e.g., synonyms in gene names in cancer research),
  • query results including domain-specific annotations based on pilots' scientific knowledge graphs.

Future extensions:

  • support for annotating publications to extend the domain-specific SKGs:
    • enabling users to add connections of publications and other objects to domain-specific entities and include these relations into their SKG,
  • additional indicators,
  • support for domain-specific highlights:
    • flags for collections of papers that are important in a specific community,
  • topic summarization & evolution visualisation features.

Conclusions

By leveraging impact indicators, researchers can navigate the vast scientific landscape more effectively, discover valuable research, and make informed decisions in their respective fields. This paves the way for accelerating knowledge discovery and amplifying the impact of valuable research.

Stay tuned for more updates on how SciLake is amplifying valuable research!

bip articles, training

Read more …Amplifying Valuable Research: How?