Scientific Data modelling and aggregation of marine data

Scientific data integration and aggregation is a challenging, time consuming and error prone process. Marine Data (e.g. biodiversity data, data about Fisheries and Stocks etc.) is scattered across different and heterogeneous databases, with no standard structure, not intended to interoperate with others (database silos) and the guidelines for populating these databases are also heterogeneous.

The main challenge in BlueBRIDGE was to integrate and aggregate marine data coming from heterogeneous data sources in a semantic-rich way to build the Global Record of Stocks and Fisheries VRE.

Although the above processes were carried out in the marine domain in the context of the BlueBRIDGE project, these practices can be applied to any other domain.

The BlueBRIDGE best practice

 

To map, integrate and aggregate marine scientific data BlueBRIDGE has adopted a six steps process:

  1. Selection of Competency Queries: The Competency Queries express the scientific questions that the final system (Knowledge Base, Application etc.) should be able to answer. Competency queries are usually referred as query requirements. They are formulated by the potential users that have adequate knowledge of the domain, assisted by the semantic model developers and are critical for the success of the practice. The competency queries affect the selection of the semantic model and its extension, the mappings, the design of the services and even the final GUI. An example competency query can be “Give me all the stocks for the species Thunnus Albacares that are harvested by fisheries in the Atlantic Ocean”.
  2. Selection / extension of the Semantic Model: The second step of the practice is to select or extend a semantic model that is adequate for the domain and has the expressive power to support the competency queries. This model will act as the semantic backbone and all the data will eventually be transformed according to its constraints. In the context of BlueBRIDGE a MarineTLO ontology was adopted that a) provides the consistent abstractions or specifications of concepts included in all data models or ontologies of marine data sources and b) contains the necessary properties to make GRSF a coherent source of facts relating observational data to the respective spatiotemporal context
  3. Exploit mapping technologies: The step that follows the selection of the semantic model is the creation and application of the mappings between the model constraints and the source schema. In the context of BlueBRIDGE X3ML language was used to express the mappings. The X3ML framework offers a plethora of applications. An indicative one is a 3M Editor that offers an interactive way of defining mappings and makes it relatively simple (even for users without an IT background).
  4. Transform Data into a common (semantic) format: One basic step for integrating data from heterogeneous data sources, is to transform them into a common format.  This requires the mappings (defined during the previous step). More specifically, we an X3ML engine was used to receive input the sources data and the formulated mappings as input and transform the data RDF in accordance with the semantic model.
  5. Ingest data into a semantic warehouse: The final step of the process is the actual creation of the aggregator warehouse. In the context of BlueBRIDGE, MatWare was exploited. This is a tool to automatically create semantic warehouses by importing RDF files into a semantic triplestore.
  6. Assess the connectivity of the semantic warehouse: After ingesting all the data into the semantic warehouse, it is important to inspect how connected the integrated dataset is. Connectivity is assessed in terms of connectivity metrics. This measures how connected the resources of the semantic warehouse are based on metrics (such as common URIs and literals, average degree, etc.). These metrics can be computed automatically after the construction (or refresh) of the semantic warehouse, and also enable errors and redundant data sources to be spotted.

The “key” features of the aforementioned steps that guarantee the efficient and aggregation of heterogeneous marine data are the following:

  • The exploitation of mappings that separates the work of domain experts (that actually define the mappings), and the IT people (that define how the information will be transformed). This decoupling of roles eliminates the bottleneck that usually appear when integrating data.
  • The assessment of the connectivity of the integrated data which is of paramount importance and guarantees that they the data from the original sources are properly connected and also spot errors (i.e. if there are or irrelevant data sources).
  • The automation of the entire process that allows reconstructing from scratch or refreshing particular parts of the integrated data set, by relying on innovative technical components (i.e. MatWare).

 

Why this is considered a best practice

Best Practice Analysis

Validation

The output of the above-described marine data modelling and aggregation best practice is the Global Record of Stocks and Fisheries VRE (GRSFVRE). The GRSF VFRE and therefore the full modelling and aggregation process has been validated by both FAO (FIRMS Database), University of Washington (RAM Database) and SFP (FishSource Database). The feedback was positive as the users (experts) were able to report that:

1)     They are able to answer queries that could not answer before

2)     They are able to browse through and discover complete sets of data about stocks and fisheries that could not be performed before

3)   The data is of better quality, providing them with “hints” as to how to improve their own database.

Innovation

The best practice contributed to delivering the (first) Global Record of Stocks and Fisheries. It enabled scientists to adopt it as a methodology to improve the final result with the overall aim of delivering a high-quality product. On this aspect, when a draft version of the registry was available, scientists adapted the competency queries to alleviate issues and capture more concepts, which in turn triggered all the subsequent steps towards re-constructing the registry.

 

As a result, the final version of the registry can be considered as a knowledge base containing a coherent set of facts on stocks and fisheries that can be used to carry out advanced stock and fishery assessment activities.

Success Factors

The main factors to guarantee that the practice was successful are the following:

  1. The practice has been adopted for constructing several versions of the Global Registry of Stocks and Fisheries.
  2. It has been presented and approved in three different technical working groups organized by FIRMS (FAO division) with the participation of different marine stakeholders.
  3. It has been published and presented in a scientific conference

Sustainability

The activities for curating and maintaining the integrated dataset is an activity that without cease. The semantic warehouse relies on data coming from different data sources, and, as such, it might require adaptations as data from the source changes. These adaptations regard changes in the mappings, as well as changes in the algorithms that integrate the data.

 

In addition, competency queries might change. As new user requirements emerge, mappings and algorithms that integrate data need to be updated to capture them.

 

The aforementioned activities should be carried out by a person familiar with mapping technologies and software. Although we cannot estimate the potential effort and cost to carry this out (clearly this depends on the complexity and the number of  new requests), there is confidence that the benefits to be obtained will be great.

Replicability and/or up-scaling

The proposed practice can be used by:

a)     Institutions that perform data mapping or data aggregation activities;

b)     Companies that exploit integrated data for a profit (e.g. companies exploiting marine resources)

c)     Non - profit organizations that use scientific data for decision making and prediction of potential disastrous scenarios (i.e. FAO for predicting the depletion of a marine resource)

 

The organisations involved in the context of the BlueBRIDGE project were:

  • FORTH who lead the technical activities for the designing and implementation of services integrating marine data.
  • FAO who provided data, validated the results and assisted in delineating the policies.
  • Sustainable Fisheries Partnership who provided FishSource data and validated the results.
  • CNR who assisted the front-end development activities.
  • RAM Legacy Stocks Assessment Database who provided data.

Although, the best practice was tailored to integrate data for the marine domain, it should be clarified that it can be applied to any other domain, with little or no modifications (i.e. only the semantic model could be updated).

 

Lessons Learnt

The collation of information is both difficult and time-consuming, as the information is scattered across different databases and is modelled using different formats and standards. The proposed best practice defines the steps that are required to integrate data from different sources, allowing it to be exploited for to respond to queries that could not be answered before (as they required combining knowledge from heterogeneous data sources), and assess the connectivity of the resulting semantic warehouse. Thanks to this practice scientists can easily discover big and complete datasets to assist them on their research, investors have central access to rich information to assist them in their decisions to invest in a specific field (e.g. invest on harvesting tuna in the North Atlantic), industry can acquire and analyse data to predict the sustainability of their businesses, and improve their profits and simple users can gain valid information in their domain of interest.

 

The main lessons learned during the application of the best practice in the context of BlueBRIDGE were:

  1. Collaboration with the providers of the data sources is critical. Without their support in understanding the semantics, the clarifications on their data and the validation of the mappings, it is extremely difficult attain successful aggregation.
  2. The use of a central semantic model is mandatory to overcome heterogeneity that exist among the schemata of the original data sources. Furthermore, it removes the complexity of updating technical components when the schemata of the original sources change (in this case only the mappings should be updated and all the technical components remain unchanged).
  3. Collection of the query requirements (the competency queries) should always be the first step. A wrong estimation of the querying capabilities of the resulting system can lead to a completely wrong design incapable of serving the query needs of the marine community.