DBpedia Development Wiki devilopment bible

This page was included from https://github.com/dbpedia/databus-maven-plugin.

Discuss via Slack #releases: dbpedia.slack.org

NOTE WE ARE WRITING A MANUAL HERE: https://github.com/dbpedia/databus-maven-plugin/wiki/User-Manual-v1.3 This page is being merged there, information here is outdated

Databus Maven Plugin Build Status Maven Central Version

Aligning data and software lifecycle with Maven

The plugin was developed to use the features of the Maven software build automation tool for data releases and metadata generation. The tool has the following features:

  • once configured properly (1-3 hours), data can be released and released systematically in minutes
  • auto-detect RDF formats with syntax validation
  • RDF is NOT a requirement, any data can be released (binary, csv, xml), however with RDF the tool has more features
  • auto-detect compression variant
  • private key signature, sha256sum and provenance (WebID) generation
  • generation of metadata compatible to:
    • RSS feeds

Metadata Standards

DBpedia’s DataID fulfills 31 of 35 Best Practices from the W3C Data on the Web Best Practices Working Group, cf. implementation report


We are planning the following features:

  • DCAT and DCAT-AP interoperability
  • FAIR Data principles
  • automatic generation of H2020 EU Data Management Plan Deliverables
    • feature exists, but is not yet integrated:
    • https://wiki.dbpedia.org/use-cases/data-management-plan-extension-dataid
  • automatic upload of metadata to other repositories:
    • http://lod-cloud.net
    • CKAN
    • RE3

Did we forget something? suggest more interoperability in the issue tracker: https://github.com/dbpedia/databus-maven-plugin/issues

Table of Contents

Bundle, dataset, distribution

In this section, we will describe the basic terminology and how they relate to Maven.


  • Dataset - a dataset is a bunch of files that have a common description. The fact that they can be described together shows an inherent coherence and that they belong together. Other than this criteria, it is quite arbitrary how datasets are defined, so this is a pragmatical approach, i.e. there is no need to duplicate documentation, i.e. have several datasets with the same description or subspecialisation, i.e. this dataset is about X, but some files are about Y
    • the databus maven plugin requires that all files of a dataset start with the datasetname
  • Distribution - one file of a dataset
  • Formatvariance - a dataset can have files in different formats. Format variance is abstracted quite well, different distributions are created with same metadata except for the format field
  • Compression variance - compression is handled separatedly from format, i.e. the same format can be compressed in different ways
  • Contentvariance of a dataset - besides the format variance a dataset can have a certain degree of content variance. This normally determines how the dataset is distributed over the files. The easiest example is DBpedia, where each dataset contains all Wikipedia languages, so in this case contentvariance is the language. The data could also be differentiated by type, e.g. a company dataset that produces a distribution for each organsiation form (non-profit, company, etc). As a guideline, contentvariance can be choosen arbitrarily and the only criteria is whether there are some use cases, where users would only want part of the dataset, otherwise merging into one big file is fine.
  • Group - a collection of datasets released together. Also a pragmatic definition. The framework here will not work well, if you combine datasets with different release cycles and metadata in the same bundle, e.g. some daily, some monthly or metadata variance different publishers or versioning systems.

Relation to Maven

Maven was established to automate software builds and release them (mostly Java). A major outcome of the ALIGNED project (http://aligned-project.eu/) was to establish which parts of data releases can be captured by Maven. Here is a practical summary:

Maven uses a Parent POM (Project Object Model) to define software project. The POM is saved in a file called pom.xml. Each project can have multiple modules where the code resides. These modules refer to the parent pom and inherit any values unless they are overwritten. While in software the programming language defines a complex structure which has to be followed, in data everything is fantasy ecxept for the concrete file as it provides a clearly defined thing. Hence the model imposed for the databus is simpler than for software:

  • Bundle relates to the Parent POM and inherits its metadata to the modules/datasets
  • Datasets are modules and receive their metadata from the bundle/parent pom (and can extend or override it)
  • Distributions are the files of the dataset and are normally stored in src/main/databus/${version}/ for each module
  • Each dataset/module has its own artifactid, the distributions/files must start with the artifactid


Changes in software can be tracked very well and manual versioning can be given. Data behaves two-fold: Schematic information, e.g. schema definitions, taxonomy and ontologies can be versioned like software. The data itself follows pareto-efficiency: The first 80% need 20% of effort, the last 20% need 80%. Fixing the last error in data is extremely expensive. Hence, we recommend using a time-based version, i.e. YEAR.MONTH.DAY in the format YYYY.MM.DD (alphabetical sortable). Another possibility is to align the version number to either:

  1. the software version used to create it (as a consequence the software version needs to be incremented for each data release)
  2. the ontology version if and only if the ontology is contained in the bundle and versioned like software



License of the software is AGPL with intended copyleft. We expect that you spend your best effort to commit upstream to make this tool better or at least that your extensions are made available again. Any contribution will be merged under the copyright of the DBpedia Association.

Development rules

  • configuration values taken from Maven are configured in Properties.scala, use its ‘sub-trait’ Locations.scala to derive filesystem locations from these and Parameters.scala to compute all other values derived from the original Maven properties (refactoring into this separation is not yet complete, but please heed this guidelines for additional configuration-derived fields nontheless)
  • Datafile.scala is a quasi decorator for files, use getInputStream to open any file
  • Use the issue tracker, do branches instead of forks (we can give access), we will merge with master
  • Document options in the archetype pom and here


Download from http://databus.dbpedia.org:8081/repository/ fails, no dependency information available

Note: this section can be removed after completion of https://github.com/dbpedia/databus-maven-plugin/issues/12 Possible reason: we have installed a dev archiva for now. Depending on your org’s network configuration, code might only be accepted from Maven Central and local/allowed maven repos.

  • [WARNING] The POM for org.dbpedia.databus:databus-maven-plugin:jar:1.0-SNAPSHOT is missing, no dependency information available
  • Could not resolve dependencies for project org.dbpedia.databus:databus-maven-plugin:maven-plugin:1.0-SNAPSHOT: Failure to find org.dbpedia.databus:databus-shared-lib:jar:0.1.4

Can potentially fixed by locally installing the shared-lib:

  • Download Jar: http://databus.dbpedia.org:8081/#artifact-details-download-content/org.dbpedia.databus/databus-shared-lib/0.1.4
  • Install dependency: https://maven.apache.org/guides/mini/guide-3rd-party-jars-local.html

Then clone the repo and run mvn install which will install the databus-maven-plugin locally

databus plugin goals are not found after installing the plugin via sources (mvn install)

[ERROR] Could not find goal 'metadata' in plugin org.dbpedia.databus:databus-maven-plugin:1.1-SNAPSHOT among available goals -> [Help 1]
org.apache.maven.plugin.MojoNotFoundException: Could not find goal 'metadata' in plugin org.dbpedia.databus:databus-maven-plugin:1.1-SNAPSHOT among available goals 

Try to wipe (make a copy of it and then delete the original) your m2 (maven local repository) and then build it again.

BUILD FAILURE, no mojo-descriptors found (when using mvn install to install the databus-maven-plugin)

This is most likely caused by using an old maven version (observed in version 3.0.5) A workaround for this would be replacing:




in databus-maven-plugin/pom.xml

UTF-8 - Encoding Errors in the produced data

On Unix: run: grep "LC_ALL" .* in your /root/ directory and make sure

.bash_profile:export LC_ALL=en_US.UTF-8
.bashrc:export LC_ALL=en_US.UTF-8

is set.