DBpedia Development Wiki devilopment bible

This page was included from https://github.com/dbpedia/databus-maven-plugin.

Discuss via Slack #releases: dbpedia.slack.org

Databus Maven Plugin Build Status

Aligning data and software lifecycle with Maven

The plugin was developed to use the features of the Maven software build automation tool for data releases and metadata generation. The tool has the following features:

  • once configured properly (1-3 hours), data can be released and released systematically in minutes
  • auto-detect RDF formats with syntax validation
  • RDF is NOT a requirement, any data can be released (binary, csv, xml), however with RDF the tool has more features
  • auto-detect compression variant
  • private key signature, sha256sum and provenance (WebID) generation
  • generation of metadata compatible to:
    • RSS feeds

Metadata Standards

DBpedia’s DataID fulfills 31 of 35 Best Practices from the W3C Data on the Web Best Practices Working Group, cf. implementation report

Roadmap

We are planning the following features:

  • DCAT and DCAT-AP interoperability
  • FAIR Data principles
  • automatic generation of H2020 EU Data Management Plan Deliverables
    • feature exists, but is not yet integrated:
    • https://wiki.dbpedia.org/use-cases/data-management-plan-extension-dataid
  • automatic upload of metadata to other repositories:
    • http://lod-cloud.net
    • CKAN
    • RE3

Did we forget something? suggest more interoperability in the issue tracker: https://github.com/dbpedia/databus-maven-plugin/issues

Table of Contents

Bundle, dataset, distribution

In this section, we will describe the basic terminology and how they relate to Maven.

Terminology

  • Dataset - a dataset is a bunch of files that have a common description. The fact that they can be described together shows an inherent coherence and that they belong together. Other than this criteria, it is quite arbitrary how datasets are defined, so this is a pragmatical approach, i.e. there is no need to duplicate documentation, i.e. have several datasets with the same description or subspecialisation, i.e. this dataset is about X, but some files are about Y
    • the databus maven plugin requires that all files of a dataset start with the datasetname
  • Distribution - one file of a dataset
  • Formatvariance - a dataset can have files in different formats. Format variance is abstracted quite well, different distributions are created with same metadata except for the format field
  • Compression variance - compression is handled separatedly from format, i.e. the same format can be compressed in different ways
  • Contentvariance of a dataset - besides the format variance a dataset can have a certain degree of content variance. This normally determines how the dataset is distributed over the files. The easiest example is DBpedia, where each dataset contains all Wikipedia languages, so in this case contentvariance is the language. The data could also be differentiated by type, e.g. a company dataset that produces a distribution for each organsiation form (non-profit, company, etc). As a guideline, contentvariance can be choosen arbitrarily and the only criteria is whether there are some use cases, where users would only want part of the dataset, otherwise merging into one big file is fine.
  • Bundle - a collection of datasets released together. Also a pragmatic definition. The framework here will not work well, if you combine datasets with different release cycles and metadata in the same bundle, e.g. some daily, some monthly or metadata variance different publishers or versioning systems.

Relation to Maven

Maven was established to automate software builds and release them (mostly Java). A major outcome of the ALIGNED project (http://aligned-project.eu/) was to establish which parts of data releases can be captured by Maven. Here is a practical summary:

Maven uses a Parent POM (Project Object Model) to define software project. The POM is saved in a file called pom.xml. Each project can have multiple modules where the code resides. These modules refer to the parent pom and inherit any values unless they are overwritten. While in software the programming language defines a complex structure which has to be followed, in data everything is fantasy ecxept for the concrete file as it provides a clearly defined thing. Hence the model imposed for the databus is simpler than for software:

  • Bundle relates to the Parent POM and inherits its metadata to the modules/datasets
  • Datasets are modules and receive their metadata from the bundle/parent pom (and can extend or override it)
  • Distributions are the files of the dataset and are normally stored in src/main/databus/${version}/ for each module
  • Each dataset/module has its own artifactid, the distributions/files must start with the artifactid

Versioning

Changes in software can be tracked very well and manual versioning can be given. Data behaves two-fold: Schematic information, e.g. schema definitions, taxonomy and ontologies can be versioned like software. The data itself follows pareto-efficiency: The first 80% need 20% of effort, the last 20% need 80%. Fixing the last error in data is extremely expensive. Hence, we recommend using a time-based version, i.e. YEAR.MONTH.DAY in the format YYYY.MM.DD (alphabetical sortable). Another possibility is to align the version number to either:

  1. the software version used to create it (as a consequence the software version needs to be incremented for each data release)
  2. the ontology version if and only if the ontology is contained in the bundle and versioned like software

Files & folders

Per default

${bundle}/ 
+-- pom.xml (parent pom with bundle metadata and ${version}, all artifactids are listed as `<modules>` )
+-- ${artifactid1}/ (module with artifactid as datasetname)
|   +-- pom.xml (dataset metadata)
|   +-- src/main/databus/${version}/
|   |   *-- ${artifactid1}_cvar1.nt (distribution, content variance 1, formatvariance nt, compressionvariant none)
|   |   *-- ${artifactid1}_cvar1.csv (distribution, content variance 1, formatvariance csv, compressionvariant none)
|   |   *-- ${artifactid1}_cvar1.csv.bz2 (distribution, content variance 1, formatvariance csv, compressionvariant bzip)
|   |   *-- ${artifactid1}_cvar2.ttl (distribution, content variance 2, formatvariance ttl, compressionvariant none)
|   |   *-- ${artifactid1}_cvar2.csv (distribution, content variance 2, formatvariance csv, compressionvariant none)

An example is given in the example folder of this repo.

(Important) File input path

The file input path is src/main/databus/${version}/ per default, relative to the module. This path can be configured in the parent pom.xml using the <databus.dataInputDirectory> parameter. Absolute paths are allowed.

(Important) File copying

During the maven build process, the code is normally duplicated 6-7 times. For each module, the code is first copied and compiled in the target/classes folder and then copied and compressed again in a .jar file. All this is then copied again. The databus-maven-plugin behaves different:

  • the target/databus folder is used to assemble metadata (which is not large)
  • mvn clean deletes the target folder and will only delete the generated metadata
  • no input data is copied into the target folder, i.e. the process does not duplicate data due to storage reasons
  • mvn databus:package-export will copy the files to an external location as given in <databus.packageDirectory>.

Usage

How to make a release

Once the project is configured properly see Configuration releases are easy to generate and update. The only technical requirement for usage is Maven3 sudo apt-get install maven We regularly deploy the plugin to our archiva at http://databus.dbpedia.org:8081/, later Maven Central. Maven will automatically install the plugin (Note that the archetype for configuration has to be installed manually at the moment ) We assume that you have set up the private key, the WebId and the data resides in src/main/databus/${version}/ and the pom.xml are configured properly.

# deleting any previously generated metadata
mvn clean 

# validate setup of private key/webid
mvn databus:validate

# validate syntax of rdf data, generated parselogs in target/databus/parselogs
# Note: this is a resource intensive step. It can be skipped (-DskipTests=true)
mvn databus:test-data

# generate metadata in target/databus/dataid
mvn databus:metadata

# export the release to a local directory as given in <databus.packageDirectory>
# copies data from src, metadata and parselogs from data
mvn databus:package-export

# submit/upload the generated metadata to the databus metadata repository
mvn databus:deploy

# output folder or any parameter can be set on the fly 
mvn databus:package-export -Ddatabus.packageDirectory="/var/www/mydata.org/datareleases"

Github setup

The pom.xml can be versioned via GitHub as we do for dbpedia (see folder). Add the following to .gitignore to exclude data from being committed to git: ${bundlefolder}/*/*/src/

Change version of the whole bundle

mvn versions:set -DnewVersion=2018.08.15

Run the example

There are working examples in the example folder, which you can copy and adapt

# clone the repository
git clone https://github.com/dbpedia/databus-maven-plugin.git
cd databus-maven-plugin
cd example/animals

# validate, parse, generate metadata and package
mvn databus:validate databus:test-data databus:metadata databus:package-export

Configuration

File setup and conventions

Generate a release configuration with an archetype

Note: For datasets with few artifacts, you can also copy the example and adjust it

We provide a Maven Archetype for easy and automatic project setup. In short, Archetype is a Maven project templating toolkit: https://maven.apache.org/guides/introduction/introduction-to-archetypes.html The template is created from an existing project, found in archetype/existing-projects. Variables are replaced upon instantiation.

Install databus archetype

We provide two archetype templates:

  • bundle-archetype generates a bundle with one dataset (called add-one-dataset)
  • add-one-dataset-archetype adds a module to an existing bundle

The archetype needs to be installed into the local maven repo:

git clone https://github.com/dbpedia/databus-maven-plugin.git
cd databus-maven-plugin/archetype/existing-projects
./deploy.sh

deploy.sh runs mvn archetype:create-from-project and mvn install on bundle and bundle/add-one-dataset

Instantiate a new project

With the archetype you can create one bundle with arbitrarily many datasets/artifacts. Here is how:

# Generate the bundle

# version number of bundle
VERSION=2018.08.15
# domain 
GROUPID=org.example.data
# bundle artifactid
BUNDLEARTIFACTID=animals
# configure list of datasets/artifacts to be created
DATASETARTIFACTID="mammals birds fish"

mvn archetype:generate -DarchetypeCatalog=local -DarchetypeArtifactId=bundle-archetype -DarchetypeGroupId=org.dbpedia.databus.archetype -DgroupId=$GROUPID -DartifactId=$BUNDLEARTIFACTID -Dversion=$VERSION -DinteractiveMode=false

# Generate datasets/modules 

# go into the bundle
cd $BUNDLEARTIFACTID

for i in ${DATASETARTIFACTID} ; do 
	mvn archetype:generate -DarchetypeCatalog=local -DarchetypeArtifactId=add-one-dataset-archetype -DarchetypeGroupId=org.dbpedia.databus.archetype -DgroupId=$GROUPID -DartifactId=$i -Dversion=$VERSION -DinteractiveMode=false
	# some clean up, since archetype does not set parent automatically  
	# TODO we are trying to figure out how to automate this
	sed -i "s|<artifactId>bundle</artifactId>|<artifactId>$BUNDLEARTIFACTID</artifactId>|" */pom.xml
	sed -i "s|<groupId>org.dbpedia.databus.archetype</groupId>|<groupId>$GROUPID</groupId>|" */pom.xml
	sed -i "s|<version>1.0.0</version>|<version>$VERSION</version>|" */pom.xml
done



# delete add-one-dataset 
rm -r add-one-dataset
sed -i  's|<module>add-one-dataset</module>||' pom.xml

# wipe the example data files
rm */src/main/databus/$VERSION/*

Development

License

License of the software is AGPL with intended copyleft. We expect that you spend your best effort to commit upstream to make this tool better or at least that your extensions are made available again. Any contribution will be merged under the copyright of the DBpedia Association.

Development rules

  • All paths are configured in Properties.scala, which is a trait for the Mojos (Maven Plugin classes), please handle all paths there
  • Datafile.scala is a quasi decorator for files, use getInputStream to open any file
  • Use the issue tracker, do branches instead of forks (we can give access), we will merge with master
  • Document options in the archetype pom and here

Troubleshooting

Download from http://databus.dbpedia.org:8081/repository/ fails, no dependency information available

Note: this section can be removed after completion of https://github.com/dbpedia/databus-maven-plugin/issues/12 Possible reason: we have installed a dev archiva for now. Depending on your org’s network configuration, code might only be accepted from Maven Central and local/allowed maven repos.

  • [WARNING] The POM for org.dbpedia.databus:databus-maven-plugin:jar:1.0-SNAPSHOT is missing, no dependency information available
  • Could not resolve dependencies for project org.dbpedia.databus:databus-maven-plugin:maven-plugin:1.0-SNAPSHOT: Failure to find org.dbpedia.databus:databus-shared-lib:jar:0.1.4

Can potentially fixed by locally installing the shared-lib:

  • Download Jar: http://databus.dbpedia.org:8081/#artifact-details-download-content/org.dbpedia.databus/databus-shared-lib/0.1.4
  • Install dependency: https://maven.apache.org/guides/mini/guide-3rd-party-jars-local.html

Then clone the repo and run mvn install which will install the databus-maven-plugin locally

BUILD FAILURE, no mojo-descriptors found (when using mvn install to install the databus-maven-plugin)

This is most likely caused by using an old maven version (observed in version 3.0.5) A workaround for this would be replacing:

<plugin>
	<groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-plugin-plugin</artifactId>
	<version>3.4</version>
</plugin>

with

<plugin>
	<groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-plugin-plugin</artifactId>
	<version>3.4</version>
        <configuration>
		<skipErrorNoDescriptorsFound>true</skipErrorNoDescriptorsFound>
	</configuration>
        <executions>
		<execution>
		        <id>mojo-descriptor</id>
		        <goals>
                            <goal>descriptor</goal>
                        </goals>
		</execution>
	</executions>
</plugin>

in databus-maven-plugin/pom.xml

UTF-8 - Encoding Errors in the produced data

On Unix: run: grep "LC_ALL" .* in your /root/ directory and make sure

.bash_profile:export LC_ALL=en_US.UTF-8
.bashrc:export LC_ALL=en_US.UTF-8

is set.