Development of the plugin started with the idea to load data into software as data dependencies, in a similar manner that Maven can load software dependencies automatically via Maven Central and Archiva. Initially, we started with this plugin and implemented all features here. Now the Databus Website exists and the concept of Databus Mods as well as the Databus Client. Therefore we are slimming down this plugin in the following manner:
dataid.ttl
stays stable as an interface. If the Databus Maven Plugin is thinner, it is easier to implement in other programming languages.dataid.ttl
description (also how to query it) and a list of planned changes, as they might break queries..ttl
is secure and if you have good local security and are the only root admin or trust all other admins.Databus is a Digital Factory Platform and transforms data pipelines into data networks. The databus-maven-plugin is the current CLI to generate metadata about datasets and then upload this metadata, so that anybody can query, download, derive and build applications with this data via the Databus. It follows a strict Knowledge Engineering methodology, so in comparison to BigData, Data Science and any Git-for-data, we implemented well-engineered processes that are repeatable, transparent and have a favourable cost-benefit ratio:
mvn
with options on CLI like it were any other software that you configure and then execute.derive
and describe
links and deploy tons of applications automatically. However, you can put a password in front of your data and charge for access as well.The Databus solves fundamental problems of data management processes in a decentralised network. Each new release is a workflow step in a global network consisting of Load-Derive-Release. Data is downloaded based on the Databus metadata, derived and re-published.
The best example is DBpedia itself, where we load the Wikimedia dumps, derive RDF with the extraction framework and release the data on our server and metadata on the databus. Other users repeat this procedure.
The main purpose of the Databus Maven Plugin. Data has been loaded and processed locally and is now available in files again. These files are made accessible on the web (package
) and automatically retrievable by Databus metadata (deploy
).
Databus uses a mix of several concepts that have proven successful:
group
, artifact
, version
structure, compare this library with this datasetThe Maven Plugin uses client certificate (.X509) authentication to establish HTTPS and private key signatures of the data files and metadata.
Publishers are required to keep their own private key/.X509 bundle, which they use to:
dataid.ttl
metadata filecurl https://databus.dbpedia.org/system/api/accounts
Private keys are like passwords. The publisher is responsible for due diligence (don’t loose, don’t share)
In addition, publishers are required to publish their public key on their servers as a WebID such as http://webid.dbpedia.org/webid.ttl#this . Clients can verify the signature of files against the publisher’s public key, either by retrieving the public key from the webid or via this query (the SPARQL API caches all webids on a daily basis):
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX cert: <http://www.w3.org/ns/auth/cert#>
# Get all files
SELECT DISTINCT ?file ?publisher ?publickey ?signature WHERE {
?dataset dataid:artifact <https://databus.dbpedia.org/dbpedia/mappings/geo-coordinates-mappingbased> .
?dataset dcat:distribution ?distribution .
?distribution dcat:downloadURL ?file .
?distribution dct:publisher ?publisher .
?publisher cert:key [ cert:modulus ?publickey ] .
?distribution dataid:signature ?signature .
}
The Databus prescribes a minimal structure to bring order into datasets, beyond this the decisions on how to use it, stay with the publisher. Below is a set of decisions that need to be taken.
The basis for any data publication on the bus are files, which are grouped into versions and then artifacts. The main aspect, that should drive dataset structure is the potential users with two basic rules:
and as a practical tip:
However, we advise against breaking changes as a new artifact can be created. Below are some use cases as a template to design artifacts. In complex cases, it might be helpful to read about Maven’s Inheritance vs. Aggregation.
In its core DBpedia has 4 different kind of extractions: Generic (19 artifacts), Mappingbased (6 artifacts), Text (2 artifacts) and Wikidata (18 artifacts), which each have their special dependencies and ways to run and set up. Hence we created four different groups with different group metadata (e.g. text has a different license). This way, we can run them individually on different servers. Furthermore, the artifacts relate approximately to one piece of the code, e.g. Label Extractor extracts Generic/Labels. In this manner, the produced data is directly linked to the code and also run together in an adequate way (not too small, as too much setup is required, but also not too big, i.e. it would take 8-12 weeks to run all extractions on one server, but only 2 days to run only the mappings group).
Some artifacts are produced in 130 languages. However, it would have been very impractical to create 130 artifacts as documentation would be copy/paste, so we encoded language as a contentvariant in the file name lang=en
, which increases the number of files per artifact, but keeps the overall massive data publication process manageable.
In addition to this, we have a clean contribution process. DBpedia contributors release data in their space like Marvin on an extraction server and we quality check it before downloading from his databus space on our server and re-releasing under DBpedia.
We have several organisations and individuals that are contributing links to DBpedia. We are currently preparing a working example and will link it here later.
Most of them, however, have a single artifact called links
with a file links.nt.bz2
and produce new versions infrequently. Some publish tsv
files, where small converters filter and derive
an RDF version upon release.
In an industrial environment, we have implemented the Databus in the following way:
A customer sends PDF documents for digitalisation of data about 10k different machines . There are several documents from different contexts (maintenance report, certification, etc.) for each machine. Upon receipt, all the relevant files for one machine are loaded into one artifact with the machine identifier, e.g. M2133
. Each context is encoded as a contentvariant, so per artifact there are 4 files: M2133_maintenance.pdf
, M2133_certificate.pdf
, etc. Since the same files are sent each year for each machine in a different week, the files are packaged on a local server and deployed to the SPARQL API on an artifact basis within the PDF group
The OCR and NLP tools scan the SPARQL API twice each day. As soon as they find new versions, they begin extracting facts and save them as a JSON file using the same articaft Ids, but in the JSON group
and package and deploy them including the prov:wasDerivedFrom
to the PDF group/artifact/version.
The JSON to RDF converter scans the SPARQL API as well and also has individually versioned artifacts.
The main reason to separate these three cycles here is that each of them is executed by a different department. During the RDF conversion, we implemented SHACL and other tests, which report issues alongside the RDF releases. These can be queried via the SPARQL API as well and are treated as issues for the OCR/NLP extraction group as upstream feedback. In this manner, the provenance is very granular, however the versioning and release process is more complex. A changelog is kept
coming soon
In the semantic web interoperability is achieved by co-evolution of own data with DBpedia. This mean linking entities to DBpedia identifiers and mapping ontologies to the DBpedia Ontology and other Vocabularies. The basic setup for such a project is to:
In the beginning, when the own ontology and data changes a lot, the version can be set to <version>0000-initial-dev<\version>
and overwritten each time the data is used internally. Later the data processes should be automated and versions should increase.
When it is sufficiently stable the linking can be automated in the following way:
This setup automates the alignment to DBpedia.
NOTE: version
MUST NOT contain any of these characters \/:"<>|?*
as they conflict with maven and URIs and filenames.
The main decision to take here is how often the files are released. This can differ widely according to the use case. In general the <version>
is a free field, so you can use arbitrary names like <version>dev-snapshot</version>
. We also allow re-releases of same versions at the moment, so it is possible to re-release dev-snapshot
according to your development needs, i.e. one person from the team shapes and tinkers on the data and re-releases 2-4 times per workday, the other person re-downloads and tests it. It is also possible to add minutes to the version string 2019.01.05-14.09
or 2019.01.05T14.07.01Z
, if you need to keep track of development history of the data. Please be considerate of these two facts:
ORDER BY DESC
on the version string. Any queries asking for the highest version numbers are sorted alpha-numerically, meaning <version>ZZZZZZZZ</version>
will almost always be shown as the latest version. The user has the freedom and the responsibility to choose versions accordingly.Guideline: The publisher should not be burdened with providing additional formats
A dedicated download client has not been written yet, but isomorphic derivations of the data can be done during download. We distinguish between format - and compression variants
. A simple example how to download as
NTriples/bzip2 as RDFXML/GZip:
FILES=`curl "https://databus.dbpedia.org/repo/sparql?default-graph-uri=&query=PREFIX+dataid%3A+%3Chttp%3A%2F%2Fdataid.dbpedia.org%2Fns%2Fcore%23%3E%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+dcat%3A++%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fdcat%23%3E%0D%0A%0D%0A%23+Get+all+files%0D%0ASELECT+DISTINCT+%3Ffile+WHERE+%7B%0D%0A%09%3Fdataset+dataid%3Aartifact+%3Chttps%3A%2F%2Fdatabus.dbpedia.org%2Fdbpedia%2Fmappings%2Fgeo-coordinates-mappingbased%3E+.%0D%0A%09%3Fdataset+dcat%3Adistribution+%3Fdistribution+.%0D%0A%09%3Fdistribution+dcat%3AdownloadURL+%3Ffile+.%0D%0A%7D%0D%0ALimit+1&format=text%2Ftab-separated-values&timeout=0&debug=on"`
for f in `echo ${FILES}| sed 's/"//g'` ; do
curl $f | lbzip2 -dc | rapper -i turtle -o rdfxml -O - - file | gzip > /tmp/downloadedfile.rdf.gz
done
The directive here is to outsource the easy conversion operations to the download client upon download, which accounts for compression, simple format conversions, but also more sophisticated operations such as loading the data directly into a database (e.g. RDF into Virtuoso or HDT ) and furthermore downloading complex mappings along with the SPARQL query to use RML or vocabulary rewrite operations, e.g. download schema.org data as DBpedia Ontology. However, these mappings and tools will be provided by third parties and should not burden the publisher.
mvn validate
sudo apt-get install maven
Check with mvn --version
.pfx
) fileThe WebID tutorial is here: https://github.com/dbpedia/webid#webid
Note: The WebID MUST be hosted on a server supporting HTTPS
Go to Databus Website and click Register to register for a new account via email or click Login and select GitHub to use your existing GitHub account name and credentials for authentication.
Once you have a verified account, login to your Databus account and enter your WebID under Account
> WebID URI
field. Click Save. Note: It can take up to 30 minutes until the change will take effect on the Databus. Moreover take care that the URI begins with https://
(you can verify everything worked by adapting this SPARQL query).
Option 1 (recommended):
${user.home}/.m2/settings.xml
<privateKey>
<server>
<id>databus.defaultkey</id>
<privateKey>${user.home}/.m2/certificate_generic.pfx</privateKey>
<passphrase>this is my password</passphrase>
</server>
Option 2:
<databus.pkcs12File>${user.home}/.m2/certificate_generic.pfx</databus.pkcs12File>
Three artifacts with one version each:
~/IdeaProjects/databus-maven-plugin/example/animals$ tree
.
├── birds
│ ├── 2018.08.15
│ │ └── birds_mockingbird.nt
│ ├── birds.md
│ └── pom.xml
├── fish
│ ├── 2018.08.15
│ │ ├── fish_mappingbased_ussorted.ttl
│ │ ├── fish_sorttest.txt
│ │ ├── fish_subspecies=carp.ttl
│ │ └── fish_subspecies=goldfish.ttl
│ ├── fish.md
│ └── pom.xml
├── mammals
│ ├── 2018.08.15
│ │ ├── mammals-2018.08.17_cat.nt
│ │ ├── mammals_binary.bin
│ │ ├── mammals_carnivore_cat.nt.patch
│ │ ├── mammals_carnivore_cat.trig
│ │ ├── mammals_monkey.nt.bz2
│ │ └── mammals.nt
│ ├── mammals.md
│ ├── pom.xml
│ └── provenance.tsv
├── pom.xml
├── test-cert-bundle.p12
└── test-cert-bundle-with-password.p12
dbpedia/*/*/*/
in .gitignore
to exclude the data from gitPer default
${groupId}/
+-- pom.xml (parent pom with common metadata and current ${version}, all artifactids are listed as `<modules>` )
+-- ${artifactid1}/ (module with artifactid as datasetname)
| +-- pom.xml (dataset metadata)
| +-- ${version}/
| | *-- ${artifactid1}_cvar1.nt (distribution, contentvariance 1, formatvariance nt, compressionvariant none)
| | *-- ${artifactid1}_cvar1.csv (distribution, contentvariance 1, formatvariance csv, compressionvariant none)
| | *-- ${artifactid1}_cvar1.csv.bz2 (distribution, contentvariance 1, formatvariance csv, compressionvariant bzip)
| | *-- ${artifactid1}_cvar2.ttl (distribution, contentvariance 2, formatvariance ttl, compressionvariant none)
| | *-- ${artifactid1}_cvar2.csv (distribution, contentvariance 2, formatvariance csv, compressionvariant none)
| | *-- ${artifactid1}.csv (distribution, no content variant, formatvariance csv, compressionvariant none)
To ensure that metadata for files to be published can be determined correctly, the names of these files have to fulfil a specific schema. This schema can be described by the following EBNF:
inputFileName ::= fileNamePrefix contentVariant* formatExtension+? compressionExtension*
fileNamePrefix ::= [^_]+? /* a non-empty string consisting of any chars except '_' */
contentVariant ::= '_' [A-Za-z0-9]+ | '_' [A-Za-z0-9]+ '=' [A-Za-z0-9]+
formatExtension ::= '.' [A-Za-z] [A-Za-z0-9]*
compressionExtension ::= '.' ( 'bz2' | 'gz' | 'tar' | 'xz' | 'zip' )
Note: +?
in the grammar above denotes a reluctant one-or-more quantifier such that, for example,
the production rule for the artifactName
will not ‘parse into’ the formatExtensions
when contentVariants
are absent.
Some valid filenames from the animals
example from the mammals
artifact:
mammals.nt - `nt` as format variant
mammals_species=carnivore_cat.nt.patch - `species=carnivore` and `cat` as content variants, `nt` and `patch` as content variants
mammals_monkey.nt.bz2 - `monkey` as content variant; `nt` as format variant; `bz2` as compression variant
mammals_cat.nt - `cat` as content variant; `nt` as format variant; `fileNamePrefix` contains a date
Invalid (counter-)examples:
mammals.zip.nt, mammals_monkey.nt.001.bz2, mammals_2018.08.17_cat.nt
As mentioned above, filenames are not only required to conform to the aforementioned schema, but the fileNamePrefix
also has to start with the name of the artifact. (Files with names starting differently will be ignored.)
It is highly recommended that you use the pattern YYYY.MM.DD
as version
If you deviate from this, please make sure that version numbers are aplhabetically sortable, i.e.
1.10
is smaller than 1.2
, so you need to use 01.10
and 01.02
in bash date +'%Y.%m.%d'
Setting the version programatically (will change version in all pom.xmls):
mvn versions:set -DnewVersion=2018.08.15
Example snippet from pom.xml
from Mappings, showing the common-metadata properties, you will need to describe your data.
<databus.packageDirectory>
/media/bigone/25TB/www/downloads.dbpedia.org/repo/lts/${project.groupId}/${project.artifactId}
</databus.packageDirectory>
<databus.downloadUrlPath>
https://downloads.dbpedia.org/repo/lts/${project.groupId}/${project.artifactId}/${project.version}/
</databus.downloadUrlPath>
<databus.publisher>https://webid.dbpedia.org/webid.ttl#this</databus.publisher>
<!-- moved to settings.xml
databus.pkcs12File>${user.home}/.m2/certificate_generic.pfx</databus.pkcs12File-->
<databus.maintainer>https://termilion.github.io/webid.ttl#this</databus.maintainer>
<databus.license>http://purl.oclc.org/NET/rdflicense/cc-by3.0</databus.license>
<databus.documentation><![CDATA[
documentation added to dataset using dataid:groupdocu
]]></databus.documentation>
</properties>
As stated above, data is hosted on your server. The packageDirectory gives the local location, where the files are copied to upon running mvn package
. The following files are copied for each artifact:
${artifactId}.md
filesThe package can be copied to the target directory and later moved manually online:
<databus.packageDirectory>
${session.executionRootDirectory}/target/databus/repo/${project.groupId}/${project.artifactId}
</databus.packageDirectory>
When on same server, the package can be copied to /var/www
directly:
<databus.packageDirectory>
/var/www/data.example.org/repo/${project.groupId}/${project.artifactId}
</databus.packageDirectory>
Also you can use the build environment as the publishing environment:
<databus.packageDirectory>
.
</databus.packageDirectory>
The command mvn prepare-package
generates a turtle file with relative URIs in the target
folder:
# <> is relative and expands to the file:// URL
<> a dataid:DataId ;
# internal fragment in the dataid.ttl file
<#mappingbased-literals_lang=id.ttl.bz2>
a dataid:SingleFile ;
# refers to a file in the same folder (external relative reference)
dcat:downloadURL <mappingbased-literals_lang=id.ttl.bz2> ;
Upon mvn package
the dataid.ttl is copied databus.packageDirectory
and all relative URIs are rewritten to:
<databus.downloadUrlPath>
https://downloads.dbpedia.org/repo/lts/${project.groupId}/${project.artifactId}/${project.version}/
</databus.downloadUrlPath>
Result (taken from http://downloads.dbpedia.org/repo/lts/mappings/mappingbased-literals/2018.12.01/dataid.ttl):
<https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-literals/2018.12.01/dataid.ttl>
a dataid:DataId ;
<https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-literals/2018.12.01/dataid.ttl#mappingbased-literals_lang=id.ttl.bz2>
a dataid:SingleFile ;
dcat:downloadURL <https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-literals/2018.12.01/mappingbased-literals_lang=id.ttl.bz2> ;
NOTE: Don’t forget the /
at the end here, if you enter #quatsch#
the URIs will start with #quatsch#
. This allows the necessary freedom to bend the URIs to the download location of the data.
The WebId URL including #this
. Used to retrieve the public key and account.
A maintainer webid, if different from publisher (normally the person doing the release).
Pick one from here: http://rdflicense.appspot.com/ Or link your own license.
UNSTABLE, will change, see Issue 84
A <![CDATA[ ]]>
field with markdown, which will be added to Dataset as dataid:groupdocu
${artifactId}.md
documents the artifact
rdf:type
already or $groupId or $publisherName, which is context. It is good to be specific about the ‘From Where?’, ‘How?’, ‘What?’, especially for ETL processes, examples:
rdfs:comment
dct:description
<databus.documentation>
will be kept separate in dataid:groupdocumvn prepare-package
, can be overwritten manually, format: 2019-02-07T10:08:27Zfor mvn test
or mvn databus:test-data
for mvn package
or mvn databus:package-export
for mvn deploy
or mvn databus:deploy
<databus.deployRepoURL>https://databus.dbpedia.org/testrepo</databus.deployRepoURL>
We provide a working example in the repo: https://github.com/dbpedia/databus-maven-plugin/tree/master/example/animals
All commands should work, except mvn:deploy
Use mvn -T 6
to run everything in parallel with 6 cores. The log will look messy.
mvn versions:set -DnewVersion=2018.08.15
Some background info, in case you would like to include scripts and other things between validate
and test
.
Note that we are using a super pom, which deactivates all software compilers:
<parent>
<groupId>org.dbpedia.databus</groupId>
<artifactId>super-pom</artifactId>
<version>1.3-SNAPSHOT</version>
</parent>
<!-- currently still needed to find the super-pom, once the super-pom is in maven central,
this can be removed as well -->
<repositories>
<repository>
<id>archiva.internal</id>
<name>Internal Release Repository</name>
<url>http://databus.dbpedia.org:8081/repository/internal</url>
</repository>
<repository>
<id>archiva.snapshots</id>
<name>Internal Snapshot Repository</name>
<url>http://databus.dbpedia.org:8081/repository/snapshots</url>
<snapshots>
<updatePolicy>always</updatePolicy>
</snapshots>
</repository>
</repositories>
Once the project is configured properly releases are easy to generate and update by typing:
mvn deploy
Deploy is the last phase in the maven lifecycle and is the same as running:
mvn validate prepare-package package deploy
# deleting any previously generated metadata
mvn clean
# validate setup of private key/webid and some values
mvn databus:validate
# generate metadata in target/databus/dataid
mvn databus:metadata
# export the release to a local directory as given in <databus.packageDirectory>
# copies data from src, metadata and parselogs from data
mvn databus:package-export
# upload the generated metadata to the databus metadata repository
mvn databus:deploy
There are working examples in the example folder, which you can copy and adapt.
mvn deploy
will not work (no account).
# clone the repository
git clone https://github.com/dbpedia/databus-maven-plugin.git
cd databus-maven-plugin
cd example/animals
# validate, test, generate metadata and package
mvn package
# or
mvn databus:validate databus:test-data databus:metadata databus:package-export
TODO: deploy
does a Multipart post: