Debugging a large codebase, for example the DIEF repository, is quite hard. One of the common methodologies is to write Unit tests (e.g. by using JUnit in JAVA). Therefore, we started to cover the DIEF code with Unit tests as well. This will lead to a better debugging experience and enables the evaluation of quality improvement between older and newer code.
Mavens default behavior is to execute any found test in the code base during the “install” goal. \
Thus, to install the DIEF you just simply clone the repository, enter the directory and execute
- A compatible shell (e.g. bash) to follow the instructions
- The version control system Git
- Java JDK 1.8 ( does not compile with JDK 1.11 )
- Apache Maven 3.3 or higher
- Scala 2.11.4 (should only matter in an IDEA e.g. Intellij)
git clone https://github.com/dbpedia/extraction-framework.git && cd extraction-framework mvn clean install # add "-Dmaven.test.skip -DskipTests" to skip all tests during the install goal
For troubleshooting, check if you fulfill the needed requirements.
For evaluating the quality of the DIEF development process, we introduce the minidump tests. The main goal of this test collection is to retrieve a global overview of the extraction quality. This is needed because sometimes the code improvement of one part in the code can lead to a decress or failure of other parts.
The minidump test uses subsets of single official Wikipedia dumps as extraction import. For now, its implementation will run the following test
To perform only the minidump tests change to the
dumps directory and execute
cd dumps/ # << $DIEF_DIR/dumps mvn test
For a single test
mvn test -Dsuites="org.dbpedia.extraction.dump.MinidumpTests"
To exclude or include steps e.g.
mvn test -Dsuites="org.dbpedia.extraction.dump.MinidumpTests" -DtagsToExclude="DownloadTest"
Scala test runner args
-n to include
-l to exclude
DownloadTest, ExtractionTest, PostProcessingTest, ConstructValidationTest, ShaclTest
The latest test code can be found inside MiniDumpTests.scala
Further, we designed a quality assessment approach which can be used to evaluate given RDF data (more formats will be accessible in the future).
In short, the evaluation is using a various number of defined IRI namespace and literal pattern. These models are stored using the RDF turtle serialization, for example dbpedia-specific-ci-tests.ttl.
TODO ValidationLauncher.scala, maybe rename class
You can deploy your own instance of an ad hoc extraction server on your local machine in order to see the extraction results for a single entity/resource/article (see e.g. http://dbpedia.informatik.uni-leipzig.de:9999/server/extraction/en/)
If you want to contribute to this debugging process feel free to, add a Unit test for a given part of the DIEF (e.g. one of the implemented data parsers) and create a pull request.