DBpedia Development Wiki devilopment bible

Edit this page with Git

Download the DBpedia Knowledge Graph

This is a technical documentation on how to customize SPARQL queries over the Databus SPARQL API, so you can query exactly the download links that you need. Other useful pages:

Overview

  • DBpedia extracts information from all Wikipedia languages, Wikidata, Commons and other projects
  • The extraction runs monthly around the 7th, for details see the Improve DBpedia section
  • You can also read the documentation and create custom SPARQL queries for individual datasets at the Databus DBpedia Account
  • The data is split into different groups or modules according to their dependencies

Core Groups

  • Databus URI Pattern: https://databus.dbpedia.org/dbpedia/$group
  • SPARQL ?dataset dataid:group <https://databus.dbpedia.org/dbpedia/generic> .
  • For docu add ?dataset rdfs:comment ?comment . ?dataset dct:description ?description . to queries.

generic

  • monthly, deployed
  • available for ~140 languages, based on the pages-articles-multistream Wikimedia dumps, uses the automatic extractors written in scala on the Wiki syntax and produces predicates of the form http://dbpedia.org/property or http://$lang.dbpedia.org/property as well as other standard vocabularies, such as foaf, rdfs:label, skos, wgs84 . They have the broadest coverage and decent quality.
    • generic group has over 20 artifacts with almost 3000 per version total
    • Filter belo is set to ‘English only’
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dataid-cv: <http://dataid.dbpedia.org/ns/cv#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat:  <http://www.w3.org/ns/dcat#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

# Get all latest English files of generic extraction
SELECT DISTINCT ?file ?shasum WHERE {
    ?dataset dataid:group <https://databus.dbpedia.org/dbpedia/generic> .
    ?dataset dcat:distribution ?distribution .
    ?distribution dataid-cv:lang "en"^^xsd:string .
    ?dataset dct:hasVersion ?latestVersion .
    {
        SELECT (max(?version) as ?latestVersion) WHERE {
           ?dataset dataid:group <https://databus.dbpedia.org/dbpedia/generic> .
           ?dataset dct:hasVersion ?version .
        }
    }
    ?distribution dcat:downloadURL ?file .
    ?distribution dataid:sha256sum ?shasum .
    # debug will be removed in a while
    FILTER NOT EXISTS {?distribution dataid-cv:tag 'debug'^^xsd:string} .       
}

mappings

  • monthly, deployed
  • avalailable for ~40 languages. The InfoboxMappingsExtractor can be configured and optimized with easier to write rules called mappings, edited in the Mappings Wiki. This module produces triples with http://dbpedia.org/ontology/ predicates. They have a higher quality, but are fewer. They are an improved complement of the generic module. Also ontology types using rdf:type are in this module.
    • The mappings group is far more heterogeneous than generic or wikidata:
      • not every artifact has every language, they are quite mixed
      • some artifacts get extra post-processing such as mappingbased objects or inference such as the instance-type
      • we give a query below that shows the different variants available.
      • Luckily there are only six artifacts total and with these four as the most popular (mappingbased-objects, mappingbased-literals, geo-coordinates-mappingbased, instance-types)
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dataid-cv: <http://dataid.dbpedia.org/ns/cv#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat:  <http://www.w3.org/ns/dcat#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

# Show all content variants of mappings, grouped by artifact
SELECT DISTINCT ?artifact ?cvproperty  (group_concat(?cvtmp;separator=",") as ?cv)  WHERE {
    { 
    	SELECT DISTINCT ?artifact ?cvproperty ?cvtmp  {
            ?dataset dataid:group <https://databus.dbpedia.org/dbpedia/mappings> .
      		?dataset dataid:artifact ?artifact .
    		?dataset dcat:distribution ?distribution .
    		?cvproperty rdfs:subPropertyOf  dataid:contentVariant . 
    		?distribution ?cvproperty ?cvtmp .
  		} 
   } 
} GROUP BY ?artifact ?cvproperty
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dataid-cv: <http://dataid.dbpedia.org/ns/cv#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat:  <http://www.w3.org/ns/dcat#>

# Get the 4 most popular mappings artifact, english only
SELECT DISTINCT ?file WHERE {
     ?dataset dcat:distribution ?distribution .
     ?dataset dct:hasVersion ?latestVersion .
     ?distribution dcat:downloadURL ?file .
     # english
     ?distribution dataid:contentVariant "en"^^xsd:string .  
  	 { 
    	    ?dataset dataid:artifact ?artifact .
    	    FILTER (?artifact in (
         		<https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals>,
                <https://databus.dbpedia.org/dbpedia/mappings/geo-coordinates-mappingbased>
            )) .
  	 } UNION {
     	    ?dataset dataid:artifact <https://databus.dbpedia.org/dbpedia/mappings/instance-types> .
    	    # pre-calculated transitive closure overrdf:type
    	    ?distribution dataid:contentVariant "transitive"^^xsd:string .
     } UNION {
     	    ?dataset dataid:artifact <https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects> .
            ?dataset dcat:distribution ?distribution .
            # removes debugging info about disjoint domain and ranges
    	    FILTER NOT EXISTS {?distribution dataid-cv:tag ?tag . }
     }
  	               
    {
	SELECT (max(?version) as ?latestVersion) WHERE {
		?dataset dataid:artifact ?artifact .
		?dataset dct:hasVersion ?version .
		FILTER (?artifact in (
		 <https://databus.dbpedia.org/dbpedia/mappings/instance-types>,
		 <https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects>,
	     <https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals>,
		 <https://databus.dbpedia.org/dbpedia/mappings/geo-coordinates-mappingbased>
		 )) .
		 }
   }
}

wikidata

  • monthly, deployed
  • applies a set of extractors and mappings on the Wikidata XML dumps to make Wikidata compatible with generic and mappings. Uses http://wikidata.dbpedia.org/resource/Q[0-9+] as subject. Also has configurable Mappings Extractor to map P[0-9]+ to http://dbpedia.org/ontology and other standard vocabularies.
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat:  <http://www.w3.org/ns/dcat#>

# Get all latest files of wikidata extraction
SELECT DISTINCT ?file ?shasum WHERE {
    ?dataset dataid:group <https://databus.dbpedia.org/dbpedia/wikidata> .
    ?dataset dcat:distribution ?distribution .
    ?dataset dct:hasVersion ?latestVersion .
    {
        SELECT (max(?version) as ?latestVersion) WHERE {
           ?dataset dataid:group <https://databus.dbpedia.org/dbpedia/wikidata> .
           ?dataset dct:hasVersion ?version .
        }
    }
    ?distribution dcat:downloadURL ?file .
    ?distribution dataid:sha256sum ?shasum .
    # debug will be removed in a while
    FILTER NOT EXISTS {?distribution <http://dataid.dbpedia.org/ns/cv#tag> 'debug'^^<http://www.w3.org/2001/XMLSchema#string>} .       
}

text

  • under maintenance, offline
  • available in 140 languages, uses the HTML queried from the Wikipedia API to extract short and long abstracts and other relevant information for Natural Language Processing via the NIF Extractor. Requires online requests to https://en.wikipedia.org/w/api.php or a local mirror, set up with http://www.nongnu.org/wp-mirror/
    • Will run Oct/Nov 2019

ontology

  • in development, on ontology edit
  • Provides version snapshots of the DBpedia Ontology downloaded from the Mappings Wiki,
    • Snapshots are currently developed by Denis and will be moved soon.
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat:  <http://www.w3.org/ns/dcat#>

SELECT distinct ?file ?latestVersion ?mediatype WHERE {
 	?dataset dataid:artifact <https://databus.dbpedia.org/denis/ontology/dbo-snapshots> .
	?dataset dcat:distribution ?distribution .
    ?distribution dcat:downloadURL ?file ;
    			  dct:hasVersion ?latestVersion ;
    			  # see all available mediatypes with 
    			  # dcat:mediaType ?mediaType .
                  dcat:mediaType <http://dataid.dbpedia.org/ns/mt#TextTurtle> . 
    {
            SELECT (?version as ?latestVersion) WHERE { 
                ?dataset dataid:artifact <https://databus.dbpedia.org/denis/ontology/dbo-snapshots> . 
                ?dataset dct:hasVersion ?version . 
            } ORDER BY DESC (?version) LIMIT 1 
	} 
}

transition

  • no updates, will be refactored
  • mixed datasets from older releases, which we need to consolidate into the new structure
    • contains links to freebase and many other
  • in development
  • ID management creates new DBpedia global ids for any accepted, external URI spaces
  • Itis bootstrapped from interlanguage links from wikidata/sameas-all-wikis and generic/interlanguage-links
  • In the future, we will add more links and IDs taken from datasets on the Databus such as geonames or musicbrainz

Community Extensions

Please read the docu at the databus: