DBpedia extracts knowledge from Wikipedia using a set of different extractors. While some extractors are generic (language independent), there are extractors which are language specific. Usually, these extractors have to be configured for the targeted language. Here we describe the process of configuration of different extractors for a particular language.
Note that some configurations are easy to adjust, while some require more knowledge and better understanding of the code/logic which relies on these configurations. In such cases we recommend reading the code and understanding the use of the configurations.
DO NOT FORGET to announce your configuration improvements to the DBpedia core team (e.g. by reporting on http://forum.dbpedia.org) so that we enable the particular extractor for your language (in case it is not yet enabled) in the next DBpedia release.
Most of the configurations can be found at:
https://github.com/dbpedia/extraction-framework/tree/master/core/src/main/scala/org/dbpedia/extraction/config
Custom parsers for various data types (data time, months, durations, geo-coordinates, etc.).
Date-Time parser - configuration: DateTimeParserConfig.scala. Language specific configurations:
"de" -> Map("januar"->1,"februar"->2,"märz"->3,"maerz"->3,
"en" -> Map("BCE" -> -1, "BC" -> -1, "CE"-> 1, "AD"-> 1, "AC"-> -1, "CE"-> 1)
"en" -> "st|nd|rd|th"
"birth date" -> Map ("year" -> "1", "month"-> "2", "day" -> "3")
Duration parser - configuration: DurationParserConfig.scala. Language specific configurations:
"cs" -> Map(
"vteřiny" -> "second", ...)
Flag Tamplate - configuration: FlagTemplateParserConfig.scala.scala. Language specific configurations:
"en" -> Set(
"flagicon", //0
"flag", //0
"flagcountry" //0
),
"fr" ->
Map(
"ALA"->"Åland",
"AFG"->"Afghanistan",
"ZAF"->"Afrique du Sud",
"ALB"->"Albanie",
GeoCoordiate parser - configuration: GeoCoordinateParserConfig.scala. Language specific configurations:
longitudeLetterMap = Map(
"en" -> Map("E" -> "E", "W" -> "W"),
Scales Parsers - configuration: ParserUtilsConfig.scala. Language specific configurations:
"en" -> Map(
"thousand" -> 3,
"million" -> 6,
"mio" -> 6,
"mln" -> 6,
"billion" -> 9,
"bln" -> 9,
"trillion" -> 12,
"quadrillion" -> 15
The mappings configurations can be found at:
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/config/mappings/
Date interval parser - configuration: DateIntervalMappingConfig.scala. Language specific configurations:
"en" -> Set("present", "now"),
"en" -> "since",
"en" -> "onward",
"en" -> "to"
Disambiguation extractor - configuration: DisambiguationExtractorConfig.scala. Language specific configurations:
"cs" -> " (rozcestník)",
"de" -> " (Begriffsklärung)",
"el" -> " (αποσαφήνιση)",
"en" -> " (disambiguation)",
Gender extractor - configuration: GenderExtractorConfig.scala. Before configuring the gender extractor, please get familiar with the code in GenderExtractor.scala. Note that for some languages, it might be difficult to configure the gender extractor. Language specific configurations:
"en" -> Map("she" -> "female", "her" -> "female", "he" -> "male", "his" -> "male", "him" -> "male", "herself" -> "female", "himself" -> "male",
"She" -> "female", "Her" -> "female", "He" -> "male", "His" -> "male", "Him" -> "male", "Herself" -> "female", "Himself" -> "male" //TODO why not just do case insensitive matches?
)
Homepage extractor - configuration: HomepageExtractorConfig.scala. Language specific configurations:
"cs" -> Set("Webová stránka", "Oficiální web"),
"de" -> Set("website", "homepage", "webpräsenz", "web", "site", "siteweb", "site web"),/*cleanup*/
"el" -> Set("ιστότοπος", "ιστοσελίδα"),
"en" -> Set("website", "homepage", "web", "site"),
"cs" -> "Odkazy",
"de" -> "Weblinks?",
"el" -> "(?:Εξωτερικοί σύνδεσμοι|Εξωτερικές συνδέσεις)",
"en" -> "External links?",
"cs" -> "oficiální",
"de" -> "offizielle",
"el" -> "(?:επίσημος|επίσημη)",
"en" -> "official",
Image extractor - configuration: ImageExtractorConfig.scala. Language specific configurations:
"en" -> """(?i)\{\{\s?non-free""".r,
Configuration document: nifextractionconfig.json. Language specific configurations:
"nif-find-pageend":[
"span[id=Gesprochene_Version]",
"span[id=Weblinks]",
"span[id=Literatur]",
"span[id=Anmerkungen]",
"span[id=Quellen]",
"span[id=Einzelnachweise]"
],"nif-find-pageend":[
"span[id=Gesprochene_Version]",
"span[id=Weblinks]",
"span[id=Literatur]",
"nif-remove-elements":[
"a[title*='Datei'] ~ sup",
"a[title*='Datei']",
"a[title*='Hörbeispiel']",
".hauptartikel-pfeil"
]
"nif-note-elements":[
"div.sieheauch -> ($c)",
"div.hauptartikel -> ($c)"
]
Great, you have learned how to configure extractors for different languages. Now just go ahead and add/update the configuration for your language and do a pull request. Some tips on where to start:
DO NOT FORGET to announce your improvements to the DBpedia core team (e.g. by reporting on http://forum.dbpedia.org) so that we enable the particular extractor for your language (in case it is not yet enabled).
Thank you for your contribution!
TODOs
- add more info on text config, integrate https://github.com/dbpedia/extraction-framework/wiki/Improving-CSS-selectors-for-NIF-extraction and https://github.com/dbpedia/extraction-framework/issues/535