sparql - How to handle Wikipedia Named Entities that have the same Category name -
i trying extract companies ran following query
prefix cat: <http://dbpedia.org/resource/category:> prefix dcterms: <http://purl.org/dc/terms/> prefix skos: <http://www.w3.org/2004/02/skos/core#> select distinct ?page ?subcat { ?subcat skos:broader* cat:companies_of_the_united_states_by_industry . ?page dcterms:subject ?subcat . ?page rdfs:label ?pagename. }
this snapshot of results
amgen , pfizer both companies category, end collecting under pfizer , amgen (people, product). found out these entries belong wikipedia category called category:wikipedia_categories_named_after_companies_of_the_united_states or category:wikipedia_categories_named_after_pharmaceutical_companies_of_the_united_states. tried filter these categories did this
select distinct ?page ?subcat { ?subcat skos:broader* cat:companies_of_the_united_states_by_industry . ?page dcterms:subject ?subcat . ?page rdfs:label ?pagename. filter( !regex(?subcat,"wikipedia_categories_named_after_pharmaceutical_companies_of_the_united_states")) }
but no luck, still there. idea how avoid problem?
the problem doesn't have them having same name. wikipedia categories don't form type hierarchy, doesn't make sense treat them one. reason see results you're seeing there's category pfizer, , broader values include company listings, dcterms:subject of dbpedia:alprazolam, dbpedia:cetirizine, etc. doesn't make sense type hierarchy, fine organizing article topics. if want companies back, ask things companies:
select distinct ?page ?subcat { ?subcat skos:broader* category:companies_of_the_united_states_by_industry . ?page dcterms:subject ?subcat . ?page rdfs:label ?pagename. ?page dbpedia-owl:company }
we can clean bit, though. you're not using ?label, can remove it. can use of shorter syntaxes make things little bit cleaner. can note "companies … industry" has skos:broader value "companies of united states" makes intent of query bit clearer.
select distinct ?company ?subcategory { ?company dcterms:subject ?subcategory ; dbpedia-owl:company . ?subcategory skos:broader* category:companies_of_the_united_states . } limit 1000
as final note, category hierarchy doesn't mean each company has single path top category. is, company listed multiple times, e.g.:
company subcategory ------------------------------------ companyx textile_companies companyx companies_in_new_hampshire
unless need listing of subcategories, might consider eliminating query, in case can have (using property paths):
select distinct ?company { ?company dbpedia-owl:company ; dcterms:subject/skos:broader* category:companies_of_the_united_states . } limit 1000
Comments
Post a Comment