Wikipedia:Wikipedia Signpost/2023-11-06/Wikidata

From Wikipedia, the free encyclopedia
File:Imbalanced justice scale silhouette.svg
991joseph
PD
0
0
300
Wikidata

Evaluating qualitative systemic bias in large article sets on Wikipedia

The presence of systemic bias on Wikipedia has been well-established by several studies.[1][2][3][4] Most studies demonstrate simple quantitative bias by noting that a particular class of articles has fewer instances than another similar class (e.g. there are more biographies about men than women, and there are more articles about Paris than all of Africa[5]).

Qualitative bias in article content (i.e. biased information in existing articles) is more difficult to assess. Creation of new articles helps address simple quantitative bias in the number of articles, but it does not independently remedy bias in the encyclopedia if those articles are orphaned or poorly interlinked. Integration of missing topics requires an assessment of due weight to determine where, for example, a new biography about a female mathematician should be linked from articles about her area of expertise, awards she won, or other mathematicians she may have influenced. It's also possible that some of her works would meet notability guidelines (WP:BK) and are also missing. The complexity of this situation means that assessing qualitative bias on the basis of even a limited class of articles (e.g. 1,000 women mathematicians) quickly implicates many thousands of other articles outside that class.

I propose a systematic approach for assessing bias by using Wikidata to create an ontological map of related articles. This approach generates a set of statements that could potentially be missing from the encyclopedia, and is intended to be a step toward assessing qualitative bias in large sets of articles. In combination with a separate assessment of sources to determine due weight (only marginally developed here) this approach could lay a foundation for identifying and integrating missing information from large sets of articles and help counter systemic bias (WP:CSB).

Environmental justice on Wikipedia

For this exercise, I examined articles related to environmental conflicts and environmental justice. Wikipedia is overwhelmingly created by white males in Europe and North America[6] – a demographic that generally benefits from environmental injustice. Information about Indigenous people and the Global South is notably absent,[7][8] creating gaps in knowledge about those portions of the world that bear the burdens of environmental injustice. There are about 4,000 environmental conflicts currently listed in the Environmental Justice Atlas, and the number is growing. Given the complexity discussed above, these 4,000 conflicts could have implications for tens of thousands of articles (some existing, and some missing) about environmental justice campaigns, resource extraction projects (e.g. mines, pipelines, gas fields), notable environmental defenders (individuals and organizations), corporations, commodities, disasters and other events, threatened ecosystems, and more.

Method

For this exercise, I examined 800 environmental conflicts from the EJAtlas to determine how they were represented on Wikipedia and Wikidata. I used script-aided matching to iteratively sort the conflicts and establish an ontological structure comprising statements that define relationships between entities. Wikidata statements have a subject-predicate-object format: for example the relationship, "The Escobal mine protests oppose the Escobal mine" is represented as: Q106830477 Escobal mine protests — P5004 in opposition to — Q16957078 Escobal mine. That statement does not exist in Wikidata at the time of this writing.

After compiling the list of conflict titles, I queried them in the English Wikipedia, returning the two most relevant matches. Of these conflicts, 488 returned at least one proposed matching entity (61%). I then sorted through the conflicts, to remove false positives and confirm that the proposed matches clearly related to the EJAtlas conflict title. This process confirmed the match for about 20% of the EJAtlas conflicts; two thirds of the initial 488 results were potentially false positives.

Following this initial matching, I reduced the size of the set to the first 250 conflicts (including unmatched conflicts), and developed a second script that enabled matching of multiple entities to a single conflict and placement of the matched entities into one of several categories:

  • Conflict (protest, social movement; directly corresponds to EJAtlas entry)
  • Project (eg: mine, pipeline)
  • Resource (eg: lake, protected area)
  • Company
  • Environmental organization
  • Disaster

To aid accurate and detailed matching, this script also provided a text input and one-click querying to retrieve additional information about the conflicts: up to five relevant Wikipedia entities, the lead paragraph of a Wikipedia article, the lead paragraph of an EJAtlas description, or the company name from a Wikipedia infobox.[a]

I was able to match 161 entities to 113 conflicts (from the reduced set of 250, so 45% of conflicts were matched with at least one entity, up from 20% in the first iteration.) Forty-eight conflicts (20% of the set) were matched to more than one entity. Most of the 161 entities were projects (45%) or companies (26%).

I used these matches to generate a partial ontology for 250 conflicts consisting of 481 relationships, including 250 general statements establishing the conflicts themselves. Very few of these statements are presently included on Wikidata, although a few were added during the activity. Many of the conflicts remain undefined, and some properties that seem integral to environmental conflict ontologies do not exist.

Partial ontology for 250 conflicts
Entity

Example

Property Entity

Example

Number of statements
Conflict

Q106830477 Escobal mine protests

in opposition to P5004 Project

Q16957078 Escobal mine

Company

Q7675621 Tahoe Resources

70



42

Organization

Q4807540 Asociación pola defensa da ría

advocates for P2650 interested in


participant in  P1344

Resource

Q3326808 Ria de Pontevedra


Conflict

EJAtlas ID 76,(Pontevedra industrial complex)

1



5

Resource

Q945954 Penobscot River

polluted by (property does not exist) Project

Old Town paper mill Q122334386

9
Project

Q5101849 Chirano gold mine

owned by P127 Company

Q546880 Kinross Gold

21
Company

Q455484 Areva

participant in P1344 Conflict

EJAtlas ID 172 (Uranium in Gabon)

42
Project

Q2219131 Paraguaná Refinery Complex

caused P1536

or significant event P793

Disaster

Q5803217 Amuay tragedy

4
Conflict

EJAtlas ID 120 (Demeter International Katondo Farm)

has location P276

or advocates for P2650

Resource

Q3364406 Bwabwata National Park

29
Disaster

Q115812365 Earlimart pesticide poisoning

facet of

P1268

Conflict

Q109968152 Pesticide incidents in the San Joaquin Valley (EJAtlas ID 140)

8
General statements:

Conflict is an instance of (P31) environmental conflict (Q5683226).

250

These statements are a small portion of all possible relationships that could comprise environmental conflict ontologies; they were chosen to illustrate the process and to (mostly) make use of existing Wikidata relationships more than to recommend a particular ontological structure. Obviously, a different set of relationships would have to be derived for a different set of articles (such as that for female mathematicians discussed above).

Suggestions for further work

The low matching rate found in this activity limits possible statements, and is at least partially due to poor coverage of environmental conflicts on Wikimedia platforms.[b] Although this experiment did not systematically determine what proportion of missing conflicts would meet general notability guidelines (GNG), only seven articles about a conflict were found in the final set of 250 conflicts for which an ontology was generated (7/250 = 3%).[c] Given that the EJAtlas is a moderated platform that requires secondary sourcing, and that conflicts in general are frequently newsworthy topics of discussion in academic literature, this rate seems absurdly low. I did identify several missing entities that met GNG and worked with another editor to create articles for them.[d] Since notability requirements for Wikidata are lower, presence of a conflict in the EJAtlas is sufficient to establish a conflict entity as an instance of (P31) environmental conflict (Q5683226). The statements proposed above assume that these entities would be created on the basis of the EJAtlas entry.

This examination suggests that Wikipedia's coverage of environmental conflict is poor. It also suggests that without further script development, it would take a little over fifty hours of work to accurately match about half of the conflicts listed in the EJAtlas to at least one existing Wikidata entity. Accurate and efficient matching of the remaining 50% (and more complete matching in general) would probably require additional development. These matches could be used to facilitate an information exchange between the EJAtlas and Wikipedia that has potential to improve the quality of information on both platforms.[e]

This work could set a foundation for correction of bias in coverage of environmental conflicts in Wikipedia by identifying entities that are missing entirely, or articles that may lack information about environmental conflict. The structured relationships also provide a framework that should facilitate script-aided editing to establish the missing information (in combination with further development to identify supporting sources).

Sources and due weight

In order to evaluate whether a missing entity meets GNG, relevant sources would have to be assessed. That assessment is mostly beyond the scope of this exercise, but I did develop a script to extract sources from an EJAtlas entry about a conflict. Similar scripts could be developed to identify relevant sources on Google Scholar or other databases, and these tools should make it relatively easy to establish whether a particular entity meets GNG — although that assessment is frequently subjective, as evidenced by many impassioned debates at AFD! Assessing due weight for inclusion of a conflict (or any concept) in other related articles is much more difficult, though tools could and should be developed to aid that assessment.

Implications and questions

It would be possible to extend this methodology to any class of articles likely to suffer from systemic bias in its representation on the encyclopedia. Our earlier example of women mathematicians could be represented as an ontology consisting of biographies that are related by statements about mathematical concepts, other biographies, places, awards, books, and technologies. In the case of environmental conflicts, the EJAtlas makes it easier to organise this information and provides a resource for identifying many of the related entities. For other sets, a similar resource would be helpful.

In the case of environmental conflict, some care is warranted to ensure that we develop an ontological structure that minimises harm. The central question is which categories should be differentiated and which should remain ambiguous. Within the platforms explored here, conflicts and conflictive projects are frequently conflated. My initial experiment conflated conflicts and disasters (in the single category of events), though I differentiated these categories in the final iteration. Any structure will erase certain distinctions while preserving others, and this erasure has the potential to do harm.

In view of the ongoing violence of environmental injustice and the possibility that this work could reduce harm by making information about that violence more accessible—as well as the reality that some of this ontology is already implicit in the existing structure of Wikidata, Wikipedia, and the EJAtlas—continued development of this approach seems worthwhile; but it will require additional attention to details about the ontological structure. I doubt quandaries about how to structure these relationships will have clear and unambiguous answers.

It is also true that sorting through large sets of articles is a lot of work; and perhaps a more organic and less systematic approach could eventually address systemic bias. But given how little attention this problem seems to get (WP:CSB has struggled to remain viable for years, and it teeters on the edge of inactivity); and given the scale of the problem, some organised and systematic approach carried out by a small number of editors seems advisable. This approach is intended to save labour in the long run by facilitating script-aided editing.

Notes

  1. ^ The script displays two conflicts per page and my matching rate is about 1 conflict/minute (including negative matches). Scripts available on request.
  2. ^ Other factors that reduced the matching rate were limitation of queries to the English Wikipedia; inconsistent formatting of conflict titles and descriptions in the EJAtlas, and relatively unsophisticated scripts. Improved scripts could match multiple articles in a single operation or potentially use artificial intelligence as a tool for confirming matches.
  3. ^ At this writing there are only about 165 instances of environmental conflicts on Wikidata (about 25 of which are statements added to existing entities as part of this exercise. Those 25 instances were identified from the 800 matches confirmed in the first iteration of the exercise, and represent 25/800 = 3% of that set being instances with preexisting articles about conflicts. This includes articles about environmental disasters).
  4. ^ These include Curipamba project (EJID 776), Detriot waste incinerator (EJID 136), Old Town paper mill (EJID 16), Christmas massacre (Bolivia) (EJID 752), Collum coal mine (EJID 1915), El dorado gold mine (El Salvador) (EJID 781), Bajo la alumbrera mine (EJID 729), Mapithel dam (EJID 943), Mandena mine (expanded) (EJID 1058)
  5. ^ This exchange already exists of course. Some EJAtlas cases clearly take information directly from Wikipedia, and EJAtlas cases are routinely cited on Wikipedia.

References

  1. ^ Hube, Christoph (2017-04-03). "Bias in Wikipedia". Proceedings of the 26th International Conference on World Wide Web Companion. WWW '17 Companion. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee: 717–721. doi:10.1145/3041021.3053375. ISBN 978-1-4503-4914-7.
  2. ^ Graham, Mark (12 November 2009). "Mapping the Geographies of Wikipedia Content". Mark Graham: Blog. ZeroGeography. Archived from the original on 8 December 2009. Retrieved 16 November 2009.
  3. ^ Livingstone, Randall M. (2010-11-23). "Let's Leave the Bias to the Mainstream Media: A Wikipedia Community Fighting for Information Neutrality". M/C Journal. 13 (6). doi:10.5204/mcj.315. ISSN 1441-2616.
  4. ^ Bjork-James, Carwil (2021-07-03). "New maps for an inclusive Wikipedia: decolonial scholarship and strategies to counter systemic bias". New Review of Hypermedia and Multimedia. 27 (3): 207–228. doi:10.1080/13614568.2020.1865463. ISSN 1361-4568.
  5. ^ Greig, Jonathan (16 April 2021). "For Wikipedia's 20th anniversary, students across Africa add vital information to site". TechRepublic. Archived from the original on 2021-04-19. Retrieved 11 May 2021.
  6. ^ Vrana, Adele Godoy; Sengupta, Anasuya; Bouterse, Siko (2020-10-13), "Toward a Wikipedia For and From Us All", Wikipedia @ 20, The MIT Press, pp. 239–258, ISBN 978-0-262-36059-3, retrieved 2023-10-14
  7. ^ Sethuraman, Manasvini; Grinter, Rebecca E.; Zegura, Ellen (2020-06-15). "Approaches to Understanding Indigenous Content Production on Wikipedia". COMPASS. ACM: 327–328. doi:10.1145/3378393.3402249. ISBN 978-1-4503-7129-2.
  8. ^ Duncan, Alexandra (2020). "Towards an activist research: Is Wikipedia the problem or the solution?". Art Libraries Journal. 45 (4): 155–161. doi:10.1017/alj.2020.24. ISSN 0307-4722.