Wikipedia:Wikipedia Signpost/2015-05-20/In focus

Source: Wikipedia, the free encyclopedia.
In focus

The awful truth about Wikimedia's article counts

The article counts of many Wikimedia wikis suddenly changed on 29 March 2015: as the Signpost reported at the time, sixty-five wikis fell below milestones tracked at the Wikimedia News Meta page, and three increased to new milestones. Among these wikis, the largest absolute changes were a decrease of 281,624 articles in the English Wikisource (a 27% drop) and an increase of 4421 entries in the Persian Wiktionary (an 8% rise). The most extreme relative changes were a 98% decrease in Sindhi Wikinews articles (from 749 to 13), and a 23% increase in Bengali Wiktionary entries (from 920 to 1134).

The proximate cause of the large changes was the running of a maintenance script that recounts articles from scratch. For several reasons the article counts reported by most Wikimedia wikis have long been inaccurate—many negligibly so, but some by ridiculously large amounts. The maintenance script corrected these inaccuracies on almost all of Wikimedia's content wikis (except Wikibooks), but because not all of the root causes of the incorrect counts have been fixed, the script will be run once per month to ensure that the article counts can no longer get too far off of their correct values for too long.

The three main reasons the article counts needed to be fixed are:

  • The definition of what constitutes an "article" has varied in the past.
  • The actual implementation of article counting in different parts of the MediaWiki software has been inconsistent.
  • The article counts have never been recalculated for all Wikimedia wikis at once to fix the resulting problems.

What follows is a brief explanation of why the article counts became wrong over time, a description of what changes in article counts were actually observed when the counts were fixed, and some suggestions of what issues related to article counting the wider Wikimedia community may wish to discuss further. Further technical details are available at meta:Article counts revisited.

What defines an article?

Because articles are the main "public face" of a wiki—the mechanism by which content is presented to readers—Wikimedia wiki communities and the Wikimedia Foundation alike like to advertise article counts. The raw data is visible on each wiki's Special:Statistics page, available off-wiki through the MediaWiki API, and presented at pages such as meta:List of Wikipedias/Table; article-count milestones (such as reaching 10,000 or 100,000 articles) are routinely announced on the Wikimedia News page, and major milestones are often reported by the Signpost.

The first definition of what constituted an article in the early years of Wikipedia was that it was a page in the main namespace that was not a redirect and contained at least one comma. This worked fine for articles in English, since it ruled out exceptionally short pages that were not to be considered useful articles, but proved a poor solution for projects in other languages—especially those that represent or use commas differently, or lack them entirely. A short discussion and vote was held on the Meta-Wiki in March 2003, where it was decided that the definition would be changed: a page would now be counted as an article if it were a non-redirect in the main namespace, as before, and contained at least one internal wikilink instead of a comma. While it is not clear whether the voters intended for category links or interwiki links to count, the actual implementation of the decision in the MediaWiki software simply checked for the string "[[" anywhere in the source text of the page, thereby counting several different types of legitimate links (1–5 below), as well as two types of "fake" links (6 and 7), and one type of non-link (8):

  1. page links → [[Babel]], [[Talk:Babel]], etc.
  2. category links → [[Category:Software]]
  3. image/file links → [[File:Yes.png]]
  4. interlanguage links → [[de:Wikipedia:Hauptseite]] or [[:de:Wikipedia:Hauptseite]]
  5. interwiki links → [[species:]]
  6. hidden links → <!-- [[don't look at me]] -->
  7. deactivated links → <nowiki>[[look at me]]</nowiki>
  8. any text containing the string "[[" → wikilinks start with "[["

(Note that links like [[:Category:Software]] and [[:File:Yes.png]], which start with an initial colon, are regular page links of type 1.)

Discovering this, some editors started gaming the system by routinely placing <!--[[--> (an HTML comment) on all of their pages, just to get them counted as articles!

At the same time, the comma-based definition of an article was not entirely abandoned. Instead, a configuration variable was added to allow wikis to optionally retain that counting criterion. Additional variables were added over the next few years to allow more flexibility in article counting. In June 2006 a variable was added to allow articles to be in namespaces other than the main one; this change was taken advantage of mainly by Wikisources, which routinely count "Author", "Page", and "Index" namespaces along with the main one as their "content namespaces". In May 2011 a variable switch was added that enabled any one of three different article-counting criteria. Two were essentially reimplementations of the link- and comma-based counting criteria already in use, while the third introduced the option of counting any content namespace page as an article. Currently the English Wikibooks and Portuguese Wikibooks use the "comma" criterion; the Czech Wikinews, Chinese Wikinews, and Gujarati Wikisource use the "any" criterion; and all other Wikimedia wikis use the "link" based criterion.

How are they counted?

The article count of a MediaWiki-based wiki is set to an initial value by a script when the wiki is first created. Afterwards, it can be recounted from scratch using either of two maintenance scripts, but otherwise all operations on the article count are relative changes in response to different actions on the wiki itself. The article count may (or may not) increase if a new page is created or imported into the wiki, or if a previously deleted page is "undeleted"; decrease if a page is deleted or if its edit history is merged with that of another page; and change in either direction if a page is edited or moved between namespaces.

It had been known at least since 2007 that different elements of MediaWiki code used different criteria for determining whether a page counted as an article. For example, every time a page in the main namespace was saved, it was checked for the string "[["; if articles were recounted using the maintenance script, a certain database table was checked to see if each page really linked to another page on the same wiki (thus counting only links of the first type listed above, but including links provided by templates); if the wiki's site statistics were completely recalculated, each potential article was simply checked to see if it contained any text at all (i.e. a page length greater than zero). The May 2011 changes to the code were supposed to fix such inconsistencies, but it is not clear whether this was fully accomplished.

Bugs affecting article counting have continued to be a problem. One particularly tenacious bug that caused newly imported articles to increase the count by the total number of revisions (page edits) rather than the number of articles was not resolved until February 2015. It seems likely that unknown but significant lingering issues still remain, calling into question whether article counts can ever truly be accurate.

For high-traffic wikis like the English Wikipedia, it may never be possible to get a completely accurate article count. Over all of 2014, the English Wikipedia averaged about 3 million page edits per month, working out to just over 1 edit per second. At peak periods of editing, the number of edits per second can be far higher. Any of these edits can potentially change the article count (though most do not); even a single edit to a template can change the "article status" of several pages at the same time. Add to this the fact that many servers are simultaneously rendering, caching, and changing page content, and the very existence of a "true" article count becomes debatable. Even if one assumes that at any instant in time there is a true count, it may not be possible for a script to successfully determine it, since the counting process itself is far from instantaneous—edits made in the intervening time may impact the results.

What about Wikistats?

For much of Wikimedia's history Erik Zachte's Wikistats website has been collecting monthly article counts (and many other statistics) for most Wikimedia wikis, counted offline from periodic database dumps using custom Perl scripts written for that purpose. Unfortunately these statistics cannot be seen as "more accurate" versions of the MediaWiki article counts, because Wikistats counts articles in a completely different way.

Wikistats presents two different types of article counts, a so-called "official" count and an "alternate" count. The "official" count uses a link-based criterion that parallels MediaWiki's link-based one, but treats category links (links that place pages into categories) as being equivalent to regular page links, whereas MediaWiki completely ignores category links when counting articles. The "alternate" count is similar, but adds an additional requirement that the page length be at least 200 characters (or 50 characters for certain non-Latin scripts). Neither matches any of the article-counting criteria used in the MediaWiki software. In addition, the Wikistats counts are recalculated every month on the basis of the then-current edit histories of each wiki, meaning that deleted articles "disappear" from all previous monthly counts going back to when those articles were created. To be more specific, if Wikistats reports that a given wiki contains 1000 articles in February 2015, the next month's report might list that wiki as having had only 990 articles in February 2015, if 10 articles that existed in that month were deleted before the next month's report was compiled.

Erik Zachte explains his approach in this way:

What has recounting done?

On 10 May 2012, following a bug report, articles were recounted on all of the language editions of Wiktionary and Wikisource (the articles of which are often called "entries" and "text units", respectively). As reported at Wikimedia News at the time, this caused 8 Wiktionaries to rise to higher milestone levels, and 24 to fall to lower levels; also, 15 Wikisources rose to higher levels and 13 fell to lower levels. 14 Wiktionaries lost all of their entries (a 100% decrease), but all of them had 5 or fewer entries before the change. The total article count of Wiktionary, summed across all language editions, decreased by 220,590 entries (a 1.6% decrease). As for Wikisource, the most extreme changes were seen in the French Wikisource, which increased by 819,297 text units (a 291% increase), and the Thai Wikisource, which decreased by 8,548 text units (a 63% decrease). Including French, 10 Wikisources more than doubled their text units: in fact, the total article count for Wikisource, summed across all language editions, almost exactly doubled, increasing by 1,599,639 text units.

Since it was clear at the time that the articles of Wikimedia's other content projects would also eventually have to be recounted, additional tables were created by a user in May 2012 that showed the changes that would occur if these wikis had their articles recounted. Not surprisingly, it was found that changes of a similar magnitude to those already observed would be seen in the rest of the projects. In particular, 3 Wikipedias, 2 Wikibooks, and 1 Wikiversity would have risen to new milestone levels, whereas 26 Wikipedias, 29 Wikibooks, 30 Wikiquotes, 11 Wikinews, and 2 Wikiversities would have fallen to lower levels (Wikivoyage was not a Wikimedia project at this time). The full information is available for review.

On 29 March 2015, the articles were recounted on all of the languages of Wikipedia, Wiktionary, Wikiquote, Wikisource, Wikinews, Wikiversity, and Wikivoyage. Wikibooks was left out because, as previously mentioned, the English and Portuguese Wikibooks use the "comma" criterion to determine articles, and the maintenance script does not correctly implement that counting method. As the Signpost reported last month, 3 wikis rose to higher milestones and 65 wikis fell to lower milestones. The previous report summarizes the major changes seen in those 68 wikis, but more details are available. In addition, among all 679 recounted wikis the most extreme changes were seen in the English Wikisource, which decreased by 281,199 (a 27% drop), and the English Wikipedia, which increased by 97,285 (a 2% rise). The largest relative increase was seen in the Norwegian (Nynorsk) Wikiquote, which rose by 40% (479 pages). One Wikipedia and 14 Wikiquotes lost all of their articles, all of which had 7 or fewer before the change. The mean absolute change was 1340 articles (up or down); the median absolute change was 39 articles. The total article count for all recounted wikis decreased by 551,440, a 0.9% decrease. A full accounting of the observed changes (including changes to total pages and page edits, for comparison) is available.

Wikis by Number of Articles Lost or Gained (2015, n=679)

Where do we go from here?

As things currently stand, the article-counting script is set to run on the 21st of each month. This should keep the on-wiki article counts reasonably correct from now on. However, there are some remaining issues that should probably be considered by the wider Wikimedia community. In each case, this will require working with the MediaWiki developers to determine which possible decisions could realistically be implemented.

Do we need to reconsider what should count as an article?
Should MediaWiki's link-based article criterion treat category links the same as regular page links? What about interlanguage links? Should regular page links only count when they are to another page in the main namespace? Should support for the comma-based criterion be dropped since it is slow and still not fully implemented across all parts of the MediaWiki code?
Can Wikibooks be included in the periodic recounts?
Can the two wikis using the comma-based criterion be recounted another way? Can the rest of the Wikibooks project be recounted without those two wikis? Should the English and Portuguese Wikibooks communities consider switching to the "all pages in all content namespaces" criterion instead? Can the maintenance scripts be fixed to correctly implement comma-based article counting?
Should other content projects be included in the periodic recounts?
What about including Wikispecies or Commons? Should every Wikimedia wiki get recounted at least once, if not periodically? Has anyone analyzed the additional server load caused by the recounting?
Can article counting ever be improved to the point where script-based recounting is unnecessary to maintain correct "live" counts?
(Probably not.)
Might other site statistics, such as total pages or images, need recounting?
Should the full site statistics of every Wikimedia wiki be recalculated at least once? Might this make some of the other statistics less correct (i.e., due to bugs)?


The issues raised above as well as many more technical details on the article-counting problem are presented at Article counts revisited on the Meta-Wiki.