Wikipedia:Wikipedia Signpost/2015-06-24/Special report

Source: Wikipedia, the free encyclopedia.


Special report

Small impact of the large Google Translation Project on Telugu Wikipedia

Belgium, according to the Telugu Wikipedia from English Wikipedia original;one of the thousands of articles generated by the Google Translation Project. But how much did this help Wikipedia?

During 2009–2011 Google ran the Google Translation Project (GTP), a program utilising paid translators to translate most popular English Wikipedia articles to various Indian language Wikipedias. The program was organized as a part of a bid to extend and improve Google Translate software services in various languages: in a presentation[1] at Wikimania 2010 a company presenter stated that "Google has been working with the Wikimedia Foundation, students, professors, Google volunteers, paid translators, and members of the Wikipedia community to increase Wikipedia content in Arabic, Indic languages, and Swahili"; for more background on the effort see Signpost coverage on Wikimania 2010 and Bengali and Swahili experience.

The Google Translation Project was at first visible only through the generation of "Recent changes" items with comments mentioning the use of a "Google translator toolkit".[2] This toolkit was first made public in June 2009; Google initially experimented with Hindi, but quickly expanded the initiative to Arabic, Tamil, Telugu, Bengali, Kannada and Swahili. Google shared the details through a presentation [1] in Wikimania 2010. In the same event, a critique of GTP[3] was presented by a Tamil Wikipedian representing Ravishankar who could not attend due to visa delays. He identified many of the issues facing the project: first, what was popular amongst English readers rarely matched the same amongst Tamil readers, and moreover the many quality problems of the translations (too many red links, mechanical translation, operational problems such as overwriting of stub articles) were all highlighted. Google tried to address the community recommendations on improving the quality of the content generated by engaging in a dialogue but did not succeed. In response to a query from the author Google informed the closure of the project in June 2011.[4] It also announced the launch of indic web and the availability of Google Translate for several indic languages.[5] As one of the first large-scale human aided machine translation efforts on Wikipedia the project also exposed important philosophical friction within the community as to the nature of volunteerism on the projects, friction that, still unaddressed, would go on to re-emerge in the debate over the role and propriety of bots on the Swedish Wikipedia—the wiki passed the million article milestone in 2013, but with almost half (~454,000) of them being bot-created.

In this review the metrics on contributions and page requests from Wikipedia are used to analyze the impact of the project by focusing, as a case study, on just one of the targeted wikis: the Telugu Wikipedia. The entire data and code is also being made available[6] for other communities to validate and apply the analysis to their Wikipedias.

Impact

As of April 2015 the 61,000 articles of the Telugu Wikipedia make it the third largest Indian language Wikipedia, behind Hindi and Tamil.[7] The site has 54 active editors (editors making more than 5 edits a month), ranking it 53rd among the 247 Wikipedia projects with more than 1,000 articles.[8] The site draws ~2.6 M page requests per month.[9] About one third of the wiki's articles describe villages within Telugu speaking states of India, most initially created via bot scripts.

Google provided very little information about the Google Translation Project directly, so the details of its contributions and impact on the Telugu Wikipedia were gathered by scanning for an automatically inserted "http://translate.google.com/toolkit" to the toolkit in revision comments.[10] From this it can be gathered that the Telugu Wikipedia branch of the project ran for 2 years, involving 65 translators and 1989 pages amounting to approximately 7.5 million words.[11] Total cost was estimated to be ~750,000 USD.[12] The project increased the article count by 4.6%, but given the large size of the articles that were translated the project size, a proxy for word count increased by 200%.

Wikipedia edits

Did the work done as part of the Google Translation Project have a long-term impact on editing within the Telugu Wikipedia, positive or negative?

Telugu Wikipedia key parameters percentage change during 2008–2012
Telugu Wikipedia key parameters percentage change during 2008–2012

The graph above charts the yearly change in the number of Wikipedians (active and inactive accounts having made at least one edit) and the growth in project size (measured in millions of words) against the translation project's timetable, demarcated in red. The project came off of a peak in 2007–2008 sourced from a large influx of accounts and edits made following the publication of few features on the Wikipedia in a Sunday edition of Eenadu a major Telugu newspaper during 2006 and 2007 [13] This growth was unrelated to the translation project and lay outside of the time period being considered in June 2009, the month marking the beginning of the translation project. The percentage growth fell already by the time GTP commenced. The percentage growth in the number of Wikipedians declined during the project period and continued to do so after its conclusion, indicating that the GTP had little to no impact on project participation levels. The percentage growth in content, on the other hand, jumped up from about 44% as of June 2009 to 86% and 91% year-to-year in June 2010 and 2011, respectively.

Did this rapid growth stimulate further development in the year afterwards? No. While it is true that the Google Translation Project led to the doubling of the projects' content while it was active, at the project's conclusion the project's growth returned to the same approximately 1 million words per year in growth generated by the core volunteer community before the project's involvement. The absolute figures are shown below:

style="margin: auto;"
Month Number of accounts with edits Accounts Growth year-on-year Project size (millions of words) Project size Growth year-on-year
2012/06 506 + 78 13.8 + 1.0
2011/06 428 + 80 12.8 + 6.1
2010/06 348 + 67 6.7 + 3.1
2009/06 281 + 57 3.6 + 1.1
2008/06 224 + 121 2.5 + 1.2
2007/06 103 1.3

Even considering the relative sizes of the amount of articles generated and the small size of the core editing community, engagement between volunteers and the articles generated by the GTP remains threadbare: as of May 2015 just 9 out of the 1989 articles created by the project (~0.45% of the total) have received substantial improvements from community volunteers.

Page requests

Google's hypothesis behind launch of the translation project was the idea that more content (meaning more words) naturally leads to more Google searches and there by more page requests. Is this true?

Telugu Wikipedia page requests during 2008–2012
Telugu Wikipedia page requests during 2008–2012

Above I chart page requests for the entire Telugu Wikipedia for the period June 2009 – June 2012, both raw (in black) and smoothed (in blue). Again, red demarcates the GTP's time period. Page requests reached an all time peak of 4.5M in February 2010 but declined rapidly afterwards, but growth appears to have returned to base levels even before the conclusion of the project. A negative effect on page requests cannot be ascribed to the project—but neither can a positive one. Instead the data simply backs up what Ravishankar had pointed out all the way back in 2010: what was popular amongst English-language readers rarely mattered to their Tamil equivalents, and despite the size and expense of the translation project the sum total of articles generated by the GTP account for just 6% of total page requests (as of March 2014). The same information, put another way: the Telugu Wikipedia features an article assessed as reasonable quality without a formal review process by a senior wikipedian on the front page every week starting June 2007; These volunteer-developed articles equivalent to B-class on English Wikipedia featured there receive, on average, four times the page requests of GTP pages (also as of March 2014) as detailed in the following section.

(Un)popularity of GTP pages

To understand the popularity, non-mobile page request data for GTP and non-GTP featured article pages (This excludes 6 improved GTP pages from featured article pages till Dec 2013) are compared for the month of March 2014. The entire wiki received 1.9M non-mobile page requests. The 1989 GTP pages received a total of 107,424 page requests, amounting to 5.7% of the total. By contrast 328 non-GTP pages featured (till Dec 2013) received a total of 66,805 page requests, amounting to 3.5% of the total. Looking from per page perspective, volunteer-contributed featured articles with 204 page requests per page had about four times the popularity of GTP pages with 54 page requests per page.

The popularity of these pages can also be compared with the village article stubs created during a bot run in 2008, which remained largely unimproved. About 29,820 such pages received 175,640 page requests (based on a sample of 1000 pages receiving 5546 requests and using the one-sample t-test's upper bound of 5.89 for 95% confidence interval). This amounts to 9.2% of page requests. A GTP page with 54 page requests per page is 9 times more popular than a bot-created village stub page with 5.89 page requests per page

The present situation and lessons learned

The improvement generated by the GTP could not be scaled: the Telugu Wikipedia community's growth failed to be stimulated by the GTP and thus did not have the resources to improve the articles generated.

Bringing a semi-automatically translated page like Belgium (illustrated above) up to quality expectation of a typical featured article requires 6–8 hours, appreciation of the quality aspects of pages, easy access to the original English page for clarifications with regard to translation and help to match translated pages with interested Wikipedians. Some community members were upset when some volunteer developed pages were overwritten by GTP and discussed the poor quality of Translated pages several times and also proposed to stop the project during the course of the project. As Google started engaging with Tamil community and assured the Telugu community that community concerns will be addressed after making progress with Tamil first, community did not proceed further. The wait proved futile, as the Tamil community engagement did not succeed and Google announced the project closure in June 2011. As is typical, decision making on small projects is really problematic, as the community is small and the number of people who participate in discussing is about 5–10 and even one objection usually results in rejection of the proposal as assessing a consensus is difficult in such situation.

I have not been able to quantitatively assess the quality felt by a reader when he/she visits a bot created stub or a translated page through surveys. The only WMF survey applicable to Telugu (Global south survey 2014) did not yield any useful results due to the flaws in survey design and implementation. Based on past discussions on Wiki about people playing Wiki as a game by repeatedly clicking 'Random article' link till they hit a decent non bot created article and counting the number of clicks as a measure of quality, I can say that a negative impression is certainly created in the mind of a casual reader in either case, when there is significant percentage of such articles.

Wikipedias grow organically through human editing, with the occasional involvement of bots that seed stubs for topics of interest for the human editors. The GTP project provides an example of an intervention which attempted to grow the content in depth for a smaller number of pages, and it did partially succeed in the sense that the results were significantly more popular than the earlier bot-created village stubs. Unlike those free volunteer-developed bots, however, this project cost hundreds of thousands of dollars of external money and time; given the clear difference between the size and expense of the project, the tiny sliver of the wiki's traffic the pages generate, and the lack of an impact on both page requests and editor numbers, the GTP is a quantifiable failure from Wikipedia editor/reader perspective. The Wikimedia Foundation was not able to see through Google's sole success criteria of number of words added and the potential adverse effects of the project on wikis. It thought the communities would be able to deal with the issue as it is related to content. As the communities themselves are small, they could only make some noises which did not affect the GTP. The WMF was thankful when it received a $2M donation from Google.[14] I hope that this project serves as an eye opener and makes the foundation play a more active role in dealing with external agencies with their own agendas on small wikipedias.

To better manage the growth of Wikipedia, bot projects or translation projects for new pages should carefully consider the ability of the community to support the intervention, by defining the expectation of quality like the scope and depth of coverage and go for a phase wise implementation, with appropriate prioritization. As we have kind of baseline expectation of popularity for either initiative from this study, the proposals should have a specific target above the baseline. Subsequent phase should be taken up with appropriate modifications after assessing the results against the target. This assessment should also include a survey of the readers to ascertain the usefulness of the initiative and impact on the perception of Wikipedia quality. In the case of languages like Telugu with small active communities, these type of initiatives should be taken up only when there is a proper sponsorship for at least one person (full or part time, based on the nature/scope of the initiative) from the foundation or corporate sponsor. Simply trying out such a project because it worked on a large Wikipedia or it seemed to work for a small pilot will not be useful.

Acknowledgements

The author acknowledges the Wikipedia tools makers Domas Mituzas, Henrik, Erik Zachte and Yuvi Panda for the excellent statistics and query support tools. He also acknowledges Ravishankar for the critical review of the Google Translation Project. He thanks Vyzasatya, a Telugu Wikipedian, for his help in reviewing this article. He also thanks the R project team for excellent open source R language and also Coursera R programming course faculty and community for helping the author learn R and use it for this analysis. Thanks are due to The Signpost's editors for their feedback and help in improving the article.

References

  1. ^ a b Galvej, Michael (2010). "Submissions/Google translation – Wikimania 2010 in Gdańsk". wikimania2010.wikimedia.org. Retrieved 19 June 2015.
  2. ^ "Google Translate Blog: Translating Wikipedia". googletranslate.blogspot.in. 2010. Retrieved 28 May 2015.
  3. ^ Ayyakanu, Ravishankar (2010). "A Review on Google Translation project in Tamil Wikipedia - A-Review-on-Google-Translation-project-in-Tamil.pdf" (PDF). pdf.js. Retrieved 28 May 2015.
  4. ^ A, Ravishankar (2011). "[Wikimediaindia-l] Google's Indic Wikipedia translation project closing down". lists.wikimedia.org. Retrieved 19 June 2015.
  5. ^ "Official Google Blog: Google Translate welcomes you to the Indic web". googleblog.blogspot.in. 2011. Retrieved 19 June 2015.
  6. ^ Chavala, Arjuna Rao (2015). "Github repository with data and analysis for data scientists for reproducing/validating the research". github.com. Retrieved 17 June 2015.
  7. ^ Latest summary statistics of Telugu Wikipedia
  8. ^ Latest statistics of Telugu Wikipedia
  9. ^ Latest page requests for Indian languages
  10. ^ An example of an database query taking advantage of this fact:
    Chavala, Arjuna Rao (2015). "SQL query using Quarry for Pages translated using Google Translate". quarry.wmflabs.org. Retrieved 17 June 2015.
    Incidentally the Google Translation Toolkit is still live and publicly accessible to everyone with a Google account.
  11. ^ Based on an average contribution of 1M words per year by volunteer Wikipedians during 2008/06-2009/06 and 2011/06-2012/06 from the wikipedia statistics table presented
  12. ^ Assuming 0.10 USD per word of translation, the standard industry rate.
  13. ^ EENADU (http://eenadu.net), Nov 5,2006, "మన తెలుగు...వెబ్‌లో బహుబాగు", February 8, 2007 and "వెబ్ లో తెలుగు వెలుగులు", Jun 10, 2007
  14. ^ "Press releases/Wikimedia Foundation announces $2 million grant from Google – Wikimedia Foundation". wikimediafoundation.org. 2010. Retrieved 22 June 2015.