Wikipedia:Bots/Requests for approval/Seppi333Bot: Difference between revisions

Source: Wikipedia, the free encyclopedia.
Content deleted Content added
→‎Feedback requested: ping Pigsonthewing re:MOS and two ideas
Line 116: Line 116:
:::::Sigh. I'm not sure it's technically correct to use the term "indiscriminate" in this case because the list is literally complete - there are no missing genes in this list which are known to encode a protein - so the list entry selection criteria can't possibly be random. This list was created as a side project when I needed a complete list of protein-coding genes for training a speech to text AI, so my own use case inspired the creation of the lists. See [[Wikipedia_talk:WikiProject_Molecular_Biology/Molecular_and_Cell_Biology/Archive_11#Does_anyone_know_of_any_resources_that_provide_IPA_or_"sounds_like"_pronunciations_for_human_protein-coding_gene_symbols_(specifically,_the_acronyms_like_SERPINA1/TAAR1)?|here]], namely the collapsed section. There are undoubtedly other use cases. HGNC doesn't provide an easily accessible and complete, human-readable list, so I put one here while I was working on my transcription AI. For context on what the entire list represents, search for the word "[[exome]]" on this page and read what I wrote previously. FWIW, I imagine there's more use cases related to [[whole exome sequencing]] than training an AI, since that list represents the whole exome (at least, the vast majority of it; there's likely a few hundred protein-coding genes - which I imagine are mostly [[microprotein|smORF]]s - that have yet to be identified, hence why I want a bot to update the list).
:::::Sigh. I'm not sure it's technically correct to use the term "indiscriminate" in this case because the list is literally complete - there are no missing genes in this list which are known to encode a protein - so the list entry selection criteria can't possibly be random. This list was created as a side project when I needed a complete list of protein-coding genes for training a speech to text AI, so my own use case inspired the creation of the lists. See [[Wikipedia_talk:WikiProject_Molecular_Biology/Molecular_and_Cell_Biology/Archive_11#Does_anyone_know_of_any_resources_that_provide_IPA_or_"sounds_like"_pronunciations_for_human_protein-coding_gene_symbols_(specifically,_the_acronyms_like_SERPINA1/TAAR1)?|here]], namely the collapsed section. There are undoubtedly other use cases. HGNC doesn't provide an easily accessible and complete, human-readable list, so I put one here while I was working on my transcription AI. For context on what the entire list represents, search for the word "[[exome]]" on this page and read what I wrote previously. FWIW, I imagine there's more use cases related to [[whole exome sequencing]] than training an AI, since that list represents the whole exome (at least, the vast majority of it; there's likely a few hundred protein-coding genes - which I imagine are mostly [[microprotein|smORF]]s - that have yet to be identified, hence why I want a bot to update the list).
:::::Addendum: Like I said before, I'm open to splitting the list into 10 pages, but that would nearly triple the amount of time it takes me to update the lists. It already takes me several minutes to do it with the current 4, so I'm not going to split the list further unless either (1) someone offers to help me perform manual updates on 10 pages or (2) this bot is approved to do it for me. The lists have actually been out of date w.r.t. the HGNC database for several days since the gene counts differ, but I haven't updated it yet because performing an update is both tedious and an annoying time sink even with 4 pages. You can probably imagine how motivated I'd be to regularly perform manual updates on 10 pages. [[User:Seppi333|'''<span style="color:#32CD32;">Seppi</span>''<span style="color:Black;">333</span>''''']]&nbsp;([[User Talk:Seppi333|Insert&nbsp;'''2¢''']]) 13:04, 19 December 2019 (UTC)
:::::Addendum: Like I said before, I'm open to splitting the list into 10 pages, but that would nearly triple the amount of time it takes me to update the lists. It already takes me several minutes to do it with the current 4, so I'm not going to split the list further unless either (1) someone offers to help me perform manual updates on 10 pages or (2) this bot is approved to do it for me. The lists have actually been out of date w.r.t. the HGNC database for several days since the gene counts differ, but I haven't updated it yet because performing an update is both tedious and an annoying time sink even with 4 pages. You can probably imagine how motivated I'd be to regularly perform manual updates on 10 pages. [[User:Seppi333|'''<span style="color:#32CD32;">Seppi</span>''<span style="color:Black;">333</span>''''']]&nbsp;([[User Talk:Seppi333|Insert&nbsp;'''2¢''']]) 13:04, 19 December 2019 (UTC)
:::::{{Ping|Pigsonthewing}} 2nd addendum − I stumbled across [[Wikipedia:Manual of Style/Tables#Size]], which links to [[WP:SPLITLIST]]; both advocate against splitting lists, and tables in particular, at arbitrary cutpoints. I don't think splitting the list into 4 tables was arbitrary since the pages were by far the largest pages in the mainspace prior to the split into 4 distinct pages. However, splitting the tables across 10 pages doesn't seem like it would be compliant with the MOS guideline on article size ([[WP:SPLITLIST]]) or tables ([[Wikipedia:Manual of Style/Tables#Size]]) based upon what those sections indicate. There are two final means I can think of that would allow for further reduction of the page size across all 4 pages, without loss of content or context about what the list entries reflect.
:::::The first is to create a [[template:UniP]] redirect to [[template:uniprot]] and change all of the uniprot templates in the tables to the redirect. This would reduce the bytes per entry by 3, so the first 3 pages would shrink by 3x5000=15000 bytes without any visible change to the lists. The second would be to remove the status column AND all of the gene entries with a status of "Entry Withdrawn", which are the last 100 entries (indices 19201−19300) in [[List of human protein-coding genes 4]], so that the lists only contain approved gene symbols. I would then need to indicate somewhere on these pages that all the gene symbols in these lists are the approved symbols for their corresponding genes. For context, "Entry withdrawn" has a [https://www.genenames.org/help/faq/#!/#tocAnchor-1-16-2 very specific meaning in the HGNC database]: it indicates "{{tq|a previously approved HGNC symbol for a gene that has since been shown not to exist.}}" Cutting that column would reduce the page size of the first 3 pages by 10 bytes per row ⇒ 10x5000=50000 bytes per page. If these both of these changes sound fine to you and no one has any objections, I can go ahead and reduce the page size of the first 3 pages by 65000 bytes, but that seems to be as far as I can reduce it simply by restructuring the tables in a manner that doesn't sacrifice content. Let me know what you think when you get a chance. [[User:Seppi333|'''<span style="color:#32CD32;">Seppi</span>''<span style="color:Black;">333</span>''''']]&nbsp;([[User Talk:Seppi333|Insert&nbsp;'''2¢''']]) 05:14, 25 December 2019 (UTC)
:::::*Actually it's not easy to do this in Wikidata, because of the conflated content of these articles and the multitude of concept types they are linked from (gene / protein / protein family). With this query I get 5,911 missing articles:
:::::*Actually it's not easy to do this in Wikidata, because of the conflated content of these articles and the multitude of concept types they are linked from (gene / protein / protein family). With this query I get 5,911 missing articles:
{{SPARQL|query=
{{SPARQL|query=

Revision as of 05:15, 25 December 2019

Operator: Seppi333 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 18:00, Wednesday, November 6, 2019 (UTC)

Function overview:

  • Update 4 massive wikitables every 24 hours (via a task scheduler) with any new/revised content from a database that is updated at the same frequency

Automatic, Supervised, or Manual:

  • Automatic

Programming language(s):

  • Python

Source code available:

Links to relevant discussions (where appropriate): These sections include discussions about the bot itself (i.e., writing the tables using User:Seppi333Bot):

These links pertain more to the wikitables generated by the python script, for which the bot function in that script automates page writing via Seppi333Bot, than the bot itself:

Edit period(s):

  • Daily

Estimated number of pages affected:

  • Exactly 4 (listed below)

Namespace(s):

  • Mainspace

Exclusion compliant (Yes/No): Yes

Function details:

You can easily check the function of this bot on the python code page; the functions were written in a modular fashion. It performs the following tasks in sequence:

  1. Download the complete set of human protein-coding genes from the HGNC database hosted on an ftp server and write it to a text file.
  2. Read from the text file and write 4 text files containing 4 wikitables, each with ~5000 rows of gene entries.
  3. Login to Wikipedia on the User:Seppi333Bot account, open the 4 text files that were written to the drive, then:
    1. Open List of human protein-coding genes 1, replace the source code with the wikitext markup in the 1st text file, and save the page.
    2. Open List of human protein-coding genes 2, replace the source code with the wikitext markup in the 2nd text file, and save the page.
    3. Open List of human protein-coding genes 3, replace the source code with the wikitext markup in the 3rd text file, and save the page.
    4. Open List of human protein-coding genes 4, replace the source code with the wikitext markup in the 4th text file, and save the page.
  4. Delete all 5 text files from the drive.

Discussion

FWIW, I've been testing the runBot() function on these sandbox pages since yesterday morning, so I know it works exactly as intended. I just need approval to run this bot in the mainspace. The other functions in the script have been in operation since last week.

Also, does anyone know how I can extend the timeout duration when saving a page? The Wikitables are massive, so publishing the edits takes a while and I can't seem to find a timeout setting in the Pywikibot library. Seppi333 (Insert ) 18:00, 6 November 2019 (UTC)[reply]

Don't know enough about Python to answer the timeout setting question, but what does happen when a human editor alters the tables in any way? Will their edits be overwritten? Jo-Jo Eumerus (talk) 18:05, 6 November 2019 (UTC)[reply]
Well, since the bot blanks the page prior to writing content from the text files, yes. The way I look at these pages is similar to a protected template; you sort of need to request a change to the underlying source code in order for it to stick on the page itself. Seppi333 (Insert ) 18:26, 6 November 2019 (UTC)[reply]
@Jo-Jo Eumerus: I realize this is a bit premature given that I'm not going to run the bot in the mainspace without approval of my bot; but, I've attempted to address the issue you mentioned by notifying other editors about the automatic page updates and where to make an edit request using these edit notices: Template:Editnotices/Page/List of human protein-coding genes 1 and Template:Editnotices/Page/List of human protein-coding genes 2. If you'd like to revise the wording or remove/change the image in the edit notices, please feel free to do so. I know some people find File:Blinking stop sign.gif a bit annoying given how attention-grabbing it is, but I chose it in this particular case to ensure that editors see the edit notice. Seppi333 (Insert ) 02:15, 24 November 2019 (UTC)[reply]
Well, we've had problems in the past with bots adding incorrect content and then re-adding it after human editors tried to fix it, that's why I asked. An edit notice seems like a good idea, perhaps it should also say where to ask about incorrect edits by the bot. Jo-Jo Eumerus (talk) 09:12, 25 November 2019 (UTC)[reply]
@Jo-Jo Eumerus: Hmm. From an algorithmic standpoint, I'm almost positive that the only circumstance in which the bot could write incorrect/invalid content to the page is if the HGNC's protein_coding-gene.txt file is corrupt, since a download failure would raise an error and stop my script (I think this is true - switching to airplane mode mid-download raised an error and stopped the script; I'll test in a few minutes to be certain and reprogram the script to stop if I'm wrong). Should I add another line to the edit notice asking users to revert the page to the last correct version and contact an administrator to block the bot in the event that happens? Seppi333 (Insert ) 21:52, 25 November 2019 (UTC)[reply]
@Jo-Jo Eumerus: I realized that, because my bot is exclusion compliant, literally any editor can block my bot from editing those pages. Since it's faster for an editor to block my bot by editing the pages upon identifying a problem (which would need to be done anyway) than blocking it by contacting an administrator, I've opted to list that method in the edit notices (e.g., see Template:Editnotices/Page/List of human protein-coding genes 3). It also would save me time to get it up and running again since I wouldn't have to appeal a block. If people start to abuse that block method though, I'll have to make my bot non-exclusion compliant. Seppi333 (Insert ) 10:36, 26 November 2019 (UTC)[reply]
  • information Note: This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT 18:24, 7 November 2019 (UTC)[reply]
    Seems to be a sandbox page. Seppi333: would it make sense for you to copy the content to mainspace using your main account, to test for objections to the content itself? –xenotalk 14:52, 16 November 2019 (UTC)[reply]
    That's my intent once I've ironed out the issues with the wikilinks mentioned here; I intend to fix those problems in the coming week and then move it to the mainspace. Seppi333 (Insert ) 19:37, 16 November 2019 (UTC)[reply]

@Xeno: Since I've fixed all of the dablinks (~750ish), I moved these pages to List of human protein-coding genes 1 and List of human protein-coding genes 2 and listed them in a hatnote in the Human genome#Human protein-coding genes section. To centralize discussions about the lists, I also left messages on the talk pages of the lists about requesting that new messages be posted on the talk page where the source code is located.

There are still a few mistargeted wikilinks (likely ~100 or so) in the table; I'm going to find and fix all of those soon after I write the natural language processing script that I mentioned in this section. I've identified all of them in User:Seppi333/GeneListNLP#mistargetedLinks.txt.

In any event, since these pages are now in the article space, approval of my bot would make it much easier for me to update those lists; it wouldn't hurt to wait a few days to see if anyone takes issue with any parts of the tables or the layout of the article though. I doubt anyone will object to the existence of the pages themselves since the complete set of human protein-coding genes constitutes the human exome and the human protein-coding genes section cites literature that discusses those genes as a set, so the existence of a list of them on Wikipedia is justifiable by the notability criteria in WP:Notability#Stand-alone lists. It's pretty self-evident that every gene in the table is notable as well given that virtually all ~20000 entries contain 2+ links to databases that provide information about the gene and encoded protein. Seppi333 (Insert ) 21:16, 23 November 2019 (UTC)[reply]

So the first page is about 875k pre expansion of templates, seems a bit large (didn’t look at page 2). Onetwothreeip moved it to draft space citing the size and unspecific other reasons, perhaps they can comment here as to the latter and you can think about splitting up the page into 4 or more pieces? –xenotalk 09:58, 24 November 2019 (UTC)[reply]
The external links are completely unnecessary and should be removed. I tried to remove them myself but was unable to. Onetwothreeip (talk) 10:50, 24 November 2019 (UTC)[reply]
@Xeno: The List of human genes article (it's technically a WP:Set index article due to the fact that the chromosome pages aren't list articles) breaks up the list by chromosome; I don't think that's a particularly useful way to break up a complete list of protein-coding genes since the gene families (i.e., groups of gene symbols that share a common root symbol that is followed by a number; e.g.,, TAAR and HTR are root symbols for groups of genes that are covered in those articles) within the tables would be split up and spread across multiple list pages. This is because the genes within a given gene family are not necessarily located on the same chromosome (e.g., the 13 HTR genes are located on 10 different chromosomes). The proteins encoded by the genes within a given family share a common function; consequently, those gene groups constitute sub-lists of related entries in the wikitables, so splitting gene groups across two pages isn't ideal.
With that in mind, I think splitting the list into 4 parts would be fine; I'd rather not split it any more than that because it becomes progressively harder to navigate the list and increasingly likely that a more gene groups will be split across two pages. I might end up replacing the locus group column with the gene location, pending feedback at WT:MCB. If I were to do that, the page size would shrink by about 50-100k more.
@Onetwothreeip: (1) not notifying me when you draftified those pages was just rude; (2) I have no clue why you didn't just ask me about the page size; (3) I know that there are no fixed page size limits for list articles specified in policy/guideline pages, so it really irritates me that I had to look for the list of the largest pages in the mainspace instead of you linking to it directly to point out the page size of those lists relative to the other listed in Special:LongPages. Seppi333 (Insert ) 13:14, 24 November 2019 (UTC)[reply]
  • Is this bot approved for a trial? It appears to be editing in mainspace and projectspace without approval. Special:Contributions/Seppi333Bot. ST47 (talk) 23:54, 24 November 2019 (UTC)[reply]
    It isn't. I've blocked it for editing without approval. — JJMC89(T·C) 00:12, 25 November 2019 (UTC)[reply]
All of the edits by the bot were only performed in the project space; it didn’t occur to me until right now that the revision history after a page move would appear to show the bot editing in the mainspace though. I suppose I should’ve just copy/pasted the source code instead.
In any event, the project space pages I was editing were - and still are - merely being used as sandboxes. It would’ve made no difference at all whether I created them in my user space or as a subpage of WP:MCB. I figured creating/editing a new page in the project space with a bot wouldn’t be against policy, but apparently my interpretation wasn’t correct. So, my bad. @JJMC89: If you’re willing to unblock the bot, I’ll confine it’s edits to my user space. I don’t intend to perform any further edits with my bot if you unblock it though. Seppi333 (Insert ) 02:25, 25 November 2019 (UTC)[reply]

On an unrelated note, I’d appreciate input from a BAG editor on this request. I can update the tables in the article space with or without a bot. The only purpose it serves is for my convenience: I don’t want to have to regularly update the tables. Failing approval for my bot, I’ll probably just end up updating the tables on a monthly basis after a while. With approval, the tables would be updated daily since I’d just use a task scheduler to run the bot script every 24 hours. Seppi333 (Insert ) 02:48, 25 November 2019 (UTC) There are at most 226 mistargeted links present in the current revisions of the 4 list pages; I could fix those right now, but I'd prefer to wait for a trial to use my bot to rewrite the relevant links as piped links due to the amount of time it takes to publish an edit in my browser window. Seppi333 (Insert ) 10:36, 26 November 2019 (UTC)[reply]

I'm going to fix this in the next 24 hours since I notified WT:WikiProject Disambiguation and they've rendered some assistance with disambiguating the links in that list. Seppi333 (Insert ) 07:21, 27 November 2019 (UTC)[reply]
 Done I've fixed the corresponding links in all 4 articles. I have no further updates/changes planned for the tables in the algorithm; all the wikilinks are now correctly disambiguated and without targeting issues. If someone actually responds to this request at some point, please ping me. Seppi333 (Insert ) 15:04, 27 November 2019 (UTC)[reply]

Feedback requested

  • {{BAG assistance needed}}. A trial I think? –xenotalk 02:52, 25 November 2019 (UTC)[reply]
    xeno, I find it difficult to approve a trial for a bot run on pages that are of a questionable nature. In looking through the various discussions I'm seeing concerns about the necessity of the external links, the size of the pages, the need to update every day, and the necessity of the pages themselves. The bot has been (more or less) shown to run properly at its current remit, but if the scope of the page changes then the entire bot run would almost need to go back through a trial. At the very least I'd like to see a consensus about the format/layout/size of the article before approving this. Primefac (talk) 15:33, 8 December 2019 (UTC)[reply]
@Primefac: You mentioned multiple issues here, so I figured I'd follow up on each individually. To summarize what I've stated below, I'm flexible on both the page size (which has since been reduced) and update frequency as well as how to format the references for each entry (NB: these are currently included as ELs). I'm not open to deleting the gene/protein references altogether due to the problem it would create w.r.t. the stand-alone list guideline (WP:CSC - 1st bullet), as explained below. I'm not sure that I understand your concern regarding the page scope.
  • Re: the need to update every day - they really don't need to be updated every day. The database is updated daily, so I figured that would be the most natural frequency. I'm flexible on this point, so if you think the bot should update less frequently, I think an update frequency of anything between once per day to once per week seems fine. An update period of >1 week seems arbitrarily long IMO given that there have been substantive changes to the database entries that are included in the wikitables several times a month since the time that I created them; there's no way to predict when such changes occur.
  • Re: the size of the pages - I'm actually somewhat flexible on this. Around 2 weeks after you replied here, Pigsonthewing raised this issue here: "Page size" thread. In response, I reduced the page size of each list by just over 100,000 bytes and mentioned how I might be able to reduce the page size even further. It's probably worth reading this thread for context.
  • Re: the necessity of the external links - Onetwothreeip and I are at an impasse on this issue; we've been discussing it here: "Still problems with these articles" thread. He believes they serve no purpose and should be deleted instead of being converted to a citation within reference tags. Since there are approximately 8000 redlinked gene symbols in the tables and on the basis of the first bullet point under WP:Stand-alone lists#Common selection criteria, I included an external link for the gene (HGNC) and encoded protein(s) (UniProt) because they serve as an official link for the gene/protein (see the discussion thread for an explanation as to what makes them "official") while simultaneously serving as a reference for the gene and protein (NB: WP:ELLIST explicitly states: In other cases, such as for lists of political candidates and software, a list may be formatted as a table, and appropriate external links can be displayed compactly within the table:... In some cases, these links may serve as both official links and as inline citations to primary sources.. The notability for the redlinked entries in these lists is not readily apparent and whether those entries are relevant to the list is not easily verifiable w/o the HGNC and UniProt links. Many of the lists on the first page of Special:LongPages also employ this method to cite list entries since an external link uses less markup (and hence, reduces the page size) relative to adding references tags for each reference. For context, <ref></ref> is exactly 9 characters long; if I placed 1 reference tag around the HGNC and UniProt ELs (w/o any additional reference formatting) for all 5000 gene entries in each wikitable, it would add 9*2*5000=90,000 bytes to the size of each page.
  • I object to your marginalisation of opposition to this. I haven't seen many articles that are blue on your list that really are about the resp. gene. Practically all such articles really are about the resp. protein encoded by the gene (even if the article begins "XYZ is a gene"). So this list is a fake from the start. Reason is, of course, that genes have no function per se except that of being the object of transcription: it's all in the proteins. --SCIdude (talk) 07:05, 19 December 2019 (UTC)[reply]
    I object to your marginalisation of opposition to this what are you referring to, specifically? I don't really follow your argument. No articles about a protein-coding gene are about just the gene or the encoded protein; they're about both. There are a very limited number of circumstances where it's appropriate to topically separate a gene and an encoded protein across 2+ articles (e.g., it might be prudent to do so when a gene encodes multiple proteins), but even in such cases, the article scope still encompasses the encoded proteins in the gene article and the gene which encodes the protein in the protein articles. The way the bluelinked articles are supposed to be written is covered in MOS:MCB#Sections. The majority of those sections relate to the protein because the encoded protein is the mechanism through which this class of genes affects the organism. Seppi333 (Insert ) 11:07, 19 December 2019 (UTC)[reply]
    So, at least, the pages are misnamed, they are not "lists of human protein-coding genes". --SCIdude (talk) 14:09, 19 December 2019 (UTC)[reply]
  • Re: if the scope of the page changes then the entire bot run would almost need to go back through a trial - I'm not certain I fully understand what you meant by this. The only way the list's scope could change is if I were to change the underlying dataset (i.e., "protein-coding gene.txt") used by my algorithm to a different one; that would entail a complete rewrite of my algorithm (i.e., the wikilink dictionary would need to be deleted or replaced since it'd be moot, the function that generates the tables would have to be entirely rewritten to reflect the scope and structure of the new dataset, and the dataset download function would require a least partial revision). I do not intend to, and wouldn't even consider, doing something like that. If you meant something else though, please clarify.
I'm not sure how I might be able to obtain consensus on these issues since most people don't really care about lists like this; I'll try asking for feedback about the page size of the lists and the external links at WT:MCB and WT:WikiProject Molecular Biology to see if anyone is willing to offer some though. If you have any advice on how I might establish a consensus or have any feedback about the lists, I'd appreciate your input. Seppi333 (Insert ) 05:07, 19 December 2019 (UTC)[reply]
Given that all the articles on Special:LongPages are too long, it's not a good idea to use them as an example of what should be done. There is simply no reason why each gene should be referenced, let alone given an external link. Onetwothreeip (talk) 10:01, 19 December 2019 (UTC)[reply]
While I agree that examples don't make for a very good argument, these lists are also very long; the 20000 entries in these list pages probably constitutes one of the longest lists on WP by number of list entries. While the inherent notability of a protein-coding gene may be obvious to some, I'm virtually certain that some editors who aren't privy to this discussion will object to the inclusion of all the redlinks at some point, justifying their argument with Wikipedia:What Wikipedia is not#Wikipedia is not an indiscriminate collection of information, if there are no references provided for them. Given the sheer size of this list (due to its completeness) and the large number of redlinks in it, I think it's necessary to include at least 1 of the current ELs to avoid future objections pertaining to WP:CSC. It's worth pointing out that the sole templated reference that's currently included on these pages merely links to a webpage with download links for machine-readable text/json files that are virtually unreadable to a person (e.g., this is the rather incomprehensible text file that the algorithm uses to generate the tables). TBH, I wouldn't really care about cutting these links if doing so didn't create a different problem. There may be a solution that would address both concerns, but I can't think of one at present. Seppi333 (Insert ) 11:07, 19 December 2019 (UTC)[reply]
I'm concerned not only about the size of these pages, but also the "indiscriminate" aspect. What is the use case for them? Who will use them? What do they offer that a category would not? Or that Wikidata does not? If a significant part of their function is to provide red links for others to work on, then perhaps a Wikipedia-space project page would be appropriate, where the list(s) could be built by WP:Listeria (as indeed they could in mainspace, at least on more enlightened Wikipedias)? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 12:14, 19 December 2019 (UTC)[reply]
Sigh. I'm not sure it's technically correct to use the term "indiscriminate" in this case because the list is literally complete - there are no missing genes in this list which are known to encode a protein - so the list entry selection criteria can't possibly be random. This list was created as a side project when I needed a complete list of protein-coding genes for training a speech to text AI, so my own use case inspired the creation of the lists. See here, namely the collapsed section. There are undoubtedly other use cases. HGNC doesn't provide an easily accessible and complete, human-readable list, so I put one here while I was working on my transcription AI. For context on what the entire list represents, search for the word "exome" on this page and read what I wrote previously. FWIW, I imagine there's more use cases related to whole exome sequencing than training an AI, since that list represents the whole exome (at least, the vast majority of it; there's likely a few hundred protein-coding genes - which I imagine are mostly smORFs - that have yet to be identified, hence why I want a bot to update the list).
Addendum: Like I said before, I'm open to splitting the list into 10 pages, but that would nearly triple the amount of time it takes me to update the lists. It already takes me several minutes to do it with the current 4, so I'm not going to split the list further unless either (1) someone offers to help me perform manual updates on 10 pages or (2) this bot is approved to do it for me. The lists have actually been out of date w.r.t. the HGNC database for several days since the gene counts differ, but I haven't updated it yet because performing an update is both tedious and an annoying time sink even with 4 pages. You can probably imagine how motivated I'd be to regularly perform manual updates on 10 pages. Seppi333 (Insert ) 13:04, 19 December 2019 (UTC)[reply]
@Pigsonthewing: 2nd addendum − I stumbled across Wikipedia:Manual of Style/Tables#Size, which links to WP:SPLITLIST; both advocate against splitting lists, and tables in particular, at arbitrary cutpoints. I don't think splitting the list into 4 tables was arbitrary since the pages were by far the largest pages in the mainspace prior to the split into 4 distinct pages. However, splitting the tables across 10 pages doesn't seem like it would be compliant with the MOS guideline on article size (WP:SPLITLIST) or tables (Wikipedia:Manual of Style/Tables#Size) based upon what those sections indicate. There are two final means I can think of that would allow for further reduction of the page size across all 4 pages, without loss of content or context about what the list entries reflect.
The first is to create a template:UniP redirect to template:uniprot and change all of the uniprot templates in the tables to the redirect. This would reduce the bytes per entry by 3, so the first 3 pages would shrink by 3x5000=15000 bytes without any visible change to the lists. The second would be to remove the status column AND all of the gene entries with a status of "Entry Withdrawn", which are the last 100 entries (indices 19201−19300) in List of human protein-coding genes 4, so that the lists only contain approved gene symbols. I would then need to indicate somewhere on these pages that all the gene symbols in these lists are the approved symbols for their corresponding genes. For context, "Entry withdrawn" has a very specific meaning in the HGNC database: it indicates "a previously approved HGNC symbol for a gene that has since been shown not to exist." Cutting that column would reduce the page size of the first 3 pages by 10 bytes per row ⇒ 10x5000=50000 bytes per page. If these both of these changes sound fine to you and no one has any objections, I can go ahead and reduce the page size of the first 3 pages by 65000 bytes, but that seems to be as far as I can reduce it simply by restructuring the tables in a manner that doesn't sacrifice content. Let me know what you think when you get a chance. Seppi333 (Insert ) 05:14, 25 December 2019 (UTC)[reply]
  • Actually it's not easy to do this in Wikidata, because of the conflated content of these articles and the multitude of concept types they are linked from (gene / protein / protein family). With this query I get 5,911 missing articles:
SELECT DISTINCT ?gene ?geneLabel
 {
   ?gene wdt:P31 wd:Q7187 .
   ?gene wdt:P703 wd:Q15978631 .
   ?gene wdt:P688 ?protein .
   ?protein wdt:P361 ?family .
    MINUS { 
    ?article 	schema:about ?gene ;
 			    schema:isPartOf <https://en.wikipedia.org/> .
    }
    MINUS { 
    ?article 	schema:about ?protein ;
 			    schema:isPartOf <https://en.wikipedia.org/> .
    }
    MINUS { 
    ?article 	schema:about ?family ;
 			    schema:isPartOf <https://en.wikipedia.org/> .
    }
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
 }

Click here to launch the Wikidata query

So why are pages needed? --SCIdude (talk) 14:27, 19 December 2019 (UTC)[reply]
@SCIdude: That's sort of an odd question. I don't think any article on WP is "needed"; it's really just a question of whether an article meets the notability criteria.
My question was (cryptic): why are these list pages needed if you can get the full list with such a query? --SCIdude (talk) 15:10, 19 December 2019 (UTC)[reply]
Well, I suppose there's 2 reasons. The first is that an article is readily accessible; a query requires a working knowledge of SQL or a pre-written script to run. The second, which I think is more important, is that the WP lists would remain fully up-to-date with the HGNC database, whereas WD is updated by PBB, which pulls gene data from NCBI gene, which in turn pulls its approval/nomenclature data from HGNC. So basically, there's no intermediate database that might delay data currency with HGNC (provided that a bot updates it). Seppi333 (Insert ) 00:25, 20 December 2019 (UTC)[reply]
Also, if there's ~6000 missing articles in wikidata and ~8000 redlinks in the tables, then there's roughly 2000 gene symbols that lack a redirect to an existing article on the gene/protein (they're typically categorized with {{R from gene symbol}}). Do you know of any simple methods to determine which gene symbols those articles correspond to using WD's data? Seppi333 (Insert ) 15:01, 19 December 2019 (UTC)[reply]
To clarify, the query lists those gene items that 1. don't have an enwiki article and 2. where the encoded protein(s) don't have an enwiki article, and 3. where the associated families don't have an enwiki article. So it is exactly what your list pages do. --SCIdude (talk) 15:10, 19 December 2019 (UTC)[reply]
Also there are no missing items in Wikdata, we have all human genes and proteins, and our InterPro family import goes up to IPR040000, so is quite recent. That's why this query does what I said above. --SCIdude (talk) 15:14, 19 December 2019 (UTC)[reply]
I understood what you meant; my point is that there are more sitelinks to gene/protein articles in wikidata than there are bluelinks in the gene lists. The only reason that would happen is if there are articles on genes/proteins which are not located at the pagename of the corresponding gene symbol and which lack a redirect from that gene symbol. I was hoping you knew of a way to identify which redlinked gene symbols need to be redirected to an existing gene/protein article. Seppi333 (Insert ) 15:21, 19 December 2019 (UTC)[reply]
WD usually has no item that links to a redirect, so this can't be done from WD alone. But I have put a useful query on your talk page. Also I just see that the above query counted articles on protein families multiple times because families are associated with every member of the family. The rewritten query would remove the third MINUS block and list all gene items that 1. don't have an enwiki article and 2. where the encoded protein(s) don't have an enwiki article. The number is now 6,078. --SCIdude (talk) 17:24, 19 December 2019 (UTC)[reply]

@SCIdude: That dataset you gave me is exactly what I needed, thanks! I should be able to process it and identify the missing gene symbol links and the corresponding redirect targets fairly easily. I can create about 2000 redirects using that data, but it’s going to require another bot approval request. Sigh. Seppi333 (Insert ) 22:27, 19 December 2019 (UTC)[reply]