Wikipedia:Bots/Requests for approval/ST47ProxyBot: Difference between revisions

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 22:10, 1 December 2019

ST47ProxyBot

Operator: ST47 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 07:33, Sunday, December 1, 2019 (UTC)

Function overview: Block IP addresses belonging to open proxies, public VPN services, and web hosts.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Combination of Python and Perl

Source code available: No, to protect sources and testing methods. Mostly uses pywikibot to interact with the wiki.

Links to relevant discussions (where appropriate): WP:NOP authorizes open proxies to be blocked, Wikipedia:Bots/Requests for approval/ProcseeBot demonstrates that a bot may be used for this task.

Edit period(s): Continuously

Estimated number of pages affected: Initially a higher rate due to a large number of not-currently-blocked proxies, once it reaches steady state, probably about 100 logged actions per day.

Namespace(s): None

Exclusion compliant (Yes/No): No, not applicable

Adminbot (Yes/No): Yes

Function details: WP:NOP states that open proxies may be blocked at any time. Open proxies allow editors to evade blocks, avoid detection, or appear to be multiple users when they were actually only one person. Waiting until these proxies are abused is not practical, as multiple easily-searchable websites advertise thousands of such proxies. The number of blocks to be made is high enough that automation is required.

There is already a bot approved for this task. However, it is beneficial to have multiple independent tools. There are a large number of data sources out there, and I've already been able to find a number of proxies that I can access, but which haven't been blocked. I'm sure others can find things that I can't. Since this involves trying to test the proxies, operators in different geographical areas or homed to different ISPs may have different views on the internet, one ISP may "blackhole" a network that hosts a lot of malicious proxies while another ISP may not, or one bot might have an outage of some sort while another is still running normally.

This bot makes three types of blocks:

Open (HTTP/HTTPS/SOCKS) proxies, tested by accessing Wikipedia through the proxy
Open VPN servers, verified by at least two data sources
IP ranges used for web hosting, cloud services, etc, manually reviewed and curated

In the first two cases, it uses open sources to build a list of IP addresses, which it scans through. For proxy types that can be automatically tested, the bot attempts to access the Userinfo API. VPN testing cannot easily be automated, so instead the bot performs checks against several independent sources to determine if the IP address is a VPN node or not. The original proxy list is the first source, and the bot requires positive responses from at least two more sources before blocking. Once the bot has either successfully accessed API:Userinfo through the proxy or verified the proxy against enough independent data sources, it adds the IP to a list to be blocked.

The block duration varies based on the number of previous proxy blocks of that IP address. Currently, the first block is set for 14 days, ramping up to 2 years after enough blocks. The block is set to account-creation-blocked with anon-only unselected. I.e., this is a hardblock. The block message uses {{blocked proxy}} with a comment providing enough information for me to investigate why the IP was blocked in case there is any question. If the IP is already blocked, the bot skips it. This includes if it is already subject to a local rangeblock. The bot does not check global blocks. If the proxy is an IPv6 IP, the bot blocks the /64 range.

In the case of web hosting IP ranges, these are ranges that I identify through whois data and other sources as belonging to web hosting or other similar types of companies. I generally find an address that is being abused, usually by a spambot, and then investigate and block the entire address space assigned to web hosting at that company. In this case, I manually review the list of ranges to be blocked, and the bot blocks them for a fixed time period, generally using the {{colocationwebhost}} template, and a description of the IP ownership in the block comment. This task will never be fully automated, as it is based on me finding and deciding to block a given set of IP ranges. However, I'm bundling it with this bot request because it also entails making a large number of blocks at once.

Further, this bot may modify blocks that it initially issued, either extending them if they are near expiration and the proxy is still active, or removing them if the proxy is confirmed to be inactive. (The removal would only be after several checks in a row, over a period of time, confirm that the proxy isn't active, and this isn't currently implemented.) ST47 (talk) 07:33, 1 December 2019 (UTC)[reply]

Discussion

To which degree does this task overlap with that of User:ProcseeBot? Also, summoning their botop slakr here. Jo-Jo Eumerus (talk) 11:35, 1 December 2019 (UTC)[reply]

@Jo-Jo Eumerus: Same objective, but I am finding a large number of proxies that are not currently blocked, and that need to be. Possibly because I'm using different data sources or different testing methods. ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]

What mechanism do you have in place to prevent wheel warring, especially with human admins? — xaosflux ^Talk 14:17, 1 December 2019 (UTC)[reply]

@Xaosflux: I will add a check that it will not block any IP address that has ever been unblocked by a human admin. Instead it will log those to a file or to userspace. ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]

Does this bot support IPv6? SQL ^{Query me!} 16:12, 1 December 2019 (UTC)[reply]

Also, as far as the major compute services (amazon, azure, google) - how are you identifying these? SQL ^{Query me!} 16:14, 1 December 2019 (UTC)[reply]

@SQL: It does support IPv6, and blocks the /64 of detected IPv6 proxies. There is no special handling for the major compute services. Most of them are already blocked, if a blocked IP (including via a range block) shows up on one of the proxy lists, I currently don't even bother testing it, no point since it's already blocked. (For the ancillary task of blocking web hosting types of IP ranges, that would be through manual review of the whois information. Basically, if I CU a spambot and find that it's on some minor cloud services company's IP range, I run it through ISP rangefinder, review the netblock names, and block everything that looks webhost-y. The only automation for those is to save me from clicking the block button 100 times.) ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]
ST47, It is often not the case that the major providers are already blocked. Amazon, and azure get new ranges all the time, see: User:SQL/Non-blocked compute hosts. I've always used [1] to detect azure, and [2] to detect amazon. Google is a bit more complicated. SQL ^{Query me!} 18:39, 1 December 2019 (UTC)[reply]

@SQL: What would you suggest doing with this information? Refrain from blocking proxy IPs within that range, or block the entire range? ST47 (talk) 19:14, 1 December 2019 (UTC)[reply]

ST47, I normally completely block those ranges by hand, It can be a bit tedious / tiresome. They're webhosts, and very commonly used by spammers / UPE. SQL ^{Query me!} 21:02, 1 December 2019 (UTC)[reply]

@SQL: Right, I guess my point is that if that range isn't blocked yet, it's still good to directly block any proxies within that range. (Hopefully, that range won't be unblocked for more than a couple of days, and once the range does get blocked, the bots will stop scanning it for proxies, and the short initial block will eventually expire, leaving the long-term rangeblock.) Improving the automation around detecting (and hopefully eventually blocking and unblocking) hosting ranges is important, but isn't this bot request's objective. ST47 (talk) 21:26, 1 December 2019 (UTC)[reply]

Fair enough, I thought that this would fall under point 3, IP ranges used for web hosting, cloud services, etc, manually reviewed and curated. SQL ^{Query me!} 22:10, 1 December 2019 (UTC)[reply]

What authentication mechanism do you plan to use for this bot? Do you plan to secure the account with 2FA? — xaosflux ^Talk 18:19, 1 December 2019 (UTC)[reply]
- @Xaosflux: BotPasswords with only the necessary permissions granted, and obviously a strong unique password on the main account. I don't currently use the beta 2FA extension, preferring instead to use strong random passwords which are only used for a single account on a single website. ST47 (talk) 18:32, 1 December 2019 (UTC)[reply]

How are you going to avoid wheeling with User:ProcseeBot which it appears uses different block lengths than your proposal. We really don't need that bot coming around and blocking for one period, then you coming around and blocking for a different period. — xaosflux ^Talk 18:41, 1 December 2019 (UTC)[reply]
- @Xaosflux: I don't really see what would be wheeling about that. If an IP is already blocked, neither this bot not ProcseeBot would change the block duration. If an IP isn't currently blocked, then whichever bot or human admin gets around to it first would choose a block duration. ST47 (talk) 19:05, 1 December 2019 (UTC)[reply]
  - @ST47: ok so if the other bot thinks a one year block is fine, your bot isn't going to go and change it to a 2 year? — xaosflux ^Talk 20:08, 1 December 2019 (UTC)[reply]
    - @Xaosflux: That is absolutely correct. If an IP is blocked, either directly or via a range block, this bot does not modify the block settings at all. ST47 (talk) 20:14, 1 December 2019 (UTC)[reply]

@@ Line 76: / Line 76: @@
 ::::::{{u|ST47}}, I normally completely block those ranges by hand, It can be a bit tedious / tiresome. They're webhosts, and very commonly used by spammers / UPE. [[User:SQL|<span style="font-size:7pt;color: #fff;background:#900;border:2px solid #999">SQL</span>]][[User talk:SQL|<sup style="font-size: 5pt;color:#999">Query me!</sup>]] 21:02, 1 December 2019 (UTC)
 :::::::{{re|SQL}} Right, I guess my point is that if that range isn't blocked yet, it's still good to directly block any proxies within that range. (Hopefully, that range won't be unblocked for more than a couple of days, and once the range does get blocked, the bots will stop scanning it for proxies, and the short initial block will eventually expire, leaving the long-term rangeblock.) Improving the automation around detecting (and hopefully eventually blocking and unblocking) hosting ranges is important, but isn't this bot request's objective. [[User:ST47|ST47]] ([[User talk:ST47|talk]]) 21:26, 1 December 2019 (UTC)
+::::::::Fair enough, I thought that this would fall under point 3, {{tq|IP ranges used for web hosting, cloud services, etc, manually reviewed and curated}}. [[User:SQL|<span style="font-size:7pt;color: #fff;background:#900;border:2px solid #999">SQL</span>]][[User talk:SQL|<sup style="font-size: 5pt;color:#999">Query me!</sup>]] 22:10, 1 December 2019 (UTC)
 *What authentication mechanism do you plan to use for this bot? Do you plan to secure the account with 2FA? — [[User:Xaosflux|<span style="color:#FF9933; font-weight:bold; font-family:monotype;">xaosflux</span>]] <sup>[[User talk:Xaosflux|<span style="color:#009933;">Talk</span>]]</sup> 18:19, 1 December 2019 (UTC)