Statistical reporting has reached EoL

edit

Sad news. I am no longer able to reach the machine where the statistics and code for the WP:5000 lives, even after multiple restarts. Basic triage by colleagues in Philadelphia has been unsuccessful. Whenever I travel next to PHL I will take a look at the hardware, but that is not going to happen any time soon. I would be happy to assist someone in setting up an identical service elsewhere using my code. Can someone template this situation appropriately on the main page? West.andrew.g (talk) 18:18, 20 January 2020 (UTC)Reply

@West.andrew.g: what's necessary machine and networking-wise to run your code? - Scarpy (talk) 22:52, 20 January 2020 (UTC)Reply
@Scarpy: The daily ingestion and weekly reporting is pretty straightforward. It just runs a shell script via 'cron' a couple times daily to see if the WMF has uploaded a new stats file, brings it back via 'curl' if it does, and then gets to work. I store things in a MySQL database that probably swells to ~200GB (maybe less) with a year's worth of data (one table logs activity, and another table for each week's data). The report is published using the API with en.wp account credentials. The yearly report is a bit messier and uses some very long join/union statements that take considerable time to run. The workload/style probably isn't great for a laptop, could work on a constantly connected desktop, and would work best on a non-dedicated server (absent the yearly processing, it is probably working for 30min-1hr each day). Thanks, West.andrew.g (talk) 17:04, 24 January 2020 (UTC)Reply

FYI, I am working on a collaboration that might restore weekly statistical reporting. West.andrew.g (talk) 20:44, 27 January 2020 (UTC)Reply

@West.andrew.g: k. would be happy to host if that doesn't work out, or even in the mean time. - Scarpy (talk) 22:18, 27 January 2020 (UTC)Reply

Large images on the list

edit

When I was reading this list, I saw four giant images pop up. They are: the logo of Youtube (#80), the flag of Brazil (#480), TVA Sports (#1493), and WhatsApp (#1908). What is the reason behind the appearance of these giant images? They look very strange. Sanjay7373 (talk) 21:10, 1 February 2020 (UTC)Reply

We are re-working the statistical aggregation. Entries from the "File:" namespace were not excluded in this early version of the code. The list will be as it always has (i.e., sans images) once we are fully up and running again on the new platform. West.andrew.g (talk) 21:18, 1 February 2020 (UTC)Reply

Would a top 1000 daily report be helpful, as a stand-in?

edit

I just noticed that this report is down. It looks like getting the top 1000 articles by daily views can be easily gathered with a single REST API query. I think I could script that up and run a daily print-to-wiki cronjob pretty quickly. Maybe within a day or so. Would that be helpful, until we can get the top 5000 weekly job working again? Let me know if I can help... J-Mo 20:22, 14 March 2020 (UTC)Reply

Since I had all the code bits and authentication set up already for a different project, I went ahead and built a top 1000 daily report on test.wikipedia.org. I used my work account this time, since that account had the proper Oauth credentials for testwiki, but if this report is useful I could easily make it a daily HostBot task running on a page like User:HostBot/Top_1000_report (or anywhere else). J-Mo 22:02, 14 March 2020 (UTC)Reply
Update: I went ahead and did this. It's running as a daily cron job. Code on GitHub, announcement on Village Pump. Still interested in improvement suggestions or other ways to help out. Cheers, J-Mo 21:01, 15 March 2020 (UTC)Reply
@Jtmorgan: thanks! Is it possible to get it to aggregate over weeks and months? - Scarpy (talk) 02:41, 23 March 2020 (UTC)Reply
Scarpy I've updated the top 1000 report. It still runs daily, but now it calculates the cumulative weekly pageviews for all articles that appeared among the top 1000 most-viewed articles on any day within the past week. It will update daily at around 15:00 UTC. This report makes a few assumptions that vastly simplify the process of calculating weekly views. As a result, it may miss a few articles that never quite made it into a daily top 1000 list, but nevertheless had a consistently high-ish volume of views over a 7 day period. However, this limitation should only impact the 'tail' of the report; you can trust that the ranking and the total weekly counts for the vast majority of the articles listed in the report are representative. Let me know if you have questions or suggestions! Cheers, J-Mo 00:05, 2 April 2020 (UTC)Reply