Launched as Indie Map!
Bear and Ben also took a stab at this. See the wiki page and GitHub project.
When I have to decide whether to implement a feature in Bridgy, or how to prioritize tasks, I often make assumptions like most indie web sites have an h-card, or PSCs and PSLs never got much traction. I know they’re based on anecdotal evidence, not actual data, but it’s all I have, so I run with it.
Clearly not ideal. I’d love to use real data instead! Here’s a project idea: crawl indieweb sites and generate usage stats for microformats2 classes and other indieweb features.
Tantek and others have proposed a similar Indie ThinkUp idea for more non-technical statistics, e.g. frequency of each post type (post vs reply vs like, etc.), how often you thank people, how often you curse, etc.
Straw man design proposal:
- Seed from IRC_People and maybe all domains that have ever logged into IndieAuth. Don’t even bother spidering, at least to start; just crawl those domains.
- Try to identify the server. (Known, WordPress, etc.)
- Parse every h-entry on the front page and every h-feed linked from the front page.
- Count all instances of mf2 classes. Identify them by the mf2 prefixes: h-, p-, u-, dt-, and e-.
- Aggregate per page and per domain so we can answer questions like what fraction of posts are photo posts? and how many people use syndication links?
- Generate a static html report with simple graphs using D3 or Google Charts or whatever.
- Set up a cron job to do all this once a day or so.
Stretch goals:
@snarfed I really like this idea! I already have too many projects to work on but I’m super tempted.
an unrelated comment from kevin marks on github led me to the 2005 Web Authoring Statistics survey. very much the same spirit as this idea, and way more ambitious. thanks for the prior art kevin!
HTTP Archive is even better, since it’s updated regularly. (The Web Authoring Statistics survey above was one time only, in 2005.)