Disclaimer: we were interested in machine learning at least a week before the recent hype started. With that out of the way…
What if there were tens of thousands of criminal justice agencies, each with dozens of potentially useful data sources sprawled across a handful of websites and subdomains? Actually…this is not hypothetical. Finding useful gems on the internet is an ongoing project of ours. Humans are good at identifying whether a web page is about the police, whether it’s useful, and in which format it’s published. Can we teach a computer to do the same thing?
One of our creative volunteers used Common Crawl to generate a list of 50,000 URLs featuring potential data about the police. They also set up a text classification pipeline for adding labels. Another spectacular volunteer created a working machine learning model trained on these labels, and we’re starting to be able to identify URLs. Exciting stuff!
We have lots of good ideas for next steps, but we need your help. All skill levels are welcome: we need help labeling URLs to train the algorithm, and we’ll train you to do that. We also welcome feedback from any text classification experts who can review our code. Just reply to this email if that’s you!
We have expanded our Data Sources database to include everything covered by our friends at OpenPoliceData. Their tool makes it easy to access records about police-public interactions across 80+ agencies, and several full states! Please support their important work by using and sharing it.
Funding opportunity: MuckRock’s grants for preserving critical documents
This year we’ve set an ambitious goal: we’d like to 10x the number of individual donations we get this year, and hit $15,000. If you think this work is important, you can donate here.
That’s all for now. As always, you can reply with questions or comments. Thanks for reading!