Data Journalism Heist
Data Journalism Heist
Data Journalism Heist
How to get in, get the data, and get the story out - and
make sure nobody gets hurt
Paul Bradshaw
This book is for sale at http://leanpub.com/DataJournalismHeist
This version was published on 2015-06-10
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
2013 - 2015 Paul Bradshaw
Contents
Huh? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3
Getting in . . . . . . . . . . . . . . . . . . . . . . . .
The crowbar of data journalism: pivot tables . . .
Who might your story be about? . . . . . . . . . .
What about that who might we want to find out?
Advanced filters - a dry run . . . . . . . . . . . .
Advanced filters - using wildcards . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
6
10
11
19
24
27
The Debrief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
No one gets hurt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Why does it matter - why do I care? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
35
36
Huh?
Rusty: Youd need at least a dozen guys doing a combination of cons.
Danny: Like what, do you think?
Rusty: Off the top of my head, Id say youre looking at a Boeski, a Jim Brown, a Miss Daisy, two
Jethros and a Leon Spinks, not to mention the biggest Ella Fitzgerald ever.
Oceans Eleven (2001)
.
Can you learn data journalism in an hour?
That was the challenge I was set in late 2011, when I was invited to Bristol to deliver a short
workshop. There was a certain appeal in the challenge: there is a myth that data journalism has
to be complicated, spectacular, or resource-intensive. But data journalism is not always like that.
For every headline-grabbing Wikileaks story or MPs expenses saga, there are dozens of everyday
uses of data journalism that go unnoticed. It might be working out whos the top-scoring Englishman
in the Premier League, or seeing whether theres been an outbreak of flu in your area. It might be
finding out the worst performing schools, or that seasons biggest fashion trends.
So I stripped back everything to some basic techniques. This book covers the bare bones of data
journalism: the basic skills to do those simple stories - from finding data in the first place, to getting
to the story you want quickly, to following it up and telling it well.
Can you learn data journalism in an hour? Not all of it. But you can learn enough to get started, and
get your first stories. More importantly, you can learn enough to see whats possible, with results
that provide a basis to begin to learn more (Ill talk about places to go next at the end).
So this is the Data Journalism Heist: nothing illegal, but rather a concept designed to reinforce the
rough and ready, fast and clean aspect of this approach - as well as the importance of the last part:
No one gets hurt.
It is a book all about speed, and were wasting time. Time to get started.
http://www.theguardian.com/world/iraq-war-logs
http://www.telegraph.co.uk/news/newstopics/mps-expenses/
If you are reporting on - or particularly interested in - a specific field like crime, health, education,
welfare or the environment, try to find bodies (local, national and international) that publish data
regularly in that area.
Here are just a few general sources of regular data in the UK alone:
http://ckan.org/instances/
http://data.gov.uk
http://www.regionofwaterloo.ca/en/regionalgovernment/OpenDataHome.asp
http://datablog.ahref.eu/en/ahref-log/opendata/open-data-la-regione-emilia-romagna-presenta-il-suo-portale
https://data.cityofchicago.org/
http://en.wikipedia.org/wiki/List_of_national_and_international_statistical_services
http://whatdotheyknow.com
http://www.asktheeu.org
http://ChangeDetection.com
Education: the Higher Education Statistics Agency (HESA), the Higher Education Funding
Council (HEFCE) and Universities and Colleges Admission Service (UCAS) all hold data on
universities and students. Ofsted, the regulator of schools, hold data on pupils and education
at pre-16 level.
The NHS Information Centre (NHSIC) and Health Episode Statistics (HES) hold data on
hospital admissions and local doctors.
NOMIS holds data on the labour market: where people are employed and unemployed.
Data.police.uk holds data on crime and policing
.
Birmingham City Councils expenditure data is at Birmingham.gov.uk/payment-data. Youll find
monthly spending data going back over the last year, and can request older data by email.
It comes in two formats: PDFs, and spreadsheets. Given the choice, always avoid PDFs.
If you prefer to work on some international data, download loans data from the European
Investment Bank - change the search to the widest possible criteria and then use the (easy to
miss) Export link at the bottom of the page to get an Excel spreadsheet.
http://www.eib.org/projects/loans/list/index.htm
.
The spreadsheets are shown here as a Microsoft Excel icon, but they are actually CSV files which
will work on any spreadsheet software, including free options like Google Drive spreadsheets and
Open Office (which can also open Excel spreadsheets).
http://www.birmingham.gov.uk/payment-data
http://www.openoffice.org/
CSV stands for Comma Separated Values - this means that the value in each column is
separated by a comma like so: Name, Date of birth, Address. When the spreadsheet
software opens this, it replaces each comma with a new column.
Click on the most recent spreadsheet version of the spending data (not the PDF) and download it to
your computer. Make sure you save it somewhere you can find later, like your desktop.
Once downloaded, open it in your preferred spreadsheet software - either by double-clicking it or
uploading to a web-based tool like Google Drive.
Now, were in.
If PDF is the only option, try a PDF-to-Excel converter like PDFtoExcelOnline.com,
PDF2XL or Wondershare PDF Converter. You can also try a quick phonecall or FOI
request to get the data in spreadsheet format.
https://support.google.com/drive/answer/2424368?hl=en&ref_topic=2375187
PDFtoExcelOnline.com
http://www.cogniview.com/pdf-to-excel/pdf2xl-basic
http://www.wondershare.com/pdf-converter/
http://leanpub.com/scrapingforjournalists
Getting in
Sometimes it takes an outsider, someone with fresh eyes to see the truth.
Ally Carter, Heist Society
.
Once were in the data, we need to know where were going. We dont want to spend a minute
longer than we have to.
Every dataset will have stories it can tell. Here are just some typical stories often found in data, some
taking more work than others:
1. Top of the pops/flops: who is top or bottom?
2. Holding up a mirror - or the face behind the figures: what does the data tell us about
ourselves as a country, industry or region?
3. A freak occurrence/unusual change: what changed in the data from one time or place to
another - and why?
4. A claim debunked or supported: does the data support claims being made by a politician,
expert or pundit?
5. Trend: whats going up or going down?
6. Postcode lottery: are people in different places getting different treatment or access, or being
charged differently?
7. Juking the stats/No accountability: is the data reliable or complete? Or is someone trying
to prevent people seeing it? Or is it not even collected in the first place? Why not?
8. Victims, Winners and Losers: whats the impact of that data in real life? What are the human
stories - including reaction to the data itself?
9. Things happen together: does one thing go up when another thing goes up? Or does one thing
go down when another goes up, or vice versa? Note that this does not mean causation.
10. Its not working: if someone made a change in the past about how things were done, has it
had the impact that was promised?
Getting in
You will often find yourself distracted by many different possible stories in the data - but if youre
up against a deadline then youll need to be focused.
Well be focusing on one of the simplest stories to tell with data: who is top or bottom. Specifically:
who gets the most money?
Getting in
Once you find the pivot table option, click on it. A new window should appear checking that
Excel has understood correctly where your data is. This is why you needed to make sure you were
somewhere in your data before you started this. Excel will automatically detect the edges of your
data from that point - in other words, where it hits an empty row or column it will take that to be
the limits of the cells containing data.
Getting in
The window that appears when you create a pivot table - note the dotted line to the right of the window showing
where it thinks the edge of your data is
Check its understood correctly where your data is - a dotted line will have appeared around the
edges of the data it thinks you want to play with - if its wrong, start again making sure that you
are starting with only one cell selected (do not press any cursor keys while still in the new window
- trust me on this one). Then click OK.
What happens now depends on your version of Excel, but you should be able to see an empty pivot
table with the following areas:
One area saying Data or Data area or Values
One area saying Row or Row area or Row labels
One area saying Column or Column area or Column labels
Look around to see if you can find these. If possible, click on the table area. Below are images showing
three common views (if you have a different view, please get in touch and Ill update the book).
Getting in
Getting in
10
Whichever version of Excel youre using, what happens next is the same: you are going to fill some
of those boxes with the subjects of your story by dragging them from the list of fields (column
headings).
The first box to focus on is Rows. Rows normally concern the who of our story
So, who could we tell a story about?
Getting in
11
So, weve narrowed our field to two strong candidates for the who part of our story: companies,
and directorates.
Lets drag companies into the Row box (if you have more than one, try just one and see what
happens - sometimes there is more than one way to do the same thing, and you can always try the
other method next time).
Your pivot table should now update to show a list of all the companies in the data.
This is quite useful in itself: our original data had one row for every payment. If we wanted to look
through those manually we would have to glance through over 15,000 rows.
This list has much fewer rows - we could look through it for any interesting companies.
But were on a deadline, so we need to move on.
Getting in
12
How those payments are counted depends on how the data is interpreted.
In most cases, Excel or Google Drive will see that these are numbers and assume that you want to
add them all together, with a function called SUM. If that has happened, your Values box should say
SUM of Amount.
Sometimes, however - generally when the data is not numeric - it will think you want to count
how many there are, with a function called (can you bear the suspense?) COUNT. If that has
happened, your Values box should say COUNT of Amount.
Getting in
13
The PivotTable Field window will appear. There are two areas to play with here:
On the left, under Summarize by: are the options to aggregate values by their SUM, or COUNT,
or AVERAGE, etc. You can also choose to just show the biggest value (i.e. the largest individual
payment) for each row (MAX) or smallest (MIN). If you cant see SUM, make sure you scroll
up.
On the right the Number... button allows you to change how the number displays. If youre
dealing with money, for example, you can choose currency here, which will add a currency
sign and thousand separators (i.e. the figure 1000000 will become 1,000,000 - much easier to
understand).
Choose what you want to show and click OK. The pivot table should update.
This will open a new area of the window titled Show data as:
14
Getting in
The drop-down menu here allows you to specify whether you want to show your numbers
differently to normal. For example % of row will show your values not as a whole number
(for example the number of payments or inspections, or the amount of money), but in terms of
their percentage of that row (for example, the proportion of all inspections, or the percentage of
payments to that company).
Similarly, % of column will show you what each cells value represents as a proportion of the
column in which it sits, and % of total will tell you exactly that.
There are other more complex options here too that allow you to display the value in terms of its
percentage of, or difference from, another value, such as the previous years number, or a specific
or gender or category.
In this example the value is being shown as a percentage difference from the previous year
And Running total in will show the values mounting up across or down, depending on that base
field you choose. Choosing your year field, for example, will mean each years value in the pivot
15
Getting in
table is actually that years total plus all the preceding years. Choosing your category field will
mean the values increase as each category appears in each row, so that the last category in your
pivot table rows will show the totals for all categories (a running total of that category and all
preceding ones).
For video tutorials on some of these techniques see this Contextures tutorial.
http://www.contextures.com/xlPivot10.html#PctDiffFrom
Spotting these stories is easier if you spend time reading the local news, but its also a good starting
point to begin familiarising yourself with the local news: searching for company names may bring
up previous reports involving them. It will also give you ideas for new angles or angles that can be
re-visited.
There are also many other possible leads if you continue to scan down the list of companies, however.
Dont overlook the smaller amounts - these sometimes include quirky recipients (my personal
favourite was a chicken hire company) or ones that raise questions (whats the story behind a gift
bought from a business gifts company?) which might make for more colourful items.
Here is a list of some of the questions you might write down as you scan down the list of companies:
How much is spent on care homes? Who benefits most? What about companies with a bad
history? You could also ask the same question of childrens homes.
Getting in
16
How much is spent through agencies on temporary staff? Why are temporary staff being hired
at the same time as redundancies are being made?
There are a lot of charities. How big a role does the voluntary sector have in providing public
services? Are they at risk from cuts?
There are large payments to sporting bodies such as the British Cycling Federation and the
Lawn Tennis Association - what is that for? (A similar lead was followed up for this front
page story by the Manchester Evening News - shown in the image below.
There are payments to other councils - what is that for?
How much is spent on transport? Who is that for?
There are some B&B (bed and breakfast) hotels in the list costing tens of thousands of pounds
- what for? (That spending was the source of this front page story in the Birmingham Post
- see second image below)
Why are they spending thousands on Starbucks?
Why are they spending over 1000 on Jaguar hire?
And:
Redacted personal data - whats that all about?
http://www.manchestereveningnews.co.uk/news/greater-manchester-news/you-splash-out-300k-australian-4033385
http://www.birminghammail.co.uk/news/local-news/city-council-spends-nearly-1m-1330817
Getting in
17
18
Getting in
Youll notice that many of these stories dont require any further work on the data - so at this point
its time to make a clean exit, and start speaking to people. If thats you, skip to the last chapter
now.
For some, however, youre talking about more than one company - and for that youre going to need
a second technique.
19
Getting in
That drop-down menu can then be used to filter the data below: for example, you could click on the
drop-down menu for Vendor name, untick Select All and then scroll down, ticking the companies
you only want to show.
For a quicker way of doing the same, instead of scrolling down you can use the search box in the
drop-down menu to find companies that contain the term youre looking for - and then tick them
from here (some versions of Excel will automatically select them for you).
Getting in
20
This easy way of filtering is very appealing - but you are likely to need a lot more control over your
filter than this, as well see. And thats where the advanced filter option comes in.
Getting in
21
Under that column heading we have one entry: ELMDON CARS (TAXIS) LTD - again, this
has been copied and pasted from the original data because advanced filters will only bring
back exact matches - unless we use special characters, which well come on to.
For a dry run, create a similar mini table in your own spreadsheet - remembering to leave an
empty column, and copy and paste the column heading and the name of one entry in that column.
Once youve created that, youre ready to use the advanced filter. But before you do that
Make sure your cursor is somewhere in the original data first. This is because, like pivot tables,
Excel is going to use this to guess which data you want to filter - and of course you want to filter
the original data, not the new mini table. Once your cursor is back in that main data you can select
Advanced filter.
The advanced filter, like the automatic filter, is normally under the Data menu.
Getting in
22
Once you click on Advanced filter you should see a new window appear with some options. The
main ones you want to look for are the List range box, which should show the range of cells you
want to filter: something like $A$1:$H$16283.
Ignore the dollar signs (these fix each part of the cell location references so they cannot be changed),
what that means is it is selecting data from cell A1 (column A, row 1) to cell H16283 (column H, row
16283). If your spreadsheet has more rows or columns the second cell reference will be different but the principle is the same.
Clicking in this box should also make a faint dotted line appear in the background spreadsheet,
around your original data (like it does in the pivot table wizard) so you can visibly see the edges of
the data Excel has assumed you want to filter.
Getting in
23
The box underneath this - titled Criteria range: - is where you tell Excel about your little mini table,
where you listed the column name and criteria you want to filter on.
Click in that box, and then click and drag across the two cells in your mini table to select them (you
may have to move the Advanced Filter window first, which you can do by clicking at the top of it
and dragging it aside).
As you do this, the cell references should appear in the box. This time it may look a bit more
complicated: something like '300627Payments_over_500_Octobe'!$J$1:$J$2 - this is because this
time the cell reference includes the sheet name too, followed by an exclamation mark before the cells
themselves: J1 to J2.
Dont worry about checking this (avoid using the arrow keys to try to see the end of that reference
- it will move the cursor on the sheet instead) - the main thing is that a dashed line should have
appeared around your mini table to indicate that it has been selected.
You can also forget about the other options on this window. Instead, click OK and you should see
the results of your advanced filter appear.
With my example, I have four rows of results (you may have more). Note the row numbers, however
- they are not sequential:
This is because the filter only hides non-matching data. Likewise, our mini table has disappeared
too - apart from the heading. The rows under that heading have been hidden. Thats important, as
well find out.
Getting in
24
Now, we could have done all this much more quickly with an automatic filter. But this was just a
dry run - now well really use the power of advanced filters.
This will bring back any rows where the Vendor Name field contains none or any characters followed
by the letters T, A, X, I, followed by none or any characters.
In other words that will match any companies with names containing the characters T, A, X, I, in
that order.
Treating the characters separately is important here: Bobs Taxidermist would be caught by the
same criteria, for example (well deal with excluding those results later).
Now, again, we could have something similar with an automatic filter - but an advanced filter allows
you to do this with multiple criteria. So, underneath *TAXI* we could also add:
*COACH*
*TRANSPORT*
*CAR*
*HIRE*
*BUS*
and any other key terms you can expect to appear in the name of taxi firms.
Getting in
25
Try that now with your own advanced filter. Youll need to clear your old filter first - which you
can do by selecting the Data menu and clicking on Clear filters or similar. Then change the criteria
in your mini table - making sure to leave the heading as it was.
When you go through the process of creating a new advanced filter (Data > Advanced filter),
remember to put your cursor back in the main data first, and specify the criteria range again, as it
is now a longer range of cells than it was previously (in the example above, seven cells rather than
two).
Now you should get hundreds of results - including many irrelevant ones. Why? Because our *CAR*
criteria is matching against words like CARE (so all the care home companies are included), and
*HIRE* is matching against words like YORKSHIRE.
To reduce the likelihood of this happening we need to add just one character to our criteria: a space.
Its easy to overlook spaces in company names and other data, but computers see these as a character
like any other, and in fact they are hugely useful in filtering. Look at the difference between these:
*CAR*
*CAR *
The first option will match CAR HIRE and CARE HOME.
The second will not match CARE HOME, because it is looking for a space after the R.
Getting in
26
However, for the same reason it will also not match HIRE CARS. To address that you might add
another criteria: *CARS*
Apply the same principle to *HIRE* - this time putting the space before it like so:
* HIRE*
At this stage you will still have some mismatches - but as long as they dont dominate the results
as they did previously, we can filter them out more quickly using pivot tables, as Ill explain in the
next section.
.
Once youve tweaked your advanced filter to the point where you have some decent results, you
can use the same pivot table techniques to get an overview on that - but not yet.
For the moment this filtered data is just a view of the data - rows that dont match our criteria are
still there, but hidden, so any pivot table we generate will also include those hidden rows.
To address this, we need to copy the filtered data into a new sheet:
While the data is still filtered, select CTRL+A to select all (CMD+A on a Mac).
Press CTRL+C (CMD+C on a Mac) to copy the selection - or select the menu Edit>Copy if
you prefer.
Youll notice that the copied data has dashed lines across it - this is because you are only
copying the filtered data, not the hidden data.
Create a new sheet to paste this data into, by selecting the Insert menu and then Sheet >
Blank sheet (you can also right-click at the tabs across the bottom and select Insert sheet, or
press SHIFT+F11)
Paste the copied data into this sheet by pressing CTRL+V (CMD+V on a Mac) or Edit>Paste
Scroll to the bottom of the sheet to check it has only pasted the filtered data: there should be
a couple hundred rows rather than over 15,000.
28
You can now use this data to generate a pivot table, as you did before with the whole data. So do
that, putting Vendor Name in the rows, and Invoice amount in the data area to give you a total for
each company.
This will also give you a list of companies which is easier to whittle down further: at the top of the
column containing the company names should be a coloured cell saying Row labels. On the right of
that is a downward arrow that opens up a drop-down menu showing all the company names. Click
on this and you can untick any companies you dont want to appear on this pivot table.
Above we can see that ALLSPORTS SCHOOLS COACHING LTD is not a transport company, so we
can untick that. Likewise CENTRAL COACHING AND SPORTS ACADEMY, DOABA TENT HIRE
SERVICE, JOHN GALLAGHER SPORT COACHING and so on (These are just examples in my data
- look for similar examples in your own.)
Some might not be immediately obvious: is FREIGHT TRANSPORT ASSOCIATION a transport
company? A quick search on Google will tell you its a trade body, so that can be excluded as well.
What about FINESSE COACHES LTD? That could be a coach in either sense of the word, but it turns
out to be the vehicle kind. And in fact, while Googling, you may find a public inquiry judgement
relating to a company of the same name in the Midlands by the Traffic Commissioner, dated a year
previously. Thats something to follow up - perhaps with other companies here too (but dont get
thrown off track: finish the story youre working on before beginning another).
Once youve filtered out the companies you dont want in this table, you have a dataset you can
re-pivot for a number of possible stories:
Whos getting the most money?
Whos spending the most on transport?
What sort of companies are being used?
29
Sorting column B from largest to smallest again would bring our biggest recipient to the top (in my
data thats one taxi company getting 67,000 in this one month alone).
What if we wanted to look at spenders? Well, we still need to retain the filter on company names:
to do that, drag the Vendor Name from the Rows box into the Filter box. That will just give us a
grand total - but we can also drag something new into Rows: Directorate. Your four boxes should
now look something like this (depending what version of Excel youre using):
This helps us understand whos spending the money, and perhaps what on.
The overwhelming majority of payments, for example, are being made by the directorate CYPF another quick search, phonecall or local knowledge would tell you this stood for Children, Young
People and Families. This covers all business areas relating to education and social care for
children and young people.
So these may include taxis and coaches for transporting pupils to and from school (individually and
in groups), and on school and other trips, among other things.
From this point you could drill down further to just look at spending by a particular directorate - it
all depends on what youre interested in, or whats more newsworthy at the moment. For example:
If the council is making lots of redundancies, spending money on temporary staff may be
worth focusing on.
If someone in the council is making statements about wasting public money, is their
department practising what she preaches?
If there are big cuts, where is money being spent unnecessarily? Or where are cuts not being
made?
If there are cuts elsewhere - for example in welfare payments - how might that be having an
impact at a local level? For example: an increase in homelessness shown by increased spending
on bed and breakfast accommodation.
http://www.birmingham.gov.uk/cs/Satellite?c=Page&childpagename=Legal-Services%2FPageLayout&cid=1223092568895&pagename=BCC%
2FCommon%2FWrapper%2FWrapper
30
If a particular issue is making the headlines, such as teenage obesity or roads in need of repair,
how much money is being spent on that issue, or how has that changed?
You can add many more - the more you read the news, the more questions will spring to mind (youll
also spot the names of people to ask about them).
For example, in our data, if you were only interested in spending by one department (say, Adults
and Communities), you could click on the drop-down menu next to Row labels and tick just that
one. Then:
drag Directorate to the Report filter box, and
drag Vendor Name back into the Rows box
so youd have something like this:
Have a play around with different combinations of filters and rows. And as always, be ready to
leave the data and pick up the phone at the earliest opportunity: this spreadsheet isnt going to give
you the story alone.
Searching for RBL23 filetype:xls on Google will only show you spreadsheets containing
the code
Searching for RBL23 filetype:pdf on Google will only show you PDFs containing the code
Searching for RBL23 site:birmingham.gov.uk on Google will only show you pages and
documents on the Birmingham council website
You can adapt the same techniques to other documents types (filetype:doc for Word docs, for
example) and sites (site:gov.uk for all local and central government sites, for example). The terms
site: and filetype: are called search operators - these can save you a lot of time looking for
information generally.
Sometimes you will find a spreadsheet or document which lists all the codes in your data, and
adds extra data (for example where schools or authorities are represented by a code, you might
find resources like this listing who they are, their addresses, and so on.
Sometimes you will find other documents, such as guidance, or details on individual payments, or
templates for submitting information. All of these can be useful if you have to resort to a Freedom
of Information request, because they tell you what information is gathered (or should be!).
If you cant find more detail, try picking up the phone again to the part of the authority that
publishes the data - there may be contact details on the page where it is published, or you may
have to call the switchboard and ask if they can put you through to whoever is responsible.
If that person cant help then try the officer responsible for Freedom of Information requests again, it may be that they can help you without you having to submit a formal request.
Even if both people cannot help, dont see it as a failure: youve made two useful contacts!
The final option, then, is the Freedom of Information (FOI) request. This can be made directly to
the authority (search for freedom of information email or similar, with the name of the body most have a webpage which provides details), or through a free service like WhatDoTheyKnow
or AsktheEU. In the US you can get help from sites like the Reporters Committee for Freedom of
the Presss FOIA Letter Generator.
The advantage of services like WhatDoTheyKnow is they make it much easier to make and track FOI
requests, and easier for others with a similar interest to find and even help you. The disadvantage
is that your request is public, so if you want to retain some sort of exclusivity to the information,
you wont. If you lack confidence in making FOI requests, sites like WhatDoTheyKnow are probably
a good way to start.
When making an FOI request relating to data, check the following:
Are you requesting the data in digital spreadsheet format (CSV or XLS)? If you dont
specify this, the data may come back as a PDF, or even printed on paper, which isnt helpful.
Ask if different parts of the codes have particular meanings. For example, the first two
characters might refer to a department, and the next three to something else.
Try requesting the data dictionary for the data youve been working on. This is a list of all
the fields in the database - not just those that they published. You should explain that, too,
as they might claim they dont know what that is, or that they dont have one. Alternatively,
31
you might request the full data, including any fields that have not been published - bearing
in mind the next point
Anticipate any objections that may be made. Typical exemptions under FOI laws include
cost, privacy, and commercial confidentiality. If you can familiarise yourself with those
exemptions, and judgements relating to them, then quote them in your request. For example,
you might point out their own obligations under those exemptions (e.g. having to explain the
reasoning, or help you rephrase the request). You can also shape your request to explicitly
exclude information which may be subject to exemptions, such as personal data.
Remember that you can request documents as well as data: are there documents which
might shed more light on your story, such as individual payment receipts (payment
reference numbers of some kind are helpful to quote here) or claim forms. Policy or guidance
documents might be useful too: for example, on spending, commissioning or claiming
money.
As requests like these often take weeks or even months, its worth making these requests even
before you plan to chase any stories, so you have basic contextual data available when the next
datasets come out.
http://www.dwp.gov.uk/docs/faqs-la-code-list.xls
http://whatdotheyknow.com/
http://www.asktheeu.org/
http://www.rcfp.org/foia
32
The Debrief
Rusty: You look down, they know youre lying and up, they know you dont know the truth. Dont
use seven words when four will do. Dont shift your weight, look always at your mark but dont
stare, be specific but not memorable, be funny but dont make him laugh. Hes got to like you then
forget you the moment youve left his side. And for Gods sake, whatever you do, dont, under any
circumstances
Livingston: Rus?
Rusty: Yeah?
Livingston: Come look at this?
Rusty: Sure.
Oceans Eleven (2001)
.
Once again: once you have your story lead, theres no need to waste more time in the data. And
turning that lead into a story means adding context, colour, and people.
Lets take some of the top of the pops stories we identified earlier:
How might you flesh that data out? In the case of Millions being paid every month on equal pay
settlements you might want to use the same techniques on other data to add some historical data,
add up how much has been paid in total, and which was the biggest month.
But equally, you will need to get some reaction and explanation of this: youll need to ask the body
spending the money, and the body receiving it, some questions. As always the five Ws and a H are
useful starting points:
Why are you spending this money? (If you dont already know from previous reports)
How much has been spent, and how much is yet to be spent? (Even if you think you know,
its always good to see what answer you get, or check your figures)
When does it end? When did it begin?
34
The Debrief
What is the impact of this? (Where is the money being taken from)
Who does this affect?
Where does this affect? (There may be some areas suffering more than others as a result)
The same applies to the profile piece on construction projects. This data gives you a list of the names
getting the most money - but youll need to speak to them to find out what its going on, and speak
to the council to find out why theyre spending money on those projects.
The other two stories rely on reactions: in Outsourcing of public services leaves millions of pounds of
spending closed to public scrutiny you need to ask a relevant body or individual why thats important
and how they feel about that. An organisation that campaigns for transparency in public office or
spending, or even a political party that relies on that transparency to hold power to account, would
likely have an opinion on this.
The hypothetical Council urged to cut its energy bills is a headline subject to change: once you
speak to some people (an expert on energy efficiency; the council) or perhaps look at other councils
energy bills or historical bills, you might decide to go with Council cuts energy bills by 50% or Council
praised for energy policy. But you wont know what your story is until you seek some reaction on
the information youve obtained.
The Debrief
35
The key to clear data visualisation is to strip the data down to the bare essentials. This is no different
to what a journalist does with any text story: cutting out waffle and irrelevant detail until the core
of the story remains. In data visualisation, try to avoid having more than four or five data points:
slices of a pie, bars or lines in a chart. Any more and it becomes hard to see what you are telling a
story about.
To do this, pick the top four or five, or divide into fewer, broader, categories, or drill down to just
the top four or five within one category. Try showing one item compared to everything else or
the average.
For the same reason, avoid using too many colours: theres no need to have a separate colour for
every bar, for example. Instead use one colour for your focal point, and another for everything else:
in a bar chart for example, the key bar might be red, and all the others dark grey. In a pie chart or
line graph all the non-focal slices might be different shades of grey, and so on.
If you want to map your data the same principles apply: pick as few colours as possible if you can.
In terms of websites to try out, BatchGeo is a good introductory tool - simply paste your data into
the empty box on the homepage (although the free version has a limit to the number of rows you
can map). OpenHeatMap is a good option if you want to map areas like constituencies or regions
- make sure to read the documentation to find out how to ensure your data works with it. Google
Fusion Tables gives you much more control than either, but has a steeper learning curve - look for
guides and videos online.
http://datawrapper.de/
http://www-958.ibm.com/software/analytics/manyeyes/
http://www.tableausoftware.com/public/
http://batchgeo.com/
http://www.openheatmap.com/
https://github.com/petewarden/openheatmap/wiki
http://www.google.com/drive/apps.html#fusiontables
The Debrief
36
In each case you mustnt take those replies at face value: its just another source. Nor should you
report it as he said, she said. Look for evidence, revisit your data, speak to an independent expert, or
whatever will help you be confident which party is correct, and no one is getting hurt unjustifiably.
Where next?
At some point you will probably need to expand your skillset into other areas of data journalism.
Here is a list of those areas, and some useful resources for each:
Getting data
Weve covered some sources of data releases in this book, and touched on Freedom of Information.
FOI laws will differ from country to country, so look for local resources. In the UK Heather Brookes
Your Right to Know is a great reference volume, while Montague and Amins FOIA Without the
Lawyer is a useful complement.
Most introductory journalism textbooks will cover the process of making contacts - consider
building relationships with FOI officers, data analysts, statisticians, academics and others who
The Debrief
work with data directly and can explain the problems with it.
The final way of getting data is scraping: automating the collection of data from anything from
one to thousands of webpages or online documents. If you want to know more about that my book
Scraping for Journalists covers the skills from very basic simple scrapers right through to more
powerful tools.
Data cleaning
Sadly there arent many comprehensive resources on data cleaning (I may write an ebook on that
some day!), but check out the online documentation for Open Refine (formerly Google Refine),
and blogs like Tony Hirsts OUseful.info. You can also find a number of posts about data cleaning
on my Online Journalism Blog.
Statistics
There are some excellent books that you should check out if you dont want to overlook problems
with the data youre working on. How to Lie With Statistics by Darrell Huff is a classic - and also
mercifully succinct if you dont like big books. Michael Blastland and Andrew Dilnots The Tiger
That Isnt is a very insightful read into problems with statistics, written by the people behind a
radio programme all about the subject, and Kaiser Fungs Numbers Rule Your World is a similar
book from someone who blogs about the same topics. Ben Goldacres Bad Science not only looks
specifically at science and health stories, but is also a very enjoyable read.
Visualisation
The Wall Street Guide to Information Graphics by Dona Wong is one of the most accessible books
Ive read on visualisation in general. Visualize This by Nathan Yau covers the technical side of
visual communication.
Spreadsheet formulae
You can get a number of general purpose books on Excel and spreadsheet software. These
are generally aimed at people using them for non-journalistic purposes, so you need a bit of
imagination.
A different approach is to simply search for what you need to do and the words Excel or Google
spreadsheets, e.g. adding two numbers Excel or calculating a percentage Excel. Youll generally
find a helpful tutorial, and if you prefer videos, use YouTube.
My next book, Telling stories with spreadsheets: recipes for interviewing data - and getting answers,
demonstrates a range of techniques for asking more questions of your data, from calculating
percentages to adding extra data, cleaning it, and combining data from different sources.
If you find any other books useful that arent mentioned here, please let me know and Ill add it to
the book, along with an acknowledgement.
https://leanpub.com/scrapingforjournalists
http://OUseful.info
http://onlinejournalismblog.wordpress.com/tag/google-refine/
http://astore.amazon.co.uk/onlijourblog-21/detail/0140136290
http://astore.amazon.co.uk/onlijourblog-21/detail/1846681111
37
The Debrief
http://astore.amazon.co.uk/onlijourblog-21/detail/0071626530
http://astore.amazon.co.uk/onlijourblog-21/detail/0007240198
http://astore.amazon.co.uk/onlijourblog-21/detail/0393072959
http://astore.amazon.co.uk/onlijourblog-21/detail/0470944889
https://leanpub.com/spreadsheetstories/
38