Understanding whether a policy works is critical to ensuring government programmes deliver for the public. The recent UK Budget committed to ensuring that “all programmes are supported by robust implementation and evaluation plans.” But where do you start? How do you go about evaluating your policy?
The What Works Team have devised a set of questions to help policy makers identify potential methods they could use to evaluate their policy. Ask yourselves these questions as you start developing a policy. This will help make sure your policy is set up in a way that allows for effective evaluation.
Disclaimer: This is not an exhaustive list of all types of methods (it would be impossible to cover them all in a single article!) But we hope these questions will get you thinking about the various ways you could find out if your policy works.
- What does success or failure look like?
Before thinking about evaluation methods, first ask yourself: what does your policy aim to achieve? What does success or failure look like? What context are you operating in? Have a go at answering the questions in the TIDieR framework.
The next step is to develop a Theory of Change for your policy or programme to help you map out what your aims and objectives are and how your policy or programme should help achieve these. This process encourages you to think about all the other things that might influence the success of your policy and can help you to identify what you need to measure to understand whether your policy has worked, why/why not, for who, and under what conditions. Check out the multitude of resources online about how to develop a Theory of Change e.g. here, here and here.
- Could you test the policy before you begin delivering at scale?
A robust, small-scale trial can be a cost effective way of finding out whether an intervention should be rolled out at scale. Randomised Controlled Trials (RCTs) involve randomly assigning participants to either a treatment group that receives an intervention or a control group that does not. The difference in the outcomes of both groups is then measured and analysed after a period of time to determine the effectiveness of the intervention. Well-designed RCTs are usually the most accurate way of determining whether an intervention is working.
Example case: The Family Nurse Partnership
Read more about developing public policy with RCT’s in Test, Learn, Adapt and check out this blog post from the Education Endowment Foundation on the pros and cons of RCT’s plus some top tips.
- Could you test a number of variations of the policy to understand the most effective approach?
Rather than put all your eggs in one basket, why not design policy with inbuilt variation to test which version is most effective?
Multi-armed RCTs simultaneously test more than one variation of a policy. In this approach, participants are randomly assigned to receive one of the variants. This enables a comparison of all the different variants and helps you to establish which one works or works the best. One of the biggest benefits of a multi-arm RCT is that it does not need to include a control group so it is a great solution if it is not possible to deny or delay implementation of the policy to eligible participants.
Example case: Testing low-cost interventions to improve road safety
- Could you phase the rollout of the programme to help you understand its impact?
Limited resources or simple logistics might mean that the delivery of a programme has to be staggered using a waiting list or phased roll out. This offers an opportunity to use a stepped-wedge design or waitlist design that allows those who have not yet received an intervention to be used as a temporary control group.
It is possible to estimate impact by (1) comparing outcomes for treatment and control groups for the time period prior to the control group receiving the intervention, or (2) comparing outcomes between groups who have been exposed to the intervention for varying amounts of time. While the order in which participants receive the intervention can be randomised, this method is often used where randomisation is not possible.
Example case: Evaluating the role of language training in integration
- Can you mine existing data, or exploit natural variation, to understand the programme’s impact?
If good quality administrative data is already available, it is possible to construct control groups based on observable demographic characteristics (e.g. gender, age, occupation). The aim is to build a control (or “matched” comparison) group that at the start of a programme looks as similar as possible to the intervention group. The outcomes of both groups are then tracked and compared to estimate the effectiveness of the intervention.
However, matching can quickly become impractical when evaluation teams are required to match a high number characteristics (e.g. offending profile, age, gender, ethnicity, socio-economic background, employment status and area of residence).
Propensity score matching (PSM) provides a workaround in this situation. The technique involves building a model that predicts an individual’s likelihood of being exposed to an intervention based on their observable characteristics. The “propensity scores” generated via this process are used to create comparison groups that enable the effect of the intervention to be analysed. This happens by matching beneficiaries and non-beneficiaries of the intervention that have similar scores and comparing their outcomes. For a great explainer on data matching and PSM that we regularly draw on, see this guide by the Government of Canada.
Example case: Evaluating support for troubled families
If historical trend data exists, it might be possible to use the difference-in-differences approach. This involves comparing the outcomes between a treatment group and a control group that have historically followed the same trend in the outcome(s) of interest (e.g. two countries whose exam results have remained parallel over time). If the outcomes for the two groups differ following an intervention (e.g. the abolition of school league tables), the change in the size of the difference can be used to estimate the effect of the intervention (hence “difference in difference”).
Example case: Using difference-in-difference assess impact of publishing school league tables
If programmes have a quantifiable eligibility threshold (e.g. age 18 or income level <£50,000) then regression discontinuity design is a good evaluation option. In reality, the group that falls just outside the cut off is very similar to the group that just qualifies. Any difference is likely down to chance so differences in outcomes between the two groups can be attributed to the programme.
Example case: The impact of payday loans
- Do you have the capacity to run trials and conduct evaluations? Do you need support?
Once you’ve thought about the potential ways you can evaluate, you will need to think through the detail of how to design and carry out the evaluation. Consider seeking help from analysts within your department. Or you could look at commissioning an evaluation from an expert provider. For those in the UK Civil Service, there are additional resources on hand to help.
If you’d like to know more about the methods mentioned in this blog, check out the following helpful guides:
- HM Treasury, UK Government - Magenta Book - UK Government official guidance on designing an evaluation
- Impact and Innovation Unit, Government of Canada - Measuring Impact by Design (2019)
- Alliance for Useful Evidence, The Experimenter’s Inventory (2020) - a jargon free guide to experimental methods with simple advice on the pros and cons of different methods
Leave a comment