Impact Evaluations, Part 3: What Are Their Limits?

by Quentin Wodon

In the first post of this series, I argued that impact evaluations could be highly valuable for organizations such as Rotary in order to assess the impact of innovative interventions that have the potential to be replicated and scaled up by others if successful. In the second post I suggested that a range of techniques are available to implement impact evaluations. In this third and last post in the series, I would like to mention some of the limits of impact evaluations. Specifically, I will discuss four limits: (1) limits as to what can be randomized or quasi-randomized; (2) limits in terms of external validity; (3) limits in terms of explanation as opposed to attribution; and finally (4) limits in terms of short-term versus long-term effects.

Can Everything Be Randomized?

The gold standard for impact evaluations is randomized controlled trials (RCTs), as discussed in the second post in this series. When it is not feasible to randomize the beneficiaries of an interventions, statistical and econometric techniques can sometimes be used to assess impact through “quasi-randomization”. But not all types of interventions can be randomized or quasi-randomized. If one wants to assess the impact on households of a major policy change in a country, this may be hard to randomize.

One example would be the privatization of a large public company with a monopoly in the delivery of a specific good. The company can be privatized, but typically it is difficult to privatize only part of it, so assessing the impact of privatization on households may be hard to do because of the absence of a good counterfactual. Another example would be a major change in the way public school teachers are evaluated or compensated nationally. At times, even with such reforms, it may be feasible to sequence the new policy, for example by covering first some geographic areas and not others, which can provide data and ways to assess impacts. But in many cases the choice is “all or nothing”. Under such circumstances techniques used for impact evaluations may not work. Some have argued that for many of the most important policies that affect development outcomes, the ability to randomize is the exception rather than the rule.

For the types of projects that most Rotary clubs are implementing, I would have doubts about an argument that randomization would not be feasible, at least at some level. This does not mean that all or even most of our projects should be evaluated. But we should recognize that most of our projects are small and local, which makes it easier to randomize (some of) them, when appropriate for evaluation. For larger programs or policy changes, one must however be aware that randomization or quasi-randomization are not always feasible.

Internal Versus External Validity

When RCTs or quasi-randomization are used to assess the impact of interventions, the evaluators often pay special attention to the internal validity of the evaluation. For example, are the control and treatment groups truly comparable, so that inferences about impact are legitimate? Careful evaluation design and research help in achieving internal validity.

But while good evaluations can be trusted in terms of their internal validity, do the results also have external validity? Do they apply beyond the design of the specific evaluation that has been carried out? Consider the case of a NGO doing great work in an area of health through an innovative pilot program. If the innovative model of that NGO is found to be successful and scaled up by a Ministry of Health, will the same results be observed nationally? Or is there a risk that with the scale-up, some of the benefits observed in the pilot will vanish, perhaps because the staff of the Ministry of Health are not as well trained or dedicated as the staff of the NGO? There have been cases of interventions when, as pilots were scaled up, their original promise did not materialize at scale.

Attribution Versus Explanation

Consider again the example of the dictionary project mentioned in the previous post. An impact evaluation could lead to the conclusion that the project improves some learning outcomes for children, or that it does not. Impact evaluations are great to attribute impacts and establish cause and effect. But they do not necessarily tell us why an impact is observed or not. For that, an understanding of the context of the intervention is needed. Such context is often provided by so-called process as opposed to impact evaluations. There is always a risk that an impact evaluation will be like a black box – impacts can be attributed, but the reasons for success or lack thereof may not be clear. This in turn can be problematic when scaling up programs that were successful as pilots because when doing so, it is often necessary to alter some of the parameters of the interventions that were evaluated, and without rich context, the potential consequences of altering some of the parameters of the original intervention may not be known.

Short Versus Long-term Effects

Another issue with impact evaluations is the time dimension to which they refer. Some interventions may have short-term positive impacts but no long-term gains. An evaluation carried out one or two years after an intervention may suggest positive impacts, but those could very well vanish after a few years. Conversely, other evaluations may have no clear impact in the short term, but positive impacts later on. Ideally, one would like to have information on both short-term and long-term impacts, but this may not be feasible. Most evaluations by design tend to look at short-term impacts rather than long-term impacts.

Implications of this Discussion

The above remarks should make it clear that impact evaluations are no panacea. They can be very useful – and I believe that Rotary should invest more in them for innovative projects that could be scaled up by others if successful – but they are not appropriate for all projects, and they should be designed with care.

I hope that this three-part series has helped some of you to understand better why impact evaluations have become so popular in development and service work, but also why they require hard work to set up well. Again, if you are considering impact evaluations in your service work, please let me know, and feel free to comment and share your own experience on this topic.

Note: This post is part of a series of three on impact evaluations. The three posts are available here: Part 1, Part 2, and Part 3.

 

Impact Evaluations, Part 2: How Are They Done?

by Quentin Wodon

Having argued in the first post in this series of three that we need more impact evaluations in Rotary, the next question is: How are such evaluations to be done? One must first choose the evaluation question, and then use an appropriate technique to answer the question. The purpose of this post is to briefly describe these two steps. A  useful resource for those interested in knowing more is an open access book entitled Impact Evaluation in Practice published by the World Bank a few years ago. The book is thorough, yet not technical (or at least not mathematical), and thereby accessible to a large audience.

As mentioned in the first post in this series, impact evaluations seek to answer cause-and-effect questions such as: what is the impact of a specific program or intervention on a specific outcome? Not every project requires an impact evaluation – but it makes sense to evaluate the impact of selected projects that are especially innovative and relatively untested, replicable at larger scale, strategically relevant for the aims of the organization implementing them, and potentially influential if successful. It is also a good practice to combine impact evaluations with a cost-effectiveness analysis, but this will not be discussed here.

Evaluation Question

An impact evaluation starts with a specific project and a question to be asked about that project. Consider the dictionary project whereby hundreds if not thousands of Rotary clubs distribute free dictionaries to primary school students, mostly in the United States. This project has been going on for many years in many clubs. In Washington DC where I work, local Rotary clubs – and especially the Rotary Club of Washington DC – distribute close to 5,000 dictionaries every year to third graders. Some 50,000 dictionaries have been distributed in the last ten years. This is the investment made in just one city. My guess is that millions of dictionaries have been distributed by Rotarians in schools throughout the US.

The dictionary project is a fun and feel good activity for Rotarians, which also helps to federate members in a club because it is easy for many members to participate. I have distributed dictionaries in schools several times, the last time with my daughters and two other Interactors. Everybody was happy, especially the students who received the dictionary with big smiles. Who could argue against providing free dictionaries in public schools for children, many of whom are from underprivileged backgrounds?

I am not going to argue here against the dictionary project. But for this project as for many others, I would like to know whether it works to improve the prospects and life of beneficiaries – in this case the children who receive the dictionaries. It could perhaps be enough to justify the project that the children are happy to receive their own dictionary and that a few use it at home. But the project does have a cost, not only in terms of the direct cost of purchasing the dictionaries, but also in terms of the opportunity cost for Rotarians to go to the schools and distribute the dictionaries. Rotary clubs could decide to continue the project even if it were shown to have limited or no medium term impact on various measures of learning for the children. But having information on impact, as well as potential ways to increase impact, would be useful in making appropriate decisions to continue this type of service project or not. It would not matter much if dictionaries were distributed only by a few clubs in a few schools– but this is a rather large project for clubs in the US.

An impact evaluation question for the project would be of the form: “What is the impact of the distribution of free dictionaries on X?” X could be – among many other possibilities – the success rates at an English exam for the children, the propensity for children to read more at home, a measure of new vocabulary gained by children, or an assessment of the quality of the spelling in the children’s writing. One could come up with other potential outcomes that the project could  affect. In order to assess impact, one would need to compare students in schools where children did receive dictionaries to students in schools where children did not. This could be done some time after the dictionaries have been distributed.

About two years ago I tried to find whether any impact evaluation of the dictionary project had been done. I could not find any. May be I missed something (let me know if I did), but it seems that this project which requires quite a bit of funding from clubs as well as a lot of time spent by thousands of Rotarians every year has not been evaluated properly. It would be nice to know whether the project actually achieves results. This is precisely what impact evaluations are designed to do.

Evaluation Techniques

In order to estimate project impacts data collection is required. Typically for impact evaluations quantitative data are used. For the dictionary project, one could have children take a vocabulary test before receiving the dictionary and again one year after having received the dictionary. One would then compare a “treatment” group (those who received the dictionary) to a “control” group (those who did not). This could be done using data specifically collected for the evaluation, or using other information – such as standardized tests administered by schools, which would reduce the cost of an impact evaluation substantially, but would also limit the outcomes being considered for the impact evaluation to those on which students are being tested by schools.

The gold standard for establishing the treatment and control groups is randomized controlled trial (RCT). Under this design, a number of schools would be randomly selected to receive dictionaries, while other schools would not. Under most circumstances, comparisons of outcomes (say, reading proficiency) between students in schools with and without dictionaries would yield (unbiased) estimates of impacts. In many interventions, the randomization is applied to direct beneficiaries – here the students. But for the dictionary project that would probably not work – it would seem too unfair to give dictionaries to some students in a given school and not others, and the impact on some students could affect the other students, thereby making the impact evaluation not as clean as it should be (even if there may be ways to control for that). This issue of fairness in choosing beneficiaries in a RCT is very important, and typically the design of RCT evaluations has to be vetted ethically by institutional review boards (IRBs).

A number of other statistical and econometric techniques can be used to evaluate impacts when a RCT is not feasible or appropriate. These include (among others) regression discontinuity design, difference-in-difference estimation, and matching estimation. I will not discuss these techniques here because this would be too technical, but the open access Impact Evaluation in Practice book that I mentioned earlier does this very well.

Finally. apart from measuring the impact of programs through evaluations, it is also useful to better understand the factors that lead to impact or lack thereof – what is often referred to as the “theory of change” for how an intervention achieves impact. The question here is not whether a project is having the desired impact, but why it does or does not. This can be done in different ways, using both qualitative and quantitative data. For example, for the dictionary project, a few basic questions could be asked, such as: 1) did the child already have access to another dictionary at home when s/he received the dictionary provided by Rotary?; 2) how many times has the child looked at the dictionary over the last one month?; 3) did the dictionary provided by Rotary have unique features that led the child to learn new things?, etc… Having answers to this type of questions helps in interpreting the results of impact evaluations.

Conclusion

Only so much can be discussed in one post, and the question of how to implement impact evaluations is complex. Still, I hope that this post gave you a few ideas and some basic understanding of how impact evaluations are done, and why they can be useful. If you are considering an impact evaluation, please let me know, and if I can help I will be happy to. In the next and final post in this series, I will discuss some of the limits of impact evaluations.

Note: This post is part of a series of three on impact evaluations. The three posts are available here: Part 1, Part 2, and Part 3.