Impact Evaluations, Part 3: What Are Their Limits?

by Quentin Wodon

In the first post of this series, I argued that impact evaluations could be highly valuable for organizations such as Rotary in order to assess the impact of innovative interventions that have the potential to be replicated and scaled up by others if successful. In the second post I suggested that a range of techniques are available to implement impact evaluations. In this third and last post in the series, I would like to mention some of the limits of impact evaluations. Specifically, I will discuss four limits: (1) limits as to what can be randomized or quasi-randomized; (2) limits in terms of external validity; (3) limits in terms of explanation as opposed to attribution; and finally (4) limits in terms of short-term versus long-term effects.

Can Everything Be Randomized?

The gold standard for impact evaluations is randomized controlled trials (RCTs), as discussed in the second post in this series. When it is not feasible to randomize the beneficiaries of an interventions, statistical and econometric techniques can sometimes be used to assess impact through “quasi-randomization”. But not all types of interventions can be randomized or quasi-randomized. If one wants to assess the impact on households of a major policy change in a country, this may be hard to randomize.

One example would be the privatization of a large public company with a monopoly in the delivery of a specific good. The company can be privatized, but typically it is difficult to privatize only part of it, so assessing the impact of privatization on households may be hard to do because of the absence of a good counterfactual. Another example would be a major change in the way public school teachers are evaluated or compensated nationally. At times, even with such reforms, it may be feasible to sequence the new policy, for example by covering first some geographic areas and not others, which can provide data and ways to assess impacts. But in many cases the choice is “all or nothing”. Under such circumstances techniques used for impact evaluations may not work. Some have argued that for many of the most important policies that affect development outcomes, the ability to randomize is the exception rather than the rule.

For the types of projects that most Rotary clubs are implementing, I would have doubts about an argument that randomization would not be feasible, at least at some level. This does not mean that all or even most of our projects should be evaluated. But we should recognize that most of our projects are small and local, which makes it easier to randomize (some of) them, when appropriate for evaluation. For larger programs or policy changes, one must however be aware that randomization or quasi-randomization are not always feasible.

Internal Versus External Validity

When RCTs or quasi-randomization are used to assess the impact of interventions, the evaluators often pay special attention to the internal validity of the evaluation. For example, are the control and treatment groups truly comparable, so that inferences about impact are legitimate? Careful evaluation design and research help in achieving internal validity.

But while good evaluations can be trusted in terms of their internal validity, do the results also have external validity? Do they apply beyond the design of the specific evaluation that has been carried out? Consider the case of a NGO doing great work in an area of health through an innovative pilot program. If the innovative model of that NGO is found to be successful and scaled up by a Ministry of Health, will the same results be observed nationally? Or is there a risk that with the scale-up, some of the benefits observed in the pilot will vanish, perhaps because the staff of the Ministry of Health are not as well trained or dedicated as the staff of the NGO? There have been cases of interventions when, as pilots were scaled up, their original promise did not materialize at scale.

Attribution Versus Explanation

Consider again the example of the dictionary project mentioned in the previous post. An impact evaluation could lead to the conclusion that the project improves some learning outcomes for children, or that it does not. Impact evaluations are great to attribute impacts and establish cause and effect. But they do not necessarily tell us why an impact is observed or not. For that, an understanding of the context of the intervention is needed. Such context is often provided by so-called process as opposed to impact evaluations. There is always a risk that an impact evaluation will be like a black box – impacts can be attributed, but the reasons for success or lack thereof may not be clear. This in turn can be problematic when scaling up programs that were successful as pilots because when doing so, it is often necessary to alter some of the parameters of the interventions that were evaluated, and without rich context, the potential consequences of altering some of the parameters of the original intervention may not be known.

Short Versus Long-term Effects

Another issue with impact evaluations is the time dimension to which they refer. Some interventions may have short-term positive impacts but no long-term gains. An evaluation carried out one or two years after an intervention may suggest positive impacts, but those could very well vanish after a few years. Conversely, other evaluations may have no clear impact in the short term, but positive impacts later on. Ideally, one would like to have information on both short-term and long-term impacts, but this may not be feasible. Most evaluations by design tend to look at short-term impacts rather than long-term impacts.

Implications of this Discussion

The above remarks should make it clear that impact evaluations are no panacea. They can be very useful – and I believe that Rotary should invest more in them for innovative projects that could be scaled up by others if successful – but they are not appropriate for all projects, and they should be designed with care.

I hope that this three-part series has helped some of you to understand better why impact evaluations have become so popular in development and service work, but also why they require hard work to set up well. Again, if you are considering impact evaluations in your service work, please let me know, and feel free to comment and share your own experience on this topic.

Note: This post is part of a series of three on impact evaluations. The three posts are available here: Part 1, Part 2, and Part 3.

 

Impact Evaluations, Part 2: How Are They Done?

by Quentin Wodon

Having argued in the first post in this series of three that we need more impact evaluations in Rotary, the next question is: How are such evaluations to be done? One must first choose the evaluation question, and then use an appropriate technique to answer the question. The purpose of this post is to briefly describe these two steps. A  useful resource for those interested in knowing more is an open access book entitled Impact Evaluation in Practice published by the World Bank a few years ago. The book is thorough, yet not technical (or at least not mathematical), and thereby accessible to a large audience.

As mentioned in the first post in this series, impact evaluations seek to answer cause-and-effect questions such as: what is the impact of a specific program or intervention on a specific outcome? Not every project requires an impact evaluation – but it makes sense to evaluate the impact of selected projects that are especially innovative and relatively untested, replicable at larger scale, strategically relevant for the aims of the organization implementing them, and potentially influential if successful. It is also a good practice to combine impact evaluations with a cost-effectiveness analysis, but this will not be discussed here.

Evaluation Question

An impact evaluation starts with a specific project and a question to be asked about that project. Consider the dictionary project whereby hundreds if not thousands of Rotary clubs distribute free dictionaries to primary school students, mostly in the United States. This project has been going on for many years in many clubs. In Washington DC where I work, local Rotary clubs – and especially the Rotary Club of Washington DC – distribute close to 5,000 dictionaries every year to third graders. Some 50,000 dictionaries have been distributed in the last ten years. This is the investment made in just one city. My guess is that millions of dictionaries have been distributed by Rotarians in schools throughout the US.

The dictionary project is a fun and feel good activity for Rotarians, which also helps to federate members in a club because it is easy for many members to participate. I have distributed dictionaries in schools several times, the last time with my daughters and two other Interactors. Everybody was happy, especially the students who received the dictionary with big smiles. Who could argue against providing free dictionaries in public schools for children, many of whom are from underprivileged backgrounds?

I am not going to argue here against the dictionary project. But for this project as for many others, I would like to know whether it works to improve the prospects and life of beneficiaries – in this case the children who receive the dictionaries. It could perhaps be enough to justify the project that the children are happy to receive their own dictionary and that a few use it at home. But the project does have a cost, not only in terms of the direct cost of purchasing the dictionaries, but also in terms of the opportunity cost for Rotarians to go to the schools and distribute the dictionaries. Rotary clubs could decide to continue the project even if it were shown to have limited or no medium term impact on various measures of learning for the children. But having information on impact, as well as potential ways to increase impact, would be useful in making appropriate decisions to continue this type of service project or not. It would not matter much if dictionaries were distributed only by a few clubs in a few schools– but this is a rather large project for clubs in the US.

An impact evaluation question for the project would be of the form: “What is the impact of the distribution of free dictionaries on X?” X could be – among many other possibilities – the success rates at an English exam for the children, the propensity for children to read more at home, a measure of new vocabulary gained by children, or an assessment of the quality of the spelling in the children’s writing. One could come up with other potential outcomes that the project could  affect. In order to assess impact, one would need to compare students in schools where children did receive dictionaries to students in schools where children did not. This could be done some time after the dictionaries have been distributed.

About two years ago I tried to find whether any impact evaluation of the dictionary project had been done. I could not find any. May be I missed something (let me know if I did), but it seems that this project which requires quite a bit of funding from clubs as well as a lot of time spent by thousands of Rotarians every year has not been evaluated properly. It would be nice to know whether the project actually achieves results. This is precisely what impact evaluations are designed to do.

Evaluation Techniques

In order to estimate project impacts data collection is required. Typically for impact evaluations quantitative data are used. For the dictionary project, one could have children take a vocabulary test before receiving the dictionary and again one year after having received the dictionary. One would then compare a “treatment” group (those who received the dictionary) to a “control” group (those who did not). This could be done using data specifically collected for the evaluation, or using other information – such as standardized tests administered by schools, which would reduce the cost of an impact evaluation substantially, but would also limit the outcomes being considered for the impact evaluation to those on which students are being tested by schools.

The gold standard for establishing the treatment and control groups is randomized controlled trial (RCT). Under this design, a number of schools would be randomly selected to receive dictionaries, while other schools would not. Under most circumstances, comparisons of outcomes (say, reading proficiency) between students in schools with and without dictionaries would yield (unbiased) estimates of impacts. In many interventions, the randomization is applied to direct beneficiaries – here the students. But for the dictionary project that would probably not work – it would seem too unfair to give dictionaries to some students in a given school and not others, and the impact on some students could affect the other students, thereby making the impact evaluation not as clean as it should be (even if there may be ways to control for that). This issue of fairness in choosing beneficiaries in a RCT is very important, and typically the design of RCT evaluations has to be vetted ethically by institutional review boards (IRBs).

A number of other statistical and econometric techniques can be used to evaluate impacts when a RCT is not feasible or appropriate. These include (among others) regression discontinuity design, difference-in-difference estimation, and matching estimation. I will not discuss these techniques here because this would be too technical, but the open access Impact Evaluation in Practice book that I mentioned earlier does this very well.

Finally. apart from measuring the impact of programs through evaluations, it is also useful to better understand the factors that lead to impact or lack thereof – what is often referred to as the “theory of change” for how an intervention achieves impact. The question here is not whether a project is having the desired impact, but why it does or does not. This can be done in different ways, using both qualitative and quantitative data. For example, for the dictionary project, a few basic questions could be asked, such as: 1) did the child already have access to another dictionary at home when s/he received the dictionary provided by Rotary?; 2) how many times has the child looked at the dictionary over the last one month?; 3) did the dictionary provided by Rotary have unique features that led the child to learn new things?, etc… Having answers to this type of questions helps in interpreting the results of impact evaluations.

Conclusion

Only so much can be discussed in one post, and the question of how to implement impact evaluations is complex. Still, I hope that this post gave you a few ideas and some basic understanding of how impact evaluations are done, and why they can be useful. If you are considering an impact evaluation, please let me know, and if I can help I will be happy to. In the next and final post in this series, I will discuss some of the limits of impact evaluations.

Note: This post is part of a series of three on impact evaluations. The three posts are available here: Part 1, Part 2, and Part 3.

Rotary Foundation Basics, Part 3: What’s Great, What Could Be Improved?

by Quentin Wodon

This last post in a series of three on The Rotary Foundation (TRF) looks at what is great about the foundation, and what could probably be improved. TRF support for Rotary projects is first discussed, based on my own perceptions and those of a few fellow Rotarians to whom I talked before writing this post. Ratings received by the foundation as a charity are then briefly reviewed.

TRF Support for Rotary Projects

On the plus side, TRF support for polio has been instrumental in the near eradication of the disease, as mentioned in the previous post in this series. The focus on polio has also helped Rotary in getting a seat at the table with major partners such as the World Health Organization and the Bill and Melinda Gates Foundation. Even more importantly for Rotarians involved in service projects, the matching system whereby TRF co-funds grants is well appreciated. Both district and global grants benefit from TRF support, but I will focus in this post on global grants.

TRF provides up to $200,000 in matching funds for global grants, with the minimum match being $15,000. This is for projects that reach a minimum size of $30,000 in overall cost/funding. The system for global grants has been fundamentally revised in recent years in order to have fewer but larger grants, which should help in ensuring that projects have a bigger impact on the ground and are well managed. Six areas of focus have been selected for the grants, which is also positive to narrow down a bit the scope of what is funded (even if this scope remains fairly broad). The rules of the game for putting together global grants are clear, which also helps.

In terms of potential areas for improvement, the Grants Online System may not be as friendly as it could be, given today’s technology. Several Rotarians mentioned to me that there may also be at times issues with the grant review process. Hopefully reviewers are as objective and qualified as they should be, but this is something that could be assessed. In addition, despite efforts to help Rotarians put together great global grants, more could be done in terms of e-learning resources and other tools to help the membership develop impactful projects beyond the management and processing aspects of grants.

Many global grants are complex and require substantial expertise. It is not always clear that project teams have enough expertise. The system relies largely on volunteer hours to prepare and implement grants. This helps not only for cost savings but also for getting Rotarians’ hands dirty. Personal experiences gained through hands-on work are invaluable, especially when working directly with project beneficiaries. But it may be useful in some cases to rely more on external paid expertise, especially for large grants. In principle Rotarians can get help from Rotarian Action Groups (RAGs) for the design and implementation of projects. These are great resources, but it is not fully clear how active and effective some of the RAGs are.

One area of concern is the ability of TRF to respond to crises, with the most recent case being Ebola in West Africa. There are two issues here. One issue is fundraising. TRF does not seem to have a good system to provide incentives (read matching funds) for individual Rotarians to donate in times of crisis. Many Rotarians donate when a major crisis hits, but they often do so through other organizations because TRF does not have a good system to attract these donations. If TRF could set aside funds to match individual donations by Rotarians for major crises, this could help the foundation raise more funds. It would also help TRF gain in visibility as a humanitarian organization. The other issue is about the allocation of the funds that could be raised. Part of the funds could be allocated to Rotary clubs in affected countries for their projects to respond to crises with some type of fast track approval. Part of the funds could also be transferred to well established national and international NGOs active on the ground in responding to crises. Overall, setting up a stronger crisis response mechanism within TRF could strengthen the Rotary brand while providing much needed rapid support to vulnerable groups in countries affected by major crises.

Finally, more expertise and commitment from TRF is needed for proper monitoring and evaluation of global grants, and for disseminating the results of such evaluations. My perception is that few projects are evaluated in-depth with baseline and endline data collection to assess impact. Impact evaluation can be expensive, so not all projects should be evaluated in that way. But more should be done in this area, including in partnership with some of the NGOs implementing TRF projects. If TRF could fund more innovative projects that would be evaluated seriously, it could have a larger impact because other organizations with more resources could then bring successful TRF pilots to scale.

Ratings for TRF as a Charity

The comments above point to some great features of TRF, but also some potential areas for improvement. One should not forget however that overall TRF is very well rated as a charity. Given that many of the followers of this blog are new, let me repeat here what I mentioned on TRF ratings a few months ago on this blog as well as in another post for Rotary Voices.

In the US, Charity Navigator provides ratings for charities. Three ratings are available for financial performance, accountability and transparency, and a combination of both. Charities can get one to four stars overall. TRF has the highest possible rating (four stars). The yellow dot in the Figure below shows exactly how the foundation is rated – it has a rating of 89.8 out of a maximum of 100 for financial performance, and 97.0 on accountability and transparency, which yields a four stars rating overall.

RI Foundation Graph

For financial performance, Charity Navigator considers seven main indicators: the share of the charity’s budget spent on programs and services, the share spent on administrative expenses, the share spent on fundraising expenses, the fundraising efficiency ratio, the primary revenue growth, the program expenses growth, and the working capital ratio. Details are available on the Charity Navigator website. For accountability and transparency, a total of 17 indicators are used. TRF could have scored even higher except for the fact that its donor privacy policy requires donors to opt out for their basic information not to be (potentially) shared with other charities.

Conclusion

Overall, TRF helps fund great projects on the ground, and it is also well rated as a charity. The reform of the global grants model of the last few years to define areas of focus and implement fewer but larger grants was smart. But as for any other organization, there are also areas where TRF could probably do better, especially in terms of the friendliness of the Grants Online System, the need to ensure that project teams have the expertise they need, the ability to respond to humanitarian crises, and the need to better evaluate the impact of projects that appear especially innovative. What do you think?

Note: This post is part of a series of three on TRF: Part 1, Part 2, Part 3.

Reducing the Gender Gap in Education

by Quentin Wodon

The International Day of the Girl Child earlier this month was an opportunity to remind ourselves that girls are among the primary victims of violence, and that they continue to, in many countries, have limited education and employment opportunities.

There has been substantial progress towards gender equity in basic education, but large gaps remain at the secondary level. In the Figure below from the World Bank’s Global Monitoring Report (GMR) just published, countries are ranked on the horizontal axis according to GDP per capita. Gaps in secondary school completion by gender are displayed on the vertical axis. The sizes of the dots represent the size of the countries’ population. Data are provided for sub-Saharan countries in orange and South Asian countries in blue. On average, a boy remains 1.55 times more likely than a girl to complete secondary school in the countries in the sample. The gaps are larger in poorer countries. But there is also a lot of variation around the regression line, suggesting that it is feasible to reduce gender gaps in attainment even in low income countries.

Ratio of Secondary School Completion Rates by Gender


Source: World Bank Global Monitoring Report.

Multiple reasons may explain why boys and girls drop out before completing secondary school. For example, in a 2012/13 survey for Uganda, parents mentioned the cost of education as the main reason for dropping out for both boys and girls. The fact that a child was not willing to continue his or her education came up next, but for girls an even more important reason for dropping out was pregnancy, often linked to early marriage. A sickness or calamity in the family was also mentioned as a reason for dropping out, as was the fact that some children did not make enough progress in school. When similar questions were asked to head teachers, differences between boys and girls emerged even more clearly. For boys, lack of interest and employment were key reasons for dropping out. For girls, pregnancies and child marriage came up strong, with these in turn likely to be related to poverty and limited employment prospects as well as cultural factors.

Because multiple reasons may contribute to gender gaps in attainment, the types of interventions that could be implemented to reduce these gender gaps are also multiple. Should the distance to schools be reduced, whether this is done by building new schools in remote areas or reducing travel time through public transportation? Should scholarships be provided to girls, as successfully pioneered by Bangladesh several decades ago? Should more female teachers be hired? Should the priority be to make separate toilet blocks available for boys and girls? Should more focus be placed on understanding and changing cultural practices? Choosing between these and many other potential interventions is often difficult and clearly responses depend on country context. But reviews of the evidence can help, and such reviews are now becoming more available thanks to a substantial increase in rigorous impact evaluations in recent years.

One such review was published in June 2014 by a team of academics led by UNESCO and funded by the UK’s Department for International Development. The review assessed the evidence on the impact of interventions for girls’ education focusing on (i) providing resources (including transfers) and infrastructure, (ii) changing institutions, and (iii) changing norms and including the most marginalized in education decision making. The review summarized the impact of different types of interventions on three outcomes: participation, learning, and empowerment. For each type of intervention and category of outcome, the evidence on the likelihood of impact was classified as strong, promising, limited, or needed (i.e., weak).

For participation, the evidence on the impact of conditional cash transfers, information about the potential employment returns to education, and the provision of additional schools in underserved and unsafe areas was found to be strong. This was also the case for the evidence on some interventions related to teacher training, group-learning, and measures to promote girl-friendly schools as well as learning outside the classroom, for example through tutoring. Several of these interventions (group-learning, programs for learning outside the classroom, and scholarships linked to student performance) were also found to have clear impacts on learning. The evidence on the impact of interventions on empowerment was generally found to be weaker.

This type of review and the studies on which such reviews are based are of high value for policy-makers. The World Bank has also started to put together a systematic database of impact evaluations and its Strategic Impact Evaluation Fund is providing funding for rigorous evaluations. What else is needed? We need more experiments and evaluations. But we also need assessments of the cost effectiveness of various types of interventions, so that Ministries of Education can make the right decision under their budget constraints. And we need more research on the political economy of program expansion to understand how great innovations can be scaled up and sustained.

Note: this post is reproduced with minor edits from a post published on October 29, 2014 on the World Bank’s Let’s Talk Development blog available at https://blogs.worldbank.org/developmenttalk/reducing-gender-gap-education