How to use Gradepro to conduct a GRADE based review
… in which I describe a basic tutorial
Introduction and some background
In this tutorial I am going to show you how you can use GRADEpro to conduct a simple evidence portfolio for a study. An evidence portfolio refers to the combination of a summary of findings from one or more studies on a defined outcome and a table that lays out the rationale of the decision on the quality of the evidence for that particular outcome. These two activities are based on the principle of GRADE (Grading of Recommendations, Assessment, Development and Evaluation) process. The Gradepro is a web based app, and you can download and automatically install this app on your chosen browser (Firefox or Chrome or Safari or Edge or another version). If you are familiar with the principle of GRADE and know how to use the Gradepro app, you will be able to develop a set of guidelines or construct the tables of evidence for a systematic review for your research. Let’s discuss briefly the principles of GRADE first and then show you how you can use Gradepro to develop tables for systematic review, health technology assessments and guidelines. This is a very simple introductory tutorial to get you started on using Gradepro, in subsequent tutorials I will show you how you can use it for health technology assessments or guidelines, or indeed other forms of evidence synthesis.
Principles of GRADE
GRADE is an acronym for “Grading of Recommendations, Assessment, Development, and Evaluations”. This grading refers to the quality of evidence presented for health care decision making and drafting of guidelines. GRADE is based on the following concepts:
- First, in GRADE, the emphasis is on outcomes across studies. Think in contrast, in systematic reviews and meta analyses, the emphasis is usually placed on studies across outcomes. This is an interesting difference, as GRADE is used for developing treatment guidelines, rather than focusing on individual studies to study their merits on internal validity and abstraction of information from all outcomes.
As a result, in GRADE, we can assess all kinds of outcomes. These outcomes can be beneficial as well as harmful outcomes. For example, consider medications and interventions for the relief of neck pain. While relief from neck pain is a beneficial or desirable outcome, side effects of medications such as peptic ulcer, or renal failure are adverse outcomes that must also be considered in the balance of benefits and harms if you want to develop guidances for clinicians and patient advocates as to what specific pain relieving agents a patient will need to use or a physician can prescribe. Hence, while studying effectiveness of efficacy of specific medications in a systematic review might consider only benefits (such as relief of neck pain) and underplay harm, in evaluating quality of evidence to develop real world advisory, both harms as well as benefits need to be mapped and as such these must be taken into consideration.
- Second, in GRADE, you can study either effectiveness or efficacy of different interventions or how effective is a screening programme or a diagnostic tool. In order to do this, we must only consider studies that compare at least two interventions or two tests, rather than purely observational studies where a course of action is followed in time, or a risk factor is characterised. Hence, GRADE (and therefore Gradepro) is useful for testing interventions or diagnostic or screening procedures for a defined outcome or a set of defined outcomes. Do not use GRADE for reporting results of observational epidemiological studies in order to map risk factors. What this means is: when we consider GRADE and use GRADEPro, we are limited to either use controlled trials (where either no intervention, placebo, or an existing alternative intervention is used to compare effectiveness and efficacy of an intervention) is used as a comparison or we use another type of experimental or quasi-experimental study design where two competing interventions are studied. Again, this is a point of departure from systematic reviews. In a systematic review (but not meta analysis), you can summarise any study or any type of study for virtually any kind of research question.
- Third, in GRADE we appaise quality of evidence. This appraisal of quality in GRADE is not the same as appraisal of quality in systematic reviews, or even when say, we appraise quality of evidence in single studies such as a randomised controlled trial or another form of study. In most common form of critical appraisal either for single studies or aggregates of studies as in a systematic review, what we mean by quality is referred to as appraisal of internal validity. The phrase “Internal validity” here refers to whether a study or the study findings are “correct” or “believable” that they fulfil appropriately the research question they set out to answer. When we do so, we emphasise whether there are significant biases in the study that are so serious that these can make the study findings null and void. The term bias here refer to systematic errors on part of the investigators or on part of the respondents in either conducting a trial or reporting an observational study. For example, in conducting a randomised controlled trial, while the investigator will have minimised selection bias because of the randomisation process itself, nevertheless, the investigator, if there is no concept of blinding of the participants, may still end up with other types of information bias.
On the other hand, the appraisal of quality in GRADE is different and is based on the question, as to what will we do with the information or how shall we use this information in real life. In GRADE therefore, quality appraisal is more complex. To make it somewhat approachable, quality appraisal in GRADE is based on a scoring algorithm.
In this scoring algorithm, one assigns quality scores or stars or plus signs to categories of evidence pertaining to an outcome. Remember that in GRADE we do not worry about studies as much as we are concerned about specific outcomes and then we use the outcomes to come to a decision about the usability of an intervention to achieve a specific outcome. Therefore, in GRADE we have a starter condition. We assign an initial high score randomised controlled trials, and a slightly lower score to an observational study design. You may argue that this is really quite pointless; but then, this is just a starter. Then, for both classes of studies, we read the studies closely and award merits and demerit points; finally, for a given outcome and a given intervention, we arrive at a summary quality estimate. We will learn how we do so in this tutorial. This is different from quality appraisal in either RCTs or any other forms of systematic reviews and meta analyses.
How studies are awarded high or low quality points in GRADE
In GRADE, quality of a body of evidence for a particular outcome with respect to an intervention (or an intervention-control pair) is based on four to one stars. These are as follows:
| Star Rating | What does that mean? |
| Four Pluses | Very high No further evidence needed |
| Three pluses| High, further evidence unlikely to change |
| Two Pluses | Moderate, further evidence may be needed |
| One plus | Low, further evidence will be needed |
The basis therefore is, how much confidence shall we have on the existing set of evidence we have at hand? Shall we say that based on the available evidence, we can proceed with the stated effectiveness of the intervention (either it is effective or not) and that is the final verdict? Or shall we wait for more evidence to emerge so we shall modify our decision. This is a moot point in GRADE style quality appraisal so that our emphasis is on the use of this information. As evidence hierarchy states that meta analyses and randomized controlled trials are placed at the highest level, so we start with clinical trials and assign them four stars to start with. For that reason, we assign one score lower to observational study designs (that is all studies that are not experimental study designs). Then, after we have done so, we upgrade and downgrade studies based on the following eight points.
What decisions or points do we consider for downgrading evidence?
- Risk of Bias. — On careful review of the body of the evidence, what biases could account for the findings? If you are reviewing meta analyses and RCTs, check for blindings and intention to treat analyses. Read the study designs and methods carefully for observational study designs. In particular check if there were important differences between the comparison groups that were not accounted for. If you suspect risks of bias, downgrade. If the risks of bias were not serious, deduct one point, else if the risks of biases are high, deduct a maximum of two points.
- Inconsistency of the findings. — Remember that in GRADE framework, we consider outcomes and interventions in assessing a body of evidence. If you cannot find meta analyses but find several different types of experimental study designs, use these different studies instead and conduct your own meta analyses (assuming that you have randomised controlled trials that you can synthesise).
When you have a meta analysis or systematic review, test heterogeneity of the studies. If you are working from a meta analysis such as a Cochrane Review, check I² statistic; as a rule of thumb, if the I² statistic is less than 30%, you may assume that heterogeneity is not a problem. Another way to assess heterogeneity is to use the Q statistic. If the Q-statistic is not statistically significant say at 5%, then you may state that heterogeneity is not a serious problem in this set of studies. In any case, if heterogeneity exists, can you explain such heterogeneity on the basis of the populations studies in the studies, the interventions, or differences in the outcomes?
- Imprecision. —Test the point estimate and the 95% confidence interval of the summary estimate (odds ratio) if you are working with meta analyses or systematic reviews, or relative risk estimate if you are dealing with individual studies. It may be helpful to think of imprecision in terms of the point estimate and the confidence interval band around the point estimate. If the 95% confidence interval around the point estimate straddles the null value (this will be a “0” where the outcome is measured in terms of risk difference, and “1” where relative risk is used as effect measure), then this indicates that the study findings are imprecise. On the other hand, if the 95% confidence interval estimates around the point estimate indicate that the lowest and the highest values point in similar directions, then no point are deducted and the imprecision is not serious. In addition to evaluation of the point estimate and the 95% confidence interval, the researcher is also advised to estimate what is referred to as Optimum Information Size (OIS). This is pertinent when the researcher has to base on meta analyses and systematic reviews. In these situations, the researcher usually estimates the optimum sample size based on what may be clinical relevant estimate should he or she were to conduct a trial and test that against the combined sample size of the meta analysis or systematic review. If the estimated sample size is higher than what is reported in the SR or RCT, then the researcher would conclude that the OIS is smaller and would deduct a point.
- Indirectness of the estimate. —This is where you test how the outcome is reported. If the investigators have used a proxy measure for the outcome they have measured not direct, then they have used an indirect measure. As an example, say in a study of death from heart disease, the authors have used LDL-C as an indicative measure for myocardial infarction as an outcome. While LDL-C is itself an important measure, we would consider LDL-C for risk of myocardial infarction as a proxy measure or an indirect measure. When you see or sense this in your report that you are using for GRADE, you will downgrade the evidence quality.
- Publication Bias. —Consider two studies that investigate the intervention-outcome combination. One study is a large study with hundreds of participants and reports a modest effect size with narrow 95% confidence interval around the point estimate. Another study is a small study with equivocal estimate and a large interval. Which study do you think has a favourable chance of getting published in a prestigious peer-reviewed journal? Also, while on this topic, think if one study gets a favourable publishing and the other study does not, what does it mean for summary estimate overall when someone like you are collating the results of such studies, _all_ studies? When something like this happens where large studies with positive effects are favoured and published over small studies with equivocal findings, this leads to a bias, and this bias is referred to as “publication bias”. When you are summarising results of studies you must pay attention to the possibility of publication bias; when you find evidence of publication bias and when you are using GRADE to assimilate study findings, then you should downgrade the quality of evidence for that particular intervention-outcome combination. This is easy when you work with already conducted meta analyses, as these are mandatory for reporting. So, you are in business if you area dealing with meta analyses such as Cochrane style meta analyses. However, on several occasions, you are on your own. In meta analyses, it is usual to report what is referred to as funnel plot. In a funnel plot, the investigators report the effect size on the x-axis and sample size of the report from where the effect estimate was calculated on the y-axis. If there would be no evidence of publication bias, you would expect that the smaller and the larger studies would be equally distributed with respect to their effect estimates; more typically, larger studies with positive or large effects are over represented in the literature and smaller studies and studies with negative effects are under represented or absent. The graph would appear like a funnel with small studies with large sample size and effect estimate that is close to the summary estimate on the top and all other studies are evenly scattered from that funnel. If there is a gross deviation from the funnel shape, you suspect that thre is considerable publication bias in the synthesis. If you sense publication bias, deduct points from the overall quality of evidence.
So these are the five points on the basis of which you can downgrade the quality of the the evidence you appraise for a particular outcome. You can also upgrade the quality of evidence if you find information that fulfil the following three criteria:
- Large effect size. — This is particularly true for observational studies. If you find that the effect measure reported in the study is of the order of say something like a relative risk estimate of 3.0 or higher, this will qualify for a large effect size. Then increase the quality score.
- Control for all possible confounding factors. — This is usually a good reason why RCTs are rated higher than observational studies, but if you find that your observational study has accounted for the main confounding variables (this will be based on concept of the exposure/intervention variable and the outcome variable) then award a high quality score to the evidence presented.
- Evidence of dose response effect. — Check the results section and see if for different levels of the exposure or the intervention variable, you also get to see correspondingly higher or lower levels of outcome responses, award higher points. Increase by one or two points.
So you see that based on these three plus and five minus points, take different combinations of outcomes and interventions and draw up a series of tables. How to construct those tables of evidence in the form of summary of findings and construct evidence portfolio is provided below.
Log in first or create an account and log in
When you log in, you get to see this:
Gradepro runs on the basis of projects. Each study you’d do in Gradepro is a project. A project can contain many questions that are put together to synthesise the evidence. So, we start with a new project, and let’s call it “Neck Pain Project”. In this project we shall create an evidence profile of all interventions that relate to the relief of neck pain in individuals and study different outcomes (neck pain relief, adverse effects) and in the process we shall examine what evidence exists in terms of what health technology (drug, device, procedures) work best for relief of neck pain and what can be done
So, click on “New Project”, this will bring you up to the following window:
See the fields are all filled up. The Name is the name of the project. Here, we shall create an evidence profile; you could also create just a summary findings table, or an Evidence to Decision Framework or a Guideline from Scratch. We will explore each type of projects below.
Once you hit “Create Project”, it will result in the following window where we have now activated all modules.
Right now, we are interested to only conduct comparisons of different interventions. In a future tutorial, I will show you that you can actually create a full guideline from scratch or you can create your own guideline for practice based on available evidence. As this is a team work (although you can work on your own), you can set tasks for teams, etc. This way it becomes a complete tool for you to develop guidelines for your own use and practice. Here, we shall keep things simple and click “Comparisons”, and then choose between whether we want to add diagnostic or whether we want to add management question. Here, we are interested to add management question, as we would like to know what treatment would best help in relief of neck pain.
then we add a management question as follows:
At this stage, let’s grab the PDF copy of the systematic review and meta analysis of an article that looked at low level laser therapy for neck pain. You can find the paper here, and we will use this paper to extract data and fill in the form. As we do so, I will explain the various terms and the different concepts. In real life, we will use similar systematic reviews and meta analyses but we will also find our own primary studies, review the reference lists of these reviews and identify newer studies, and add/edit the study lists to develop our own guidelines or advisory. Here, for the sake of learning how to use Gradepro, we will use just this systematic review to learn about the process of Gradepro. You can work on other meta analyses and systematic reviews to produce your own evidence portfolio for other study questions.
Step 1: First, get or download the article from here
Step 2: Open the article and follow along
Here are the sixteen studies they covered in this systematic review and meta analyses on which we will also work. You can do additional meta analyses on your own if you like.
Once you will save the above image in Figure 6, it will bring you to a window which will look like as follows:
Now we will add the outcome that we are interested in. In real life, you will be adding many outcomes. Some of these outcomes will be beneficial outcomes, others will be harm related outcomes. A beneficial outcome in this case is ‘relief from neck pain”, a harmful outcome might be “nausea”, or “increased pain”. In this review, the authors did not report harms as they failed to identify harm related studies or the studies they selected for this review did not contain any reference to harms. Hence, we are somewhat limited in the number of outcomes we can cover here. In real life, you will create a guideline only when you have information on both harms as well as benefits. Here, we shall only cover chronic neck pain relief and we will test if based on this article alone, what can we say.
So, go ahead, and click “Add Outcome”. Here, we are going to add pain relief and after we
Now we fill in the details of the studies. We will use 14 studies as in Figure 4 of the paper. See:
In order to fill in these boxes, not only do we need to access the meta analysis, but we will also need to conduct where needed our own analyses, and access the original RCTs to read more about them. This is a must. So, for this particular outcome, after we fill in the details, this is how the boxes are going to look like:
As can be seen in the above figure, based on the description in the article, we have added the following information:
- The number of studies. — We kept at 14, because these were the number of studies where they measured the relief from neck pain as an outcome
- Study design. — all of them were randomized controlled trials
- Risk of Bias. — From the description of the studies, all of these studies were either single or double blinded, and further all of these studies were based on randomisation that were described clearly. In this particular case, based on all these studies, we would rank them as “not serious”, but if you were to closely read studies in other settings, you might find that for some studies, they might have had serious errors. In that case, you may have to isolate those studies and report separately only on those few studies with respect to their quality of evidence. In our case, at least in this situation, we’d like to leave all of these studies as not having any serious errors.
- Inconsistency. — This is labelled as “serious”; the reason being if you review the Figure above, you see that the test of heterogeneity is not passed (p < 0.001); that said, you also see that while the risk estimates as a whole do not pass the test of heterogeneity and therefore we conclude that the studies are dissimilar, the risk estimates are not too far off from each other. In these situations, your task is to first review whether the interventions, or the outcomes or the populations are widely dissimilar from each other. If that dissimilarity would explain the heterogeneity, then, as a next step, you would combine those studies that are very similar to each other and recalculate the effect size and the re run the test of heterogeneity. Here, for instance, we did not find that the studies were too dissimilar from each other with respect to their populations, or their interventions, and therefore despite such similarities, when we find that they as a whole did not pass the test of heterogeneity, we assign the overall score of “serious” to the quality appraisal score. The “semantics” of assigning terms such as “serious” or not serious or “very serious is subjective”, but at least in this occasion, we would like to mark the quality criterion as “serious” and deduct one point from the quality score.
- You can see that the measures are direct, and that the point estimate and the 95% Confidence Interval are within the desired direction. Hence these are not serious errors. Therefore for the corresponding boxes, that is “imprecision” and “indirectness”, we have registered “not serious”. We also studied if there were publication bias in these set of studies, and we found this:
The plot above is referred to as “funnel plot”. This graph plots on the x-axis the effect size and the sample size of the included studies on the y-axis. In an ideal world where no publication bias exists, the studies (dots in the graph) will look like an inverted funnel. There would be one or two studies at the top that would have large sample and they would also report an effect size close to the summary estimate; there would be increased number of studies down the bottom but evenly scattered with respect to the effect estimate and spread apart. The largest spread will be somewhere in the bottom. This would suggest that while there have been many studies reported, some of those studies would capture the true figure and others would not, but everyone had a fair go and got represented. If any “gradient” would be missed, that would suggest that there is a publication bias to consider.
With this in mind, check this plot. We see a few studies at the top of this graph, a lot of studies at the bottom and that there is no set pattern and we cannot find if any quadrant is missing. It does not look like the authors did not include studies with low sample size and negative or equivocal effect estimate. Hence on the basis of this meta analysis and this set of studies, we can not comment on any evidence of publication bias. We therefore will mark this as “not serious” for publication bias.
Next we add outcome to the or click “Add Outcome” and add an outcome. In our case the outcome was “relief of chronic neck pain” and we fill in the boxes. It now looks like as below:
A few things to note here:
- Gradepro automatically selects that it is not going to report Relative Risk estimate as you have specified that your outcome variable is measured in scale measures or continuous outcome measures. So, you can see that little dash mark in the corresponding box next to “Relative Risk” that is greyed out.
- You can also see that on the basis of all the inputs you gave Gradepro, it has decided that the overall evidence for this particuar outcome for the body of the studies provided stands at “moderate”. It has therefore put a three star rating (three plus signs within the rounds).
- Also, note that you have decided that on a ten-point scale of whether it is critical, important, or not important, you have marked this outcome as 9 and so Gradepro calls it a critical outcome.
This completes your evidence portfolio for this outcome for the set of studies you checked. Let’s summarise the key points with respect to the summary of findings table and the evidence portfolio:
- Your outcome for this study or this problem was “relief from chronic neck pain”.
- Your drew your evidence on the effectiveness or efficacy of interventions for this particular outcome from a set of 14 single or double blind randomized controlled trials.
- Based on the information you gave Gradepro, it judged that the overall evidence for this outcome-intervention pair was “moderate”.
- You found from the summary estimate in the meta analysis that with low grade laser therapy there was a 19 point drop on a 100 point scale in chronic neck pain relief.
- You consider that chronic neck pain relief is a critical outcome.
Remember that this is just for one outcome and for one specific intervention. Even if we keep the intervention same, we will now need to search for other outcomes and other sets of studies. Other outcomes can also come from the same set of studies. What is important here is not what set of studies we look for but what outcome we study, how critical is it, and what is the quality of the overall evidence we are presented with. Then, we make a call as to whether on the basis of the evidence presented to us, if we are confident enough to recommend this particular intervention for a particular set of outcomes that are achievable. This is based on the strength of the recommendations. As you can see, this strength will be derived from the following:
- The summary effect measure, its magnitude, and its precision
- Whether we consider the summary effect measure is critical to make our decision or not. If it is critical or important, and it has large effect size that is also precise, we know that this set of outcome-intervention combination will feature high on our list. On the other hand, if the effect size is low, it is imprecise, and it is not deemed important, then it will figure quite low in our decision list.
- The overall quality of evidence we have for this particular outcome and this particular intervention.
You will in real life need to go over a list of defined outcomes for an intervention and construct an evidence portfolio before you can come to a conclusion about the usability of this information. In this eample however, we have worked on only one outcome for which we identified a set of studies. The goal of this lesson was just to show you what you can do with Gradepro, not so much as to work through a real world exercise.
If we were to form an advisory only on this outcome and intervention, we would see that low intensity laser therapy for chronic neck pain may be justified as it has moderate quality of evidence, we consider chronic neck pain relief as critical as an outcome, and we also see that it leads to an overall 19 point drop on a 100 point scale of pain reporting. In real life however, we would consider more outcomes. Some of these outcomes will be beneficial outcomes, and others would be harmful oucomes. Finally, after we would have exhaused the list of outcomes and the rank order of those outcomes we will take a stock of the situation to test if we have sufficient stength in the evidence we have obtained. On this basis, we would have to conclude as to the real world effectiveness of an intervention for a family of outcomes or a health problem or an issue.
This was a quick tour of the core issues that you can use about GRADE as a decision making tool for your health problem. Below you will find more resources (some annotated) so that you can learn more about GRADE and use it to develop guidelines and resources for your health and healthcare related questions.
The Gradepro Website & the Webapp
GRADEpro | GDT
GRADEpro GDT is an easy to use, all‐in‐one web solution to summarise and present information for healthcare decision…
This is the tool that you will use on a daily basis. This is a web based tool and is frequently upgraded. Use it on any modern web browser (Safari, Edge, Google Chrome/Chromium, Vivaldi, Firefox and clones).
In subsequent tutorials, I shall get into the details of the different types of research questions and framing of guidelines from existing reviews.
I have linked a set of 11 core articles from where you will learn more about the GRADE process.
Guyatt, G., Oxman, A. D., Akl, E. A., Kunz, R., Vist, G., Brozek, J., … & Rind, D. (2011). GRADE guidelines: 1. Introduction — GRADE evidence profiles and summary of findings tables. Journal of clinical epidemiology, 64(4), 383–394.
Guyatt, G. H., Oxman, A. D., Kunz, R., Atkins, D., Brozek, J., Vist, G., … & Schünemann, H. J. (2011). GRADE guidelines: 2. Framing the question and deciding on important outcomes. Journal of clinical epidemiology, 64(4), 395–400.
Balshem, H., Helfand, M., Schünemann, H. J., Oxman, A. D., Kunz, R., Brozek, J., … & Guyatt, G. H. (2011). GRADE guidelines: 3. Rating the quality of evidence. Journal of clinical epidemiology, 64(4), 401–406.
Guyatt, G. H., Oxman, A. D., Vist, G., Kunz, R., Brozek, J., Alonso-Coello, P., … & Norris, S. L. (2011). GRADE guidelines: 4. Rating the quality of evidence — study limitations (risk of bias). Journal of clinical epidemiology, 64(4), 407–415.
Guyatt, G. H., Oxman, A. D., Montori, V., Vist, G., Kunz, R., Brozek, J., … & Williams, J. W. (2011). GRADE guidelines: 5. Rating the quality of evidence — publication bias. Journal of clinical epidemiology, 64(12), 1277–1282.
Guyatt, G. H., Oxman, A. D., Kunz, R., Brozek, J., Alonso-Coello, P., Rind, D., … & Jaeschke, R. (2011). GRADE guidelines 6. Rating the quality of evidence — imprecision. Journal of clinical epidemiology, 64(12), 1283–1293.
Guyatt, G. H., Oxman, A. D., Kunz, R., Woodcock, J., Brozek, J., Helfand, M., … & Norris, S. (2011). GRADE guidelines: 7. Rating the quality of evidence — inconsistency. Journal of clinical epidemiology, 64(12), 1294–1302.
Guyatt, G. H., Oxman, A. D., Kunz, R., Woodcock, J., Brozek, J., Helfand, M., … & Akl, E. A. (2011). GRADE guidelines: 8. Rating the quality of evidence — indirectness. Journal of clinical epidemiology, 64(12), 1303–1310.
Guyatt, G. H., Oxman, A. D., Sultan, S., Glasziou, P., Akl, E. A., Alonso-Coello, P., … & Jaeschke, R. (2011). GRADE guidelines: 9. Rating up the quality of evidence. Journal of clinical epidemiology, 64(12), 1311–1316.
Brunetti, M., Shemilt, I., Pregno, S., Vale, L., Oxman, A. D., Lord, J., … & Jaeschke, R. (2013). GRADE guidelines: 10. Considering resource use and rating the quality of economic evidence. Journal of clinical epidemiology, 66(2), 140–150.
Guyatt, G., Oxman, A. D., Sultan, S., Brozek, J., Glasziou, P., Alonso-Coello, P., … & Rind, D. (2013). GRADE guidelines: 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes. Journal of clinical epidemiology, 66(2), 151–157.