Graphing Relationship Between Continuous Variable and Categorical Variable
Choosing the Best Graph Type
>
Adapted from Frank E Harrell, Jr: on Graphics:
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatGraphCourse/graphscourse.pdf
- Bar charts have many problems:
- High ink to information ratio
- Error bars cause perception errors
- Can only show one-sided confidence intervals well
- Thick bars reduce the number of categories that can be shown
- Labels on vertical bar charts may be difficult to read
- Dots plots are almost always better
- Consider multi-panel side-by-side display for comparing several contrasting or similar cases
- Use same scales for both x- and y-axes across different panels
- Consider ordering categories by values represented, for more accurate perception
Bar Charts, Error Bars and Dot Plots
As noted previously, bar charts can be problematic. Here is another one presenting means and error bars, but the error bars are misleading because they only extend in one direction. A better alternative would have been to to use full error bars with a scatter plot, as illustrated previously (right).
Source: Hummer BT, Li XL, Hassel BA (2001) Role for p53 in gene induction by double-stranded RNA. J Virol 75:7774-7777, Figure 4 |
|
Consider the four graphs below presenting the incidence of cancer by type. The upper left graph unnecessary uses bars, which take up a lot of ink. This layout also ends up making the fonts for the types of cancer too small. Small font is also a problem for the dot plot at the upper right, and this one also has unnecessary grid lines across the entire width.
The graph at the lower left has more readable labels and uses a simple dot plot, but the rank order is difficult to figure out.
The graph at the lower right is clearly the best, since the labels are readable, the magnitude of incidence is shown clearly by the dot plots, and the cancers are sorted by frequency.
************************* + |
|
|
|
Single Continuous Numeric Variable
In this situation a cumulative distribution function conveys the most information and requires no grouping of the variable. A box plot will show selected quantiles effectively, and box plots are especially useful when stratifying by multiple categories of another variable.
Histograms are also possible. Consider the examples below.
Density Plot | Histogram | Box Plot |
|
|
|
Two Variables
Adapted from Frank E. Harrell Jr. on graphics:
http://biostat.mc.vanderbiltedu/twiki/pub/Main/StatGraphCourse/graphscourse.pdf
Two categorical variables
- Use frequency table
- One categorical variable and other continuous variable
- Box plots of continuous variable values for each category of categorical variable
- Side-by-side dot plots (means + measure of uncertainty, SE or confidence interval)
- Do not link means across categories!
Two continuous variables
- Scatter plot of raw data if sample size is not too large
- Prediction with confidence bands
The two graphs below summarize BMI (Body Mass Index) measurements in four categories, i.e., younger and older men and women. The graph on the left shows the means and 95% confidence interval for the mean in each of the four groups. This is easy to interpret, but the viewer cannot see that the data is actually quite skewed. The graph on the right shows the same information presented as a box plot. With this presentation method one gets a better understanding of the skewed distribution and how the groups compare.
|
|
The next example is a scatter plot with a superimposed smoothed line of prediction. The shaded region embracing the blue line is a representation of the 95% confidence limits for the estimated prediction. This was created using "ggplot" in the R programming language.
Source: Frank E. Harrell Jr. on graphics: http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatGraphCourse/graphscourse.pdf (page 121)
Multivariate Data
- If there aren't too many variables, it may be possible display the relationship among variables using a line plot with multiple lines.
- Another option is to display the data multiple panels rather than a single plot with multiple lines than may be hard to distinguish.
- In any event, be sure to use consistent axes and colors across panels.
The example below shows the use of multiple panels.
Source: Cleveland S. The Elements of Graphing Data. Hobart Press, Summit, NJ, 1994.
jacksonpairame1966.blogspot.com
Source: https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/datapresentation/DataPresentation7.html
0 Response to "Graphing Relationship Between Continuous Variable and Categorical Variable"
Post a Comment