Spot Potential Relationships Between Two Variables Using the Scatter Diagram

Add bookmark

Anantha Kollengode
04/07/2010

Finally, the spring season is here, and my family and I have begun planning a road trip for this summer. One of the places we would like to visit is the Yellowstone National Park and we are set on seeing the Old Faithful geyser eruptions. To witness this natural wonder, we could plan to stay at the park for half a day or we could use data to our advantage to plan a much shorter wait time to see the geyser erupt over 100 feet. What tool from the quality tool kit would come in handy to minimize our wait while maximizing the odds of seeing a long duration eruption?

This question leads me to discuss the scatter diagram (also known as the scatter plot or X-Y plot), one of the seven basic quality tools that provides a visual display of potential relationships between two variables. This tool is helpful early in the problem solving phase (the Measure phase of Six Sigma DMAIC) to analyze the possible relationship between two variables. Typically the two variables are plotted on the two axes (hence the name X-Y plot) to prove or disprove suspected cause-and-effect relationships between the two continuous variables. The scatter diagram helps to confirm if the two variables are co-related and also helps in determining the strength of the relationship between the two variables studied.

A Scatter Diagram Example

To illustrate how to use the scatter diagram, let’s consider the following example. The lab analyzing a particular specimen has encountered more defects in this sample from the previous study, and the project team suspects that the reagent temperature has had an impact on the results. The team therefore conducts a small study to test the number of specimen defects at the various temperatures of the reagent. The results are given below.

Test Reagent Temperature(Celsius)	Number of Defects in the Results
22	2
29	9
21	3
30	8
25	4
23	4
22	3
29	7
26	7
25	5

By looking at the data in this table, it is difficult to interpret the results. The scatter diagram, however, helps to bring out the trends in the data in a visual manner. In this hypothetical example, the scatter diagram aided the team to see the effect of reagent temperature on the number of defects: as the reagent’s temperature increased, the number of defects also increased. (Click on image to enlarge.)

Building a scatter diagram: The first step in building a scatter diagram is to identify the two variables that you suspect are related and of interest to your team. Verify that you are able to collect information on these variables at the same time (for example, the number of defects that occur at 25^oCelsius). Preferably collect 50 to 100 pairs of data. Plot the paired data on the two axes, with the suspected cause (temperature) on the x-axis and the suspected effect (number of defects) on the y-axis. The scale of the axes can be changed to get a better visual representation (in the above example the x-axis starts at 20^oC whereas the y-axis starts at 0^oC).

Interpretation of the scatter diagram: There are two key considerations for interpreting the scatter diagram, namely the type of correlation and the degree of correlation. There are six different types of correlation: positive correlation (increasing X increases Y), negative correlation (increasing X decreases Y), no correlation (by increasing X, we cannot determine how Y will be impacted) and non-linear (such as a U-shaped, S-shaped, curved, or partially curved line). The degree of correlation can be strong (strongly positive or strongly negative), weak (weakly positive or weakly negative), or zero (no correlation).

If you see correlation in your scatter diagram: If you see either a strong or weak correlation (linear or non-linear) in your data, it implies one of the following: a) there is a cause and effect relationship between the two variables, b) the two variables are impacted by a third variable, or c) the correlation between the two variables is purely coincidental. Some common analysis and tools help in making an appropriate interpretation. These include the correlation coefficient, which ranges from -1 (a 100 percent negative correlation) to 0 (no correlation) to +1 (100 percent positive correlation); a regression line (the best fit through the data points); and standard error (a measure indicating the spread of a possible effect due to any cause effect).

It is important to note that while the scatter diagram may indicate the cause and effect relationship between two variables, it does not prove by itself that one variable causes the other.

If you do not see any correlation in your scatter diagram: Consider stratifying the data such as by day of the week, shift of the day, by supplier, by specialty, etc. In addition, check to see if the x-axis variable (temperature in our example) is over a wider range. Statistical analysis, which is beyond the scope of this article,is also helpful to conclude that there is no relationship between the variables.

Examples of scatter diagram use in healthcare: The scatter diagram can be a good first step in helping to determine possible correlations found in the particular scenarios: the number of errors related to the hours of overtime in a hospital, the time of the day when patient falls occur in a unit,and the time taken by staff to admit a patient related to the years of experience in a department.

Getting Back to Old Faithful

So where does the scatter diagram fit in with helping us to determine how much time we really need to ensure we see a long duration Old Faithful eruption? If we use the scatter diagram of eruption duration (x-axis) and the waiting time between eruptions data (y-axis), we can quickly see there are two types of eruptions: the short-wait-short-duration type and the long-wait-long-duration type. Armed with this new found knowledge, we can plan to wait about 90 minutes to catch an eruption that lasts longer than 3 minutes. Now that’s worth the trip! (Click on image to enlarge.)