There are no items in your cart
Add More
Add More
Item Details | Price |
---|
Come on this incredible journey and help enhance your capability for Biomedical Research
Suppose we take two attributes or variables in a community, such as gender and habit of smoking. Now gender is a socially constructed attribute with usually two categories i.e. male and female. I can also classify habit of smoking into only two categories: smokers and non-smokers. Now I want to see whether being a particular gender has any relationship, or in statistical terms, association with the habit of smoking. Maybe I have a hypothesis, maybe not. For the sake of being methodical, I can say that the null hypothesis is, there is no relationship between gender and habit of smoking in a community. That is the hypothesis which signifies null, or no association between any two variables under consideration.
H0: There is no association between gender and habit of smoking
Then we arrange the total number of persons in the community in a 2X2 table, putting them in the group they belong to. This 2X2 table is also called the contingency table, with each row and each column containing only 2 cells for entry of data, which is a very common method to represent categorical data such as gender and habit of smoking.
| Habit of smoking | TOTAL | |
Gender | Smoker | Non-smoker | |
Male | a | b | a + b |
Female | c | d | c + d |
TOTAL | a + c | b + d | n |
There are “a” number of people who are male and smokers, and “b” people who are male and non-smokers. Similarly there are “c” people being female and smokers, and “d” people who are female and non-smokers. In total, there are (a + b + c + d) people, or “n” people in the community. We have to assume that there is no change in this “n” number of people.
I want to make clear that I have not just randomly put Gender in rows and Habit of smoking in columns. Since Gender does not depend on Habit of smoking, it is the independent variable here. However, Habit of smoking might depend on Gender, so it is a dependent variable. By default, we put independent variable in rows, and dependent variable in column. Now that we are clear on that, we can proceed to find whether there is any association between Gender and Habit of smoking (against the Null hypothesis).
We test this hypothesis with the help of Chi-Square Test of Association. The value of the Chi-square from the given data, is measured against a critical value. If it is greater than the critical value then we can reject the Null hypothesis and say there is an association between the two categorical variables. Following is the formula for Chi-square statistic (χ^{2}):
χ^{2} = Ʃ [(O_{i }– E_{i})^{2} / E_{i}]
with O being the observed frequency, E being the expected frequency, and i representing the frequency in a particular cell of the 2X2 table. The observed value is the data present in each cell. The expected value in each cell is what helps achieve the null hypothesis. The expected frequency values for each cell, is the product of the row and column totals as a proportion of the whole total. We can apply this concept to calculate the expected values:
| Habit of smoking | TOTAL | |
Gender | Smoker | Non-smoker | |
Male | (a + b)( a + c)/n | (a + b)( b + d)/n | a + b |
Female | (c + d)( a + c)/n | (c + d)( b + d)/n | c + d |
TOTAL | a + c | b + d | n |
The more the observed value deviates from the expected value, there is a higher chance of getting a significant association between the two variables. This is what a Chi-square statistic is about. Let us now take an example with a data from a study by Peters SAE, Huxley RR and Woodward M published in 2014. The following table shows the observed and expected values (in parentheses) in the Socioeconomic stratum 5 of the study:
| Habit of smoking | TOTAL | |
Gender | Smoker | Non-smoker | |
Male | 12504 (11261.4) | 13099 (14341.6) | 25603 |
Female | 10858 (12100.6) | 16653 (15410.4) | 27511 |
TOTAL | 23362 | 29752 | 53114 |
Now, we can calculate the Chi-square statistic:
χ^{2} = (12504-11261.4)^{2}/11261.4 + (13099-14341.6)^{2}/14341.6 + (10858-12100.6)^{2}/12100.6 + (16653-15410.4)^{2}/15410.4 = 472.58
To calculate whether this Chi-square statistic is representing any significant association or not, we need to determine the critical value of this statistic. The critical value is determined by:
· the degree of freedom for this test, and
· the type I error (α).
The degree of freedom is calculated by:
(number of rows – 1)(number of columns – 1)
In a 2X2 table, the value of df is 1.
The degrees of freedom would increase as the number of rows and columns of our table increases. In a 3X4 table, the value of df would be 6.
The Type I error (α) can be taken as 0.1 (10%), 0.05 (5%) or 0.01 (1%). An α value of 0.05 means that only 5% of the possible Chi-square values will be greater than the critical value.
The critical value of the Chi-square statistic at α of 5% and at 1 degree of freedom is 3.84, and any value equal to or lower than that makes the Null hypothesis to be true i.e. we cannot reject it. Similarly for each α value and degrees of freedom, there are different critical values.
In our example, the calculated Chi-square statistic is 472.58, which is much higher than 3.84. So we can reject the Null hypothesis and conclusively make the statement that there is a statistically significant association between Gender and Habit of smoking.
**The results of this statistical test are only valid as long as the expected cell values are all above 5.
But how can we know the direction of the association? We still do not know if males have a greater smoking habit or vice versa. This can be done by simply calculating either the row or the column percentages in the table. In our example:
ROW TOTAL % | Habit of smoking | TOTAL | ||
Gender | Smoker | Non-smoker | ||
Male | 12504 (48.8%) | 13099 (51.2%) | 25603 | |
Female | 10858 (39.5%) | 16653 (60.5%) | 27511 |
48.8% of males are smokers compared to 39.5% females. We can interpret this as, there are statistically significantly more smokers among males (48.8%) than among females (39.5%) as given by the Chi-square value of 472.58, df 1 and p value < 0.05. Again similar interpretation can be done with column totals, as below.
COLUMN TOTAL % | Habit of smoking | |
Gender | Smoker | Non-smoker |
Male | 12504 (53.5%) | 13099 (44.0%) |
Female | 10858 (46.5%) | 16653 (56.0%) |
TOTAL | 23362 | 29752 |
The interpretation can be: There are statistically significantly more smokers (53.5%) than non-smokers (44.0%) among males as given by the Chi-square value of 472.58, df 1 and p value < 0.05.
However, we have defined the independent variable in rows, and we want to actually know the effect of the independent variable (gender) on the dependent variable (habit of smoking). So we should be comparing the difference in proportion of smoking habit across both the categories of gender (male and female). Thus, in interpretation we usually consider the row percentages to calculate the required proportions.
Conclusion: So whenever you have to compare and find out presence of association between two categorical variables, you have to first fix the independent and the dependent variables. Then form a Null hypothesis stating that there is no association between the two variables.
Put the independent variable in rows and dependent variable in columns. In a 2X2 table, we first arrange the data and calculate the expected frequencies in each cell. After calculating the Chi-square value and degree of freedom, we calculate the critical value of Chi-square at a specified value of α and the same degree of freedom. If the calculated value exceeds the critical value, then we can reject the Null hypothesis and infer that there is a statistically significant association between the variables, and its direction can be determined by comparing the row percentages in the same table.
Fisher's Exact Test
Written by:
Dr. Ria Roy,
Senior Resident
Department of Community and Family Medicine, AIIMS Patna
Interests: Adolescent Health, Nutrition, Biostatistics, Epidemiology, NCDs