Hypothesis testing for categorical variables
Last updated
Last updated
There are several methods to test hypotheses using categorical data. The most useful methods are to calculate confidence intervals for proportions or for difference in proportions and to use hypothesis tests for proportions, difference in proportions, independence between variables and goodness of fit. Below we will briefly explore these methods.
While there are many summary and descriptive statistics that can be calculated for categorical data, there is no straightforward way to calculate confidence intervals. Below we explore several ways to approximate confidence intervals.
Bootstrapping is a technique that samples with replacement from the data and performs the same calculation through several iterations. For example, if a proportion is calculated 100 times from different samplings of the data provided, we can approximate that the true proportion of the category in the data is contained within the lower and upper limit of the values calculated in those 100 iterations.
Helper functions from the infer library that allow you to calculate confidence intervals through bootstrapping:
install.packages ="infer"
Specify allows you to specify the variable of interest in the data and the value for which to calculate the summary statistic. For example the value "female" from the variable "gender".
Generate allows you to perform the bootstrapping operation, and specify the number of iterations for the operation.
Calculate allows you to specify the summary statistic to be calculated for the variable of interest throughout the bootstrapping operation.
Example:
CI <- df%>% specify(response = gender, success="female")%>%
generate(reps=100, type="bootstrap)%>%
calculate(stat="prop")
Phat standard distribution it is possible to calculate the standard error (SE) from the phat standard distribution. Once the standard error is known, a confidence interval can be calculated. The SE of phat is calculated as sqrt(phat*(1-phat)/n).
Note that for this method to be used it is necessary that the observations are independent from each other and that the n is large enough (at least n>10).
Here is an example taken from
As reviewed in the section above, the r package infer allows calculating confidence intervals through bootstrapping operations using the helper functions specify, generate and calculate. To test a hypothesis we add the helper function:
Example:
Object_name <- df%>% specify(response = gender, success="female")%>%
hypothesize(null="point", p=0.5),
generate(reps=100, type="simulate")%>%
calculate(stat="prop")
This method allows you to calculate the proportion value you would get under the null hypothesis. To calculate a p-value from there you would need to calculate the amount of phats that were more extreme than the observed phat under the null hypothesis. For example:
null %>% summarize(mean(stat>phat)%>%
pull()*2
Note that we multiply the pull value times 2 to account for the left tail.
You might want to calculate the association between categorical variables under the null hypothesis that the variables of interest are independent from each other. One way to test this hypothesis is to create contingency tables and to use the chisquare statistic to reject the null.
To create tables you will need to use the "broom" r package: install.packages("broom").
Some helper functions when using the broom package to do contingency tables are:
Table allows you to turn a selection of columns into a table object.
Tidy allows you to automatically organize a table object in a more efficient way.
Uncount(n) allows you to get a row per observation instead of a summary of the counts.
Once the variables of interest are selected and saved into a tidy table object, we can test the independence of the variables. To test for indenpendence we go back to our helper functions specify, hypothesize, generate and calculate. This time we generate repetitions by permutation.
Example:
null<- data %>% specify (var1~var2) %>%
hypothesize (null="independence")%>%
generate(reps=100, type="permute")%>%
calculate(stat="Chisq")
This allows us to generate expected counts under the null hypothesis of independence, and thus to compare them to our observed counts.
A goodness of fit test tries to establish weather the observed values match expected values under a specific model. For example, say we expect the distribution of family language strategy use to be equal in a group of randomly sampled parents. If we have three possible family language strategies we expect this model to be true: model<- c(opol=1/3, 2_bilingual=1/3, 1_bilingual=1/3)
.
Once we have a model of how we expect the data to behave, we can use the chi-squared test as a goodness-of-fit measure, by comparing our model to our observed data a such:
chisq.test(observed_table, p=model)$stat
Hypothesize allows you to declare a null hypothesis. For more documentation go .