Technology

Alt Full Text
Relevance of chi-square test in logistic regression

Relevance of chi-square test in logistic regression

 

Yes, it is still possible to use logistic regression even if a chi-square test is not significant (p>0.05). The chi-square test is a hypothesis that determines whether there is a significant association between two variables in a contingency table.

What is Chi-square test

A chi-square test is a statistical test that is used to determine whether there is a significant association between two categorical variables in a contingency table. The chi-square test is used to determine whether the observed frequencies or proportions in the contingency table are significantly different from what would be expected if there was no association between the variables.

The chi-square test involves calculating a test statistic called the chi-square statistic (X^2). The chi-square statistic measures the difference between the observed frequencies or proportions in the contingency table and the expected frequencies or proportions under the null hypothesis of no association between the variables. The null hypothesis assumes that there is no significant association between the variables.

The chi-square statistic follows a chi-square distribution with (r-1) x (c-1) degrees of freedom, where r is the number of rows in the contingency table and c is the number of columns. The degrees of freedom represent the number of independent pieces of information that can be used to estimate the population parameters. The p-value of the chi-square test is the probability of obtaining a chi-square statistic as extreme or more extreme than the observed statistic, assuming that the null hypothesis is true.

If the p-value is less than the significance level (usually 0.05), the null hypothesis is rejected, and it is concluded that there is a significant association between the variables. If the p-value is greater than the significance level, the null hypothesis is not rejected, and it is concluded that there is no significant association between the variables.

The chi-square test can be used for a variety of applications:

  1. testing for the independence of two categorical variables
  2. testing for goodness of fit of a theoretical distribution to an observed distribution
  3. testing for homogeneity of proportions across different groups

Contingency table

The contingency table is a table that shows the distribution of the variables' frequencies or proportions across different categories. 

Confounding variables

Confounding variables are extraneous variables that can affect the relationship between the independent variable and dependent variable in a study. Confounding variables are variables that are related to both the independent variable and the dependent variable but are not part of the research question. Confounding variables can lead to biased estimates of the effect of the independent variable on the dependent variable.

For example, suppose a researcher wants to investigate the relationship between smoking and lung cancer. The independent variable is smoking, and the dependent variable is lung cancer. However, age is a confounding variable because it is related to both smoking and lung cancer risk. Older individuals are more likely to have smoked for a longer period of time and are also more likely to develop lung cancer. Therefore, age can confound the relationship between smoking and lung cancer, making it difficult to determine whether smoking is the cause of lung cancer.

To control for confounding variables, researchers can use statistical techniques such as regression analysis or propensity score matching.

  1. Regression analysis can be used to adjust for the effects of confounding variables by including them as covariates in the regression model.
  2. Propensity score matching - is a technique that matches individuals in different groups based on their propensity scores which are estimated probabilities of being in a particular group given their characteristics. This technique can help to balance the distribution of confounding variables between the groups and reduce bias in the estimates of the treatment effect.

Controlling for confounding variables is important in research studies because it allows researchers to make more accurate inferences about the relationship between the independent variable and the dependent variable. By controlling for confounding variables, researchers can reduce the likelihood of drawing incorrect conclusions about the relationship between variables.

Hypothesis testing

Is a statistical method used to make decisions about a population based on a sample of data. The method involves formulating a hypothesis about a population parameter (such as the mean or proportion) and collecting data to test the hypothesis. The hypothesis is usually stated as two complementary statements: the null hypothesis and the alternative hypothesis.

The null hypothesis (H0) is a statement that assumes there is no significant difference or relationship between two or more populations or samples. The alternative hypothesis (H1) is a statement that assumes there is a significant difference or relationship between two or more populations or samples.

The hypothesis testing process involves the following steps:

  1. State the null and alternative hypotheses
  2. Choose a significance level (usually 0.05).
  3. Collect data and calculate a test statistic.
  4. Calculate the p-value of the test statistic.
  5. Compare the p-value to the significance level
  6. Make a decision to reject or fail to reject the null hypothesis based on the p-value and the significance level.

If the p-value is less than the significance level, the null hypothesis is rejected, and it is concluded that there is sufficient evidence to support the alternative hypothesis. If the p-value is greater than the significance level, the null hypothesis is not rejected, and it is concluded that there is not enough evidence to support the alternative hypothesis.

Hypothesis testing is used in a wide range of applications, such as:

  1. testing the effectiveness of a new drug
  2. comparing the means of two populations,
  3. testing for the independence of two categorical variables

It is an important tool for making decisions based on data and for drawing conclusions about populations based on a sample of data.

Logistic regression assumptions

Some of the key assumptions of logistic regressions are:

  1. Binary outcome - the dependent variable in logistic regression is binary, meaning that it takes on one of two possible values (e.g., 0 or 1). If the dependent variable is not binary, logistic regression may not be appropriate.
  2. Independence - The observations in the dataset are assumed to be independent of each other. If the observations are not independent, such as in clustered or longitudinal data, alternative methods may be more appropriate.
  3. Linearity - Logistic regression assumes that the relationship between the independent variables and the log odds of the dependent variable is linear. If the relationship is non-linear, other modeling techniques may be more appropriate.
  4. No multicollinearity - The independent variables in logistic regression should not be highly correlated with each other. Multicollinearity can make it difficult to interpret the coefficients of the logistic regression.
  5. Large sample size - Logistic regression works best when the sample size is large. A small sample size can lead to unstable estimates of the coefficients and poor model performance.
  6. No outliers - Logistic regression assumes that there are no extreme values or outliers in the data. Outliers can have a large influence on the model estimates and affect the performance of the model.
  7. No perfect separation - Perfect separation occurs when one or more independent variables can perfectly predict the outcome variable. This can lead to infinite estimates of the coefficients, making the model uninterpretable. If perfect separation exists in the data, alternative methods may be more appropriate.

It's important to check these assumptions before using logistic regression, as violating any of these assumptions can lead to biased estimates and poor model performance.

Practical Notebooks

  • Students enrolling for any AI related course from Carnegie Training Institute have access to jupyter notebook, class exercises illustrating this reasoning.

Sources

  1.  

Related Articles