Detection of Multicollinearity in Data sets using Statistical Testing. | by Erdogan Taskesen | Oct, 2023

Detecting multicollinearity in data sets is an important step but also challenging. I will demonstrate how to detect variables with similar behavior in mixed data sets and how to deeper examine the relationships with interactive charts.

Erdogan Taskesen
Towards Data Science
Photo by Erol Ahmed on Unsplash

Understanding the strength of relationships between variables in a data set is important because variables with statistically similar behavior can affect the reliability of models. To remove the so-called multicollinearity we can use correlation measures for continuous variables. However, when we also have categorical variables and thus mixed data sets, it becomes even more challenging to test for multicollinearity. Statistical tests, such as Hypergeometric testing and the Mann-Whitney U test can be used to test for associations across variables in mixed data sets. Although this is great, it requires various intermediate steps such as the typing of variables, one-hot encoding, and multiple test corrections, among others. This entire pipeline is readily implemented in a method named HNet. In this blog, I will demonstrate how to detect variables with similar behavior so that multicollinearity can be easily detected.

Real-world data often contains measurements with both continuous and discrete values. We need to look at each variable and use common sense to determine whether variables can be related to each other. But when there are tens (or more) variables, where each variable can have multiple states per category, it becomes time-consuming and error-prone to manually check all the variables. We can automate this task by performing intensive pre-processing steps, together with statistical testing methods. Here comes HNet [1, 2] into play which uses statistical tests to determine the significant relationships across all variables in a dataset. It allows you to input your raw unstructured data into the model and then outputs a network that sheds light on the complex relationships across variables. Let’s go to the next section where I will explain how to detect variables with similar behavior using statistical

Source link

This post originally appeared on TechToday.

Leave a Reply

Your email address will not be published. Required fields are marked *