BACKGROUND: I have dataset that includes Race (e.g., White, Black) and Ethnicity (e.g., Hispanic, Non-Hispanic) as observed variables. The dataset also includes Race_Ethnicity (e.g., Hispanic White, Non-Hispanic Black) as an engineered variable, if you will. I am am wondering if I should retain the observed variables in my supervised ML model?
The observed variables are obviously correlated with the engineered variable. This is an issue for ML (i.e., the multicollinearity problem), if I am thinking about this correctly (but please correct me if I'm wrong). However, it may be possible that Race interacts with yet a 4th variable, whereas Ethnicity does not. Thus, leaving out Race may be costing me important boost in performance. (Race_Ethnicity may have a more "muddied" relationship with the 4th variable than Race alone.)
QUESTION: What to do, y'all? Should they (the observed variables) stay or should they go?