5/7/2023 0 Comments Permute random rope define![]() ![]() Maybe you could devise some crazy Hamming distance scheme to figure out how much to weight the inaccuracies of models training on scrambled datasets, but this seems like it would require a lot of models and be computationally expensive, and it shouldn't be necessary. I wouldn't take either of these approaches to decide feature importance, because when I misplace a column, I don't just misplace it in isolation I displace the one where it has to go. Or if you train on the canonical order, then the scrambled order looks scrambled. To answer your question, it seems the same to me whether you scramble columns in the set you train on or before you pass the test set: Effectively if you scramble before training, the model expects things in scrambled order, and then the canonical order looks scrambled to it. My question is: Is this approach valid? Is permutation of variables in the training set a better approach in general? If not, in which cases would it be not appropriate? ![]() Therefore, I'm thinking about changing my approach in the sense that I'm permuting the variables in the training set and not in the test set. If I now permute the corresponding columns of the test set, it is clear that the performance will drop, although this does not indicate whether these variables hold unique information at all. The variables are predictive and and due to sub-sampling of variables at each split, they will get selected during model training. However, in the above case, the assumption does not hold. In these cases my assumption under H0 is that these variables are completely uninformative and thus it shouldn't matter whether I permute the column of the training or test set. I first bootstrap from the training data and build a random forest model and then permute the columns in the out-of-bootstrap samples and check if I observe a significant decline in accuracy. My general approach for testing the significance of the importance of variables is a bootstrap permutation test, i.e. However, these variables are highly correlated and thus it is unclear whether each variable actually holds unique information or simply ranks high due to correlation to the causal variable. I have a rather high-dimensional data set ( p > 1000) with several variables ranking significantly higher than the rest in terms of variable importance (measured by Gini impurity). ![]()
0 Comments
Leave a Reply. |