Sensitivity analysis with Random Forest #30536

Caakurugu · 2024-12-23T17:37:27Z

Caakurugu
Dec 23, 2024

Hi everyone, I am a graduate student in Geophysics and my research is on investigating factors driving groundwater depletion. I am currently working on a sensitivity analysis to identify the main drivers of groundwater depletion using Random Forest. My input dataset includes total precipitation, snow water equivalence (SWE), soil moisture, soil temperature, air temperature, and stream water level. The GWL and input features appear to show some seasonal trends I have the following concerns:1. Can the RF regression model deal with trended inputs (X) and outputs (Y)? Can RF model non-stationary data, e.g., Y=GWL of bedrock well is trending downward? If not, is de-trending of Y needed as a preprocessing step before input to the RF model? Or, since your RF model is fit to existing data, without making a forecast (a new inputs X --> a new Y), do we need to worry about the non-stationary Y? 2) X exhibit different magnitudes: preprocessing by scaling first before input to the RF model? Should each input feature be scaled to 0-1 before training/testing? What splitting technique is recommended for this kind of analysis?

UmbertoFasci · 2024-12-26T19:22:01Z

UmbertoFasci
Dec 26, 2024

Concerning your first point, I believe that there are two paths you can take depending on the objective of your project. While RFs can technically handle non-stationary data it won't inherently model trends or seasonal patterns. 1. If you are trying to model within the same period, you can use the raw non-stationary data as long as you include some time-based features to try and capture these patterns. 2. If you are trying to make some forecasts then you should detrend your data as you mentioned.

You might want to read this paper to get a grip on how you want to handle the non-stationary side of your data depending on how you want to use it.

While scaling is not always necessary for RF you might want to consider StandardScalar, but remember to apply it appropriately to avoid data leakage.

For your splitting you might want to consider using time-based splitting, but there are few options you can consider looking into on this particular matter.

Let me know if this was helpful

0 replies

Caakurugu · 2024-12-26T20:43:24Z

Caakurugu
Dec 26, 2024
Author

Thank you for your insightful feedback. The main aim of my analysis is to identify the main influential factors of groundwater level fluctuation. The groundwater level data from confined aquifer wells exhibit declining trend since 2016 till date. Input variables include precipitation (daily accumulated), daily SWE accumulated, daily average air temperature, daily average soil temperature, daily average soil moisture content (%), and stream water levels. All input and output (GWL) are time series data from 11/15/2017 to 06/06/2023.

All the input variables (Except stream water levels) were obtained from one SNOTEL station.` Employed the test_train_split technique for splitting the data to training (80%) and testing (20 %). Three feature importance methods: Gini importance, permutation importance, and SHAP importance were used to rank the input variables in order of their importance to influencing the output (GWL). I am not sure if the random splitting technique as an effect on the feature importance identification or not?

1 reply

UmbertoFasci Dec 29, 2024

I am not sure if the random splitting technique as an effect on the feature importance identification or not?

So now that we are certain you won't be doing any forecasting, random splitting would not be viable in that case. The only way you can see if there is an effect on the feature importance is if you test other splitting techniques and compare them. In your case the seasonal patterns may be more realized in other techniques. Let me know what you think

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensitivity analysis with Random Forest #30536

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Sensitivity analysis with Random Forest #30536

Caakurugu Dec 23, 2024

Replies: 2 comments · 1 reply

UmbertoFasci Dec 26, 2024

Caakurugu Dec 26, 2024 Author

UmbertoFasci Dec 29, 2024

Caakurugu
Dec 23, 2024

Replies: 2 comments 1 reply

UmbertoFasci
Dec 26, 2024

Caakurugu
Dec 26, 2024
Author