cart

June 24, 2012

a study into patients after admission for a heart attack
19 variables collected during the first 24 hours for 215 patients (for those who survived the 24 hours)
Question: Can the high risk (will not survive 30 days) patients be identified

Impurity of a Node
Need a measure of impurity of a node to help decide on how to split a node, or which node to split
The measure should be at a maximum when a node is equally divided amongst all classes
The impurity should be zero if the node is all one class

Predictor variables can be continuous or categorical

A Classification tree is created if the response variable is categorical

A Regression tree is created if the response variable is continuous

Large sample size for efficient split of the too many predictors
Interaction between predictors can be identified
Relative importance of predictors cannot be well identified
Missing observations form a separate category

Resubstitution Costs
It is error of the tree estimated when the same data is being rerun through the tree that was built from it. development data set used for building tree
As the no. of nodes in the tree increases, resubstitution cost decreases as the maximal tree is always “best fits” the development data

All possible splits for all the input variables are considered
For example, consider a data set with 230 cases and 15 variables. CART considers up to 230 times 15 splits for a total of 3450 possible splits

In case of missing values in a splitter variable, CART identifies surrogate predictive splitter variables
Surrogate splitter contains information that is typically similar to what would be found in the primary splitter
For example, in a given model,
CART splits data according to household income
If a value for income is not available, CART might substitute education level as a good surrogate
By using surrogates to stand in for missing values, robust and reliable predictive models are generated
It considers all other independent variables leaving the splitter variable but prefers the variable with higher degree of association as surrogate

Search This Blog

Web logs

cart

Comments

Popular posts from this blog

Ensemble

Bias-Variance tradeoff

AI