Posts

Showing posts from June, 2012

cart

Image
a study into patients after admission for a heart attack 19 variables collected during the first 24 hours for 215 patients (for those who survived the 24 hours) Question: Can the high risk (will not survive 30 days) patients be identified Impurity of a Node Need a measure of impurity of a node to help decide on how to split a node, or which node to split The measure should be at a maximum when a node is equally divided amongst all classes The impurity should be zero if the node is all one class Predictor variables can be continuous or categorical A Classification tree is created if the response variable is categorical A Regression tree is created if the response variable is continuous Large sample size for efficient split of the too many predictors Interaction between predictors can be identified Relative importance of predictors cannot be well identified Missing observations form a separate category Resubstitution Costs It is error of the tree estimat...

de facto devaluation

There has been a de facto devaluation in Greece even though they are still in the euro. As the local population is forced to spend less,  prices have come down

Logistic

Chi square of the intercept should be larger than any other variable If any of the independent variable's chisquare is high, then that particular variable will have an inordinately huge impact on the dependent variable. VIF- take an independent variable and regress it on other independent variables, find the R2 and use it on 1/1-R2 formula Sumofsquares concordance the good ones should have a higher score than the bad ones. the larger such pairs, the concordance is high. it should be in the neighbourhood of 60-70%

LP

Shadow Price, the right term, is really a simple concept in LP. It indicates the maximum amount that one is willing to incur as costs in order to relax the constraints to maximize/minimize the objective further

Granger causality test

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another. According to Granger causality, if a signal X1 "Granger-causes" (or "G-causes") a signal X2, then past values of X1 should contain information that helps predict X2 above and beyond the information contained in past values of X2 alone Suppose that we have three terms, Xt , Yt , and Wt , and that we first attempt to forecast Xt+1 using past terms of Xt and Wt . We then try to forecast Xt+1 using past terms of Xt , Yt , and Wt . If the second forecast is found to be more successful, according to standard cost functions, then the past of Y appears to contain information helping in forecasting Xt+1 that is not in past Xt or Wt . In particular, Wt could be a vector of possible explanatory variables. Thus, Yt would "Granger cause" Xt+1 if (a) Yt occurs before Xt+1 ; and (b) it contains information useful in forecasting ...

Factor Analysis

factor loading (regression coefficient between an indicator and its factor).

Arima

 Lags of the differenced series appearing in the forecasting equation are called "auto-regressive" terms, lags of the forecast errors are called "moving average" terms, and a time series which needs to be differenced to be made stationary is said to be an "integrated" version of a stationary series.  a very common general type of pattern in time series data, where the amplitude of the seasonal changes increases with the overall trend (i.e., the variance is correlated with the mean over the segments of the series). This pattern which is called multiplicative seasonality indicates that the relative amplitude of seasonal changes is constant over time, thus it is related to the trend.

F test

F test. It is most often used when comparing statistical models that have been fit to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact F-tests mainly arise when the models have been fit to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio Most F-tests arise by considering a decomposition of the variability in a collection of data in terms of sums of squares. The test statistic in an F-test is the ratio of two scaled sums of squares reflecting different sources of variability. These sums of squares are constructed so that the statistic tends to be greater when the null hypothesis is not true Examples of F-tests include: The hypothesis that the means of several normally distributed populations, all having the same standard deviation, are equal. This is perhaps the best-known F-test, and plays an im...

CHAID

CHAID is a tree building algorithm used to construct non-binary trees for classification and regression problems If the response variable is categorical, Chi-square test is used to determine the best next split If the response variable is continuous, F-test is used to determine the best next split

CART

CART is nonparametric CART does not require variables to be selected in advance. CART algorithm will itself identify the most significant variables and eleminate non-significant ones. CART results are invariant to monotone transformations of its independent variables. Changing one or several variables to its logarithm or square root will not change the structure of the tree. Only the splitting values (but not variables) in the questions will be different. CART can easily handle outliers. Outliers can negatively affect the results of some statistical models, like Principal Component Analysis (PCA) and linear regression. But the splitting algorithm of CART will easily handle noisy data: CART will isolate the outliers in a separate node. Boston Housing is a classical dataset which can be easily used for regression trees. On the one hand, we have 13 independent variables, on the other hand, there is response variable - value of house (variable number 14). Boston hous...

SEM

Two main components of models are distinguished in SEM: the structural model showing potential causal dependencies between endogenous and exogenous variables, and the measurement model showing the relations between latent variables and their indicators. Exploratory and Confirmatory factor analysis models, for example, contain only the measurement part, while path diagrams can be viewed as an SEM that only has the structural part. In specifying pathways in a model, the modeler can posit two types of relationships: (1) free pathways, in which hypothesized causal (in fact counterfactual) relationships between variables are tested, and therefore are left 'free' to vary, and (2) relationships between variables that already have an estimated relationship, usually based on previous studies, which are 'fixed' in the model. A structural model with linear relations is only an approximation. The world is unlikely to be linear. Indeed, the true relations between variables ar...

Importance of variance

if you multiply every number in a list by some constant K, you multiply the mean of the numbers by K. Similarly, you multiply the standard deviation by the absolute value of K. For example, suppose you have the list of numbers 1,2,3. These numbers have a mean of 2 and a standard deviation of 1. Now, suppose you were to take these 3 numbers and multiply them by 4. Then the mean would become 8, and the standard deviation would become 4, the variance thus 16. The point is, if you have a set of numbers X related to another set of numbers Y by the equation Y = 4X, then the variance of Y must be 16 times that of X, so you can test the hypothesis that Y and X are related by the equation Y = 4X indirectly by comparing the variances of the Y and X variables. This idea generalizes, in various ways, to several variables inter-related by a group of linear equations. The rules become more complex, the calculations more difficult, but the basic message remains the same -- you can test whethe...

Latent variable

Latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured)  it reduces the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. Examples of latent variables from the field of economics include quality of life, business confidence, morale, happiness and conservatism: these are all variables which cannot be measured directly Latent variables, as created by factor analytic methods, generally represent 'shared' variance, or the degree to which variables 'move' together. Variables that have no correlation cannot result in a latent construct based on the common factor model Sometimes latent variables correspond to aspects of physical reality, which could in principle be measured, but may not be for pra...