# Advanced Statistical Analysis

## Abstract and Keywords

Definitions of what constitutes advanced statistical analysis often differ among social-work researchers and across disciplines. In this article, the term advanced statistical analysis refers to advanced models increasingly applied to social-work research that help address important research questions. Because contemporary statistical models fully rely on the maximum likelihood (ML) estimator, this article begins with an overview of ML that serves as a foundation for understanding advanced statistical analysis. Following the overview, the article describes six categories of analytic methods that are important in confronting a broad range of issues in social-work research (that is, hierarchical linear modeling, survival analysis, structural equation modeling, propensity score analysis, missing data imputation, and other corrective methods for causal inference such as instrumental variable approach and regression discontinuity designs). These analytic methods are used to address research issues such as generating knowledge for evidence-based practices; evaluating intervention effectiveness in studies that use an experimental or quasi-experimental design; assessing clients’ problems, well-being, and outcome changes over time; and discerning the effects of policies.

Keywords: analysis, assessment, estimator, maximum likelihood, modeling, quantitative research, statistical analysis

Maximum Likelihood (ML) Estimator

The entire business of statistical modeling is *optimization*; that is, with empirical data, researchers attempt to find out best (that is, optimal) model parameters to test theoretically derived hypotheses, to describe relationships (causal or associative) between key study variables, and to discern whether such sample-based estimates of parameters (that is, the so-called *coefficients* based on sample data) are statistically significant. If so, then the researchers can conclude that their research hypotheses or hypothesized relationships of variables may exist in the study population with a low chance of making a Type I error.

Two methods exist for achieving the goal of optimization: minimization and maximization. In minimization, the researcher estimates model coefficients and standard errors by minimizing errors. The set of coefficients and their standard errors are referred to as the *best* (that is, *optimal*) because using this set, and only this set, of coefficients, the researcher can make the overall sample errors minimal, where errors are defined as the differences between observed values of the dependent variable and model-predicted values of the same dependent variable for all sample observations. The most popular approach of minimization is the *ordinary least square* (OLS) estimator.

The second way to achieve optimization is maximization. With maximization, the researcher attempts to find a set of statistical coefficients (that is, the estimated parameters using sample data) by maximizing the likelihood (that is, the probability) of reproducing the sample. Researchers label model-estimated coefficients generated by a maximization algorithm as the best, or optimal, because this and only this set of coefficients provides the researcher with the highest likelihood of reproducing the sample. In sharp contrast to minimization that optimizes model parameters by minimizing errors, maximization achieves the same goal by maximizing the likelihood of reproducing sample. The primary advantage of maximization is its ability to estimate nonlinear equations, which is the dominant feature of contemporary statistical analysis. Almost all advanced statistical models are nonlinear models and cannot be estimated by the OLS algorithm. The ML estimator is the main, if not the sole, approach to statistical modeling via maximization.

Because ML aims to solve the so-called likelihood function, it inevitably involves a complex procedure called *numerical linear approximation* (or *numerical approach*) in the estimation procedure. Numerical approaches are at the core of modern science and are widely practiced in solving complicated mathematical functions such as calculating the trajectory of a spacecraft. As such, the numerical approach must use an iterative procedure to gradually approach the best-estimated coefficients. The most widely practiced method is the Newton–Raphson algorithm. The Newton–Raphson procedure employs a vector composed of the first-order partial derivatives of likelihood function with respect to parameters (that is, the so-called *gradient vector*, also known as the *score vector*) and a matrix composed of the second-order partial derivatives of likelihood function with respect to parameters (that is, the so-called *Hessian matrix*).

The Newton–Raphson algorithm follows four basic steps described below:

1. Based on the mathematical relationships of the variables and the statistical assumptions, the developer of a statistical model writes out a log-likelihood function (that is, ln

*ℓ*) to be maximized. Typically the right-hand side of the equation contains unknown parameters (for example, regression coefficients*β*_{0}and*β*_{1}for a simple logistic regression) and known data of the independent variables (for example, the known independent variable*x*for a simple logistic regression).2. The Newton–Raphson algorithm then inserts starting values of

*β*_{0}and*β*_{1}into the right-hand side of the equation. Given that*x*is known for the entire sample, the algorithm obtains a first guess of ln*ℓ*.3. The algorithm then inserts a different set of values of

*β*_{0}and*β*_{1}to obtain another guess of ln*ℓ*. By comparing the new ln*ℓ*value with the first value, the algorithm can discern the direction for trying the next set of*β*_{0}and*β*_{1}. The process from Step 2 to Step 3 is called*one iteration*.4. The algorithm replicates the above process many times (that is, runs several iterations) until it finds the largest value of ln

*ℓ*(that is, the maximum log-likelihood). The algorithm stops the search when the difference between the current ln*ℓ*and the ln*ℓ*of a previous iteration is less than a predetermined criterion (that is, a very small value called “*convergence criterion*,” such as .000001).

A technical problem exists in Step 3: how does the algorithm know the direction of the next guess about *β*_{0} and *β*_{1}? If a model contains only one unknown parameter, such as a Cox regression using one predictor, it is easy to know the direction of a guessed value in the next iteration (Guo, 2010). Unfortunately, with two or more unknown parameters, the number of possible ways to have guessed values that increase ln *ℓ* is extremely large—or even unlimited. It is for this reason that the Newton–Raphson algorithm uses the gradient vector and Hessian matrix to calculate the so-called direction vector and stepsize and uses these statistics to move from the least optimal estimates of parameters to the optimal ones (Gould, Pitblado, & Poi, 2010; Guo, in press

Variations exist in the numerical approaches. The Newton–Raphson algorithm is the most popular and widely practiced approach, but it is not the only approach. An alternative to Newton–Raphson is Fisher scoring, which is also an iterative method. Fisher scoring uses the *expected value* of the Hessian matrix, called the *expected information*, whereas Newton–Raphson uses the matrix itself, called the *observed information*. Another type of maximizing technique is known as quasi-Newton methods. These methods follow the same steps as the Newton–Raphson, but make a substitution for the Hessian matrix in computing which direction to go. The BHHH algorithm (pronounced B-H-cubed, standing for the four authors, Berndt, Hall, Hall, and Hausman) is an example of quasi-Newton method, which substitutes the outer product of the gradients for the observed negative Hessian matrix (Gould et al., 2010). The expectation–maximization (EM) algorithm is also a popular method to find ML. The EM iteration alternates between performing an expectation (*E*) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (*M*) step, which computes parameters maximizing the expected log-likelihood found in the *E* step. These parameter estimates are then used to determine the distribution of the latent variables in the next *E* step (Expectation-maximization, 2012).

All advanced statistical models use Newton–Raphson or similar approaches to estimate model parameters that maximize the likelihood function. That is, they all follow the basic procedure as that for a simple logistic regression, they all begin with a development of likelihood function, they all work in an iterative fashion, they all approach the optimal values of model parameters by searching estimates that maximize the likelihood, they all must calculate directions and stepsizes or similar quantities to speed up model convergence, and, most important, they all use the gradient vector and Hessian matrix to perform linear numerical approximation.

Hierarchical Linear Modeling

*Hierarchical linear modeling* (HLM; also known as the *random-effects model* or *mixed-effects model* and known as *growth-curve analysis* when researchers apply the method to multiwave panel data) is a class of statistical models developed (a) to analyze “nested” or clustered data, (b) to address multilevel influences on outcomes, and (c) to seek important predictors of change trajectories in a longitudinal study.

Social-work researchers frequently encounter nested or clustered data. A primary feature of nested data is its multilevel structure, such as children nested within families, students nested within classrooms, or clients nested within agencies. When researchers have multilevel data, they can choose one of two analytic strategies: conduct the analysis at either the macro level (for example, the larger social unit) or the micro level (for example, the individual level). To analyze the data at the macro level, the researcher aggregates the information for all individuals from the same macro-level unit. However, macro-level analysis has a clear drawback: it discards individual information and omits important variation among individuals within the same macro-level unit. Alternatively, to analyze data at the micro level, the researcher stacks the individuals across units together. Although individual-level analysis seemed promising, it was impracticable for quantitative researchers until HLM and similar approaches were invented. Before the development of HLM, researchers were limited to linear models such as OLS regression. Most linear models assume that observations are independent of one another; however, this assumption is violated when the study involves nested or clustered data. The essential problem of clustering is data redundancy. For example, suppose a researcher has a dataset consisting of 10 classrooms with 20 students per classroom. Although it might appear that the researcher has 200 observations available from the study, there are not 200 unique pieces of information because students who had the same teachers and shared the same classroom environments are similar in outcomes, and some of the apparent information in the sample is redundant.

A more important reason for conducting multilevel modeling is substantive: researchers need to test how individual characteristics interact with macro characteristics and, therefore, to test joint effects of the two-level characteristics on the outcome variable. For instance, an important question in poverty research is determining the joint impact of a child’s receipt of welfare and a parent’s receipt of welfare on the child’s academic achievement. More generally, this question is, “What is the impact of intergenerational dependence on welfare programs on children’s academic achievement?” This type of cross-level interaction, or propositions about macro-to-micro relations, cannot be answered by studies that use only conventional analytical approaches.

Researchers assess the extent of clustering in sample data by evaluating the sample *intraclass correlation coefficient*. The intraclass correlation coefficient measures the proportion of variance in the outcome variable that is between groups: a high intraclass correlation coefficient indicates that much of the variation in the outcome variable is caused by the groups (that is, greater redundancy in the individual-level data) and signifies the need to use a multilevel model (Raudenbush & Bryk, 2002; Snijders & Bosker, 1999). The intraclass correlation coefficient can be assessed by performing an unconditional analysis of variance with random effects.

Conceptually, a researcher might view HLM as a modeling process that runs OLS regression several times at different levels. A typical HLM analysis of cross-sectional multilevel data involves two levels. Suppose a researcher analyzes a study sample of children nested within families. At Level 1, the researcher regresses the outcome variable of all children on the children’s characteristics. The Level 1 model produces an estimated intercept, regression coefficient, or slope for each predictor variable, plus an error term. The intercept and slopes at Level 1 are called *fixed effects* because for a given value of the Level 1 predictor variable the effect is fixed. For instance, if a model estimates a gender effect of .5 and gender is coded as 1 for male and 0 for female, then a male has an outcome that is .5 higher than female—the gender effect is a fixed value regardless of the child’s other characteristics and which family he or she comes from.

At Level 2, the researcher takes each Level 1 fixed effect as a dependent variable and regresses the variable on the family characteristics. Each Level 2 regression equation then estimates intercept, the effect of each Level 2 variable (or macro characteristic), plus an error term. The error terms of the Level 2 equations are called *random effects* because the values vary by Level 2 unit (for example, if a study sample has 100 families, then the model estimates 100 specific values for each Level 2 error term). The Level 2 regression slopes of the equations that do not use the Level 1 intercept as the dependent variable are called *cross-level interaction*s, which play an important role in testing research hypotheses about multilevel influences. Hierarchical linear modeling estimates the fixed and random effects of a multilevel model by using a *full maximum likelihood* or a *restricted maximum likelihood* estimator in conjunction with an empirical Bayesian method.

Bryk and Raudenbush (1987) demonstrate the linkage between multilevel analysis and longitudinal study depicting change. Since their seminal paper, HLM has been widely used in research involving data collected at multiple time points (in this application, HLM is synonymous with growth-curve analysis) and in research aiming to estimate both multilevel influences on a dependent variable and the change of dependent variable over time (Singer & Willett, 2003). Growth-curve analysis may be specified as a two-level or three-level model. A three-level model, for instance, may formulate occasions, individuals, and organizations as three hierarchical levels to evaluate the rate of change of an outcome, differential rates of change between groups of study participants (that is, a cross-level interaction of occasion and individual), and differential change rates between organizations (that is, a cross-level interaction of occasion and organization). Indeed, HLM is a flexible and versatile model that can be applied to a variety of situations to answer diverse research questions.

Thus, the major difference between HLM and OLS is the treatment of clustering effects. Hierarchical linear modeling corrects for clustering by adding random effects to the model. Random effects are a level of heterogeneity for each Level 2 unit (or Level 3 unit). Researchers are not particularly interested in the substantive meaning of the random effects, but by explicitly specifying these effects in the model, the researcher can correct for biases triggered by clustering. The central task of running HLM is to correctly specify a model that includes appropriate random effects. A simple and useful test is to check the significance of the χ^{2} associated with the estimated variance component (Raudenbush & Bryk, 2002). Because this test requires all (or at least most) groups having sufficient data to compute initial estimates, it may lead to a biased decision. When necessary, researchers should consider using one of two methods to determine the inclusion or exclusion of random effects and, hence, to evaluate the final model’s *goodness of fit* to the data. The first method is the likelihood ratio test (also known as the *deviance test*), which compares models with different specifications on random effects (Snijders & Bosker, 1999). The second method is the comparison of models with different specifications of random effects using either the *Akaike information criterion* or the *Bayesian information criterion*. This test is similar to the deviance test, but instead of making a decision based on likelihood ratio, it uses the Akaike information criterion or Bayesian information criterion that corrects for bias induced by sample size and number of random effects (Verbeke & Molenberghs, 2000).

When planning a study that will use HLM as an analytic method, researchers must figure out the minimum sample size at each level to guarantee that the study has adequate statistical power. Determining adequate statistical power is often a required component of grant proposals. The needed sample size for HLM can be assessed under a general framework of power analysis (Cohen, 1988) and one of the following three frameworks: optimal design for cluster-randomized trials using HLM (Raudenbush, 1997), optimal design for multisite randomized trials using HLM (Raudenbush & Liu, 2000), and design for longitudinal study assessing group differences using HLM (Raudenbush & Liu, 2001a). The Optimal Design software package was specifically developed to accomplish these tasks (Raudenbush & Liu, 2001b); Optimal Design is available without charge on the Internet (http://sitemaker.umich.edu/group-based/optimal_design_software).

Hierarchical linear modeling assumes a normal distribution of the dependent variable. When users have discrete or limited dependent variables, they must run *hierarchical generalized linear models* that use a penalized quasi-likelihood estimator.

Multilevel modeling can also use a different approach called *generalized estimating equation* (GEE). Also known as a marginal approach, GEE was first developed by Zeger, Liang, and Albert (1988) and shares common features as the robust estimator of standard errors (also known as the Huber–White estimator; Huber & Ronchetti, 2009). The GEE and robust estimator of standard errors serve the same function of correcting for clustering effects as that of HLM.

Survival Analysis

*Survival analysis* (SA; also known as *event history analysis*, *duration analysis, transition analysis*, or *failure-time analysis*) is a collection of statistical methods addressing questions that have to do with *whether* and *when* an event of interest takes place. Precisely, SA is “the analysis of data that correspond to the time from a well-defined *time origin* until the occurrence of some particular event or *end-point*” (Collett, 1994, p. 1). Time-to-event data are ubiquitous in social-work research. For instance, child-welfare researchers and practitioners are concerned about the length of time children stay in foster care because federal law (that is, the *Adoption and Safe Families Act*, Pub. L. No. 105-89) requires that agencies make reasonable efforts to find safe, permanent, and nurturing homes for children within 12 months of their entry into the foster-care system. In studying welfare reform, researchers are interested in factors affecting the length of time individuals receive welfare support through the Temporary Assistance for Needy Families program because welfare reform laws (Personal Responsibility and Work Opportunity Reconciliation Act, Pub. L. No. 104-193) placed a lifetime cap of 60 months on support with Temporary Assistance for Needy Families. Researchers and practitioners evaluating mental-health treatment interventions closely monitor the timing of relapses of targeted problems because determining timing of relapse and reducing the incidence are both key measures of the effectiveness of an intervention.

The fundamental feature that distinguishes time-to-event data from conventional data is *censoring*. Censoring refers to data incompletion and has three basic forms. *Right-hand censoring* refers to the situation in which the ending point of a “spell” is unknown or the event of interest has not yet occurred when the data collection is completed. *Left-hand censoring* refers to the situation in which the origin or the starting point of a spell is unknown. *Random censoring* refers to the case in which the researcher observes both the origin and the ending points, but the observation is terminated for reasons other than the event of interest. Because of censoring, researchers use two variables to code time-to-event data: the continuous length of time variable and the indicator variable indicating whether the observed length of time is the time for event occurrence or the time for data censoring. Most SA models allow users to control for right-hand censoring and random censoring, although these models require that random censoring is noninformative (Allison, 1995).

The fundamental task of SA is to use statistics to reveal characteristics of a survival distribution. Biostatisticians have developed important functions to characterize survival distributions and derived rigorous theories to show relationships among these functions. Survival analysis has two most important functions, the first of which is the hazard function, or *h(t)*. Formally, the hazard function is an instantaneous probability measuring rate of change. The hazard function can be expressed as a ratio of conditional probability to have the event within an extremely small time interval (that is, when the time interval approaches zero or is infinitesimal) over the time interval. Hazard function is a formal way to represent change and can be interpreted as speed of change. The second important function of SA is the survivor function, or *S(t)*. The survivor function measures the probability of not having the event (surviving to, or remaining in the participant set of, having no event) by a specific time *t*. Although the term “survivor” might sound strange, it is analogous to a participant whose event of interest has not yet occurred at a given time point. Remember that survival analysis originates from studies of mortality, in which death was the event of interest; therefore, participants who did not have the event were survivors. Thus, we use the survivor function to measure the probability of not having a defined event by time *t*. The life-table method estimates hazard and survivor functions, and the Kaplan–Meier estimator (also known as the *product-limit* method) estimates survivor function. These two methods serve descriptive purposes and can be used to describe survival distribution of a given time-to-event variable. In addition, these two methods can be used to discern group differences on survivor functions using significance tests such as the log-rank test, the Wilcoxon test, the Breslow test, and the Tarone–Ware test. Because of censoring, conventional statistics such as mean and standard deviation cannot be used to describe event history data; instead, researchers must use hazard or survivor functions derived from the life-table or Kaplan–Meier estimator to perform univariate and bivariate analyses (Guo, 2010).

Social-work researchers can use three types of statistical models for multivariate analysis of time-to-event data to discern important predictors of timing of event occurrence. The first model type is the class of *discrete-time models* (Allison, 1982). The primary feature of this type of models is its approximation of hazard rate using probability estimated from a person–time data set. When the study focuses on a single event, the analyst uses a binary logistic regression to estimate the probability. When the study focuses on multiple events (that is, termination of study time caused by more than one reason), the analyst uses a multinomial logit regression to estimate multiple probabilities.

The second model type is the class of *parametric models*, which was the leading approach to multivariate analysis of time-to-event data prior to the Cox proportional hazards model. Parametric models have two main advantages: (a) parametric models enable the user to run models that include left-hand and interval censorings; and (b) when assumptions about survival distribution are tenable, the estimates provided by the model are usually good, unbiased, and efficient. Conversely, parametric models have two disadvantages: (a) parametric models cannot be used to analyze time-varying covariates; and (b) these models require prior knowledge about the nature of the survival distribution being analyzed; if such information is not available, the user must assume that the empirical distribution being analyzed is the same distribution suggested by the parametric model. However, when the above assumption is not valid and the actual distribution is not the same as the distribution suggested by the model, the user will obtain misleading and biased results.

Although the parametric models are not in widespread use today, an innovative development that integrates a discrete-time model and an exponential model called *piecewise exponential model* is promising and potentially could be applied to many empirical datasets to answer important research questions (for example, Sandefur & Cook, 1998). The central idea of the piecewise model can be summarized as follows: although a constant hazard rate assumed by the exponential model is not tenable in real data settings, a piecewise exponential assumption is flexible enough and fits most applications. If the researcher can divide a long study period into a series of short periods and assume that within each short period the hazard rate is constant, then the piecewise exponential model provides an efficient and consistent estimation about important predictors of event history data under study (Guo, 2010).

The last and most widely practiced multivariate model in SA is the *Cox proportional hazards model* (also known as *Cox regression*; Cox, 1972). The Cox regression is important and widely applied in biomedical, engineering, economic, social, and behavioral sciences for several reasons. First, before development of the Cox regression, the leading approach to multivariate survival analysis was the parametric model, which had distinct disadvantages, as previously mentioned. The Cox regression is a distribution-free model and does not require survival distribution information (that is, needed in the parametric model). Second, Cox developed the *partial likelihood* estimation method, which was innovative in several ways. Among other innovations, partial likelihood allows the user to estimate the regression coefficients of the proportional hazards model without having to specify the baseline hazard function (again, Cox regression is a distribution-free approach), and the estimates depend only on the event times’ *ranks*. Because the model depends only on ranks, any monotonic transformation of the event times will leave the coefficient estimates unchanged. Third, the Cox regression was the first model that permits the user to incorporate the time-varying covariates in SA. The parametric models cannot incorporate time-varying covariates, and although the discrete-time model can incorporate the time-varying covariates, this method was developed after Cox’s work. Fourth, with appropriate specifications, the Cox regression can be used to answer challenging research questions. Innovative applications using the Cox regression include competing risks analysis and a nonproportional hazards model that creates a time-varying covariate by interacting a time-fixed covariate with event time.

Statistical power analysis for SA can be performed using the PS program or “Power and Sample Size Calculation Software Package” developed by Dupont and Plummer (2009). In SA power analysis researchers use hazard ratio as a measure of effect size (Collett, 1994).

Recent advances in biomedical research incorporate Cox regression with multilevel analysis. These advances have yielded two types of such models: the *frailty model* and the *marginal model*. Between the two, the marginal model makes less restrictive assumptions and is easier to practice than the frailty model. The marginal models have much in common with the GEE approach (Zeger et al., 1988). In particular, two marginal models are especially useful to multilevel analysis of survival data: the WLW model (Wei, Lin, & Weissfeld, 1989) and the LWA model (Lee, Wei, & Amato, 1992). (Note these models are designated with the initials of the developers’ surnames). Software packages such as Stata and SAS offer routine programs to estimate both models (Guo, 2010).

Structural Equation Modeling

*Structural equation modeling* (SEM; also known as *covariance structure analysis* or *analysis of moment structures*) is probably the most widely practiced quantitative method in social-work research. This widespread use is not coincidental, but results from the fact that social-work practitioners and researchers commonly measure complex patterns of cognition, affect, and behavior. Attitudes (such as racism), cognitions (for example, self-perceptions), behavior patterns (such as aggression), social experiences (for example, social support), and emotions (such as depression) are complex phenomena that cannot be either directly observed or accurately measured with only one questionnaire item (Bowen & Guo, 2012).

Structural equation modeling offers a highly desirable approach to statistical analysis with latent variables (that is, variables that are not directly observed), making various types of statistical analyses feasible. Among the many types of SEM, examples of frequently used SEMs include confirmatory factor analysis in instrument development, structural models depicting recursive or nonrecursive relationships between latent variables, testing mediational effects with latent variables or without latent variables (that is, a path analysis), multitrait–multimethod model, and latent growth-curve analysis (Bollen, 1989). Since the 1970s, SEM has evolved from a technique restricted to the highest echelon of statisticians (for example, Jöreskog, 1971) to a practical tool used and valued by a broad scientific audience. Structural equation modeling enables researchers to study complex relationships, including those in which some variables are latent. The underlying relationship, or *structure*, among variables is formulated as a model and often articulated graphically through path diagrams. Structural equation modeling permits multiple indicators of latent variables and the ability to estimate and test hypothesized relationships while controlling for random and systematic measurement errors. Taking account of measurement error contrasts with the current common practice of treating observed variables, whether single variables or scales, as if observed variables are free of error. Another valuable SEM attribute is that it allows the user to compare the model with the empirical data, which leads to evaluating how well or to what extent the model “fits” the empirical data. If the fit is acceptable, then the model is not rejected and the hypothesized relationships between variables are considered consistent with the data. Alternatively, a poor fit raises questions about the appropriateness of a model even if problems are not evident with individual coefficients. Whereas analysis using SEM once would have required highly trained specialists, modern SEM software is widely available and easy to use.

The primary goal of an SEM analysis is to confirm research hypotheses about the observed means, variances, and covariances of a set of variables. The hypotheses are represented by a number of structural parameters (for example, factor loadings, regression paths) that is smaller than the number of observed data. When used as a confirmatory approach, it is crucial for researchers using SEM to test models that have strong theoretical or empirical foundations. Ideally, SEM is conducted with large sample sizes and continuous variables with multivariate normality. A recent development in SEM permits analysis of latent categorical variables (Muthén & Muthén, 2010). The level of equality between a population variance–covariance matrix and the model-implied variance–covariance matrix serves as a fundamental hypothesis of SEM, which is the key to understanding issues of identification, estimation, and goodness of fit of the estimated model (Bollen, 1989).

Structural equation models are often overidentified models that assume a subset of variables measure theoretically derived or hypothesized concepts or latent variables. The constraints imposed by overidentification are tested through various indices of *goodness of fit*, a term that refers to the extent of similarity between the population variance–covariance matrix and the model-implied variance–covariance matrix, and whether such similarity is sufficient for the researcher to claim support for the hypothesized model. The commonly practiced fit indices and their cutoff values for model acceptance are as follows: model χ^{2} with a nonsignificant *p* value (that is, *p* > .05); root mean square error of approximation with a value less than or equal to .05 indicates close fit, whereas a value between .05 and .08 indicates a reasonable fit and a value greater than or equal to .10 indicates poor fit; comparative fit index that is greater than or equal to .95; Tucker–Lewis index that is greater than or equal to .95; goodness-of-fit index that is greater than or equal to .90; and weighted root mean square residual (in Mplus only) that is greater than or equal to .90 (Bowen & Guo, 2012).

The statistical power analysis for SEM can be performed using a framework developed by MacCallum and colleagues (MacCallum, Browne, & Cai, 2006; MacCallum, Browne, & Sugawara, 1996). In this framework, the effect size of an SEM analysis is defined as the root mean square error of approximation, and the power analysis is performed for testing overall model fit in one model, or for comparing nested models, separately. McCallum and colleagues emphasized a key finding with regard to relationships among factors affecting power: the crucial factor affecting a study’s power is not sample size alone, but sample size in relation to a model’s degrees of freedom (*df*). A more complex model (that is, a model with more estimated parameters and fewer *df*) needs a larger sample to achieve the same level of power as a less complex model (that is, a model with fewer estimated parameters and more *df*).

Structural equation modeling shares common statistical assumptions and specifications with HLM when handling clustering effects (Willett & Sayer, 1994). The special type of SEM developed to analyze nested or clustered data is called the *latent-intercept-and-latent-slope model*, or, when the model is applied to longitudinal data, it is called latent growth-curve analysis (Bollen & Curran, 2006).

Propensity Score Analysis

The development of corrective procedures for observational studies, such as propensity score analysis, originated from a common need among social-science researchers to overcome the challenges of failure to conduct true experiments. Although the randomized controlled trial is considered the gold standard in research design, true experimental designs are not always possible, ethical, practical, or even desirable in human social-science research. Given that social-science research continues to rely heavily on quasi-experimental research design, researchers have increasingly sought methods of improved program evaluation or causal inferences.

Since about 1980, these methods have undergone a significant change because researchers recognized the need to develop more efficient ways to assess treatment effects using observational data as well as the need for methods to evaluate studies based on quasi-experimental designs. This growing interest in seeking consistent and efficient estimators of program effectiveness has led to a surge in work focused on estimating average treatment effects under various sets of assumptions. Statisticians (for example, Rosenbaum & Rubin, 1983) and econometricians (for example, Heckman, 1978, 1979) have made substantial contributions by developing and refining new approaches for the estimation of causal effects from observational data; collectively these approaches are known as *propensity score analysis*.

Researchers found the conventional covariance control approach had numerous flaws and, to draw causal inference, needed to be replaced with a method that was more rigorous. For instance, a common approach in sociological research to evaluate the treatment effect was a regression model (or a regression-type model) with a dummy variable (that is, treatment versus nontreatment); however, this practice has been criticized by Sobel (1996) and others. Indeed, the following summary highlights the primary problems of the covariance control approach discussed in the literature. First, the dummy treatment variable is specified by these models as exogenous, but in fact it is not; the determinants of incidental truncation or sample selection should be explicitly modeled first, and selection effects should be taken into consideration when estimating causal impacts on outcomes (Heckman, 1978, 1979). Second, the strongly ignorable treatment assignment assumption (that is, conditional upon covariates, the treatment assignment is independent from outcomes under both treatment and control conditions) is prone to violation in observational studies. Under such conditions, the presence of the endogeneity problem (that is, correlation of the residual term from a regression model with one or more independent variables used in the model) leads to a biased and inconsistent estimation of the regression coefficient (Berk, 2004; Imbens, 2004; Rosenbaum & Rubin, 1983). Last, covariance control does not automatically correct for the nonignorable treatment assignment (Guo & Fraser, 2010).

The Neyman–Rubin counterfactual framework (Morgan & Winship, 2007; Neyman, 1923; Rubin, 1974) is a conceptual model guiding the development of statistical models for causal inferences. Under this setting, a counterfactual is a potential outcome; that is, a counterfactual is what would have happened in the absence of the cause (Shadish, Cook, & Campbell, 2002). A *counterfactual framework* emphasizes that individuals selected into either the treatment or the nontreatment group have potential outcomes in both states: the condition or state in which they are observed and the condition in which they are not observed. The Neyman–Rubin framework offers a practical way to evaluate counterfactuals. Working with data from a sample that represents the population of interest, the standard estimator for the average treatment effect is considered the difference of two estimated means of outcome between treated and control participants.

The counterfactual framework makes two important assumptions for program evaluation or causal inference. One assumption is the strongly ignorable treatment assignment assumption (previously mentioned) and the other is the *stable unit treatment value assumption*, which was labeled and formally presented by Rubin in 1980. The stable unit treatment value assumption is simply the a priori assumption that the value of *Y* for unit *i* when exposed to treatment *w* will be the same no matter what mechanism is used to assign treatment *w* to unit *i* and no matter what treatments the other units receive; this holds for all participants in the treated and control conditions. When the stable unit treatment value assumption is violated (for example, when unrepresented versions of treatment exist or when interference exists between participants), researchers should seek remedial measures. The stable unit treatment value assumption is both an assumption that facilitates investigation of counterfactuals and a conceptual perspective that underscores the importance of analyzing differential treatment effects with appropriate estimators.

Of the many approaches using propensity scores, four approaches are particularly important (Guo & Fraser, 2010). Although these four approaches share closely related core features, they are technically distinct propensity score models as summarized below:

1.

(1978, 1979)*Heckman’s sample selection model*(Maddala, 1983). The crucial features of these models are (a) an explicit modeling of the structure of selection; (b) a switching regression that seeks exogenous factors determining study participants’ switch between two regimens (that is, the treated and nontreated conditions); (c) the use of the conditional probability of receiving treatment in the estimation of treatment effects; and (d) the use of estimated variability of residuals of the selection equation and the inverse Mills ratio to correct for bias in the outcome analysis that is caused by selection.*and its revised version estimating treatment effects*2.

(Rosenbaum & Rubin, 1983)*Propensity score matching*. The fundamental feature of propensity score matching is that it balances data through resampling or matching nontreated participants to treated participants on probabilities of receiving treatment (that is, the propensity scores) and permits follow-up bivariate or multivariate analysis (for example, stratification of outcome based on quintiles of propensity scores, OLS regression, survival modeling, SEM, HLM, or others) as would be performed on a sample generated by a randomized experiment. Reducing the dimensionality of covariates to a one-dimensional score—the propensity score—is a substantial contribution that leverages matching. The classic*and related models**greedy matching*, that is, in the form of*nearest-neighbor matching within caliper*or similar models, has substantial limitations because it requires a sizeable common-support region (that is, a sufficient overlap on estimated propensity scores between treated and control groups), is a local optimization rather than a global optimization approach, and often leads to sample reduction after matching. To overcome these limitations, researchers can use one of three recently developed methods, all of which are based on the framework of Rosenbaum and Rubin (1983). The three methods are (a) propensity score optimal matching (Haviland, Nagin, & Rosenbaum, 2007; Rosenbaum, 2002a), (b) propensity score weighting (Hirano & Imbens, 2001; McCaffrey, Ridgeway, & Morral, 2004; Rosenbaum, 1987), and (c) propensity score subclassification (Rosenbaum & Rubin, 1983). For an introduction to and summary of these methods, see Guo and Fraser (2010).3.

(Abadie & Imbens, 2002, 2006). The key feature of the matching estimators is to directly impute counterfactuals for both treated and nontreated participants using a vector norm derived from a positive definite matrix (that is, the inverse of sample variance–covariance matrix or the inverse of sample variance matrix). Various types of treatment effects can be estimated: (a) the sample average treatment effect; (b) the sample average treatment effect for the treated; (c) the sample average treatment effect for the controls; and (d) the equivalent effects for the population (that is, population average treatment effect, population average treatment effect for the treated, and population average treatment effect for the controls). Standard errors corresponding to these treatment effects are developed and used in significance tests.*Matching estimators*4.

(Heckman, Ichimura, & Todd, 1997, 1998). The critical feature of this method is the comparison of each treated participant with all nontreated participants based on distances between propensity scores. A nonparametric regression such as local linear regression or*Propensity score analysis with nonparametric regression**lowess*is used to produce an estimate of the average treatment effect for the treatment group. By applying the method to data at two time points, this approach estimates the average treatment effect for the treated group in a dynamic fashion, known as*difference-in-differences*.

The major limitation of propensity score analysis is its failure to control for unmeasured or hidden selection bias. Because of this limitation, this method cannot compete with a randomized controlled trial. Hidden bias is essentially a problem created by the omission of important variables in statistical analysis, and such omission renders the unobserved heterogeneity nonrandom (as reflected by an error term in regression equations). Although correcting the problem might involve steps such as respecifying an analytic model using additional variables, collecting additional data, or redesigning a study as a randomized experiment, these strategies can be expensive and time-consuming. To gauge the level of sensitivity to hidden selections for a completed study, Rosenbaum (2002a, 2002b, 2005) developed a special type of analytic method called *sensitivity analysis to hidden selection bias*. This analysis asks what the unmeasured covariates would have to be like to alter the conclusions of the study. If a study is less sensitive to hidden selections, then the study findings are considered more trustworthy than those from a study that is more sensitive to hidden bias.

Other Corrective Methods for Causal Inferences: Instrumental Variable Approach and Regression Discontinuity Designs

Other types of corrective models for causal-inference research are also available, including the instrumental variable approach, regression discontinuity designs, interrupted time series design, and Bayesian approach to inference of average treatment effect. In this section, we briefly describe the major features of the instrumental variable approach and the regression discontinuity designs.

After OLS regression, the instrumental variable approach is perhaps the second most widely practiced method in economic research (Wooldridge, 2002). We use economics here as an example because this is the leading field in social research that focuses on causal-inferential typed questions. As mentioned earlier, the selection bias problem is a problem of endogeneity of running regression. That is, the lack of randomization mechanism causes the residual term in a regression to be correlated with one or more independent variables. To solve the problem, researchers may find an observed variable *z* that satisfies the following two conditions: *z* is not correlated with the residual term, but *z* is highly correlated with the independent variable that causes endogeneity. If *z* meets these two conditions, then *z* is called an instrumental variable, and researchers can use a two-stage-least-square estimator to estimate the treatment effect through the instrumental variable approach. The *z* may not necessarily be a single variable and can be a vector that involves two or more variables. In practice, finding instrumental variables that meet the two required conditions can be extremely challenging. For this reason, researchers need alternative approaches to correct for endogeneity (Wooldridge, 2002). Propensity score analysis is an example of such a widely practiced alternative to the instrumental variable approach.

Regression discontinuity design (RDD) is a quasi-experimental approach that evaluates the treatment effect by assigning a cutoff or threshold value above or below which a treatment is assigned. By comparing observations lying closely on either side of the threshold, it is possible to estimate the local treatment effect in which a randomized controlled trial is not feasible. The method is highly similar to an interrupted time series design that compares outcomes before and after an intervention, except that the treatment in RDD is a function of a variable other than time. The RDD method was first proposed by Thistlewaite and Campbell (1960) when they analyzed the effect of student scholarships on career aspirations. In practice, researchers using RDD may distinguish between two general settings: the sharp and the fuzzy regression discontinuity designs. The estimation of treatment effects with both designs can be obtained using a standard nonparametric regression approach such as *lowess* with appropriately specified kernel function and bandwidth (Imbens & Lemieux, 2008). Two assumptions exist with RDD: (a) treatment assignment is equally as good as random selection at the threshold for treatment and (b) individuals are sampled independently. Violations of these assumptions lead to biased estimation of treatment effect. The most severe problem of RDD is misspecification of the functional form of the relation between treatment and outcome. Specifically, users run the risk of misinterpreting a nonlinear relationship between treatment and outcome as a discontinuity. Counterfactual values must be extrapolated from observed data below and above the application of the treatment. If the assumptions built into the RDD of extrapolation are unreasonable, then causal estimates are incorrect (Morgan & Winship, 2007).

Missing Data Imputation

Social-work researchers are increasingly applying imputation methods to handle missing data (Rose & Fraser, 2008; Saunders et al., 2006). When using datasets with any missing values, researchers must first understand the extent to which their datasets have missing values and the mechanisms that produce *missingness*. This groundwork is often the first and most important work to do at the early stage of data analysis. The extent to which data are missing from variables to be analyzed can be examined by obtaining univariate descriptive statistics. A careful examination of the pattern of missingness, that is, a simple count of the number of missing values on each variable and determining whether this number correlates with that of other variables, is always worth doing.

Three types of mechanisms (or causes) of missingness are frequently discussed in the literature: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). According to Graham (2009), “If the cases for which the data are missing can be thought of as a random sample of all the cases, then the missingness is MCAR. This means that everything one might want to know about the data set as a whole can be estimated from any of the missing data patterns, including the pattern in which data exist for all variables, that is, for complete cases” (p. 552). The MCAR mechanism is a restrictive form of missingness and it is typically difficult for researchers to discern the MCAR presence in empirical settings. The MAR mechanism refers to a pattern of missingness that can depend on observed data (for example, other variables in the study), but not on values of the outcome of interest (Saunders et al., 2006). According to Acock (2005), “The missing data for a variable are MAR if the likelihood of missing data on the variable is not related to the participant’s score on the variable, after controlling for other variables in the study” (p. 1012). The MNAR mechanism is a pattern of missingness that is systematic or based on values of the outcome variable of interest (Saunders et al., 2006).

Before the development of new methods for missing data imputation, researchers were restricted to dealing with missing data using several common methods including (a) listwise deletion or analysis based on complete cases; (b) pairwise deletion, that is, calculating a correlation or covariance matrix used by SEM using complete cases for each pair of variables, while ignoring missing values irrelevant to the pair of variables for which a correlation or covariance is constructed; (c) mean substitution by replacing missing values of a variable by its sample mean of the complete cases; (d) incorporation of a missingness dummy variable in the analysis in addition to the specially coded missing value; (e) regression-based single imputation; and (f) imputation of categorical values based on data from cases with similar response patterns on other variables. Of these older methods, only listwise deletion is still considered a valid method for statistical analysis and then only under certain conditions, such as if the study uses a large sample and the proportion of missing data is small. Researchers generally agree that all other methods yield biased parameter estimates and should not be used (Allison, 2002; Graham, 2009).

Since 1987, when Little and Rubin published their seminal work on missing data (updated in 2002; Little & Rubin, 2002), statisticians have developed three “modern” missing data procedures: (a) the EM algorithm, (b) multiple imputation under the normal model, and (c) full-information maximum likelihood (FIML) method. All three methods are now widely practiced by social-behavioral researchers.

A study that looked at both the traditional and the newer methods of handling missing values within the analyses using SEM found that both FIML and multiple imputation were superior to other approaches (Acock, 2005). Results from a simulation study comparing FIML to pairwise deletion, listwise deletion, and the “similar response pattern imputation” method indicated that FIML outperformed all of the comparison methods and (a) performed well under the MARcondition, as well as the MCAR, (b) worked well regardless of the amount of missing data (the authors tested samples with between 2 and 25% missing data), and (c) was the least likely of the methods to have convergence failures (Enders & Bandalos, 2001). One attractive feature of FIML is that the method deals with the missing data, conducts parameter estimation, and estimates standard errors all in a single step. Unlike the EM method, FIML offers good estimates of standard errors and permits researchers to perform hypothesis testing without serious bias (Graham, 2009). Unlike the multiple imputation method, FIML does not require multiple imputed files and postimputation aggregation and, therefore, provides final results in one step. In addition, FIML can be used with estimators other than ML in some programs, such as Mplus.

## References

Abadie, A., & Imbens, G. W. (2002). *Simple and bias-corrected matching estimators* (Technical report). Berkeley, CA: Department of Economics, University of California, Berkeley. Retrieved August 8, 2008, from http://ideas.repec.org/p/nbr/nberte/0283.htmlFind this resource:

Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. *Econometrica*, *74*, 235–267. doi:10.1111/j.1468-0262.2006.00655.xFind this resource:

Acock, A. C. (2005). Working with missing values. *Journal of Marriage and the Family*, *67*, 1012–1028. doi:10.1111/j.1741-3737.2005.00191.xFind this resource:

Allison, P. D. (1982). Discrete-time methods for the analysis of event histories. In S. Leighhardt (Ed.), *Sociological Methodology 1982* (pp. 61–98). San Francisco, CA: Jossey–Bass.Find this resource:

Allison, P. D. (1995). *Survival analysis using the SAS system*. Cary, NC: SAS Institute.Find this resource:

Allison, P. D. (2002). *Missing data*. Thousand Oaks, CA: Sage.Find this resource:

Berk, R. A. (2004). *Regression analysis: A constructive critique*. Thousand Oaks, CA: Sage.Find this resource:

Bollen, K. A. (1989). *Structural equations with latent variables*. New York, NY: Wiley.Find this resource:

Bollen, K. A., & Curran, P. J. (2006). *Latent curve models: A structural equation perspective*. Hoboken, NJ: Wiley.Find this resource:

Bowen, N. K., & Guo, S. (2012). *Structural equation modeling*. New York, NY: Oxford University Press.Find this resource:

Bryk, A. S., & Raudenbush, S. W. (1987). Application of hierarchical linear models to assessing change. *Psychological Bulletin*, *101*(1), 147–158. doi:10.1037/0033-2909.101.1.147Find this resource:

Cohen, J. (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). Hillsdale, NJ: Erlbaum.Find this resource:

Collett, D. (1994). *Modelling survival data in medical research*. London, UK: Chapman & Hall.Find this resource:

Cox, D. R. (1972). Regression models and life tables (with discussion). *Journal of the Royal Statistical Society, B*, *74*, 187–220.Find this resource:

Dupont, W. D., & Plummer, W. D. (2009). *PS: Power and sample size calculation* (version 3.0). Retrieved from http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

Enders, C. K., & Bandalos, D. L. (2001). The relative performance of full information maximum likelihood estimation for missing data in structural equation models. *Structural Equation Modeling*, *8*(3), 430–457. doi:10.1207/S15328007SEM0803_5Find this resource:

Expectation–maximization algorithm. (2012). In *Wikipedia*. Retrieved December 10, 2012, from http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

Gould, W., Pitblado, J., & Poi, B. (2010). *Maximum likelihood estimation with Stata* (4th ed.). College Station, TX: Stata Press.Find this resource:

Graham, J. W. (2009). Missing data analysis: Making it work in the real world. *Annual Review of Psychology*, *60*, 549–576. doi:10.1146/annurev.psych.58.110405.085530Find this resource:

Guo, S. (2010). *Survival analysis*. New York, NY: Oxford University Press.Find this resource:

Guo, S. (in press). The maximum likelihood estimator: The untold stories, caveats, and tips for application. *Chinese Sociological Review*, *45*(3).Find this resource:

Guo, S., & Fraser, W. M. (2010). *Propensity score analysis: Statistical methods and applications*. Thousand Oaks, CA: Sage.Find this resource:

Haviland, A., Nagin, D. S., & Rosenbaum, P. R. (2007). Combining propensity score matching and group-based trajectory analysis in an observational study. *Psychological Methods*, *12*, 247–267. doi:10.1037/1082-989X.12.3.247Find this resource:

Heckman, J. J. (1978). Dummy endogenous variables in a simultaneous equations system. *Econometrica*, *46*, 931–960. doi:10.2307/1909757Find this resource:

Heckman, J. J. (1979). Sample selection bias as a specification error. *Econometrica*, *47*, 153–161. doi:10.2307/1912352Find this resource:

Heckman, J. J., Ichimura, H., & Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. *Review of Economic Studies*, *64*, 605–654. doi:10.2307/2971733Find this resource:

Heckman, J. J., Ichimura, H., & Todd, P. E. (1998). Matching as an econometric evaluation estimator. *Review of Economic Studies*, *65*, 261–294. doi:10.1111/1467-937X.00044Find this resource:

Hirano, K., & Imbens, G. W. (2001). Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. *Health Services & Outcomes Research Methodology*, *2*, 259–278.Find this resource:

Huber, P. J., & Ronchetti, E. M. (2009). *Robust statistics* (2nd ed.). Hoboken, NJ: Wiley. doi:10.1002/9780470434697Find this resource:

Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. *Review of Economics and Statistics*, *86*, 4–29. doi:10.1162/003465304323023651Find this resource:

Imbens, G. W., & Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. *Journal of Econometrics 142*, 615–635. doi:10.1016/j.jeconom.2007.05.001Find this resource:

Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. *Psychometrika*, *36*, 409–426. doi:10.1007/BF02291366Find this resource:

Lee, E. W., Wei, L. J., & Amato, D. A. (1992). Cox-type regression analysis for large numbers of small groups of correlated failure time observations. In J. P. Klein & P. K. Goel (Eds.), *Survival analysis: State of the art* (pp. 237–247). Dordrecht, The Netherlands: Kluwer Academic.Find this resource:

Little, R. A., & Rubin, D. B. (2002). *Statistical analysis with missing data* (2nd ed.). New York, NY: Wiley.Find this resource:

MacCallum, R. C., Browne, M. W., & Cai, L. (2006). Testing differences between covariance structure models: Power analysis and null hypothesis. *Psychological Methods*, *11*, 19–35. doi:10.1037/1082-989X.11.1.19Find this resource:

MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. *Psychological Methods*, *1*, 130–149. doi:10.1037/1082-989X.1.2.130Find this resource:

Maddala, G. S. (1983). *Limited-dependent and qualitative variables in econometrics*. Cambridge, UK: Cambridge University Press.Find this resource:

McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. *Psychological Methods*, *9*, 403–425. doi:10.1037/1082-989X.9.4.403Find this resource:

Morgan, S. L., & Winship, C. (2007). *Counterfactuals and causal inference: Methods and principles for social research*. New York, NY: Cambridge University Press. doi:10.1017/CBO9780511804564Find this resource:

Muthén, L. K., & Muthén, B. O. (2010). *Mplus user’s guide*. Los Angeles, CA: Muthén & Muthén.Find this resource:

Neyman, J. S. (1923). Statistical problems in agricultural experiments. *Journal of the Royal Statistical Society Series B*, *2*, 107–180.Find this resource:

Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials. *Psychological Methods*, *2*(2), 173–185. doi:10.1037/1082-989X.2.2.173Find this resource:

Raudenbush, S. W., & Bryk, A. S. (2002). *Hierarchical linear models: Applications and data analysis methods* (2nd ed.). Thousand Oaks, CA: Sage.Find this resource:

Raudenbush, S. W., & Liu, X. F. (2000). Statistical power and optimal design for multisite randomized trials. *Psychological Methods*, *5*(2), 199–213. doi:10.1037/1082-989X.5.2.199Find this resource:

Raudenbush, S. W., & Liu, X. F. (2001a). Effects of study duration, frequency of observation, and sample size on power in studies of group differences in polynomial change. *Psychological Methods*, *6*(4), 387–401. doi:10.1037/1082-989X.6.4.387Find this resource:

Raudenbush, S. W., & Liu, X. F. (2001b). *Optimal design*. Retrieved from http://sitemaker.umich.edu/group-based/optimal_design_software

Rose, R. A., & Fraser, M. W. (2008). A simplified framework for using multiple imputation in social work research. *Social Work Research*, *32*, 171–178. doi:10.1093/swr/32.3.171Find this resource:

Rosenbaum, P. R. (1987). Model-based direct adjustment. *Journal of the American Statistical Association*, *82*, 387–394.Find this resource:

Rosenbaum, P. R. (2002a). *Observational studies* (2nd ed.). New York, NY: Springer.Find this resource:

Rosenbaum, P. R. (2002b). Covariance adjustment in randomized experiments and observational studies. *Statistical Science*, *17*, 286–304. doi:10.1214/ss/1042727942Find this resource:

Rosenbaum, P. R. (2005). Sensitivity analysis in observational studies. In B. S. Everitt & D. C. Howell (Eds.), *Encyclopedia of statistics in behavioral science* (pp. 1809–1814). New York, NY: Wiley. doi:10.1002/0470013192.bsa606Find this resource:

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. *Biometrika*, *70*, 41–55. doi:10.1093/biomet/70.1.41Find this resource:

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. *Journal of Educational Psychology*, *66*, 688–701. doi:10.1037/h0037350Find this resource:

Sandefur, G. D., & Cook, S. T. (1998). Permanent exits from public assistance: The impact of duration, family, and work. *Social Forces*, *77*(2), 763–786.Find this resource:

Saunders, J. A., Morrow-Howell, N., Spitznagel, E., Dore, P., Proctor, E. K., & Pescarino, R. (2006). Imputing missing data: A comparison of methods for social work researchers. *Social Work Research*, *30*, 19–31. doi:10.1093/swr/30.1.19Find this resource:

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). *Experimental and quasi-experimental designs for generalized causal inference*. Boston, MA: Houghton Mifflin.Find this resource:

Singer, J., & Willett, J. B. (2003). *Applied longitudinal data analysis*. New York, NY: Oxford University Press. doi:10.1093/acprof:oso/9780195152968.001.0001Find this resource:

Snijders, T., & Bosker, R. (1999). *Multilevel analysis: An introduction to basic and advanced multilevel modeling*. Thousand Oaks, CA: Sage.Find this resource:

Sobel, M. E. (1996). An introduction to causal inference. *Sociological Methods & Research*, *24*, 353–379. doi:10.1177/0049124196024003004Find this resource:

Thistlewaite, D. L., & Campbell, D. T. (1960). Regression-discontinuity analysis: An alternative to the ex post facto experiment. *Journal of Educational Psychology*, *51*, 309–317. doi:10.1037/h0044319Find this resource:

Verbeke, G., & Molenberghs, G. (2000). *Linear mixed models for longitudinal data*. New York, NY: Springer-Verlag.Find this resource:

Wei, L. J., Lin, D. Y., & Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. *Journal of the American Statistical Association*, *84*(408), 1065–1073. doi:10.1080/01621459.1989.10478873Find this resource:

Willett, J. B., & Sayer, A. G. (1994). Using covariance structure analysis to detect correlates and predictors of individual change over time. *Psychological Bulletin*, *116*, 363–381. doi:10.1037/0033-2909.116.2.363Find this resource:

Wooldridge, J. M. (2002). *Econometric analysis of cross section and panel data*. Cambridge, MA: MIT Press.Find this resource:

Zeger, S. L., Liang, K. Y., & Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. *Biometrics*, *44*(4), 1049–1060. doi:10.2307/2531734Find this resource: