четверг, 10 августа 2017 г.

Identifying and Imputing Multiple Missing Values


  1. The VIM Package.
  2. The Amelia Package.
  3. The mvnmle Package.
  4. The SeqKnn and rrcovNA Packages.


  1. The VIM Package.

The VIM package can also be used to do multiple imputation using the ‘irmi’ function which does what it stands for, Iterative Robust Model-based Imputation. The function runs iterative regression analysis in which each iteration uses one variable as an outcome and the remaining variables as predictors. If the outcome has any missing values, the predicted values from the regression are imputed. Iterations end when all variables in the data frame have served as an outcome.

> data(tao)
> summary(tao)
> imputed.tao <- irmi(tao)
> summary(imputed.tao)

VIM on cran

      2. The Amelia Package.

Another way of dealing with missing data is to use the Amelia package. The Amelia package (Honaker, King, & Blackwell, 2010a) is specifically designed to do multiple imputation on a variety of data types, as long as the data is in a matrix or data frame. The imputation function is the ‘amelia’ function, which creates new data sets which include multiple imputation of incomplete multivariate data values in place of missing values by running a bootstrapped EM algorithm. The ‘amelia’ function has a variety of optional arguments, including the ability to provide an initial priors matrix and bounds for missing values.

> library(Amelia)
Loading required package: foreign
##
## Amelia II: Multiple Imputation
## (Version 1.2-18, built: 2010-11-04)
## Copyright (C) 2005-2010 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
> data(africa)
> summary(africa)

Next, we can use the ‘amelia’ function to create the new data set(s). Notice the summary (below) tells us there were “5 imputed datasets” created. We could increase the number of data sets created by changing ‘m=5’ (default) to whatever number of data sets we wanted; however, Honaker, King, and Blackwell (2010b) state “unless the rate of missing-ness is very high, m = 5 (the program default) is probably adequate” (p. 4).

> a.out <- amelia(x=africa,m=5,cs=”country”,ts=”year”,logs=”gdp_pc”)
> summary(a.out)

Note: ts time series must be integer ids not datetime,

Amelia on cran
Amelia GUI

      3. The mvnmle Package.

Another way of dealing with missing data involves using the ‘mvnmle’ package (Gross & Bates, 2009) to create a complete variance/covariance matrix which will include maximum likelihood estimates for missing values. Notice, this is very different from the previous two methods. The previous methods were concerned with retrieving a new (imputed) data file. The mvnmle method is concerned only with a complete variance/covariance matrix based on maximum likelihood values imputed where previously missing values existed. This can be useful for some multivariate analysis (e.g. structural equation modeling, principal components analysis, etc.).

> library(mvnmle)
> data(apple)
> summary(apple)
> imputed.cov.apple <- mlest(apple)$sigmahat
> imputed.cov.apple


      4. The SeqKnn and rrcovNA Packages.

Finally, another way of dealing with missing data is the k nearest neighbor (knn) approach. This method is quite simple in principle but is effective and often preferred over some of the more sophisticated methods described above. Nearest neighbors are records that have similar completed data patterns; the average of the k-nearest neighbor’s completed data are used to impute the value for a variable that is missing it’s value (where k can be set by the analyst or R user). Hastie, et al., (1999) have shown a k ranging from 5 to 10 is adequate. The advantage of the knn approach is that it assumes data are missing at random (MAR) meaning, missing data only depends on the observed data; which in turn means, the knn approach is able to take advantage of multivariate relationships in the completed data. The disadvantage of this approach is it does not include a component to model random variation; consequently uncertainty in the imputed value is underestimated. As an example of the simplicity of the knn approach, consider the following:

To implement the knn approach in R, Kim and Yi (2009) have made available the ‘SeqKnn’ package, which performs a sequential knn procedure using the ‘SeqKnn’ function. Again, using the example provided in the package documentation offers a quick introduction to the function. It is a simple function which simply uses the data name (matrix or data frame) and k = the user defined number of nearest neighbors (k = 10 below).

> library(SeqKnn)
> data(khan05)
> imputed.k05 <- SeqKNN(khan05,10)
2208
>

Summaries were not included above because; the khan05 dataset has 64 variables and the summary outputs would fill an unnecessary amount of space in this article. To get the summaries for comparison, simply type:

summary(khan05)
summary(imputed.k05)

The package ‘rrcovNA’ (Todorov, 2010) also has a function for conducting sequential nearest neighbor imputation (‘impSeq’), as well as a function (‘impSeqRob’) which is a robust variant of the former. Similar to the ‘SeqKNN’ in terms of simplicity, the function ‘impSeq’ simply requires the data in matrix or data frame format. The difference between the ‘impSeq’ function and the ‘SeqKNN’ from above is the manner in which distances between neighboring cases are determined. The ‘SeqKNN’ function uses Euclidean distances while ‘impSeq’ uses statistical measures of distance (mean & covariance). In the case of ‘impSeqRob’ the distances are determined by robust estimates of location and scatter. The ‘rrcovNA’ package requires several other packages (listed below in the output). Again, using the examples provided in the library documentation shows how easy it is to use these functions.

> library(rrcovNA)
> data(bush10)
> summary(bush10)

Below is an example of the standard ‘impSeq’ function.
> imputed.b10 <- impSeq(bush10)
> summary(imputed.b10)

How to Identify and Impute Multiple Missing Values using R.

Original link

Комментариев нет:

Отправить комментарий