Compartir a través de


Fitting Logistic Regression Models using Machine Learning Server

Important

This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?

Logistic regression is a standard tool for modeling data with a binary response variable. In R, you fit a logistic regression using the glm function, specifying a binomial family and the logit link function. In RevoScaleR, you can use rxGlm in the same way (see Fitting Generalized Linear Models) or you can fit a logistic regression using the optimized rxLogit function; because this function is specific to logistic regression, you need not specify a family or link function.

A Simple Logistic Regression Example

As an example, consider the kyphosis data set in the rpart package. This data set consists of 81 observations of four variables (Age, Number, Kyphosis, Start) in children following corrective spinal surgery; it is used as the initial example of glm in the White Book (see Additional Resources for more information. The variable Kyphosis reports the absence or presence of this deformity.

We can use rxLogit to model the probability that kyphosis is present as follows:

library(rpart)
rxLogit(Kyphosis ~ Age + Start + Number, data = kyphosis)

The following output is returned:

Logistic Regression Results for: Kyphosis ~ Age + Start + Number
Data: kyphosis
Dependent variable(s): Kyphosis
Total independent variables: 4
Number of valid observations: 81
Number of missing observations: 0

Coefficients:
				Kyphosis
(Intercept) -2.03693354
Age          0.01093048
Start       -0.20651005
Number       0.41060119

The same model can be fit with glm (or rxGlm) as follows:

glm(Kyphosis ~ Age + Start + Number, family = binomial, data = kyphosis)

	 Call:  glm(formula = Kyphosis ~ Age + Start + Number, family = binomial,      data = kyphosis)

	 Coefficients:
	 (Intercept)          Age        Start       Number  
	    -2.03693      0.01093     -0.20651      0.41060  

	 Degrees of Freedom: 80 Total (i.e. Null);  77 Residual
	 Null Deviance:	    83.23
	 Residual Deviance: 61.38 	AIC: 69.38

Stepwise Logistic Regression

Stepwise logistic regression is an algorithm that helps you determine which variables are most important to a logistic model. You provide a minimal, or lower, model formula and a maximal, or upper, model formula, and using forward selection, backward elimination, or bidirectional search, the algorithm determines the model formula that provides the best fit based on an AIC or significance level selection criterion.

RevoScaleR provides an implementation of stepwise logistic regression that is not constrained by the use of "in-memory" algorithms. Stepwise linear regression in RevoScaleR is implemented by the functions rxLogit and rxStepControl.

Stepwise logistic regression begins with an initial model of some sort. We can look at the kyphosis data again and start with a simpler model: Kyphosis ~ Age:

initModel <- rxLogit(Kyphosis ~ Age, data=kyphosis)
initModel

	  Logistic Regression Results for: Kyphosis ~ Age
	  Data: kyphosis
	  Dependent variable(s): Kyphosis
	  Total independent variables: 2
	  Number of valid observations: 81
	  Number of missing observations: 0

	  Coefficients:
	  				Kyphosis
	  (Intercept) -1.809351230
	  Age          0.005441758

We can specify a stepwise model using rxLogit and rxStepControl as follows:

KyphStepModel <-  rxLogit(Kyphosis ~ Age,
	data = kyphosis,
	variableSelection = rxStepControl(method="stepwise",
		scope = ~ Age + Start + Number ))

	KyphStepModel
	  Logistic Regression Results for: Kyphosis ~ Age + Start + Number
	  Data: kyphosis
	  Dependent variable(s): Kyphosis
	  Total independent variables: 4
	  Number of valid observations: 81
	  Number of missing observations: 0

	  Coefficients:
	  			   Kyphosis
	  (Intercept) -2.03693354
	  Age          0.01093048
	  Start       -0.20651005
	  Number       0.41060119

The methods for variable selection (forward, backward, and stepwise), the definition of model scope, and the available selection criteria are all the same as for stepwise linear regression; see "Stepwise Variable Selection" and the rxStepControl help file for more details.

Plotting Model Coefficients

The ability to save model coefficients using the argument keepStepCoefs = TRUE within the rxStepControl call and to plot them with the function rxStepPlot was described in great detail for stepwise rxLinMod in Fitting Linear Models using RevoScaleR. This functionality is also available for stepwise rxLogit objects.

Prediction

As described above for linear models, the objects returned by the RevoScaleR model-fitting functions do not include fitted values or residuals. We can obtain them, however, by calling rxPredict on our fitted model object, supplying the original data used to fit the model as the data to be used for prediction.

For example, consider the mortgage default example in Tutorial: Analyzing loan data with RevoScaleR. In that example, we used ten input data files to create the data set used to fit the model. But suppose instead we use nine input data files to create the training data set and use the remaining data set for prediction. We can do that as follows (again, remember to modify the first line for your own system):

#  Logistic Regression Prediction

bigDataDir <- "C:/MRS/Data"
mortCsvDataName <- file.path(bigDataDir, "mortDefault", "mortDefault")
trainingDataFileName <- "mortDefaultTraining"
mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "")
targetDataFileName <- "mortDefault2009.xdf"
ageLevels <- as.character(c(0:40))		
yearLevels <- as.character(c(2000:2009))
colInfo <- list(list(name = "houseAge", type = "factor",
	levels = ageLevels), list(name = "year", type = "factor",
	levels = yearLevels))
append= FALSE
for (i in 2000:2008)
{
	importFile <- paste(mortCsvDataName, i, ".csv", sep = "")
	rxImport(inData = importFile, outFile = trainingDataFileName,
	colInfo = colInfo, append = append)
	append = TRUE								
}


rxImport(inData = mortCsv2009, outFile = targetDataFileName,
	colInfo = colInfo)

We can then fit a logistic regression model to the training data and predict with the prediction data set as follows:

logitObj <- rxLogit(default ~ year + creditScore + yearsEmploy + ccDebt,
	data = trainingDataFileName, blocksPerRead = 2, verbose = 1,
	reportProgress=2)
rxPredict(logitObj, data = targetDataFileName,
	outData = targetDataFileName, computeResiduals = TRUE)

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

To view the first 30 rows of the output data file, use rxGetInfo as follows:

rxGetInfo(targetDataFileName, numRows = 30)

Prediction Standard Errors and Confidence Intervals

You can use rxPredict to obtain prediction standard errors and confidence intervals for models fit with rxLogit in the same way as for those fit with rxLinMod. The original model must be fit with covCoef=TRUE:

#  Prediction Standard Errors and Confidence Intervals

logitObj2 <- rxLogit(default ~ year + creditScore + yearsEmploy + ccDebt,
	data = trainingDataFileName, blocksPerRead = 2, verbose = 1,
	reportProgress=2, covCoef=TRUE)

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

You then specify computeStdErr=TRUE to obtain prediction standard errors; if this is TRUE, you can also specify interval="confidence" to obtain a confidence interval:

rxPredict(logitObj2, data = targetDataFileName,
	outData = targetDataFileName, computeStdErr = TRUE,
	interval = "confidence", overwrite=TRUE)

The first ten lines of the file with predictions can be viewed as follows:

rxGetInfo(targetDataFileName, numRows=10)

	  File name: C:\Users\yourname\Documents\MRS\mortDefault2009.xdf
	  Number of observations: 1e+06
	  Number of variables: 10
	  Number of blocks: 2
	  Compression type: zlib
	  Data (10 rows starting with row 1):
	     creditScore houseAge yearsEmploy ccDebt year default default_Pred
	  1          617       20           8   4410 2009       0 6.620773e-06
	  2          623       11           7   5609 2009       0 4.610861e-05
	  3          758       17           4   7250 2009       0 4.259884e-04
	  4          687       22           5   3761 2009       0 3.770789e-06
	  5          663       15           6   6746 2009       0 2.312827e-04
	  6          676       10           2   7106 2009       0 1.092593e-03
	  7          721       23           2   2280 2009       0 8.515912e-07
	  8          680       18           7   2831 2009       0 6.011109e-07
	  9          734        9           5   3867 2009       0 3.144299e-06
	  10         688       16           8   6238 2009       0 5.350031e-05
	     default_StdErr default_Lower default_Upper
	  1    3.143695e-07  6.032422e-06  7.266507e-06
	  2    1.953612e-06  4.243427e-05  5.010109e-05
	  3    1.594783e-05  3.958500e-04  4.584203e-04
	  4    1.739047e-07  3.444893e-06  4.127516e-06
	  5    8.733193e-06  2.147838e-04  2.490486e-04
	  6    3.952975e-05  1.017797e-03  1.172880e-03
	  7    4.396314e-08  7.696409e-07  9.422675e-07
	  8    3.091885e-08  5.434655e-07  6.648706e-07
	  9    1.469334e-07  2.869109e-06  3.445883e-06
	  10   2.224102e-06  4.931401e-05  5.804197e-05

Using ROC Curves to Evaluate Estimated Binary Response Models

A receiver operating characteristic (ROC) curve can be used to visually assess binary response models. It plots the True Positive Rate (the number of correctly predicted TRUE responses divided by the actual number of TRUE responses) against the False Positive Rate (the number of incorrectly predicted TRUE responses divided by the actual number of FALSE responses), calculated at various thresholds. The True Positive Rate is the same as the sensitivity, and the False Positive Rate is equal to one minus the specificity.

Let’s start with a simple example. Suppose we have a data set with 10 observations. The variable actual contains the actual responses, or the ‘truth’. The variable badPred are the predicted responses from a very poor model. The variable goodPred contains the predicted responses from a great model.

# Using ROC Curves for Binary Response Models

sampleDF <- data.frame(
	actual = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
	badPred = c(.99, .99, .99, .99, .99, .01, .01, .01, .01, .01),
	goodPred = c( .01, .01, .01, .01, .01,.99, .99, .99, .99, .99))

We can now call the rxRocCurve function to compute the sensitivity and specificity for the ‘bad’ predictions, and draw the ROC curve. The numBreaks argument indicates the number of breaks to use in determining the thresholds for computing the true and false positive rates.

rxRocCurve(actualVarName = "actual", predVarNames = "badPred",
data = sampleDF, numBreaks = 10, title = "ROC for Bad Predictions")

rxRocCurve badPred

Since all of our predictions are wrong at every threshold, the ROC curve is a flat line at 0. The Area Under the Curve (AUC) summary statistic is 0.

At the other extreme, let’s draw an ROC curve for our great model:

rxRocCurve(actualVarName = "actual", predVarNames = "goodPred",
	data = sampleDF, numBreaks = 10, title = "ROC for Great Predictions")

rxRocCurve goodPred

With perfect predictions, we see the True Positive Rate is 1 for all thresholds, and the AUC is 1. We’d expect a random guess ROC curve to lie along with white diagonal line.

Now let’s use actual model predictions in an ROC curve. We’ll use the small mortgage default sample data to estimate a logistic model and them compute predicted values:

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

# Using mortDefaultSmall for predictions and an ROC curve

mortXdf <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall")
logitOut1 <- rxLogit(default ~ creditScore + yearsEmploy + ccDebt,
	data = mortXdf,	blocksPerRead = 5)

predFile <- "mortPred.xdf"

predOutXdf <- rxPredict(modelObject = logitOut1, data = mortXdf,
	writeModelVars = TRUE, predVarNames = "Model1", outData = predFile)

Now, let’s estimate a different model (with 1 less independent variable), and add the predictions from that model to our output data set:

# Estimate a second model without ccDebt
logitOut2 <- rxLogit(default ~ creditScore + yearsEmploy,
	data = predOutXdf, blocksPerRead = 5)

# Add preditions to prediction data file
predOutXdf <- rxPredict(modelObject = logitOut2, data = predOutXdf,

predVarNames = "Model2")

Now we can compute the sensitivity and specificity for both models, using rxRoc:

rocOut <- rxRoc(actualVarName = "default",
	predVarNames = c("Model1", "Model2"),
	data = predOutXdf)
rocOut

		 threshold predVarName sensitivity specificity
	  1       0.00      Model1 1.000000000   0.0000000
	  2       0.01      Model1 0.825902335   0.9197118
	  3       0.02      Model1 0.647558386   0.9567965
	  4       0.03      Model1 0.569002123   0.9721488
	  5       0.04      Model1 0.481953291   0.9797647
	  6       0.05      Model1 0.437367304   0.9845472
	  7       0.06      Model1 0.386411890   0.9877825
	  8       0.07      Model1 0.335456476   0.9900130
	  9       0.08      Model1 0.305732484   0.9916406
	  10      0.09      Model1 0.288747346   0.9930272
	  11      0.10      Model1 0.261146497   0.9940520
	  12      0.11      Model1 0.237791932   0.9947252
	  13      0.12      Model1 0.225053079   0.9953682
	  14      0.13      Model1 0.208067941   0.9959107
	  15      0.14      Model1 0.197452229   0.9963528
	  16      0.15      Model1 0.182590234   0.9967648
	  17      0.16      Model1 0.171974522   0.9971064
	  18      0.17      Model1 0.161358811   0.9973877
	  19      0.18      Model1 0.152866242   0.9975886
	  20      0.19      Model1 0.150743100   0.9978298
	  21      0.20      Model1 0.144373673   0.9980307
	  22      0.21      Model1 0.138004246   0.9982518
	  23      0.22      Model1 0.131634820   0.9984527
	  24      0.23      Model1 0.131634820   0.9986034
	  25      0.24      Model1 0.129511677   0.9987340
	  26      0.25      Model1 0.123142251   0.9987843
	  27      0.26      Model1 0.116772824   0.9988546
	  28      0.27      Model1 0.116772824   0.9989149
	  29      0.28      Model1 0.114649682   0.9989752
	  30      0.29      Model1 0.108280255   0.9990355
	  31      0.30      Model1 0.101910828   0.9991158
	  32      0.31      Model1 0.099787686   0.9991661
	  33      0.32      Model1 0.091295117   0.9992264
	  34      0.33      Model1 0.087048832   0.9992866
	  35      0.34      Model1 0.082802548   0.9993469
	  36      0.35      Model1 0.080679406   0.9994072
	  37      0.36      Model1 0.074309979   0.9994474
	  38      0.37      Model1 0.072186837   0.9994675
	  39      0.38      Model1 0.070063694   0.9995077
	  40      0.39      Model1 0.067940552   0.9995780
	  41      0.40      Model1 0.063694268   0.9995881
	  42      0.41      Model1 0.063694268   0.9996182
	  43      0.42      Model1 0.063694268   0.9996684
	  44      0.43      Model1 0.055201699   0.9996986
	  45      0.44      Model1 0.050955414   0.9997287
	  46      0.45      Model1 0.048832272   0.9997790
	  47      0.46      Model1 0.046709130   0.9997790
	  48      0.47      Model1 0.042462845   0.9997991
	  49      0.48      Model1 0.040339703   0.9998091
	  50      0.49      Model1 0.040339703   0.9998292
	  51      0.50      Model1 0.040339703   0.9998493
	  52      0.51      Model1 0.040339703   0.9998794
	  53      0.52      Model1 0.033970276   0.9998895
	  54      0.53      Model1 0.031847134   0.9999096
	  55      0.54      Model1 0.031847134   0.9999196
	  56      0.55      Model1 0.031847134   0.9999297
	  57      0.58      Model1 0.029723992   0.9999397
	  58      0.59      Model1 0.027600849   0.9999498
	  59      0.60      Model1 0.023354565   0.9999598
	  60      0.61      Model1 0.016985138   0.9999598
	  61      0.63      Model1 0.014861996   0.9999598
	  62      0.65      Model1 0.014861996   0.9999799
	  63      0.70      Model1 0.014861996   0.9999900
	  64      0.72      Model1 0.012738854   0.9999900
	  65      0.74      Model1 0.010615711   0.9999900
	  66      0.78      Model1 0.010615711   1.0000000
	  67      0.80      Model1 0.008492569   1.0000000
	  68      0.83      Model1 0.006369427   1.0000000
	  69      0.89      Model1 0.004246285   1.0000000
	  70      0.91      Model1 0.000000000   1.0000000
	  71      0.00      Model2 1.000000000   0.0000000
	  72      0.01      Model2 0.108280255   0.9612776
	  73      0.02      Model2 0.000000000   0.9994474
	  74      0.03      Model2 0.000000000   1.0000000

With the removeDups argument set to its default of TRUE, rows containing duplicate entries for sensitivity and specificity were removed from the returned data frame. In this case, it results in many fewer rows for Model2 than Model1. We can use the rxRoc plot method to render our ROC curve using the computed results.

plot(rocOut)

The resulting plot shows that the second model is much closer to the “random” diagonal line than the first model.

plot(rocOut)