Partilhar via


rxRoc: Receiver Operating Characteristic (ROC) computations and plot

Description

Compute and plot an ROC curve using actual and predicted values from binary classifier system

Usage

  rxRoc(actualVarName, predVarNames, data, numBreaks = 100, 
      removeDups = TRUE, blocksPerRead = 1, reportProgress = 0)

  rxRocCurve(actualVarName, predVarNames, data, numBreaks = 100, 
      blocksPerRead = 1, reportProgress = 0, computeAuc = TRUE, title = NULL,
      subtitle = NULL, xTitle = NULL, yTitle = NULL, legend = NULL,
      chanceGridLine = TRUE, ...)

 ## S3 method for class `rxRoc':
as.data.frame  ( x, ..., var = NULL)

 ## S3 method for class `rxRoc':
rxAuc  ( x )

 ## S3 method for class `rxRoc':
plot  (x, computeAuc = TRUE, title = NULL, subtitle, 
      xTitle = NULL, yTitle = NULL, legend = NULL, chanceGridLine = TRUE, ...)

Arguments

actualVarName

A character string with the name of the variable containing actual (observed) binary values.

predVarNames

A character string or vector of character strings with the name(s) of the variable containing predicted values in the [0,1] interval.

data

data frame, character string containing an .xdf file name (with path), or RxXdfData object representing an .xdf file containing the actual and observed variables.

numBreaks

integer specifying the number of breaks to use to determine thresholds for computing the true and false positive rates.

removeDups

logical; if TRUE, rows containing duplicate entries for sensitivity and specificity will be removed from the returned data frame. If performing computations for more than one prediction variable, this implies that there may be a different number of rows for each prediction variable.

blocksPerRead

number of blocks to read for each chunk of data read from the data source.

reportProgress

integer value with options:

  • 0: no progress is reported.
  • 1: the number of processed rows is printed and updated.
  • 2: rows processed and timings are reported.
  • 3: rows processed and all timings are reported.

computeAuc

logical value. If TRUE, the AUC is computed for each prediction variable and printed in the subtitle or legend text.

title

main title for the plot. Alternatively main can be used. If NULL a default title will be created.

subtitle

subtitle (at the bottom) for the plot. If NULL and computeAuc is TRUE, the AUC for a single prediction variable will be computed and printed in the subtitle.

xTitle

title for the X axis. Alternatively xlab can be used. If NULL, a default X axis title will be used.

yTitle

title for the Y axis. Alternatively ylab can be used. If NULL, a default Y axis title will be used.

legend

logical value. If TRUE and more than one prediction variable is specified, a legend is is created. If computeAuc is TRUE, the AUC is computed for each prediction variable and printed in the legend text.

chanceGridLine

logical value. If TRUE, a grid line from (0,0) to (1,1) is added to represent a pure chance model.

x

an rxRoc object.

var

an integer or character string specifying the prediction variable for which to extract data frame containing the ROC computations. If an integer is specified, it will use that as an index to an alphabetized list of predictionVarNames. If NULL, all of the computed data will be returned in a data frame.

...

additional arguments to be passed directly to an underlying function. For plotting functions, these are passed to the xyplot function.

Details

rxRoc computes the sensitivity (true positive rate) and specificity (true negative rate) using a variable containing actual (observed) zero and one values and a variable containing predicted values in the unit interval as the discrimination threshold is varied. The thresholds are determined by the numBreaks argument. The computations are done on chunks of data, so that they can be performed on very large data sets. If more than one prediction variable is specified, the computations will be performed for each prediction variable. Observations that have a missing value for the actual value or any of the prediction values are removed before computations are performed.

rxRocCurve and the S3 plot method for an rxRoc object plot the computed sensitivity (true positive rate) versus 1 - specificity (false positive rate). ROC curves were first used during World War II for detecting enemy objects in battle fields.

Value

rxRoc returns a data frame of class "rxRoc" containing four variables: threshold, sensitivity, specificity, and predVarName (a factor variable containing the prediction variable name).

The rxAuc S3 method for an rxRoc object returns the AUC (area under the curve) summary statistic.

Author(s)

Microsoft Corporation Microsoft Technical Support

See Also

rxPredict, rxLogit, rxGlm, rxLinePlot.

Examples


 ########################################################################
 # Example using simple created actual and prediction data
 ########################################################################
 # Create a data frame with made-up actual and predicted values
 sampleDF <- data.frame(actual = c(0,0,0,0,0, 1,1,1,1,1))
 sampleDF$prediction <- c(.6, .5, .4, .3, .2, .8, .7, .6, .5, .4)
 # Add predictions that are all wrong and all right
 sampleDF$wrongPrediction <- c(.99, .99, .99, .99, .99, .01, .01, .01, .01, .01)
 sampleDF$rightPrediction <- c( .01, .01, .01, .01, .01,.99, .99, .99, .99, .99)
 # Compute the ROC information for all three prediction variables
 rocOut <- rxRoc(actualVarName = "actual", predVarNames = 
   c("prediction", "wrongPrediction", "rightPrediction"), 
   data = sampleDF, numBreaks = 10)
 # View the computed sensitivity and specificity
 rocOut
 # Plot the results
 plot(rocOut, title = "ROC Curve for Simple Data",
   lineStyle = c("solid", "twodash", "dashed"))
 ########################################################################
 #  Example using data frame with one predicted variable
 ########################################################################
 # Estimate a logistic regression model using the internal 'infert' data
 rxLogitOut <- rxLogit(case ~ spontaneous + induced, data=infert )

 # Compute predictions for the model, creating a new data frame with
 # predictions and the original data used to estimate the model
 rxPredOut <- rxPredict(modelObject = rxLogitOut, data = infert, 
   writeModelVars = TRUE, predVarNames = "casePred1")

 # Compute the ROC data for the default number of thresholds   
 rxRocObject <- rxRoc(actualVarName = "case", predVarNames = c("casePred1"), 
     data = rxPredOut)

 # Draw the ROC curve
 plot(rxRocObject)

 #########################################################################
 #  Example using a data frame with two predicted variables and rxRocCurve
 #########################################################################

 # As in first example, estimate a logistic regression model and
 # compute predictions

 logitOut1 <- rxLogit(case ~ spontaneous + induced, data=infert )

 predOut <- rxPredict(modelObject = logitOut1, data = infert, 
   writeModelVars = TRUE, predVarNames = "Model1")

 # Estimate another model, and add predictions to prediction data frame
 logitOut2 <- rxLogit(case ~ spontaneous + induced + parity, data=infert )
 predOut <- rxPredict(modelObject = logitOut2, data = infert,
       outData = predOut, predVarNames = "Model2")

 # Do computations and plot ROC curve
 rxRocCurve(actualVarName = "case", predVarNames = c("Model1", "Model2"),
   data = predOut,
   title = "ROC Curves for 'case', including 'parity' in Model2") 

 #########################################################################
 #  Example using xdf files 
 #########################################################################  
 mortXdf <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall")

 logitOut1 <- rxLogit(default ~ creditScore + yearsEmploy + ccDebt, 
         data = mortXdf, blocksPerRead = 5)

 predFile <- tempfile(pattern = ".rxPred", fileext = ".xdf")

 # predOutXdf will be a data source object representing the
 # prediction xdf file (predFile)
 predOutXdf <- rxPredict(modelObject = logitOut1, data = mortXdf, 
     writeModelVars = TRUE, predVarNames = "Model1", outData = predFile)

 # Estimate a second model without ccDebt
 logitOut2 <- rxLogit(default ~ creditScore + yearsEmploy, 
     data = predOutXdf, blocksPerRead = 5)

 # Add predictions to prediction data file
 predOutXdf <- rxPredict(modelObject = logitOut2, data = predOutXdf, 
     predVarNames = "Model2")

 rxRocCurve(actualVarName = "default", 
     predVarNames = c("Model1", "Model2"), 
     data = predOutXdf)

 # Remove temporary file storing predictions       
 file.remove(predFile)