Freigeben über


Fitting Linear Models using RevoScaleR

Important

This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?

Linear regression models are fitted in RevoScaleR using the rxLinMod function. Like other RevoScaleR functions, rxLinMod uses an updating algorithm to compute the regression model. The R object returned by rxLinMod includes the estimated model coefficients and the call used to generate the model, together with other information that allows RevoScaleR to recompute the model. BSecause rxLinMod is designed to work with arbitrarily large data sets, quantities such as residuals, and fitted values are not included in the return object, although these can be obtained easily once the model has been fitted.

As a simple example, let’s use the sample data set AirlineDemoSmall.xdf and fit the arrival delay by day of week:

# Fitting Linear Models

readPath <- rxGetOption("sampleDataDir")
airlineDemoSmall <- file.path(readPath, "AirlineDemoSmall.xdf")
rxLinMod(ArrDelay ~ DayOfWeek, data = airlineDemoSmall)

	Call:
	rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airlineDemoSmall)

	Linear Regression Results for: ArrDelay ~ DayOfWeek
	File name:
		C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\ RevoScaleR\SampleData\AirlineDemoSmall.xdf
	Dependent variable(s): ArrDelay
	Total independent variables: 8 (Including number dropped: 1)
	Number of valid observations: 582628
	Number of missing observations: 17372

	Coefficients:
						ArrDelay
	(Intercept)         10.3318058
	DayOfWeek=Monday     1.6937981
	DayOfWeek=Tuesday    0.9620019
	DayOfWeek=Wednesday -0.1752668
	DayOfWeek=Thursday  -1.6737983
	DayOfWeek=Friday     4.4725290
	DayOfWeek=Saturday   1.5435207
	DayOfWeek=Sunday       Dropped

Because our predictor is categorical, we can use the cube argument to rxLinMod to perform the regression using a partitioned inverse, which may be faster and may use less memory than the standard algorithm. The output object also includes a data frame with the averages or counts for each category:

arrDelayLm1 <- rxLinMod(ArrDelay ~ DayOfWeek, cube = TRUE,
	data = airlineDemoSmall)

Typing the name of the object arrDelayLm1 yields the following output:

Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airlineDemoSmall,
	cube = TRUE)

	Cube Linear Regression Results for: ArrDelay ~ DayOfWeek
	File name:
	    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\ RevoScaleR\SampleData\AirlineDemoSmall.xdf
	Dependent variable(s): ArrDelay
	Total independent variables: 7
	Number of valid observations: 582628
	Number of missing observations: 17372

	Coefficients:
	                     ArrDelay
	DayOfWeek=Monday    12.025604
	DayOfWeek=Tuesday   11.293808
	DayOfWeek=Wednesday 10.156539
	DayOfWeek=Thursday   8.658007
	DayOfWeek=Friday    14.804335
	DayOfWeek=Saturday  11.875326
	DayOfWeek=Sunday    10.331806

Obtaining a Summary of the Model

The print method for the rxLinMod model object shows only the call and the coefficients. You can obtain more information about the model by calling the summary method:

# Obtaining a Summary of a Model

summary(arrDelayLm1)

This produces the following output, which includes substantially more information about the model coefficients, together with the residual standard error, multiple R-squared, and adjusted R-squared:

Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airlineDemoSmall,
	cube = TRUE)

Cube Linear Regression Results for: ArrDelay ~ DayOfWeek
File name:
	C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\ RevoScaleR\SampleData\AirlineDemoSmall.xdf
Dependent variable(s): ArrDelay
Total independent variables: 7
Number of valid observations: 582628
Number of missing observations: 17372

Coefficients:
	                    Estimate Std. Error t value Pr(>|t|)     | Counts
	DayOfWeek=Monday     12.0256     0.1317   91.32 2.22e-16 *** |  95298
	DayOfWeek=Tuesday    11.2938     0.1494   75.58 2.22e-16 *** |  74011
	DayOfWeek=Wednesday  10.1565     0.1467   69.23 2.22e-16 *** |  76786
	DayOfWeek=Thursday    8.6580     0.1445   59.92 2.22e-16 *** |  79145
	DayOfWeek=Friday     14.8043     0.1436  103.10 2.22e-16 *** |  80142
	DayOfWeek=Saturday   11.8753     0.1404   84.59 2.22e-16 *** |  83851
	DayOfWeek=Sunday     10.3318     0.1330   77.67 2.22e-16 *** |  93395
	---
	Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

	Residual standard error: 40.65 on 582621 degrees of freedom
	Multiple R-squared: 0.001869 (as if intercept included)
	Adjusted R-squared: 0.001858
	F-statistic: 181.8 on 6 and 582621 DF,  p-value: < 2.2e-16
	Condition number: 1

Using Probability Weights

Probability weights are common in survey data; they represent the probability that a case was selected into the sample from the population, and are calculated as the inverse of the sampling fraction. The variable perwt in the census data represents a probability weight. You pass probability weights to the RevoScaleR analysis functions using the pweights argument, as in the following example:

#  Using Probability Weights

rxLinMod(incwage ~ F(age), pweights = "perwt", data = censusWorkers)

Yields the following output:

	Call:
	rxLinMod(formula = incwage ~ F(age), data = censusWorkers, pweights = "perwt")

	Linear Regression Results for: incwage ~ F(age)
	File name:
	    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
	Probability weights: perwt
	Dependent variable(s): incwage
	Total independent variables: 47 (Including number dropped: 1)
	Number of valid observations: 351121
	Number of missing observations: 0

	Coefficients:
	                incwage
	(Intercept)  32111.3012
	F_age=20    -19488.0874
	F_age=21    -18043.7738
	F_age=22    -15928.5538
	F_age=23    -13498.2138
	F_age=24    -11248.8583
	F_age=25     -7948.6603
	F_age=26     -5667.3599
	F_age=27     -4682.8375
	F_age=28     -3024.1648
	F_age=29     -1480.8682
	F_age=30      -971.4461
	F_age=31       143.5648
	F_age=32      2009.0019
	F_age=33      2861.3031
	F_age=34      2797.9986
	F_age=35      4038.2131
	F_age=36      5511.8633
	F_age=37      5148.1841
	F_age=38      5957.2190
	F_age=39      7652.9870
	F_age=40      6969.9630
	F_age=41      8387.6821
	F_age=42      9150.0006
	F_age=43      8936.7590
	F_age=44      9267.0820
	F_age=45     10148.1702
	F_age=46      9099.0659
	F_age=47      9996.0450
	F_age=48     10408.1712
	F_age=49     10281.1324
	F_age=50     12029.6751
	F_age=51     10247.2529
	F_age=52     12105.0654
	F_age=53     10957.8211
	F_age=54     11881.4307
	F_age=55     11490.8457
	F_age=56      9442.3856
	F_age=57      9331.2347
	F_age=58      9353.1980
	F_age=59      7526.4912
	F_age=60      6078.4463
	F_age=61      4636.6756
	F_age=62      3129.5531
	F_age=63      2535.4858
	F_age=64      1520.3707
	F_age=65        Dropped

Using Frequency Weights

Frequency weights are useful when your data has a particular form: a set of data in which one or more cases is exactly replicated. If you then compact the data to remove duplicated observations and create a variable to store the number of replications of each observation, you can use that new variable as a frequency weights variable in the RevoScaleR analysis functions.

For example, the sample data set fourthgraders.xdf contains the following 44 rows:

	   height eyecolor   male reps
	14     44     Blue   Male    1
	23     44    Brown   Male    3
	9      44    Brown Female    1
	38     44    Green Female    1
	8      45     Blue   Male    1
	31     45     Blue Female    5
	2      45    Brown   Male    2
	51     45    Brown Female    1
	6      45    Green   Male    1
	47     45    Hazel Female    1
	37     46     Blue   Male    1
	21     46     Blue Female    3
	12     46    Brown   Male    2
	5      46    Brown Female    4
	19     46    Green   Male    1
	80     46    Green Female    3
	64     47     Blue   Male    1
	69     47     Blue Female    3
	39     47    Brown   Male    4
	18     47    Brown Female    3
	41     47    Green   Male    2
	7      47    Green Female    4
	26     47    Hazel   Male    3
	17     48     Blue   Male    4
	36     48     Blue Female    3
	13     48    Brown   Male    6
	62     48    Brown Female    4
	61     48    Green Female    2
	56     48    Hazel   Male    1
	10     49     Blue   Male    1
	15     49     Blue Female    5
	22     49    Brown   Male    2
	4      49    Brown Female    5
	3      49    Green   Male    3
	91     49    Green Female    1
	94     50     Blue   Male    1
	96     50     Blue Female    1
	97     50    Brown   Male    1
	1      50    Brown Female    1
	28     50    Green   Male    4
	52     50    Hazel   Male    1
	98     51     Blue   Male    1
	57     51    Brown Female    1
	90     51    Green Female    1

The reps column shows the number of replications for each observation; the sum of the reps column indicates the total number of observations, in this case 100. We can fit a model (admittedly not very useful) of height on eye color with rxLinMod as follows:

	#  Using Frequency Weights

	fourthgraders <- file.path(rxGetOption("sampleDataDir"),
	    "fourthgraders.xdf")
	fourthgradersLm <- rxLinMod(height ~ eyecolor, data = fourthgraders,
	    fweights="reps")

Typing the name of the object shows the following output:

	Call:
	rxLinMod(formula = height ~ eyecolor, data = fourthgraders, fweights = "reps")

	Linear Regression Results for: height ~ eyecolor
	File name:
	    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\fourthgraders.xdf
	Linear Regression Results for: height ~ eyecolor
	Frequency weights: reps
	Dependent variable(s): height
	Total independent variables: 5 (Including number dropped: 1)
	Number of valid observations: 44
	Number of missing observations: 0

	Coefficients:
	                    height
	(Intercept)    47.33333333
	eyecolor=Blue  -0.01075269
	eyecolor=Brown -0.08333333
	eyecolor=Green  0.40579710
	eyecolor=Hazel     Dropped

Using rxLinMod with R Data Frames

While RevoScaleR is primarily designed to work with data on disk, it can also be used with in-memory R data frames. As an example, consider the R sample data frame “cars”, which contains data recorded in the 1920s on the speed of cars and the distances taken to stop.

	#  Using rxLinMod with R Data Frames

	rxLinMod(dist ~ speed, data = cars)

We get the following output (which matches the output given by the R function lm):

	Call:
	rxLinMod(formula = dist ~ speed, data = cars)

	Linear Regression Results for: dist ~ speed
	Data: cars
	Dependent variable(s): dist
	Total independent variables: 2
	Number of valid observations: 50
	Number of missing observations: 0

	Coefficients:
	                  dist
	(Intercept) -17.579095
	speed         3.932409

Using the Cube Option for Conditional Predictions

If the cube argument is set to TRUE and the first term of the independent variables is categorical, rxLinMod will compute and return a data frame with category counts. If there are no other independent variables, or if the cubePredictions argument is set to TRUE, the data frame will also contain predicted values. Let’s create a simple data frame to illustrate:

#  Using the Cube Option for Conditional Predictions

xfac1 <- factor(c(1,1,1,1,2,2,2,2,3,3,3,3), labels=c("One1", "Two1", "Three1"))
xfac2 <- factor(c(1,1,1,1,1,1,2,2,2,3,3,3), labels=c("One2", "Two2", "Three2"))
set.seed(100)
y <- as.integer(xfac1) + as.integer(xfac2)* 2 + rnorm(12)
myData <- data.frame(y, xfac1, xfac2)

If we estimate a simple linear regression of y on xfac1, the coefficients are equal to the within- group means:

myLinMod <- rxLinMod(y ~ xfac1, data = myData, cube = TRUE)
myLinMod

	Call:
	rxLinMod(formula = y ~ xfac1, data = myData, cube = TRUE)

	Cube Linear Regression Results for: y ~ xfac1
	Data: myData
	Dependent variable(s): y
	Total independent variables: 3
	Number of valid observations: 12
	Number of missing observations: 0

	Coefficients:
						y
	xfac1=One1   3.109302
	xfac1=Two1   5.142086
	xfac1=Three1 8.250260

In addition to the standard output, the returned object contains a countDF data frame with information on within-group means and counts in each category. In this simple case the within-group means are the same as the coefficients:

	myLinMod$countDF
	   xfac1        y Counts
	1   One1 3.109302      4
	2   Two1 5.142086      4
	3 Three1 8.250260      4

Using rxLinMod with the cube option also allows us to compute conditional within-group means. For example:

myLinMod1 <- rxLinMod(y~xfac1 + xfac2, data = myData,
	cube = TRUE, cubePredictions = TRUE)
myLinMod1

	Call:
	rxLinMod(formula = y ~ xfac1 + xfac2, data = myData, cube = TRUE,
	cubePredictions=TRUE)

	Cube Linear Regression Results for: y ~ xfac1 + xfac2
	Data: myData
	Dependent variable(s): y
	Total independent variables: 6 (Including number dropped: 1)
	Number of valid observations: 12
	Number of missing observations: 0

	Coefficients:
						y
	xfac1=One1    7.725231
	xfac1=Two1    8.833730
	xfac1=Three1  8.942099
	xfac2=One2   -4.615929
	xfac2=Two2   -2.767359
	xfac2=Three2   Dropped

If we look at the countDF, we see the within-group means, conditional on the average value of the conditioning variable, xfac2:

myLinMod1$countDF
	     xfac1        y Counts
	  1   One1 4.725427      4
	  2   Two1 5.833926      4
	  3 Three1 5.942295      4

For the variable xfac2, 50% of the observations have the value “One2”, 25% of the observations have the value “Two2”, and 25% have the value “Three2”. We can compute the weighted average of the coefficients for xfac2 as:

myCoef <- coef(myLinMod1)
avexfac2c <- .5*myCoef[4] + .25*myCoef[5]

To compute the conditional within-group mean shown in the countDF for xfac1 equal to “One1”, we add this to the coefficient computed for “One1”:

condMean1 <- myCoef[1] + avexfac2c
condMean1

	xfac1=One1
	4.725427

Conditional within-group means can also be computed using additional continuous independent variables.

Fitted Values, Residuals, and Prediction

When you fit a model with lm or any of the other core R model-fitting functions, you get back an object that includes as components both the fitted values for the response variable and the model residuals. For models fit with rxLinMod or other RevoScaleR functions, it is usually impractical to include these components, as they can be many megabytes in size. Instead, they are computed on demand using the rxPredict function. This function takes an rxLinMod object as its first argument, an input data set as its second argument, and an output data set as its third argument. If the input data set is the same as the data set used to fit the rxLinMod object, the resulting predictions are the fitted values for the model. If the input data set is a different data set (but one containing the same variable names used in fitting the rxLinMod object), the resulting predictions are true predictions of the response for the new data from the original model. In either case, residuals for the predicted values can be obtained by setting the flag computeResiduals to TRUE.

For example, we can draw from the 7% sample of the large airline data set (available online) training and prediction data sets as follows (remember to customize the first line below for your own system):

bigDataDir <- "C:/MRS/Data"
sampleAirData <- file.path(bigDataDir, "AirOnTime7Pct.xdf")
trainingDataFile <- "AirlineData06to07.xdf"
targetInfile <- "AirlineData08.xdf"

rxDataStep(sampleAirData, trainingDataFile, rowSelection = Year == 1999 |
	Year == 2000 | Year == 2001 | Year == 2002 | Year == 2003 |
	Year == 2004 | Year == 2005 | Year == 2006 | Year == 2007)
rxDataStep(sampleAirData, targetInfile, rowSelection = Year == 2008)

We can then fit a linear model with the training data and compute predicted values on the prediction data set as follows:

arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek + UniqueCarrier + Dest,
	data = trainingDataFile)
rxPredict(arrDelayLm2, data = targetInfile, outData = targetInfile)

To see the first few rows of the result, use rxGetInfo as follows:

rxGetInfo(targetInfile,  numRows = 5)

	  File name: C:\MRS\Data\OutData\AirlineData08.xdf
	  Number of observations: 489792
	  Number of variables: 14
	  Number of blocks: 2
	  Compression type: zlib
	  Data (5 rows starting with row 1):
	    Year DayOfWeek UniqueCarrier Origin Dest CRSDepTime DepDelay TaxiOut TaxiIn
	  1 2008       Mon            9E    ATL  HOU   7.250000       -2      14      5
	  2 2008       Mon            9E    ATL  HOU   7.250000       -2      16      2
	  3 2008       Sat            9E    ATL  HOU   7.250000        0      18      5
	  4 2008       Wed            9E    ATL  SRQ  12.666667        6      20      4
	  5 2008       Sat            9E    HOU  ATL   9.083333       -9      15      9
	    ArrDelay ArrDel15 CRSElapsedTime Distance ArrDelay_Pred
	  1      -15    FALSE            140      696      8.981548
	  2        0    FALSE            140      696      8.981548
	  3        0    FALSE            140      696      5.377572
	  4        6    FALSE             86      445      7.377952
	  5      -23    FALSE            120      696      6.552242

Prediction Standard Errors, Confidence Intervals, and Prediction Intervals

You can also use rxPredict to obtain prediction standard errors, provided you have included the variance-covariance matrix in the original rxLinMod fit. If you choose to compute the prediction standard errors, you can also obtain either of two kinds of intervals: confidence intervals that for a given confidence level tells us how confident we are that the expected value is within the given interval, and prediction intervals that specify, for a given confidence level, how likely future observations are to fall within the interval given what has already been observed. Standard error computations are computationally intensive, and they may become prohibitive on large data sets with a large number of predictors. To illustrate the computation, we start with a small data set, using the example on page 132 of Introduction to the Practice of Statistics, (5th Edition). The predictor, neaInc, is the increase in “non-exercise activity” in response to an increase in caloric intake. The response, fatGain, is the associated increase in fat. We first read in the data and create a data frame to use in our analysis:

# Standard Errors, Confidence Intervals, and Prediction Intervals

neaInc <- c(-94, -57, -29, 135, 143, 151, 245, 355, 392, 473, 486, 535, 571,
	580, 620, 690)
fatGain <- c( 4.2, 3.0, 3.7, 2.7, 3.2, 3.6, 2.4, 1.3, 3.8, 1.7, 1.6, 2.2, 1.0,
	0.4, 2.3, 1.1)
ips132df <- data.frame(neaInc = neaInc, fatGain=fatGain)

Next we fit the linear model with rxLinMod, setting covCoef=TRUE to ensure we have the variance-covariance matrix in our model object:

ips132lm <- rxLinMod(fatGain ~ neaInc, data=ips132df, covCoef=TRUE)

Now we use rxPredict to obtain the fitted values, prediction standard errors, and confidence intervals. By setting writeModelVars to TRUE, the variables used in the model will also be included in the output data set. In this first example, we obtain confidence intervals:

ips132lmPred <- rxPredict(ips132lm, data=ips132df, computeStdErrors=TRUE,
	interval="confidence", writeModelVars = TRUE)

The standard errors are by default put into a variable named by concatenating the name of the response variable with an underscore and the string “StdErr”:

ips132lmPred$fatGain_StdErr

	 [1] 0.3613863 0.3381111 0.3209344 0.2323853 0.2288433 0.2254018 0.1941840
	 [8] 0.1863180 0.1915656 0.2151569 0.2202366 0.2418892 0.2598922 0.2646223
	[15] 0.2865818 0.3279390

We can view the original data, the fitted prediction line, and the confidence intervals as follows:

rxLinePlot(fatGain + fatGain_Pred + fatGain_Upper + fatGain_Lower ~ neaInc,
	data = ips132lmPred, type = "b",
	lineStyle = c("blank", "solid", "dotted", "dotted"),
	lineColor = c(NA, "red", "black", "black"),
	symbolStyle = c("solid circle", "blank", "blank", "blank"),
	title = "Data, Predictions, and Confidence Bounds",
	xTitle = "Increase in Non-Exercise Activity",
	yTitle = "Increse in Fat", legend = FALSE)

The resulting plot is shown below:

rxLinePlot 2

The prediction intervals can be obtained and plotted as follows:

ips132lmPred2 <- rxPredict(ips132lm, data=ips132df, computeStdErrors=TRUE,
	interval="prediction", writeModelVars = TRUE)
rxLinePlot(fatGain + fatGain_Pred + fatGain_Upper + fatGain_Lower ~ neaInc,
	data = ips132lmPred2, type = "b",
	lineStyle = c("blank", "solid", "dotted", "dotted"),
	lineColor = c(NA, "red", "black", "black"),
	symbolStyle = c("solid circle", "blank", "blank", "blank"),
	title = "Prediction Intervals",
	xTitle = "Increase in Non-Exercise Activity",
	yTitle = "Increse in Fat", legend = FALSE)

The resulting plot is shown below:

rxLinePlot

We can fit the prediction standard errors on our large airline regression model if we first refit it with covCoef=TRUE:

arrDelayLmVC <- rxLinMod(ArrDelay ~ DayOfWeek + UniqueCarrier + Dest,
	data = trainingDataFile, covCoef=TRUE)

We can then obtain the prediction standard errors and a confidence interval as before:

rxPredict(arrDelayLmVC, data = targetInfile, outData = targetInfile,
	computeStdErrors=TRUE, interval = "confidence", overwrite=TRUE)

We can then look at the first few lines of targetInfile to see the first few predictions and standard errors:

rxGetInfo(targetInfile, numRows=10)

	  File name: C:\YourOutputPath\AirlineData08.xdf
	  Number of observations: 489792
	  Number of variables: 17
	  Number of blocks: 2
	  Compression type: zlib
	  Data (10 rows starting with row 1):
		 Year DayOfWeek UniqueCarrier Origin Dest CRSDepTime DepDelay TaxiOut TaxiIn
	  1  2008       Mon            9E    ATL  HOU   7.250000       -2      14      5
	  2  2008       Mon            9E    ATL  HOU   7.250000       -2      16      2
	  3  2008       Sat            9E    ATL  HOU   7.250000        0      18      5
	  4  2008       Wed            9E    ATL  SRQ  12.666667        6      20      4
	  5  2008       Sat            9E    HOU  ATL   9.083333       -9      15      9
	  6  2008       Mon            9E    ATL  IAH  16.416666       -3      25      4
	  7  2008      Thur            9E    IAH  ATL  18.500000        9      12      7
	  8  2008       Sat            9E    IAH  ATL  18.500000       -3      19      8
	  9  2008       Mon            9E    IAH  ATL  18.500000       -5      16      6
	  10 2008      Thur            9E    ATL  IAH  13.250000       -2      15      7
		 ArrDelay ArrDel15 CRSElapsedTime Distance ArrDelay_Pred ArrDelay_StdErr
	  1       -15    FALSE            140      696      8.981548       0.3287748
	  2         0    FALSE            140      696      8.981548       0.3287748
	  3         0    FALSE            140      696      5.377572       0.3293198
	  4         6    FALSE             86      445      7.377952       0.6380760
	  5       -23    FALSE            120      696      6.552242       0.2838803
	  6       -11    FALSE            145      689      7.530244       0.2952031
	  7        -8    FALSE            125      689     12.168922       0.2833356
	  8       -16    FALSE            125      689      6.552242       0.2838803
	  9       -24    FALSE            125      689     10.156218       0.2833283
	  10       21     TRUE            130      689      9.542948       0.2952058
		 ArrDelay_Lower ArrDelay_Upper
	  1        8.337161       9.625935
	  2        8.337161       9.625935
	  3        4.732117       6.023028
	  4        6.127346       8.628559
	  5        5.995847       7.108638
	  6        6.951657       8.108832
	  7       11.613594      12.724249
	  8        5.995847       7.108638
	  9        9.600905      10.711532
	  10       8.964355      10.121540

Stepwise Variable Selection

Stepwise linear regression is an algorithm that helps you determine which variables are most important to a regression model. You provide a minimal, or lower, model formula and a maximal, or upper, model formula, and using forward selection, backward elimination, or bidirectional search, the algorithm determines the model formula that provides the best fit based on an AIC selection criterion.

In SAS, stepwise linear regression is implemented through PROC REG. In open-source R, it is implemented through the function step. The problem with using the function step in R is that the size of the data set that can be analyzed is severely limited by the requirement that all computations must be done in memory.

RevoScaleR provides an implementation of stepwise linear regression that is not constrained by the use of "in-memory" algorithms. Stepwise linear regression in RevoScaleR is implemented by the functions rxLinMod and rxStepControl.

Stepwise linear regression begins with an initial model of some sort. Consider, for example, the airline training data set AirlineData06to07.xdf featured in Fitting Linear Models using RevoScaleR:

#  Stepwise Linear Regression

rxGetVarInfo(trainingDataFile)
	  Var 1: Year, Type: integer, Low/High: (1999, 2007)
	  Var 2: DayOfWeek
			 7 factor levels: Mon Tues Wed Thur Fri Sat Sun
	  Var 3: UniqueCarrier
			 30 factor levels: AA US AS CO DL ... OH F9 YV 9E VX
	  Var 4: Origin
			 373 factor levels: JFK LAX HNL OGG DFW ... IMT ISN AZA SHD LAR
	  Var 5: Dest
			 377 factor levels: LAX HNL JFK OGG DFW ... ESC IMT ISN AZA SHD
	  Var 6: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 23.9833)
	  Var 7: DepDelay, Type: integer, Low/High: (-1199, 1930)
	  Var 8: TaxiOut, Type: integer, Low/High: (0, 1439)
	  Var 9: TaxiIn, Type: integer, Low/High: (0, 1439)
	  Var 10: ArrDelay, Type: integer, Low/High: (-926, 1925)
	  Var 11: ArrDel15, Type: logical, Low/High: (0, 1)
	  Var 12: CRSElapsedTime, Type: integer, Low/High: (-34, 1295)
	  Var 13: Distance, Type: integer, Low/High: (11, 4962)

We are interested in fitting a model that will predict arrival delay (ArrDelay) as a function of some of the other variables. To keep things simple, we’ll start with our by now familiar model of arrival delay as a function of DayOfWeek and CRSDepTime:

initialModel <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
	data = trainingDataFile)
initialModel

	  Call:
	  rxLinMod(formula = ArrDelay ~ DayOfWeek + CRSDepTime, data = trainingDataFile)

	  Linear Regression Results for: ArrDelay ~ DayOfWeek + CRSDepTime
	  File name:
		  C:\MyWorkingDir\Documents\MRS\LMPrediction\AirlineData06to07.xdf
	  Dependent variable(s): ArrDelay
	  Total independent variables: 9 (Including number dropped: 1)
	  Number of valid observations: 3945964
	  Number of missing observations: 98378

	  Coefficients:
						ArrDelay
	  (Intercept)    -4.91416806
	  DayOfWeek=Mon   0.49172258
	  DayOfWeek=Tues -1.41878850
	  DayOfWeek=Wed   0.09677481
	  DayOfWeek=Thur  2.52841304
	  DayOfWeek=Fri   3.29474667
	  DayOfWeek=Sat  -2.86217838
	  DayOfWeek=Sun      Dropped
	  CRSDepTime      0.86703378

The question is, can we improve this model by adding more predictors? If so, which ones? To use stepwise selection in RevoScaleR, you add the variableSelection argument to your call to rxLinMod. The variableSelection argument is a list, most conveniently created by using the rxStepControl function. Using rxStepControl, you specify the method (the default, "stepwise", specifies a bidirectional search), the scope (lower and upper formulas for the search), and various control parameters. With our model, we first want to try some more numeric predictors, so we specify our model as follows:

airlineStepModel <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
	data = trainingDataFile,
	variableSelection = rxStepControl(method="stepwise",
		scope = ~ DayOfWeek + CRSDepTime + CRSElapsedTime +
			Distance + TaxiIn + TaxiOut ))
	  Call:
	  rxLinMod(formula = ArrDelay ~ DayOfWeek + CRSDepTime, data = trainingDataFile,
		  variableSelection = rxStepControl(method = "stepwise", scope = ~DayOfWeek +
			  CRSDepTime + CRSElapsedTime + Distance + TaxiIn + TaxiOut))

	  Linear Regression Results for: ArrDelay ~ DayOfWeek + CRSDepTime +
		  CRSElapsedTime + Distance + TaxiIn + TaxiOut
	  File name:
		  C:\MyWorkingDir\Documents\MRS\LMPrediction\AirlineData06to07.xdf
	  Dependent variable(s): ArrDelay
	  Total independent variables: 13 (Including number dropped: 1)
	  Number of valid observations: 3945964
	  Number of missing observations: 98378

	  Coefficients:
						ArrDelay
	  (Intercept)    -9.87474950
	  DayOfWeek=Mon   0.09830827
	  DayOfWeek=Tues -1.81998781
	  DayOfWeek=Wed  -0.62761606
	  DayOfWeek=Thur  1.53941124
	  DayOfWeek=Fri   2.45565510
	  DayOfWeek=Sat  -2.20111789
	  DayOfWeek=Sun      Dropped
	  CRSDepTime      0.75995355
	  CRSElapsedTime -0.20256102
	  Distance        0.02135381
	  TaxiIn          0.16194266
	  TaxiOut         0.99814931

Methods of Variable Selection

Three methods of variable selection are supported by rxLinMod:

  • "forward": starting from the minimal model, variables are added one at a time until no additional variable satisfies the selection criterion, or until the maximal model is reached.

  • "backward": starting from the maximal model, variables are removed one at a time until the removal of another variable won’t satisfy the selection criterion, or until the minimal model is reached.

  • "stepwise" (the default): a combination of forward and backward selection, in which variables are added to the minimal model, but at each step, the model is reanalyzed to see if any variables that have been added are candidates for deletion from the current model.

You specify the desired method by supplying a named component "method" in the list supplied for the variableSelection argument, or by specifying the method argument in a call to rxStepControl that is then passed as the variableSelection argument.

Variable Selection with Wide Data

We’ve found that generalized linear models do not converge if the number of predictors is greater than the number of observations. If your data has more variables, it won’t be possible to include all of them in the maximal model for stepwise selection. We recommend you use domain experience and insights from initial data explorations to choose a subset of the variables to serve as the maximal model before performing stepwise selection.

There are a few things you can do to reduce the number of predictors. If there are many variables that measure the same quantitative or qualitative entity, try to select one variable that represents the entity best. For example, include a variable that identifies a person’s political party affiliation instead of including many variables representing how the person feels about individual issues. If your goal with linear modeling is the interpretation of individual predictors, you want to ensure that correlation between variables in the model is minimal to avoid multicollinearity. This means checking for these correlations before modeling. Sometimes it is useful to combine correlated variables into a composite variable. Height and weight are often correlated, but can be transformed into BMI. Combining variables allows you to reduce the number of variables without losing any information. Ultimately, the variables you select will depend on how you plan to use the results of your linear model.

Specifying Model Scope

You use the scope argument in rxStepControl (or a named component "scope" in the list supplied for the variableSelection argument) to specify which variables should be considered for inclusion in the final model selection and which should be ignored. Whether you specify a separate value for scope also determines which models the algorithm tries next.

You can specify the scope as a simple formula (which will be treated as the upper bound, or maximal model), or as a named list with components "lower" and "upper" (either of which may be missing). For example, to analyze the iris data with a minimal model involving the single predictor Sepal.Width and a maximal model involving Sepal.Width, Petal.Length, and the interaction between Petal.Width and Species, we can specify our variable selection model as follows:

#	Specifying Model Scope

form <- Sepal.Length ~ Sepal.Width + Petal.Length
scope <- list(
	lower = ~ Sepal.Width,
	upper = ~ Sepal.Width + Petal.Length + Petal.Width * Species)

varsel <- rxStepControl(method = "stepwise", scope = scope)
rxlm.step <- rxLinMod(form, data = iris, variableSelection = varsel,
	verbose = 1, dropMain = FALSE, coefLabelStyle = "R")

In general, the models considered are determined from the scope argument as follows:

  • "lower" scope only: All models considered will include this lower model up to the base model specified in the formula argument to rxLinMod.

  • "upper" scope only: All models considered will include the terms in the base model (specified by the formula argument to rxLinMod), plus additional terms as specified in the upper scope.

  • Both "lower" and "upper" scope supplied: All models considered will include at least the terms in the lower scope and additional terms are added using terms from the upper scope until the stopping criterion is reached or the full upper scope model is selected.

  • No scope supplied: The minimal model includes just the intercept term and the maximal model include at most the terms specified in the base model.

(This convention is identical to that used by R’s step function.)

Specifying the Selection Criterion

By default, variable selection is determined using the Akaike Information Criterion, or AIC; this is the R standard. If you want a stepwise selection that is more SAS-like, you can specify stepCriterion="SigLevel". If this is set, rxLinMod uses either an F-test (default) or Chi-square test to determine whether to add or drop terms from the model. You can specify which test to perform using the test argument. For example, we can refit our airline model using the SigLevel step criterion with an F-test as follows:

#  
# Specifying selection criterion
#  
airlineStepModelSigLevel <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
	data = trainingDataFile, variableSelection =
		rxStepControl( method = "stepwise", scope = ~ DayOfWeek +
			CRSDepTime + CRSElapsedTime + Distance + TaxiIn + TaxiOut,
			stepCriterion = "SigLevel" ))

In this case (and with many other well-behaved models), the results of using the SigLevel step criterion are identical to those using AIC.

You can control the significance levels for adding and dropping models using the maxSigLevelToAdd and minSigLevelToDrop. The maxSigLevelToAdd specifies a significance level below which a variable can be considered for inclusion in the model; the minSigLevelToDrop specifies a significance level above which a variable currently included can be considered for dropping. By default, for method="stepwise", both the levels are set to 0.15. We can tighten the selection criterion to the 0.10 level as follows:

airlineStepModelSigLevel.10 <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
	data = trainingDataFile, variableSelection =
		rxStepControl( method = "stepwise", scope = ~ DayOfWeek +
			CRSDepTime + CRSElapsedTime +Distance + TaxiIn + TaxiOut,
			stepCriterion = "SigLevel",
			maxSigLevelToAdd=.10, minSigLevelToDrop=.10))

Plotting Model Coefficients

By default, the values of the parameters at each step of the stepwise selection are not preserved. Using an additional argument, keepStepCoefs, in your rxStepControl statement saves the values of the coefficients from each step of the regression. This coefficient data can then be plotted using another function, rxStepPlot.

Consider the stepwise linear regression on the iris data from Fitting Linear Models using RevoScaleR:

#  
# Plottings Model Coefficients at Each Step
#  
form <- Sepal.Length ~ Sepal.Width + Petal.Length
scope <- list(
	lower = ~ Sepal.Width,
	upper = ~ Sepal.Width + Petal.Length + Petal.Width * Species)

varsel <- rxStepControl(method = "stepwise", scope = scope, keepStepCoefs=TRUE)
rxlm.step <- rxLinMod(form, data = iris, variableSelection = varsel,
	verbose = 1, dropMain = FALSE, coefLabelStyle = "R")

Notice the addition of the argument keepStepCoefs = TRUE to the rxStepContol call. This produces an extra piece of output in the rxLinMod object, a dataframe containing the values of the coefficients at each step of the regression. This dataframe, stepCoefs, can be accessed as follows:

	rxlm.step$stepCoefs
	                               0         1          2          3
	  (Intercept)          2.2491402 0.9962913  1.1477685  0.7228228
	  Sepal.Width          0.5955247 0.4322172  0.4958889  0.5050955
	  Petal.Length         0.4719200 0.7756295  0.8292439  0.8702840
	  Petal.Width          0.0000000 0.0000000 -0.3151552 -0.2313888
	  Species1             0.0000000 1.3940979  1.0234978  1.2715542
	  Species2             0.0000000 0.4382856  0.2999359  1.0781752
	  Species3             0.0000000        NA         NA         NA
	  Petal.Width:Species1 0.0000000 0.0000000  0.0000000  0.2630978
	  Petal.Width:Species2 0.0000000 0.0000000  0.0000000 -0.5012827
	  Petal.Width:Species3 0.0000000 0.0000000  0.0000000         NA

Trying to glean patterns and information from a table can be difficult. So we’ve added another function, rxStepPlot, which allows the user to plot the parameter values at each step. Using the iris model object, we plot the coefficients:

rxStepPlot(rxlm.step)

rxStepPlot(rxlm.step)

From this plot, we can tell when a variable enters the model by noting the step when it becomes non-zero. Lines are labeled with the numbers on the right axis to indicate the parameter. The numbers correspond to the order they appear in the data frame stepCoef. You’ll notice that the 7th and 10th parameters don’t show up in this plot because the species3 parameter is the reference category for species.

The function rxStepPlot is easily customized by using additional graphical parameters from the matplot function from base R. In the following example, we update our original call to rxStepPlot to demonstrate how to specify the colors used for each line, adjust the line width, and add a proper title to the plot:

	colorSelection <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
	    "#0072B2", "#D55E00", "#CC79A7")
	rxStepPlot(rxlm.step, col = colorSelection, lwd = 2,
	    main = "Step Plot – Iris Coefficients")

rxStepPlot

By default, the rxStepPlot function uses seven line colors. If the number of parameters exceeds the number of colors, they are reused in the same order. However, the line types are set to vary from 1 to 5, so lines that have the same color may differ in line type. The line types can also be specified using the lty argument in the rxStepPlot call.

Fixed-Effects Models

Fixed-effects models are commonly associated with studies in which multiple observations are recorded for each test subject, for example, yearly observations of median housing price by city, or measurements of tensile strength from samples of steel rods by batch. To fit such a model with rxLinMod, include a factor variable specifying the subject (the cities, or the batch identifier) as the first predictor, and specify cube=TRUE to use a partitioned inverse and omit the intercept term.

For example, the MASS library contains the data set petrol, which consists of measurements of the yield of a certain refining process with possible predictors including specific gravity, vapor pressure, ASTM 10% point, and volatility measured as the ASTM endpoint for 10 samples of crude oil. The following example (reproduced from Modern Applied Statistics with S), first scales the numeric predictors, then fits the fixed-effects model:

#  Fixed-Effects Models

library(MASS)
Petrol <- petrol
Petrol[,2:5] <- scale(Petrol[,2:5], scale=F)
rxLinMod(Y ~ No + EP, data=Petrol,cube=TRUE)

	  Call:
	  rxLinMod(formula = Y ~ No + EP, data = Petrol, cube = TRUE)

	  Cube Linear Regression Results for: Y ~ No + EP
	  Data: Petrol
	  Dependent variable(s): Y
	  Total independent variables: 11
	  Number of valid observations: 32
	  Number of missing observations: 0

	  Coefficients:
					Y
	  No=A 32.5493917
	  No=B 24.2746407
	  No=C 27.7820456
	  No=D 21.1541642
	  No=E 21.5191269
	  No=F 20.4355218
	  No=G 15.0359067
	  No=H 13.0630467
	  No=I  9.8053871
	  No=J  4.4360767
	  EP    0.1587296

Least Squares Dummy Variable (LSDV) Models

RevoScaleR is capable of estimating huge models where fixed effects are estimated by dummy variables, that is, binary variables set to 1 or TRUE if the observation is in a particular category. Creation of these dummy variables is often accomplished by interacting two or more factor variables using “:” in the formula. If the first term in an rxLinMod (or rxLogit) model is purely categorical and the “cube” argument is set to TRUE, the estimation uses a partitioned inverse to save on computation time and memory.

A Quick Review of Interacting Factors

First, let’s do a quick, cautionary review of interacting factor variables by experimenting with a small made-up data set.

#  Least Squares Dummy Variable (LSDV) Models
#   A Quick Review of Interacting Factors

set.seed(50)
income <- rep(c(1000,1500,2500,4000), each=5) + 100*rnorm(20)
region <- rep(c("Rural","Urban"), each=10)
sex <- rep(c("Female", "Male"), each=5)
sex <- c(sex,sex)
myData <- data.frame(income, region, sex)

The data set has 20 observations with three variables: a numeric variable income and two factor variables representing region and sex. There are three easy ways to compute the within group means of every combination of age and region using RevoScaleR. First, we can use rxSummary:

rxSummary(income~region:sex, data=myData)

which includes in its output:

	 Category                            region sex    Means     StdDev   Min      
	 income for region=Rural, sex=Female Rural  Female  970.7522 98.01983  827.2396
	 income for region=Urban, sex=Female Urban  Female 2501.8303 47.10861 2450.1364
	 income for region=Rural, sex=Male   Rural  Male   1480.4378 92.29320 1355.4250
	 income for region=Urban, sex=Male   Urban  Male   3944.5127 40.82421 3883.3983

Second, we could use rxCube:

rxCube(income~region:sex, data=myData)

which includes in its output:

	 region    sex    income Counts
	1 Rural Female  970.7522      5
	2 Urban Female 2501.8303      5
	3 Rural   Male 1480.4378      5
	4 Urban   Male 3944.5127      5

Or, we can use rxLinMod with cube=TRUE. The intercept is automatically omitted, and four dummy variables are created from the two factors: one for each combination of region and sex. The coefficients are simply the within group means:

summary(rxLinMod(income~region:sex, cube=TRUE, data=myData))

which includes in its output:

	Coefficients:
	                         Estimate Std. Error t value Pr(>|t|)     | Counts
	region=Rural, sex=Female   970.75      33.18   29.26 2.66e-15 *** |      5
	region=Urban, sex=Female  2501.83      33.18   75.41 2.22e-16 *** |      5
	region=Rural, sex=Male    1480.44      33.18   44.62 2.22e-16 *** |      5
	region=Urban, sex=Male    3944.51      33.18  118.90 2.22e-16 *** |      5

The same model could be estimated using: lm(income~region:sex + 0, data=myData). Below we refer to these within group means as MeanIncRuralFemale, MeanIncUrbanFemale, MeanIncRuralMale, and MeanIncUrbanMale.

If we add an intercept, we encounter perfect multicollinearity and one of the coefficients will be dropped. The intercept is then the mean of a reference group and the other coefficients represent the differences or contrasts between the within group means and the reference group. For example,

lm(income~region:sex, data=myData)

produces:

	Coefficients:
	        (Intercept)  region:Rural:sexFemale  regionUrban:sexFemale  
	               3945                -2974                -1443  
	  regionRural:sexMale    regionUrban:sexMale  
	              -2464                   NA  

We can see that the dummy variable for urban males was dropped; urban males are the reference group and the Intercept is equal to MeanIncUrbanMale. The other coefficients represent contrasts from the reference group, so for example, regionRural:sexFemale is equal to MeanIncRuralFemale – MeanIncUrbanMale.

Another variation is to use “*”in the formula for factor crossing. Using a*b is equivalent to using a + b + a:b. For example:

lm(income~region*sex, data = myData)

which results in:

	Coefficients:
	      (Intercept)          regionUrban            sexMale  regionUrban:sexMale  
	            970.8             1531.1              509.7              933.0

The dropping of coefficients in rxLinMod can be controlled to obtain the same results:

rxLinMod(income~region*sex, data = myData, dropFirst = TRUE, dropMain = FALSE)

Coefficients using this model can be more difficult to interpret, and in fact are highly dependent on the order of the factor levels. In this case, we see the following relationship between the estimated coefficients and the within group means

(Intercept) MeanIncRuralFemale
regionUrban MeanIncUrbanFemale – MeanIncRuralFemale
sexMale MeanIncRuralMale – MeanIncRuralFemale
regionUrban:sexMale MeanIncUrbanMale – MeanIncUrbanFemale - MeanIncRuralMale + MeanIncRuralFemale

If we set up our data slightly differently, we get different results. Let’s use the same income data but a different naming convention for the factors:

region <- rep(c("Rural","Urban"), each=10)
sex <- rep(c("Woman", "Man"), each=5)
sex <- c(sex,sex)
myData1 <- data.frame(income, region, sex)

Using the same model, lm(income~region*sex, data=myData1), results in:

	Coefficients:
	           (Intercept)           regionUrban                sexWoman  
	                1480.4                  2464.1                  -509.7  
	regionUrban:sexWoman  
	                -933.0  

With this superficial modification to the data, the coefficient for the dummy variable for regionUrban has jumped from $1531 to $2464. This is due to the change in reference group; in this model the Urban coefficient represents the difference between mean urban and rural male income, while in the previous model it was the difference between mean urban and rural female income. That is, the relationship between the within group means and the new coefficients are:

(Intercept) MeanIncRuralMale
regionUrban MeanIncUrbanMale – MeanIncRuralMale
sexWoman MeanIncRuralFemale – MeanIncRuralMale
regionUrban:sexMale MeanIncUrbanFemale – MeanIncUrbanMale - MeanIncRuralFemale + MeanIncRuralMale

Omitting the intercept provides yet another combination of results. For example, using rxLinMod with the original factor labeling:

rxLinMod(income~region*sex, data=myData, cube=TRUE)

results in:

	Call:
	rxLinMod(formula = income ~ region * sex, data = myData, cube = TRUE)

	Cube Linear Regression Results for: income ~ region * sex
	Data: myData
	Dependent variable(s): income
	Total independent variables: 8 (Including number dropped: 4)
	Number of valid observations: 20
	Number of missing observations: 0

	Coefficients:
	                           income
	region=Rural            1480.4378
	region=Urban            3944.5127
	sex=Female             -1442.6824
	sex=Male                  Dropped
	region=Rural, sex=Female 932.9968
	region=Urban, sex=Female  Dropped
	region=Rural, sex=Male    Dropped
	region=Urban, sex=Male    Dropped
region=Rural MeanIncRuralMale
region=Urban MeanIncUrbanMale
Sex=Female MeanIncUrbanFemale – MeanIncUrbanMale
region=Rural, sex=Female (MeanIncRuralFemale – MeanIncRuralMale) - (MeanIncUrbanFemale – MeanIncUrbanMale)

With large data sets it is common to estimate many interaction terms, and if some categories have zero counts, it may not even be obvious what the reference group is. Also note that setting cube=TRUE in the preceding model is of limited use: only the first term from the expanded expression (in this case region) is estimated using a partitioned inverse.

Using Dummy Variables in rxLinMod: Letting the Data Speak Example 2

In previous articles, we looked at the CensusWorkers.xdf data set and examined the relationship between wage income and age. Now let’s add another variable, and examine the relationship between wage income and sex and age.

We can start with a simple dummy variable model, computing the mean wage income by sex:

#   Using Dummy Variables in rxLinMod

censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
rxLinMod(incwage~sex, data=censusWorkers, pweights="perwt", cube=TRUE)

which computes:

	Call:
	rxLinMod(formula = incwage ~ sex, data = censusWorkers, pweights = "perwt",
		cube = TRUE)

	Cube Linear Regression Results for: incwage ~ sex
	File name:
		C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
	Probability weights: perwt
	Dependent variable(s): incwage
	Total independent variables: 2
	Number of valid observations: 351121
	Number of missing observations: 0

	Coefficients:
				incwage
	sex=Male   43472.71
	sex=Female 26721.09

Similarly, we could look at a simple linear relationship between wage income and age:

linMod1 <- rxLinMod(incwage~age, data=censusWorkers, pweights="perwt")
summary(linMod1)

resulting in:

	Call:
	rxLinMod(formula = incwage ~ age, data = censusWorkers, pweights = "perwt")

	Linear Regression Results for: incwage ~ age
	File name:
		C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
	Probability weights: perwt
	Dependent variable(s): incwage
	Total independent variables: 2
	Number of valid observations: 351121
	Number of missing observations: 0

	Coefficients:
				 Estimate Std. Error t value Pr(>|t|)    
	(Intercept) 12802.980    247.963   51.63 2.22e-16 ***
	age           572.980      5.947   96.35 2.22e-16 ***
	---
	Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

	Residual standard error: 180800 on 351119 degrees of freedom
	Multiple R-squared: 0.02576
	Adjusted R-squared: 0.02576
	F-statistic:  9284 on 1 and 351119 DF,  p-value: < 2.2e-16
	Condition number: 1

Computing the two end points on the regression line, we can plot it:

age <- c(20,65)
coefLinMod1 <- coef(linMod1)
incwage_Pred <- coefLinMod1[1] + age*coefLinMod1[2]
plotData1 <- data.frame(age, incwage_Pred)
rxLinePlot(incwage_Pred~age, data=plotData1)

rxLinePlot(incwage_Pred~age, data=plotData1)

The next typical step is to combine the two approaches by estimating separate intercepts for males and females:

linMod2 <- rxLinMod(incwage~sex+age, data = censusWorkers, pweights = "perwt",
cube=TRUE)
summary(linMod2)

which results in:

	Call:
	rxLinMod(formula = incwage ~ sex + age, data = censusWorkers,
		pweights = "perwt", cube = TRUE)

	Cube Linear Regression Results for: incwage ~ sex + age
	File name:
	C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
	Probability weights: perwt
	Dependent variable(s): incwage
	Total independent variables: 3
	Number of valid observations: 351121
	Number of missing observations: 0

	Coefficients:
				Estimate Std. Error t value Pr(>|t|)     |  Counts
	sex=Male   20479.178    250.033   81.91 2.22e-16 *** | 3866542
	sex=Female  3723.067    252.968   14.72 2.22e-16 *** | 3276744
	age          573.232      5.816   98.56 2.22e-16 *** |        
	---
	Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

	Residual standard error: 176800 on 351118 degrees of freedom
	Multiple R-squared: 0.06804 (as if intercept included)
	Adjusted R-squared: 0.06804
	F-statistic: 1.282e+04 on 2 and 351118 DF,  p-value: < 2.2e-16
	Condition number: 1

We create a small sample data set with the same variables we use in censusWorkers, but with only four observations representing the high and low value of age for both sexes. Using the rxPredict function, the predicted values for each one of these sample observations is computed, which we then plot:

age <- c(20,65,20,65)
sex <- factor(rep(c(1, 2), each=2), labels=c("Male", "Female"))
perwt <- rep(1, times=4)
incwage <- rep(0, times=4)
plotData2 <- data.frame(age, sex, perwt, incwage)
plotData2p <- rxPredict(linMod2, data=plotData2, outData=plotData2)
rxLinePlot(incwage_Pred~age, groups=sex, data=plotData2p)

rxLinePlot(incwage_Pred~age, groups=sex, data=plotData2p)

These types of models are often relaxed further by allowing both the slope and intercept to vary by group:

linMod3 <- rxLinMod(incwage~sex+sex:age, data = censusWorkers,
	pweights = "perwt", cube=TRUE)
summary(linMod3)
	Call:
	rxLinMod(formula = incwage ~ sex + sex:age, data = censusWorkers,
		pweights = "perwt", cube = TRUE)

	Cube Linear Regression Results for: incwage ~ sex + sex:age
	File name:
		C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
	Probability weights: perwt
	Dependent variable(s): incwage
	Total independent variables: 4
	Number of valid observations: 351121
	Number of missing observations: 0

	Coefficients:
						Estimate Std. Error t value Pr(>|t|)     |  Counts
	sex=Male           10783.449    328.871   32.79 2.22e-16 *** | 3866542
	sex=Female         15131.422    356.806   42.41 2.22e-16 *** | 3276744
	age for sex=Male     814.948      7.888  103.31 2.22e-16 *** |        
	age for sex=Female   288.876      8.556   33.76 2.22e-16 *** |        
	---
	Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

	Residual standard error: 176300 on 351117 degrees of freedom
	Multiple R-squared: 0.07343 (as if intercept included)
	Adjusted R-squared: 0.07343
	F-statistic:  9276 on 3 and 351117 DF,  p-value: < 2.2e-16
	Condition number: 1.1764

Again getting predictions and plotting:

plotData3p <- rxPredict(linMod3, data=plotData2, outData=plotData2)
rxLinePlot(incwage_Pred~age, groups=sex, data=plotData3p)

rxLinePlot(incwage_Pred~age, groups=sex, data=plotData3p)

We could continue the process, experimenting with functional forms for age. But, since we have many observations (and therefore many degrees of freedom), we can take advantage of the F() function available in revoScaleR to let the data speak for itself. The F() function creates a factor variable from a numeric variable “on-the-fly”, creating a level for every integer value. This allows us to compute and observe the shape of the functional form using a purely dummy variable model:

linMod4 <- rxLinMod(incwage~sex:F(age), data=censusWorkers, pweights="perwt",
cube=TRUE)

This model estimated a total of 92 coefficients, all for dummy variables representing every age in the range of 20 to 65 for each sex. To visually examine the coefficients we could add observations to our plotData data frame and use rxPredict to compute predicted values for each of the 92 groups represented, but since the model contains only dummy variables in an initial cube term, we can instead use the “counts” data frame returned with the rxLinMod object:

plotData4 <- linMod4$countDF
# Convert the age factor variable back to an integer
plotData4$age <- as.integer(levels(plotData4$F.age.))[plotData4$F.age.]
rxLinePlot(incwage~age, groups=sex, data=plotData4)

rxLinePlot(incwage~age, groups=sex, data=plotData4)

Intercept-Only Models

You may have seen intercept-only models fitted with R’s lm function, where the model formula is of the form response ~ 1. In RevoScaleR these models should be fitted using rxSummary, because the intercept-only model simply returns the mean of the response. For example:

airlineDF <- rxDataStep(inData =
	file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.xdf"))
lm(ArrDelay ~ 1, data = airlineDF)
	Call:
	lm(formula = ArrDelay ~ 1, data = airlineDF)

	  Coefficients:
	  (Intercept)  
	        11.32  

	rxSummary(~ ArrDelay, data = airlineDF)
	  Call:
	  rxSummary(formula = ~ArrDelay, data = airlineDF)

	  Summary Statistics Results for: ~ArrDelay
	  Data: airlineDF
	  Number of valid observations: 6e+05
	  Number of missing observations: 0

	     Name     Mean     StdDev   Min Max  ValidObs MissingObs
	     ArrDelay 11.31794 40.68854 -86 1490 582628   17372