Compartilhar via


rxKmeans: K-Means Clustering

Description

Perform k-means clustering on small or large data.

Usage

  rxKmeans(formula, data, 
           outFile = NULL, outColName = ".rxCluster", 
           writeModelVars = FALSE, extraVarsToWrite = NULL,
           overwrite = FALSE, numClusters = NULL, centers = NULL, 
           algorithm = "Lloyd", numStartRows = 0, maxIterations = 1000, 
           numStarts = 1, rowSelection = NULL, 
           transforms = NULL, transformObjects = NULL,
           transformFunc = NULL, transformVars = NULL, 
           transformPackages = NULL, transformEnvir = NULL,
           blocksPerRead = rxGetOption("blocksPerRead"),
           reportProgress = rxGetOption("reportProgress"), verbose = 0,
           computeContext = rxGetOption("computeContext"),
           xdfCompressionLevel = rxGetOption("xdfCompressionLevel"), ...)

 ## S3 method for class `rxKmeans':
print  (x, header = TRUE, ...)

Arguments

formula

formula as described in rxFormula.

data

a data source object, a character string specifying a .xdf file, or a data frame object.

outFile

either an RxXdfData data source object or a character string specifying the .xdf file for storing the resulting cluster indexes. If NULL, then no cluster indexes are stored to disk. Note that in the case that the input data is a data frame, the cluster indexes are returned automatically. Note also that, if rowSelection is specified and not NULL, then outFile cannot be the same as the data since the resulting set of cluster indexes will generally not have the same number of rows as the original data source.

outColName

character string to be used as a column name for the resulting cluster indexes if outFile is not NULL. Note that make.names is used on outColName to ensure that the column name is valid. If the outFile is an RxOdbcData source, dots are first converted to underscores. Thus, the default outColName becomes "X_rxCluster".

writeModelVars

logical value. If TRUE, and the output file is different from the input file, variables in the model will be written to the output file in addition to the cluster numbers. If variables from the input data set are transformed in the model, the transformed variables will also be written out.

extraVarsToWrite

NULL or character vector of additional variables names from the input data to include in the outFile. If writeModelVars is TRUE, model variables will be included as well.

overwrite

logical value. If TRUE, an existing outFile with an existing column named outColName will be overwritten.

numClusters

number of clusters k to create. If NULL, then the centersargument must be specified.

centers

a k x p numeric matrix containing a set of initial (distinct) cluster centers. If NULL, then the numClusters argument must be specified.

algorithm

character string defining algorithm to use in defining the clusters. Currently supported algorithms are "Lloyd". This argument is case insensitive.

numStartRows

integer specifying the size of the sample used to choose initial centers. If 0, (the default), the size is chosen as the minimum of the number of observations or 10 times the number of clusters.

maxIterations

maximum number of iterations allowed.

numStarts

if centers is NULL, k rows are randomly selected from the data source for use as initial starting points. The numStarts argument defines the number of these random sets that are to be chosen and evaluated, and the best result is returned. If numStarts is 0, the first k rows in the data set are used. Random selection of rows is only supported for .xdf data sources using the native file system and data frames. If the .xdf file is compressed, the random sample is taken from a maximum of the first 5000 rows of data.

rowSelection

name of a logical variable in the data set (in quotes) or a logical expression using variables in the data set to specify row selection. For example, rowSelection = "old" will use only observations in which the value of the variable old is TRUE. rowSelection = (age > 20) & (age < 65) & (log(income) > 10) will use only observations in which the value of the age variable is between 20 and 65 and the value of the log of the income variable is greater than 10. The row selection is performed after processing any data transformations (see the arguments transforms or transformFunc). As with all expressions, rowSelection can be defined outside of the function call using the expression function.

transforms

an expression of the form list(name = expression, ...) representing the first round of variable transformations. As with all expressions, transforms (or rowSelection) can be defined outside of the function call using the expression function.

transformObjects

a named list containing objects that can be referenced by transforms, transformsFunc, and rowSelection.

transformFunc

variable transformation function. See rxTransform for details.

transformVars

character vector of input data set variables needed for the transformation function. See rxTransform for details.

transformPackages

character vector defining additional R packages (outside of those specified in rxGetOption("transformPackages")) to be made available and preloaded for use in variable transformation functions, e.g., those explicitly defined in RevoScaleR functions via their transforms and transformFunc arguments or those defined implicitly via their formula or rowSelection arguments. The transformPackages argument may also be NULL, indicating that no packages outside rxGetOption("transformPackages") will be preloaded.

transformEnvir

user-defined environment to serve as a parent to all environments developed internally and used for variable data transformation. If transformEnvir = NULL, a new "hash" environment with parent baseenv() is used instead.

blocksPerRead

number of blocks to read for each chunk of data read from the data source.

reportProgress

integer value with options:

  • 0: no progress is reported.
  • 1: the number of processed rows is printed and updated.
  • 2: rows processed and timings are reported.
  • 3: rows processed and all timings are reported.

verbose

integer value. If 0, no additional output is printed. If 1, additional summary information is printed.

computeContext

a valid RxComputeContext. The RxSpark and RxHadoopMR compute contexts distribute the computation among the nodes specified by the compute context; for other compute contexts, the computation is distributed if possible on the local computer.

xdfCompressionLevel

integer in the range of -1 to 9 indicating the compression level for the output data if written to an .xdf file. The higher the value, the greater the amount of compression - resulting in smaller files but a longer time to create them. If xdfCompressionLevel is set to 0, there will be no compression and files will be compatible with the 6.0 release of Revolution R Enterprise. If set to -1, a default level of compression will be used.

...

additional arguments to be passed directly to the Revolution Compute Engine.

x

object of class rxKmeans.

logical value. If TRUE, header information is printed.

Details

Performs scalable k-means clustering using the classical Lloyd algorithm.

For reproducibility when using random starting values, you can pass a random seed by specifying seed=value as part of your call. See the Examples.

Value

An object of class "rxKmeans" which is a list with components:

cluster

A vector of integers indicating the cluster to which each point is allocated. This information is always returned if the data source is a data frame. If the data source is not a data frame and outFile is specified. i.e., not NULL, the cluster indexes are written/appended to the specified file with a column name as defined by outColName.

centers

matrix of cluster centers.

withinss

within-cluster sum of squares (relative to the center) for each cluster.

totss

total within-cluster sum of squares.

tot.withinss

sum of the withinss vector.

betweenss

between-cluster sum of squares.

size

number of points in each cluster.

valid.obs

number of valid observations.

missing.obs

number of missing observations.

numIterations

number iterations performed.

params

parameters sent to Microsoft R Services Compute Engine.

formula

formula as described in rxFormula.

call

the matched call.

Author(s)

Microsoft Corporation Microsoft Technical Support

References

Lloyd, S. P. (1957, 1982) Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128-137.

See Also

kmeans.

Examples


 # Create data
 N <- 1000 
 sd1 <- 0.3
 mean1 <- 0 
 sd2 <- 0.5
 mean2 <- 1 
 set.seed(10)
 data <- rbind(matrix(rnorm(N, sd = sd1, mean = mean1), ncol = 2),  
               matrix(rnorm(N, mean = mean2, sd = sd2), ncol = 2)) 
 colnames(data) <- c("x", "y") 
 DF <- data.frame(data)
 XDF <- paste(tempfile(), "xdf", sep=".")
 if (file.exists(XDF)) file.remove(XDF)
 rxDataStep(inData = DF, outFile = XDF)

 centers <- DF[sample.int(NROW(DF), 2, replace = TRUE),] # grab 2 random rows for starting

 # Example using an XDF file as a data source
 rxKmeans(~ x + y, data = XDF, centers = centers)

 # Example using a local data frame file as a data source
 z <- rxKmeans(~ x + y, data = DF, centers = centers)

 # Show a plot of the results
 # By design, the data in 2-space populates two groups of points centered about 
 # points (0,0) and (1,1). The spread about the mean is based on a random set of 
 # points drawn from a Gaussian distribution with standard deviations 0.3 and 0.5. 
 # As a visual, the resulting plot shows two circles drawn at the centers with radii 
 # equal to the corresponding standard deviations.
 plot(DF, col = z$cluster, asp = 1, 
      main = paste("Lloyd k-means Clustering: ", z$numIterations, "iterations"))
 symbols(mean1, mean1, circle = sd1, inches = FALSE, add = TRUE, fg = "black", lwd = 2) 
 symbols(mean2, mean2, circle = sd2, inches = FALSE, add = TRUE, fg = "red", lwd = 2)
 points(z$centers, col = 2:1, bg = 1:2, pch = 21, cex = 2) # big filled dots for centers

 # Example using randomly selected rows from data source as initial centers
 # but with seed set for reproducibility
 ## Not run:

z <- rxKmeans(~ x + y, data = DF, numClusters = 2, seed=18)
## End(Not run) 


 # Example using first rows from Spss data source as initial centers
 spssFile <- file.path(rxGetOption("sampleDataDir"),"claims.sav")                
 spssDS <- RxSpssData(spssFile, colClasses = c(cost = "integer"))                        
 resultSpss <- rxKmeans(~cost, data = spssDS, numClusters = 2, numStarts = 0)