HTCondor
HTCondor can easily be enabled on a CycleCloud cluster by modifying the "run_list" in the configuration section of your cluster definition. There are three basic components of an HTCondor cluster. The first is the "central manager" which provides the scheduling and management daemons. The second component of an HTCondor cluster is one or more schedulers from which jobs are submitted into the system. The final component is one or more execute nodes which are the hosts perform the computation. A simple HTCondor template may look like:
[cluster htcondor]
[[node manager]]
ImageName = cycle.image.centos7
MachineType = Standard_A4 # 8 cores
[[[configuration]]]
run_list = role[central_manager]
[[node scheduler]]
ImageName = cycle.image.centos7
MachineType = Standard_A4 # 8 cores
[[[configuration]]]
run_list = role[condor_scheduler_role],role[filer_role],role[scheduler]
[[nodearray execute]]
ImageName = cycle.image.centos7
MachineType = Standard_A1 # 1 core
Count = 1
[[[configuration]]]
run_list = role[usc_execute]
Importing and starting a cluster with definition in CycleCloud will yield a "manager" and a "scheduler" node, as well as one "execute" node. Execute nodes can be added to the cluster via the cyclecloud add_node
command. To add 10 more execute nodes:
cyclecloud add_node htcondor -t execute -c 10
HTCondor Autoscaling
CycleCloud supports autoscaling for HTCondor, which means that the software will monitor the status of your queue and turn on and off nodes as needed to complete the work in an optimal amount of time/cost. You can enable autoscaling for HTCondor by adding Autoscale=true
to your cluster definition:
[cluster htcondor]
Autoscale = True
HTCondor Advanced Usage
If you know the average runtime of jobs, you can define average_runtime
(in minutes) in your job. CycleCloud will use that to start the minimum number of nodes (for example, five 10-minute jobs will only start a single node instead of five when average_runtime
is set to 10).
Autoscale Nodearray
By default, HTCondor will request cores from the nodearray called 'execute'. If a job requires a different nodearray (for example if certain jobs within a workflow have a high memory requirement), you can specify a slot_type
attribute for the job. For example, adding +slot_type = "highmemory"
will cause HTCondor to request a node from the "highmemory" nodearray instead of "execute" (note that this currently requires htcondor.slot_type = "highmemory"
to be set in the nodearray's [[[configuration]]]
section). This will not affect how HTCondor schedules the jobs, so you may want to include the slot_type
startd attribute in the job's requirements
or rank
expressions. For example: Requirements = target.slot_type = "highmemory"
.
Submitting Jobs to HTCondor
The most generic way to submit jobs to an HTCondor scheduler is the command (run from a scheduler node):
condor_submit my_job.submit
A sample submit file might look like this:
Universe = vanilla
Executable = do_science
Arguments = -v --win-prize=true
Output = log/$(Cluster).$(Process).out
Error = log/$(Cluster).$(Process).err
Should_transfer_files = if_needed
When_to_transfer_output = On_exit
+average_runtime = 10
+slot_type = "highmemory"
Queue
HTCondor Configuration Reference
The following are the HTCondor-specific configuration options you can set to customize functionality:
HTCondor-Specific Configuration Options | Description |
---|---|
htcondor.agent_enabled | If true, use the condor_agent for job submission and polling. Default: false |
htcondor.agent_version | The version of the condor_agent to use. Default: 1.27 |
htcondor.classad_lifetime | The default lifetime of classads (in seconds). Default: 700 |
htcondor.condor_owner | The Linux account that owns the HTCondor scaledown scripts. Default: root |
htcondor.condor_group | The Linux group that owns the HTCondor scaledown scripts. Default: root |
htcondor.data_dir | The directory for logs, spool directories, execute directories, and local config file. Default: /mnt/condor_data (Linux), C:\All Services\condor_local (Windows) |
htcondor.ignore_hyperthreads | (Windows only) Set the number of CPUs to be half of the detected CPUs as a way to "disable" hyperthreading. If using autoscale, specify the non-hyperthread core count with the Cores configuration setting in the [[node]] or [[nodearray]] section. Default: false |
htcondor.install_dir | The directory that HTCondor is installed to. Default: /opt/condor (Linux), C:\condor (Windows) |
htcondor.job_start_count | The number of jobs a schedd will start per cycle. 0 is unlimited. Default: 20 |
htcondor.job_start_delay | The number of seconds between each job start interval. 0 is immediate. Default: 1 |
htcondor.max_history_log | The maximum size of the job history file in bytes. Default: 20971520 |
htcondor.max_history_rotations | The maximum number of job history files to keep. Default: 20 |
htcondor.negotiator_cycle_delay | The minimum number of seconds before a new negotiator cycle may start. Default: 20 |
htcondor.negotiator_interval | How often (in seconds) the condor_negotiator starts a negotiation cycle. Default: 60 |
htcondor.negotiator_inform_startd | If true, the negotiator informs the startd when it is matched to a job. Default: true |
htcondor.remove_stopped_nodes | If true, stopped execute nodes are removed from the CycleServer view instead of being marked as "down". |
htcondor.running | If true, HTCondor collector and negotiator daemons run on the central manager. Otherwise, only the condor_master runs. Default: true |
htcondor.scheduler_dual | If true, schedulers run two schedds. Default: true |
htcondor.single_slot | If true, treats the machine as a single slot (regardless of the number of cores the machine possesses). Default: false |
htcondor.slot_type | Defines the slot_type of a node array for autoscaling. Default: execute |
htcondor.update_interval | The interval (in seconds) for the startd to publish an update to the collector. Default: 240 |
htcondor.use_cache_config | If true, use cache_config to have the instance poll CycleServer for configuration. Default: false |
htcondor.version | The version of HTCondor to install. Default: 8.2.6 |
HTCondor Auto-Generated Configuration File
HTCondor has large number of configuration settings, including user-defined attributes. CycleCloud offers the ability to create a custom configuration file using attributes defined in the cluster:
Attribute | Description |
---|---|
htcondor.custom_config.enabled | If true, a configuration file is generated using the specified attributes. Default: false |
htcondor.custom_config.file_name | The name of the file (placed in htcondor.data_dir /config) to write. Default: ZZZ-custom_config.txt |
htcondor.custom_config.settings | The attributes to write to the custom config file such as htcondor.custom_config.settings.max_jobs_running = 5000 |
Note
HTCondor configuration attributes containing a . cannot be specified using this method. If such attributes are needed, they should be specified in a cookbook or a file installed with cluster-init.
CycleCloud supports a standard set of autostop attributes across schedulers:
Attribute | Description |
---|---|
cyclecloud.cluster.autoscale.stop_enabled | Is autostop enabled on this node? [true/false] |
cyclecloud.cluster.autoscale.idle_time_after_jobs | The amount of time (in seconds) for a node to sit idle after completing jobs before it is scaled down. |
cyclecloud.cluster.autoscale.idle_time_before_jobs | The amount of time (in seconds) for a node to sit idle before completing jobs before it is scaled down. |