Data Smoothing Algorithms and Analytics
Analytics is a framework to run user defined data smoothing algorithms on any source data that comes to the framework. Analytics provide different plug-ins which can do the analysis on the source data based on users request. A user can request the analysis of source data through workflows. Below is the sample of a workflow file:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>*aggregator_name*</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>*node_name*</hostname>
<data_group>*sensor_name*</data_group>
<core>*core_id*</core>
</step>
<step name = "aggregate">
<compute>average</compute>
</step>
</workflow>
</workflows>
In which, the node_name
, sensor_name
and core_id
could be a comma separated list. In the above example, if user specifies hostname as node1
, data_group
as coretemp
and core
as core 1
, core 2
, then the aggregate plugin will do an average of the coretemp values of core 1 and core 2.
The below diagram provides an overview of Analytics Framework in a cluster environment and different analytics plug-ins are descibed later in this page.
Analytics framework runs only on the aggregator node. The actions to be performed by the Analytics framework are submitted by OCTL tool. The OCTL tool can be run by administrator from any management nodes from network. The details of implementation of OCTL tool is provided in the OCTL section of this document. The details of implementation of Analytic framework are provided in the Analytics Framework seciton of this document.
OCTL
NOTE: Please refer to the OCTL tool wiki page for the usage.
Analytics Framework
There are currently six plug-ins (Filter, Aggregate, Threshold, Window, Cott and Spatial) supported in Analytics Framework. In addition to these plug-ins, there is a optioanl DB Storage Attribute which will allow the plug-in to log the data into database. Besides, there are a couple of MCA parameters that would allow user to determine whether log the raw data and event data to database or not. Short description of all the plug-ins, attribute, and the mca parameters are given below.
Filter
The role of the FILTER plug-in is to compare the user requested information with the actual source data sent by the aggregator. By doing this we make sure that we analyze only user requested data instead of analyzing any unnecessary data. After validating the data this plug-in will pass on this information to average or any other plug-in. The first workflow step in a workflow is always the Filter, which is mandatory.
Workflow Example:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>c[2:1-8]</hostname>
<data_group>coretemp</data_group>
<core>core0</core>
</step>
</workflow>
</workflows>
In the above example, the aggregator tag indicates which aggregator(s) this workflow will be submitted to. The filter workflow step has three attributes, namely hostname
, data_group
, and core id
. These attributes match the key of the source data (in this case, the sensor). This example filters the core 0's coretemp data for all the nodes c01
, c02
, ..., c08
.
Aggregate
The aggregate plugin performs running average, minimum and maximum aggregation operations.The running average is the average of all the data up to the current data value. Average plug-in utilizes the data received from the previous workflow step. The formula for doing average is as follows:
A(N+1) = (N * A(N) + new_sample) / (N + 1)
Workflow Example:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>c01,c02</hostname>
<data_group>coretemp</data_group>
</step>
<step name = "aggregate">
<compute>average</compute>
</step>
</workflow>
</workflows>
In the above example, the coretemp sensor data of nodes c01
and c02
will be passed down to the aggregate plugin to do an average across c01
and c02
.
Aggregate plugin attributes:
compute
: (Required) This specifies the computation type. Currently, there are three types of computations supported, namely: average, min and max.
Threshold
The threshold plugin does level checking on data and generates syslog/email notification for all out-of-bounds values. User shall specify high and low bounds using workflow parameters.
A typical policy includes:
- Threshold type: high or low
- Threshold value: threshold boundary of sensor measurement
- Severity: severity/priority of the RAS event that is triggered by the policy; severity level follows RFC 5424 syslog protocol
- Notification mechanism: notification mechanism of the event, syslog or SMTP email
Workflow Example:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>c[2:1-8]</hostname>
<data_group>coretemp</data_group>
</step>
<step name = "threshold">
<policy>hi|100|crit|smtp,low|2-|warn|syslog</policy>
</step>
</workflow>
</workflows>
The above workflow means that for values higher than 100, send an email reporting a critical event, and for values lower than 20, write a warning message to the syslog.
Window
The window plugin calculates the statistics (average, min, max, and standard deviation (sd)) of the values of the coming samples within a time/counter window. The boundaries (left and right) of the window are incrementing (sliding) with a sliding size. In the current implementation, the sliding size equals to the window size.
Workflow Example:
Workflow 1:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>c02</hostname>
<data_group>coretemp</data_group>
</step>
<step name = "window">
<win_size>1h</win_size>
<compute>average</compute>
<type>time</type>
</step>
</workflow>
</workflows>
Workflow 2:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>c01</hostname>
<data_group>coretemp</data_group>
</step>
<step name = "window">
<win_size>10</win_size>
<compute>sd</compute>
<type>counter</type>
</step>
</workflow>
</workflows>
The first workflow specifies a timing window of 1 hour to do average for the coretemp
sensor data of node c02
, while the second workflow specifies a counter window of 10 samples to do the standard deviation for the coretemp
sensor data of node c01
.
Window plugin attributes:
win_size
: (Required) size of the windowcompute
: (Required) the type of computation (average, min, max, sd). For now, only one computation is allowed in one workflow step.type
: (Optional) the type of the window. By default it is time windowtype=time
: the window is a time window. By default, the unit of the win_size is second. When the users want to do computation with units of minute, hour, day. We are not supporting units longer than day(e.g. weeks, months, years, etc). The users can either set the unit with aunit
attribute, or convert the value to seconds. For example:win_size=3600
means a one-hour window, this equals to set:win_size=1;unit=hour
type=counter
: the window is a counter window, meaning the number of samples. For example:win_size=200
means doing computation for the last 200 samples.
Count Over Time Threshold (cott)
This analytics plugin is designed to take the incoming absolute count data during sampling and will fire an event when the calculated count delta is greater than or equal to the threshold value within the specified time window. The errcounts
sensor plugin can be used as direct input to this analytic plugin.
Understand that the input values for a given data label (or key) are the current count and not a change of the count since the last data associated with the label. This means if you get a change value you will need an accumulator-like plugin to get a count to input into this plugin.
The output of this plugin is the incoming data. It only examines and triggers events.
Sample Workflow File:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<data_group>errcounts</data_group>
</step>
<step name = "cott">
<severity>error</severity>
<fault_type>soft</fault_type>
<store_event>yes</store_event>
<notifier_action>smtp</notifier_action>
<label_mask>CPU_SrcID#*_Channel#*_DIMM#*_CE</label_mask>
<time_window>60s</time_window>
<count_threshold>10</count_threshold>
</step>
</workflow>
</workflows>
In this example, the filter returns only data from the errcounts
plugin to use as input to the cott
analytics plugin. The plugin looks for a count change of 10 within a time window of 60 seconds. The example when triggered will fire an event of error severity and as a soft fault. It also will store the event in the event database.
Workflow Arguments:
label_mask
: (Required) This specifies the data labels (key) to match using wildcard syntax.count_threshold
: (Required) This denotes the number of new item counts within thetime_window
which after the the event is fired (inclusive).severity
: (Optional) This defaults toerror
and can only beemerg
,alert
,crit
,error
,warn
.fault_type
: (Optional) This defaults tohard
and can only behard
orsoft
.store_event
: (Optional) This defaults toyes
and can only beyes
orno
.notifier_action
(Optional) This defaults tonone
and can only benone
,email
,syslog
.time_window
: (Optional) This defaults to 1 second; the format is a number followed by a character denoting the unit of the number:s
- secondsm
- minutesh
- hoursd
- days
Spatial
The spatial plugin is used in the purpose of doing spatial (not temporal) data aggregations across a rack or a sub-group of nodes of a rack. The groups of nodes can be defined in the logical group configuration file. When doing the computation for a group of nodes, there will be one sample per node for all the nodes.
Sample Workflow File:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>$rack1</hostname>
<data_group>coretemp</data_group>
</step>
<step name = "spatial">
<nodelist>$rack1</win_size>
<compute>average</compute>
<interval>10</interval>
<timeout>20</timeout>
</step>
</workflow>
</workflows>
The above workflow does average for the coretemp
sensor data for all the compute nodes defined in group rack1
in logical group, every 10 seconds. See Section 3.5 for logical group file definition.
Spatial Plugin Attributes:
nodelist
: (Required) The list of nodes that the aggregation will be conductedcompute
: (Required) the type of computation (average, min, max, sd). For now, only one computation is allowed in one workflow step.interval
: (Optional) It is a timing attribute with the unit ofsecond
. It indicates the sleeping length between 2 consecutive cycling. This is to give the user the ability to control the granularity of cycling if he/she does not want the cycling to go as fast as it can go. The default value is 60 seconds.timeout
: (Optional) It is a timing attribute with the unit ofsecond
. It means the maximum waiting time upon the first sample of a cycle comes before doing the computation. In the case of node failure happens, this attribute avoids waiting endlessly. The default value is 60 seconds.
General Exception (genex)
The genex
plugin purpose is to catch any message and filter all that contains a given word or set of words. The plugin filter criteria is also based on the severity given to the message. Coincidences can generate event notifications.
Workflow Example:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<data_group>syslog</data_group>
</step>
<step name = "genex">
<msg_regex>access_denied</msg_regex>
<severity>critical</severity>
<notifier>smtp</notifier>
</step>
</workflow>
</workflows>
The above workflow means that for the messages that contains the string access denied
with a critical severity will send an e-mail reporting a critical event.
DB Storage Attribute
The role of the DB storage attribute is to store the analyzed data in the database. This attribute is optional and will be activated only if user has requested the data to be stored in the database. Example
<step name = "aggregate">
<compute>average</compute>
<db>yes</db>
</step>
Store raw sensor data
Using an MCA parameter from the analytics code, sensor raw data logging to Database is controlled. The MCA parameter that controls the sensor raw data logging to DB is analytics_base_store_raw_data
To turn the sensor raw data logging to true, use the below command
# orcmd --omca store_raw_data true
To turn the sensor raw data logging to false, use the below command
# orcmd --omca store_raw_data false
Store only events
Using an MCA parameter from the analytics code, event data logging to Database is controlled. The MCA parameter that controls the event data logging to DB is "analytics_base_store_event_data"
To turn the event data logging to true, use the below command
# orcmd --omca store_event_data true
To turn the event data logging to false, use the below command
# orcmd --omca store_event_data false
Suppress repeat
The suppress repeat MCA attribute specifies the time period during which all repeat events will be suppressed. An event is considered a repeat if reporter, severity, type and key value match those of an earlier event. The MCA parameter that controls the event data logging to DB is analytics_base_suppress_repeat
To suppress repeat events for a time interval T
, use below command
# orcmd --omca analytics_base_suppress_event <T>
Workflow Example:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>c[2:1-8]</hostname>
<data_group>coretemp</data_group>
</step>
<step name = "threshold">
<policy>hi|100|crit|smtp,low|2|warn|syslog</policy>
<suppress_repeat>yes</suppress_repeat>
<category>HARD_FAULT</category>
<severity>crit</severity>
<time>20m</time>
</step>
</workflow>
</workflows>
To suppress a certain category or severity of events at a different rate, every plugin supports additional attributes
suppress_repeat
: (Required) Enables suppress repeats for specific plugins. If no additional attributes are specified, all events generated by the plugin will be suppressed for time period determined by theanalytics_base_suppress_repeat
MCA parameter.category
: (Optional) Event category. Allowed types are HARD_FAULT and SOFT_FAULTseverity
: (Optional) Event severity. Allowed types areemerg
,alert
,crit
,error
,warn
,notice
,info
anddebug
. All matching this parameter or with a higher value will be suppressed at a different rate.time
: (Optional) All events that match the severity and category parameters will be suppressed for time period specified using this parameter. All other events will be suppressed time period determined byanalytics_base_suppress_repeat
MCA parameter.
Launch an executable for a given event
This feature allows user to specify an executable to be launched when an event happened. The user specifies the event and executable in the workflow file that is submitted through OCTL. The aggregator will process the workflow and send a command to the scheduler for launching the executable. The executable will be located at the scheduler side. The path of the executable is only visible to the scheduler. The workflow file only specifies the exec name. Regarding the exec name, it is recommended to define a logical group name for the exec and to expose only the logical group name in the workflow file.
Workflow Example:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<workflows>
<aggregator>aggregator1</aggregator>
<workflow name = "wf1">
<step name = "filter">
<hostname>c01</hostname>
<data_group>coretemp</data_group>
</step>
<step name = "threshold">
<policy>hi|30|crit|exec</policy>
<exec_name>$exec1</exec_name>
<exec_argv>argv1,argv2,argv3</exec_argv>
</step>
</workflow>
</workflows>
The above workflow means that for compute node c01
, if the core temperature is higher than 30, it would be critical and the aggregator would notify the scheduler to launch the exec
with a logical group name of exec1
, along with the three arguments, namely argv1
, argv2
, and argv3
.
The fourth element of the policy in the threshold plugin is exec
, meaning launching an executable. There are two attributes associated with the exec
:
exec_name
: (Mandatory) The name of the executable to be launched by the scheduler. Logical group name is recommended.exec_argv
: (Optional) Comma separated argument list of running the exec.
The logical group name of the executable should be defined using octl logical group command before running all the daemons, such as:
$ octl grouping add exec1 hello_world
MCA Parameter of the exec Path:
At the scheduler side, an MCA parameter, named event_exec_path
(or with the -e
option) is provided to allow the scheduler to specify the path of the executables to be launched. By default, the path points to orcm_install/bin
folder. Below are the examples to specify the parameter:
$ ./orcmsched -e /usr/bin
$ ./orcmsched --omca event_exec_path /usr/bin