Data Smoothing Algorithms and Analytics

Analytics is a framework to run user defined data smoothing algorithms on any source data that comes to the framework. Analytics provide different plug-ins which can do the analysis on the source data based on users request. A user can request the analysis of source data through workflows. Below is the sample of a workflow file:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>*aggregator_name*</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>*node_name*</hostname>
            <data_group>*sensor_name*</data_group>
        <core>*core_id*</core>
        </step>
        <step name = "aggregate">
            <compute>average</compute>
        </step>
    </workflow>
    </workflows>

In which, the node_name, sensor_name and core_id could be a comma separated list. In the above example, if user specifies hostname as node1, data_group as coretemp and core as core 1, core 2, then the aggregate plugin will do an average of the coretemp values of core 1 and core 2.

The below diagram provides an overview of Analytics Framework in a cluster environment and different analytics plug-ins are descibed later in this page.

Analytics framework runs only on the aggregator node. The actions to be performed by the Analytics framework are submitted by OCTL tool. The OCTL tool can be run by administrator from any management nodes from network. The details of implementation of OCTL tool is provided in the OCTL section of this document. The details of implementation of Analytic framework are provided in the Analytics Framework seciton of this document.

Analytics Framework Overview

OCTL

OCTL Overview

OCTL Flow Chart

NOTE: Please refer to the OCTL tool wiki page for the usage.

Analytics Framework

Working of Analytics Framework

Analytics Framework Flow Chart

There are currently six plug-ins (Filter, Aggregate, Threshold, Window, Cott and Spatial) supported in Analytics Framework. In addition to these plug-ins, there is a optioanl DB Storage Attribute which will allow the plug-in to log the data into database. Besides, there are a couple of MCA parameters that would allow user to determine whether log the raw data and event data to database or not. Short description of all the plug-ins, attribute, and the mca parameters are given below.

Filter

The role of the FILTER plug-in is to compare the user requested information with the actual source data sent by the aggregator. By doing this we make sure that we analyze only user requested data instead of analyzing any unnecessary data. After validating the data this plug-in will pass on this information to average or any other plug-in. The first workflow step in a workflow is always the Filter, which is mandatory.

Workflow Example:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>c[2:1-8]</hostname>
            <data_group>coretemp</data_group>
        <core>core0</core>
        </step>
    </workflow>
    </workflows>

In the above example, the aggregator tag indicates which aggregator(s) this workflow will be submitted to. The filter workflow step has three attributes, namely hostname, data_group, and core id. These attributes match the key of the source data (in this case, the sensor). This example filters the core 0's coretemp data for all the nodes c01, c02, ..., c08.

Aggregate

The aggregate plugin performs running average, minimum and maximum aggregation operations.The running average is the average of all the data up to the current data value. Average plug-in utilizes the data received from the previous workflow step. The formula for doing average is as follows:

    A(N+1) = (N * A(N) + new_sample) / (N + 1)

Workflow Example:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>c01,c02</hostname>
            <data_group>coretemp</data_group>
        </step>
        <step name = "aggregate">
            <compute>average</compute>
        </step>
    </workflow>
    </workflows>

In the above example, the coretemp sensor data of nodes c01 and c02 will be passed down to the aggregate plugin to do an average across c01 and c02.

Aggregate plugin attributes:

  • compute: (Required) This specifies the computation type. Currently, there are three types of computations supported, namely: average, min and max.

Threshold

The threshold plugin does level checking on data and generates syslog/email notification for all out-of-bounds values. User shall specify high and low bounds using workflow parameters.

A typical policy includes:

  • Threshold type: high or low
  • Threshold value: threshold boundary of sensor measurement
  • Severity: severity/priority of the RAS event that is triggered by the policy; severity level follows RFC 5424 syslog protocol
  • Notification mechanism: notification mechanism of the event, syslog or SMTP email

Workflow Example:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>c[2:1-8]</hostname>
            <data_group>coretemp</data_group>
        </step>
        <step name = "threshold">
            <policy>hi|100|crit|smtp,low|2-|warn|syslog</policy>
        </step>
    </workflow>
    </workflows>

The above workflow means that for values higher than 100, send an email reporting a critical event, and for values lower than 20, write a warning message to the syslog.

Window

The window plugin calculates the statistics (average, min, max, and standard deviation (sd)) of the values of the coming samples within a time/counter window. The boundaries (left and right) of the window are incrementing (sliding) with a sliding size. In the current implementation, the sliding size equals to the window size.

Workflow Example:

Workflow 1:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>c02</hostname>
            <data_group>coretemp</data_group>
        </step>
        <step name = "window">
            <win_size>1h</win_size>
            <compute>average</compute>
            <type>time</type>
        </step>
    </workflow>
    </workflows>

Workflow 2:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>c01</hostname>
            <data_group>coretemp</data_group>
        </step>
        <step name = "window">
            <win_size>10</win_size>
            <compute>sd</compute>
            <type>counter</type>
        </step>
    </workflow>
    </workflows>

The first workflow specifies a timing window of 1 hour to do average for the coretemp sensor data of node c02, while the second workflow specifies a counter window of 10 samples to do the standard deviation for the coretemp sensor data of node c01.

Window plugin attributes:

  • win_size: (Required) size of the window
  • compute: (Required) the type of computation (average, min, max, sd). For now, only one computation is allowed in one workflow step.
  • type: (Optional) the type of the window. By default it is time window
    • type=time: the window is a time window. By default, the unit of the win_size is second. When the users want to do computation with units of minute, hour, day. We are not supporting units longer than day(e.g. weeks, months, years, etc). The users can either set the unit with a unit attribute, or convert the value to seconds. For example: win_size=3600 means a one-hour window, this equals to set: win_size=1;unit=hour
    • type=counter: the window is a counter window, meaning the number of samples. For example: win_size=200 means doing computation for the last 200 samples.

Count Over Time Threshold (cott)

This analytics plugin is designed to take the incoming absolute count data during sampling and will fire an event when the calculated count delta is greater than or equal to the threshold value within the specified time window. The errcounts sensor plugin can be used as direct input to this analytic plugin.

Understand that the input values for a given data label (or key) are the current count and not a change of the count since the last data associated with the label. This means if you get a change value you will need an accumulator-like plugin to get a count to input into this plugin.

The output of this plugin is the incoming data. It only examines and triggers events.

Sample Workflow File:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <data_group>errcounts</data_group>
        </step>
        <step name = "cott">
            <severity>error</severity>
            <fault_type>soft</fault_type>
            <store_event>yes</store_event>
            <notifier_action>smtp</notifier_action>
            <label_mask>CPU_SrcID#*_Channel#*_DIMM#*_CE</label_mask>
            <time_window>60s</time_window>
            <count_threshold>10</count_threshold>
        </step>
    </workflow>
    </workflows>

In this example, the filter returns only data from the errcounts plugin to use as input to the cott analytics plugin. The plugin looks for a count change of 10 within a time window of 60 seconds. The example when triggered will fire an event of error severity and as a soft fault. It also will store the event in the event database.

Workflow Arguments:

  • label_mask: (Required) This specifies the data labels (key) to match using wildcard syntax.
  • count_threshold: (Required) This denotes the number of new item counts within the time_window which after the the event is fired (inclusive).
  • severity: (Optional) This defaults to error and can only be emerg, alert, crit, error, warn.
  • fault_type: (Optional) This defaults to hard and can only be hard or soft.
  • store_event: (Optional) This defaults to yes and can only be yes or no.
  • notifier_action (Optional) This defaults to none and can only be none, email, syslog.
  • time_window: (Optional) This defaults to 1 second; the format is a number followed by a character denoting the unit of the number:
  • s - seconds
  • m - minutes
  • h - hours
  • d - days

Spatial

The spatial plugin is used in the purpose of doing spatial (not temporal) data aggregations across a rack or a sub-group of nodes of a rack. The groups of nodes can be defined in the logical group configuration file. When doing the computation for a group of nodes, there will be one sample per node for all the nodes.

Sample Workflow File:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>$rack1</hostname>
            <data_group>coretemp</data_group>
        </step>
        <step name = "spatial">
            <nodelist>$rack1</win_size>
            <compute>average</compute>
            <interval>10</interval>
            <timeout>20</timeout>
        </step>
    </workflow>
    </workflows>

The above workflow does average for the coretemp sensor data for all the compute nodes defined in group rack1 in logical group, every 10 seconds. See Section 3.5 for logical group file definition.

Spatial Plugin Attributes:

  • nodelist: (Required) The list of nodes that the aggregation will be conducted
  • compute: (Required) the type of computation (average, min, max, sd). For now, only one computation is allowed in one workflow step.
  • interval: (Optional) It is a timing attribute with the unit of second. It indicates the sleeping length between 2 consecutive cycling. This is to give the user the ability to control the granularity of cycling if he/she does not want the cycling to go as fast as it can go. The default value is 60 seconds.
  • timeout: (Optional) It is a timing attribute with the unit of second. It means the maximum waiting time upon the first sample of a cycle comes before doing the computation. In the case of node failure happens, this attribute avoids waiting endlessly. The default value is 60 seconds.

General Exception (genex)

The genex plugin purpose is to catch any message and filter all that contains a given word or set of words. The plugin filter criteria is also based on the severity given to the message. Coincidences can generate event notifications.

Workflow Example:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <data_group>syslog</data_group>
        </step>
        <step name = "genex">
            <msg_regex>access_denied</msg_regex>
            <severity>critical</severity>
            <notifier>smtp</notifier>
        </step>
    </workflow>
    </workflows>

The above workflow means that for the messages that contains the string access denied with a critical severity will send an e-mail reporting a critical event.

DB Storage Attribute

The role of the DB storage attribute is to store the analyzed data in the database. This attribute is optional and will be activated only if user has requested the data to be stored in the database. Example

    <step name = "aggregate">
        <compute>average</compute>
        <db>yes</db>
    </step>

Store raw sensor data

Using an MCA parameter from the analytics code, sensor raw data logging to Database is controlled. The MCA parameter that controls the sensor raw data logging to DB is analytics_base_store_raw_data

To turn the sensor raw data logging to true, use the below command

# orcmd --omca store_raw_data true

To turn the sensor raw data logging to false, use the below command

# orcmd --omca store_raw_data false

Store only events

Using an MCA parameter from the analytics code, event data logging to Database is controlled. The MCA parameter that controls the event data logging to DB is "analytics_base_store_event_data"

To turn the event data logging to true, use the below command

# orcmd --omca store_event_data true

To turn the event data logging to false, use the below command

# orcmd --omca store_event_data false

Suppress repeat

The suppress repeat MCA attribute specifies the time period during which all repeat events will be suppressed. An event is considered a repeat if reporter, severity, type and key value match those of an earlier event. The MCA parameter that controls the event data logging to DB is analytics_base_suppress_repeat

To suppress repeat events for a time interval T, use below command

# orcmd --omca analytics_base_suppress_event <T>

Workflow Example:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>c[2:1-8]</hostname>
            <data_group>coretemp</data_group>
        </step>
        <step name = "threshold">
            <policy>hi|100|crit|smtp,low|2|warn|syslog</policy>
            <suppress_repeat>yes</suppress_repeat>
            <category>HARD_FAULT</category>
            <severity>crit</severity>
            <time>20m</time>
        </step>
    </workflow>
    </workflows>

To suppress a certain category or severity of events at a different rate, every plugin supports additional attributes

  • suppress_repeat: (Required) Enables suppress repeats for specific plugins. If no additional attributes are specified, all events generated by the plugin will be suppressed for time period determined by the analytics_base_suppress_repeat MCA parameter.
  • category: (Optional) Event category. Allowed types are HARD_FAULT and SOFT_FAULT
  • severity: (Optional) Event severity. Allowed types are emerg, alert, crit, error, warn, notice, info and debug. All matching this parameter or with a higher value will be suppressed at a different rate.
  • time: (Optional) All events that match the severity and category parameters will be suppressed for time period specified using this parameter. All other events will be suppressed time period determined by analytics_base_suppress_repeat MCA parameter.

Launch an executable for a given event

This feature allows user to specify an executable to be launched when an event happened. The user specifies the event and executable in the workflow file that is submitted through OCTL. The aggregator will process the workflow and send a command to the scheduler for launching the executable. The executable will be located at the scheduler side. The path of the executable is only visible to the scheduler. The workflow file only specifies the exec name. Regarding the exec name, it is recommended to define a logical group name for the exec and to expose only the logical group name in the workflow file.

Workflow Example:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
    <workflows>
    <aggregator>aggregator1</aggregator>
    <workflow name = "wf1">
        <step name = "filter">
            <hostname>c01</hostname>
            <data_group>coretemp</data_group>
        </step>
        <step name = "threshold">
            <policy>hi|30|crit|exec</policy>
            <exec_name>$exec1</exec_name>
            <exec_argv>argv1,argv2,argv3</exec_argv>
        </step>
    </workflow>
    </workflows>

The above workflow means that for compute node c01, if the core temperature is higher than 30, it would be critical and the aggregator would notify the scheduler to launch the exec with a logical group name of exec1, along with the three arguments, namely argv1, argv2, and argv3.

The fourth element of the policy in the threshold plugin is exec, meaning launching an executable. There are two attributes associated with the exec:

  • exec_name: (Mandatory) The name of the executable to be launched by the scheduler. Logical group name is recommended.
  • exec_argv: (Optional) Comma separated argument list of running the exec.

The logical group name of the executable should be defined using octl logical group command before running all the daemons, such as:

$ octl grouping add exec1 hello_world

MCA Parameter of the exec Path:

At the scheduler side, an MCA parameter, named event_exec_path (or with the -e option) is provided to allow the scheduler to specify the path of the executables to be launched. By default, the path points to orcm_install/bin folder. Below are the examples to specify the parameter:

$ ./orcmsched -e /usr/bin
$ ./orcmsched --omca event_exec_path /usr/bin