JARRIX SYSTEMS Pty.Ltd.

Managing Complexity through Simplicity

 
 

eEMU Message Guide

Author: Jarra Voleynik

V1.1

Table of Contents











9 Message Types

9.1 Resource ID

9.2 eEMU text file

9.3 Message types

9.4 normal message

9.5 delete message

9.6 comment message

9.7 sleep message

9.8 wakeup message

9.9 mask message

9.10 query message

9.11 count <lag> message

9.12 event message

9.13 suspend message

10 Scenarios of eEMU use
 
 
 
 
 

9 Message Types

9.1 Resource ID

Monitored resources are designated with a unique qualifier that we call an object ID. The hostname and object ID form a resource ID:

resource ID = hostname:object ID

The resource ID uniquely identifies a resource in the enterprise. It is used as a message key in the database. Messages with the same hostname and object ID as
an existing message are considered updates, irrespective of what the message text is. As an example, let's take a /usr/local filesystem on a host called
dumbo.company.com.
The resource ID in this case is dumbo.company.com:/usr/local. In order to provide for easy segregation of the object ID, it is ALWAYS specified as the first word of the message. Make sure that agents send messages with unique object IDs in them.

Let's suppose we need to monitor a process called netscape run by user jarra or victor. How do we distinguish between the two if the object ID = netscape? In that case, another qualifier is introduced as part of the object ID. There can be multiple qualifiers separated by a percent sign. Each qualifier narrows down the object ID. For example, the above netscape process will have two watchers with the following object IDs:

· netscape%jarra

· netscape%victor

In case of log watchers monitoring log files for a pattern, a sequence number is appended as part of the object ID. For example, we can have 2 watchers for /var/log/messages. One scans for the word "volerror" the other for the pattern "disk error". The following are the object IDs:

· /var/log/messages%1

· /var/log/messages%2
 

(NOTE: log watcher composes a message based on the log file's contents. The actual message will tell us what is wrong, not the object ID)

In case of databases, we may go into three levels if the resource hierarchy calls for it. For example, the following is an object ID for table "SALARY" in tablespace "HR" in database "ACCOUNTS":
ACCOUNTS%HR%SALARY

Given an object ID, the message text is either a full description of a problem or it may complement the object ID to form a meaningful message.

Examples of resource ID

· Oracle database of SID = ICC_PRD, tablespaces TOOLS on host ttc3232= resource ID = ttc3232:ICC_PRD%TOOLS

· /tmp filesystem on host tcc1212.companu.com = resource ID = tcc1212.company.com.au:/tmp

· process cron on host dumbo = resource ID = dumbo:cron

Why is the concept of resource IDs important ?
If we start integrating eEMU with a call logging system, we will find that calls need to be logged once only for a specified resource problem. Later messages for that resource need to update an existing call only (with the new message). An example may be a filesystem whose utilization goes up even after an alarm has been triggered. It is obvious that the message text will be changing depending on the utilization percentage. Our system needs to interpret all those messages as belonging to a single resource.

9.2 eEMU text file

eEMU exports its dbm database into a text file for easy processing by eEMU browsers. This text file resides in the database directory and has a .txt suffix. It is updated only if:

- new message (resource ID) is added
- message text of an existing message has
- message severity of an existing message has changed

- message class of an existing message has changed
- message comment of an existing message has changed or been added

NOTE: if a message attribute changes, it is reflected by an environment variable passed to the output script. Refer to the "Message Actions" section earlier in this manual for a detailed description of the available environment variables.

Only normal messages are put in the text file since they are the only message type meant for display in the browser.

9.3 Message types

eEMU has several message types in order to provide maximum flexibility and functionality. The message type determines the way the message is handled by eEMU. Originally, eEMU started out with normal and delete messages only. Over time, more message types were added and eEMU developed its own messaging language that can be used to define very complex event scenarios. Every business has its specific event flow. This flow can be described and programmed by eEMU messages to achieve full control.

The first type of messages are for database inserts or updates. They are:
normal
sleep
mask
count
event

The second type of messages manipulates an existing message addressable by a resource ID. They are:
delete
wakeup
comment
query (does a message exist ?)

The third type are command messages. They instruct eEMU to do something: They are:
query (to fetch a file)
suspend

9.4 normal message

This is the most frequently used message type. It is used in filesystem, process, swap and other agents. Time-to-live is set in order to preserve the message in the eEMU database across resource polls. If no refresh of the same resource ID is received during time-to-live, the message is removed (expired).

eEMU processing:
call input script, add/update message, log message, call output script

Syntax:
$emsg1 [-h <hostname] [-u <user] -o normal -n <emu server> -p <port> -t <time-to-live> -s <severity> \
-w <password> -c <class> -m "<message>"

If the -o option is not specified, it defaults to normal.

Example:
an alarm  message from host dumbo.company.com.au about object ID=/usr/local getting full
$emsg1 -o normal -n emuserver -p 2345 -t 6m -s 1 -c /OS/UNIX/FS -w icecream -m "/usr/local is 90% full"

The agent that produces the above message should run every 5 minutes.

Use:
Any resource monitoring, such as filesystems, processes, tablespaces, mail queues. Time-to-live is set depending on the interval at which the resource is polled. Remember that time to live must be slightly longer than the polling interval. The polling interval is typically set in cron if the agent is invoked by cron. For batch jobs or backups, set time-to-live to -1 to force the message to stay displayed until it is manually acknowledged. Note that if multiple messages arrive for the same resource ID, only the latest one is displayed. This finds its use in message consolidation. E.g. when a disk starts failing the system produces sometimes hundreds of messages in syslog, each stating a different block number. If the disk name is selected for the object ID, all the messages are consolidated by eEMU to just one. eEMU acts as a good console output consolidator, or a generic message consolidator.

Action script:

The input and output action scripts are invoked on receipt of a normal message.

9.5 delete message

This message is used by emucleaner to delete messages whose time is up. Nevertheless, a message can be deleted from an application script as well.

eEMU processing:
call input script, log message, call delete script, delete message

Syntax:
$emsg1 -o delete -n <emu server> -p <port> -w <password> -m "<hostname>:<object ID>"

Example:
to delete a message about object ID=/usr/local  that was received earlier
$emsg1 -o delete -n emuserver -p 2345 -w icecream -m "dumbo.company.com.au:/usr/local"

Use:
Delete message is self-explanatory. It deletes a message of a specified resource ID from the database. It usually means that the problem has been fixed. eEMU browsers allow to send a delete message for a problem acknowledgement/fix. As we will find later, delete messages are used to stop sleep messages from exploding.

Action script:

eEMU invokes the input and delete action scripts on receipt of a delete message.

9.6 comment message

For efficient and prompt communication among all the people watching eEMU messages, message annotation is indispensable.

eEMU processing:
call input script, comment specified message, log message,  call output script

Syntax:
$emsg1 -o comment -n <emu server> -p <port> -w <password> -m "<hostname>:<object ID>  comment   ....."

Notice that the comment freely follows the resource ID.

Example:
to send an annotation ("Administrator notified") to the previous message about /usr/local getting full on dumbo.company.com.au
$emsg1 -o comment -n emuserver -p 2345 -w icecream -m "dumbo.company.com.au:/usr/local Administrator notified"

Use:
Comments are a good way to communicate among multiple people about an existing message. It can be a work request number, note about ETA etc.
 

Action script:

The input and output action scripts are invoked on receipt of a comment message. The comment can be forwarded to a higher level eEMU from the action script.

9.7 sleep message

Sleep message literally sleeps in the eEMU database until it is woken up by emucleaner (based on its time-to-live) or by another external event (sending the wakup message). Once woken up, a sleep message is changed to a normal message with time-to-live set to -1 (infinity). The normal message must be manually deleted. Of course the deletion can be automated through a script with a delete message.

eEMU processing:
call input script, log message, call delete script, delete message

Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o sleep -n <emu server> -p <port> -t <time-to-live> -s <severity> \
-w <password> -c <class> -m "<message>"

Example:
a sleep  message from host dumbo.company.com.au about object ID=backup%fs. The message is sent by a filesystem backup script at the beginning of the backup. A delete message for resource ID = dumbo.company.com.au:backup%fs is sent at the end of the backup. Time-to-live is set to either a typical duration of the backup or the end time of backup window. In our case, backup must finish by 6:30 am.
$emsg1 -o sleep -n emuserver -p 2345 -t 06:30 -s 1 -c /OS/UNIX/BCK -w icecream -m "backup%fs failed or is running overtime"

Use:
As the above example suggests, sleep is usually used to notify us about a negative event in case a processing script crashes.  It is a prediction message that explodes if the calling program doesn't get a chance to clean it up. This feature is very powerful for failure prediction. Also, a sleep message allows us to "schedule" an event to occur at a fixed time from now on. Such schedule message can remove itself through the script it invokes.

Action script:

The input and output scripts are invoked on receipt of a sleep message.

9.8 wakeup message

This message is sent by emucleaner to turn a sleep message into a normal message so that it shows in the EMU browser.

eEMU processing:
call input script, change sleep message into normal message, log message, call output script

Syntax:
$emsg1 -o wakeup -n <emu server> -p <port> -w <password> -m "<hostname>:<object ID>"

Example:
to wakeup a message with resource ID = dumbo.company.com.au:backup%fs  that was received earlier:
$emsg1 -o wakeup -n emuserver -p 2345 -w icecream -m "dumbo.company.com.au:backup%fs"

Use:
wakeups are usually sent by emucleaner when a sleep message time-to-live is up. Nevertheless, wakeup messages can be sent from scripts as well if the application calls for it.

Action script:

The output script is invoked on receipt of a wakeup message. In the output script, all message attributes except for message type (which stays set to wakeup) and E_COUNT (which is set to 1) are inherited from the message being woken up. This is to allow simple actioning in the output script.
Mask messages, as the name suggests, are used to mask out incoming messages for the same resource ID as the mask message. A typical scenario is missing processes alarms as a result of running a cold database backup.

9.9 mask message

mask message causes a resource ID message to be blocked out for/until  time-to-live.

eEMU processing:
call input script, add/update message, log message, call output script

Syntax:
$emsg1 [-h <hostname] [-u <user] -o mask -n <emu server -p <port -t <time-to-live -s <severity -w <password -c <class -m "<message"

Example:
a mask  message from host dumbo.company.com.au that an oracle database called accounts is down. Since the backup should take a maximum of 3 hours, time-to-live is set to 3h. The mask message will be sent by an oracle cold backup script.
$emsg1 -o mask -n emuserver -p 2345 -t 3h-s 0 -c null -w icecream -m "accounts database is down"

Notice that the severity and class are set to 0 and null, respectively. It is because they are not used in mask messages.

Use:
If, for example, a process watcher is set up on a monitored system, there may be times that we want the process watcher disabled (masked out). It may be necessary if the process is missing for a reason, such as a cold backup.

Action script:

The output action script is invoked on receipt os a mask message.

9.10 query message

This message has two applications:
1. eEMU database is queried whether a message for a queried resource ID exists in the database. emsg1 returns the string "none" if the message doesn't exist or the message type (normal,sleep, etc) if it does exist.

2. eEMU is asked to produce a file from the out directory. emsg1 sends the file contents to its standard output. This option is great for agent configuration files distribution. On sending the file, eEMU deletes it from the out directory.

eEMU processing:
call input script, inquire on the message in the database or fetch a file in the out directory and send it to emsg1,  log message, call output script

Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o query -n <emu server> -p <port> -w <password> \
-m "<hostname>:<object ID>"

$emsg1 [-h <hostname>] [-u <user>] -o query -n <emu server> -p <port> -w <password>  -m "FILE <file name>"
 

Example:
Script for job4 will send a query  message to find out if a message about a completed job with resource ID = porky.company.com.au:job3 is in the eEMU database.
If it is and it is of type "event", job 4 is run. Now it is obvious that job 4 starts off only if job 3 completed successfully.

RET=`emsg1 -o query -n emuserver -p 2345 -w icecream -m " porky.company.com.au:job3"`
if [ $? eq 1 ];then
       exit
fi
if [ "$RET" = "event" ];then
       run job 4
fi

emsg1 will return a "none" string if porky.company.com.au:job3 doesn't exist in the database. Otherwise, it returns the type of message it found, e.g. "event" or "normal"

Another example is for agent configuration files distribution to a monitored host called dumbo. Configuration files are kept in one place on the management box. Each time an agent configuration file is changed, it is put in the EMU out directory (the out directory is specified in the <port.cfg eEMU configuration file). The agent will check before each run if the EMU server has a new configuration file to download. If it doesn't have any, the agent will use the old one. This scenario is good for sites without rcp capabilities that need central agent configuration file management. It is also good for nodes behind a firewall.

RET=`emsg1 -o query -n emuserver -p 2345 -w icecream -m "FILE dumbo.cfg" | tee new_dumbo.cfg`
if [ $? eq 1 ];then
    # a return code of 1 means that the connection to EMU timed out
    connect failed, try again later
    exit 1
fi
if [ "$RET" != "none" ];then
    cp new_dumbo.cfg dumbo.cfg
fi

run agents with dumbo.cfg
 

Use:
The above examples are quite descriptive to suggest the potential the command has. It can be used to facilitate dependency checks across multiple systems. The file to download can be a job to run, whereby eEMU can be used as a simple scheduling system.

9.11 count <lag> message

Count messages are put in the text file only if more than the specified number of messages  <lag value for the same resource ID are received in a row. It finds use in swap, CPU and memory monitoring.

eEMU processing:
call input script, add/update with respect to the lag value, log message, call output script

Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o "count <lag>" -n <emu server> -p <port> -t <time-to-live> -s <severity> \
-w <password> -c <class> -m "<message>"

Example:
A CPU agent on node porky scans CPU utilization every 5 minutes. In real life, it is not abnormal to see CPU utilization fluctuate for short periods of time. However, it is running at 100% for 25 minutes at a stretch, it may indicate a performance problem, possible due to a run-away process.

$emsg1 -o "count 5" -n emuserver -p 2345 -t 6m -s 1 -c /OS/UNIX/CPU -w icecream -m "CPU is 100% utilized"

Use:
Some resource we monitor may exhibit fluctuations that would normally distort a real problem detection. Introducing a lag allows to pick up cases of a long term continual resource problem.

Action script:

The output action script is invoked on receipt of a count message. If the message count reaches the <lag> value, the message attributes are changed in the output script as follows:

count is set to 1

message type is set to normal

If the action script is set up to simply forward normal messages or page someone if the message count equals 1, then it will work as expected for count messages as well.

9.12 event message

Event messages describe events. They just like normal messages with the difference that they are send once only on completion of an event. This event can be a batch job, backup etc. Time-to-live is set to how long we want eEMU to know about this event for other scripts to query. Event messages don't show on the eEMU browser.

eEMU processing:
call input script, add/update message, log message, call output script

Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o event -n <emu server> -p <port> -t <time-to-live> -s null -w <password> \
-c null -m "<object ID>"

Notice that the severity and class are set to 0 and null, respectively. It is because there is no use for them in event messages. The message text is just an object ID against which a query will be run.

-1- Example:
an event that job 4 completed so that other dependent jobs can run.
???$emsg1 -o event -n emu server -p 2345 -t 2h -s 0 -c null -w icecream -m "job"

Use:
The event message will find best use in synchronizing events across multiple systems.

Action script:

The output action script is invoked on receipt of an event message.

9.13 suspend message

The suspend message suspends eEMU (puts emu to sleep) for a specified number of seconds

eEMU processing:
call input script, log message, sleep for a specified number of seconds

Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o suspend -n <emu server> -p <port> -w <password> -m "<seconds>"

Example:
If we want to synchronize EMU on system A to EMU on system B, we need to perform a copy operation while system B is not undergoing any changes. By suspending system B for a short period of time, say 10 seconds, the database on system B can be copied to system A.

$emsg1 -o suspend -n emuserver -p 2345 -w icecream -m "10"

Action script:

no action script is invoked.

10 Scenarios of eEMU use

1. hot backup script on speedy (application stays up)

sleep message <backup failed>
backup start
.
backup end
delete message <speedy:backup>

2. cold backup script on speedy (application1 is shut down)
sleep message <backup failed>
mask message <application1>
backup start
.
backup end
delete message <application>1
delete message <speedy:backup>

3. more to be added
 
 
 


Copyright © 1999-2000 Jarrix Systems Pty. Ltd., Australia. All rights reserved.
Legal Statement