![]() |
JARRIX SYSTEMS Pty.Ltd. |
eEMU Message Guide
Author: Jarra Voleynik
V1.1
Table of Contents
resource ID = hostname:object ID
The resource ID uniquely identifies a resource in the enterprise. It
is used as a message key in the database. Messages with the same hostname
and object ID as
an existing message are considered updates, irrespective of what the
message text is. As an example, let's take a /usr/local filesystem on a
host called
dumbo.company.com.
The resource ID in this case is dumbo.company.com:/usr/local. In order
to provide for easy segregation of the object ID, it is ALWAYS specified
as the first word of the message. Make sure that agents send messages with
unique object IDs in them.
Let's suppose we need to monitor a process called netscape run by user jarra or victor. How do we distinguish between the two if the object ID = netscape? In that case, another qualifier is introduced as part of the object ID. There can be multiple qualifiers separated by a percent sign. Each qualifier narrows down the object ID. For example, the above netscape process will have two watchers with the following object IDs:
· netscape%jarra
· netscape%victor
In case of log watchers monitoring log files for a pattern, a sequence number is appended as part of the object ID. For example, we can have 2 watchers for /var/log/messages. One scans for the word "volerror" the other for the pattern "disk error". The following are the object IDs:
· /var/log/messages%1
· /var/log/messages%2
(NOTE: log watcher composes a message based on the log file's contents. The actual message will tell us what is wrong, not the object ID)
In case of databases, we may go into three levels if the resource hierarchy
calls for it. For example, the following is an object ID for table "SALARY"
in tablespace "HR" in database "ACCOUNTS":
ACCOUNTS%HR%SALARY
Given an object ID, the message text is either a full description of a problem or it may complement the object ID to form a meaningful message.
Examples of resource ID
· Oracle database of SID = ICC_PRD, tablespaces TOOLS on host ttc3232= resource ID = ttc3232:ICC_PRD%TOOLS
· /tmp filesystem on host tcc1212.companu.com = resource ID = tcc1212.company.com.au:/tmp
· process cron on host dumbo = resource ID = dumbo:cron
Why is the concept of resource IDs important ?
If we start integrating eEMU with a call logging system, we will find
that calls need to be logged once only for a specified resource problem.
Later messages for that resource need to update an existing call only (with
the new message). An example may be a filesystem whose utilization goes
up even after an alarm has been triggered. It is obvious that the message
text will be changing depending on the utilization percentage. Our system
needs to interpret all those messages as belonging to a single resource.
- new message (resource ID) is added
- message text of an existing message has
- message severity of an existing message has changed
- message class of an existing message has changed
- message comment of an existing message has changed or been added
NOTE: if a message attribute changes, it is reflected by an environment variable passed to the output script. Refer to the "Message Actions" section earlier in this manual for a detailed description of the available environment variables.
Only normal messages are put in the text file since they are the only message type meant for display in the browser.
The first type of messages are for database inserts or updates. They
are:
normal
sleep
mask
count
event
The second type of messages manipulates an existing message
addressable by a resource ID. They are:
delete
wakeup
comment
query (does a message exist ?)
The third type are command messages. They instruct eEMU
to do something: They are:
query (to fetch a file)
suspend
eEMU processing:
call input script, add/update message, log message, call output script
Syntax:
$emsg1 [-h <hostname] [-u <user] -o normal -n <emu server>
-p <port> -t <time-to-live> -s <severity> \
-w <password> -c <class> -m "<message>"
If the -o option is not specified, it defaults to normal.
Example:
an alarm message from host dumbo.company.com.au about object
ID=/usr/local getting full
$emsg1 -o normal -n emuserver -p 2345 -t 6m -s 1 -c /OS/UNIX/FS -w
icecream -m "/usr/local is 90% full"
The agent that produces the above message should run every 5 minutes.
Use:
Any resource monitoring, such as filesystems, processes, tablespaces,
mail queues. Time-to-live is set depending on the interval at which the
resource is polled. Remember that time to live must be slightly longer
than the polling interval. The polling interval is typically set in cron
if the agent is invoked by cron. For batch jobs or backups, set time-to-live
to -1 to force the message to stay displayed until it is manually acknowledged.
Note that if multiple messages arrive for the same resource ID, only the
latest one is displayed. This finds its use in message consolidation. E.g.
when a disk starts failing the system produces sometimes hundreds of messages
in syslog, each stating a different block number. If the disk name is selected
for the object ID, all the messages are consolidated by eEMU to just one.
eEMU acts as a good console output consolidator, or a generic message consolidator.
Action script:
The input and output action scripts are invoked on receipt of a normal message.
eEMU processing:
call input script, log message, call delete script, delete message
Syntax:
$emsg1 -o delete -n <emu server> -p <port> -w <password> -m
"<hostname>:<object ID>"
Example:
to delete a message about object ID=/usr/local that was received
earlier
$emsg1 -o delete -n emuserver -p 2345 -w icecream -m "dumbo.company.com.au:/usr/local"
Use:
Delete message is self-explanatory. It deletes a message of a specified
resource ID from the database. It usually means that the problem has been
fixed. eEMU browsers allow to send a delete message for a problem acknowledgement/fix.
As we will find later, delete messages are used to stop sleep messages
from exploding.
Action script:
eEMU invokes the input and delete action scripts on receipt of a delete message.
eEMU processing:
call input script, comment specified message, log message, call
output script
Syntax:
$emsg1 -o comment -n <emu server> -p <port> -w <password>
-m "<hostname>:<object ID> comment ....."
Notice that the comment freely follows the resource ID.
Example:
to send an annotation ("Administrator notified") to the previous message
about /usr/local getting full on dumbo.company.com.au
$emsg1 -o comment -n emuserver -p 2345 -w icecream -m "dumbo.company.com.au:/usr/local
Administrator notified"
Use:
Comments are a good way to communicate among multiple people about
an existing message. It can be a work request number, note about ETA etc.
Action script:
The input and output action scripts are invoked on receipt of a comment message. The comment can be forwarded to a higher level eEMU from the action script.
eEMU processing:
call input script, log message, call delete script, delete message
Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o sleep -n <emu server>
-p <port> -t <time-to-live> -s <severity> \
-w <password> -c <class> -m "<message>"
Example:
a sleep message from host dumbo.company.com.au about object ID=backup%fs.
The message is sent by a filesystem backup script at the beginning of the
backup. A delete message for resource ID = dumbo.company.com.au:backup%fs
is sent at the end of the backup. Time-to-live is set to either a typical
duration of the backup or the end time of backup window. In our case, backup
must finish by 6:30 am.
$emsg1 -o sleep -n emuserver -p 2345 -t 06:30 -s 1 -c /OS/UNIX/BCK
-w icecream -m "backup%fs failed or is running overtime"
Use:
As the above example suggests, sleep is usually used to notify us about
a negative event in case a processing script crashes. It is a prediction
message that explodes if the calling program doesn't get a chance to clean
it up. This feature is very powerful for failure prediction. Also, a sleep
message allows us to "schedule" an event to occur at a fixed time from
now on. Such schedule message can remove itself through the script it invokes.
Action script:
The input and output scripts are invoked on receipt of a sleep message.
eEMU processing:
call input script, change sleep message into normal message, log message,
call output script
Syntax:
$emsg1 -o wakeup -n <emu server> -p <port> -w <password> -m
"<hostname>:<object ID>"
Example:
to wakeup a message with resource ID = dumbo.company.com.au:backup%fs
that was received earlier:
$emsg1 -o wakeup -n emuserver -p 2345 -w icecream -m "dumbo.company.com.au:backup%fs"
Use:
wakeups are usually sent by emucleaner when a sleep message time-to-live
is up. Nevertheless, wakeup messages can be sent from scripts as well if
the application calls for it.
Action script:
The output script is invoked on receipt of a wakeup message. In the
output script, all message attributes except for message type (which stays
set to wakeup) and E_COUNT (which is set to 1) are inherited from the message
being woken up. This is to allow simple actioning in the output script.
Mask messages, as the name suggests, are used to mask out incoming
messages for the same resource ID as the mask message. A typical scenario
is missing processes alarms as a result of running a cold database backup.
9.9 mask message
mask message causes a resource ID message to be blocked out for/until time-to-live.
eEMU processing:
call input script, add/update message, log message, call output script
Syntax:
$emsg1 [-h <hostname] [-u <user] -o mask -n <emu server -p
<port -t <time-to-live -s <severity -w <password -c <class
-m "<message"
Example:
a mask message from host dumbo.company.com.au that an oracle
database called accounts is down. Since the backup should take a maximum
of 3 hours, time-to-live is set to 3h. The mask message will be sent by
an oracle cold backup script.
$emsg1 -o mask -n emuserver -p 2345 -t 3h-s 0 -c null -w icecream -m
"accounts database is down"
Notice that the severity and class are set to 0 and null, respectively. It is because they are not used in mask messages.
Use:
If, for example, a process watcher is set up on a monitored system,
there may be times that we want the process watcher disabled (masked out).
It may be necessary if the process is missing for a reason, such as a cold
backup.
Action script:
The output action script is invoked on receipt os a mask message.
2. eEMU is asked to produce a file from the out directory. emsg1 sends the file contents to its standard output. This option is great for agent configuration files distribution. On sending the file, eEMU deletes it from the out directory.
eEMU processing:
call input script, inquire on the message in the database or fetch
a file in the out directory and send it to emsg1, log message, call
output script
Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o query -n <emu server>
-p <port> -w <password> \
-m "<hostname>:<object ID>"
$emsg1 [-h <hostname>] [-u <user>] -o query -n <emu server>
-p <port> -w <password> -m "FILE <file name>"
Example:
Script for job4 will send a query message to find out if a message
about a completed job with resource ID = porky.company.com.au:job3 is in
the eEMU database.
If it is and it is of type "event", job 4 is run. Now it is obvious
that job 4 starts off only if job 3 completed successfully.
RET=`emsg1 -o query -n emuserver -p 2345 -w icecream -m " porky.company.com.au:job3"`
if [ $? eq 1 ];then
exit
fi
if [ "$RET" = "event" ];then
run job 4
fi
emsg1 will return a "none" string if porky.company.com.au:job3 doesn't exist in the database. Otherwise, it returns the type of message it found, e.g. "event" or "normal"
Another example is for agent configuration files distribution to a monitored host called dumbo. Configuration files are kept in one place on the management box. Each time an agent configuration file is changed, it is put in the EMU out directory (the out directory is specified in the <port.cfg eEMU configuration file). The agent will check before each run if the EMU server has a new configuration file to download. If it doesn't have any, the agent will use the old one. This scenario is good for sites without rcp capabilities that need central agent configuration file management. It is also good for nodes behind a firewall.
RET=`emsg1 -o query -n emuserver -p 2345 -w icecream -m "FILE dumbo.cfg"
| tee new_dumbo.cfg`
if [ $? eq 1 ];then
# a return code of 1 means that the connection
to EMU timed out
connect failed, try again later
exit 1
fi
if [ "$RET" != "none" ];then
cp new_dumbo.cfg dumbo.cfg
fi
run agents with dumbo.cfg
Use:
The above examples are quite descriptive to suggest the potential the
command has. It can be used to facilitate dependency checks across multiple
systems. The file to download can be a job to run, whereby eEMU can be
used as a simple scheduling system.
eEMU processing:
call input script, add/update with respect to the lag value, log message,
call output script
Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o "count <lag>" -n <emu
server> -p <port> -t <time-to-live> -s <severity> \
-w <password> -c <class> -m "<message>"
Example:
A CPU agent on node porky scans CPU utilization every 5 minutes. In
real life, it is not abnormal to see CPU utilization fluctuate for short
periods of time. However, it is running at 100% for 25 minutes at a stretch,
it may indicate a performance problem, possible due to a run-away process.
$emsg1 -o "count 5" -n emuserver -p 2345 -t 6m -s 1 -c /OS/UNIX/CPU -w icecream -m "CPU is 100% utilized"
Use:
Some resource we monitor may exhibit fluctuations that would normally
distort a real problem detection. Introducing a lag allows to pick up cases
of a long term continual resource problem.
Action script:
The output action script is invoked on receipt of a count message. If the message count reaches the <lag> value, the message attributes are changed in the output script as follows:
count is set to 1
message type is set to normal
If the action script is set up to simply forward normal messages or page someone if the message count equals 1, then it will work as expected for count messages as well.
eEMU processing:
call input script, add/update message, log message, call output script
Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o event -n <emu server>
-p <port> -t <time-to-live> -s null -w <password> \
-c null -m "<object ID>"
Notice that the severity and class are set to 0 and null, respectively. It is because there is no use for them in event messages. The message text is just an object ID against which a query will be run.
-1- Example:
an event that job 4 completed so that other dependent jobs can run.
???$emsg1 -o event -n emu server -p 2345 -t 2h -s 0 -c null -w icecream
-m "job"
Use:
The event message will find best use in synchronizing events across
multiple systems.
Action script:
The output action script is invoked on receipt of an event message.
eEMU processing:
call input script, log message, sleep for a specified number of seconds
Syntax:
$emsg1 [-h <hostname>] [-u <user>] -o suspend -n <emu server>
-p <port> -w <password> -m "<seconds>"
Example:
If we want to synchronize EMU on system A to EMU on system B, we need
to perform a copy operation while system B is not undergoing any changes.
By suspending system B for a short period of time, say 10 seconds, the
database on system B can be copied to system A.
$emsg1 -o suspend -n emuserver -p 2345 -w icecream -m "10"
Action script:
no action script is invoked.
sleep message <backup failed>
backup start
.
backup end
delete message <speedy:backup>
2. cold backup script on speedy (application1 is shut down)
sleep message <backup failed>
mask message <application1>
backup start
.
backup end
delete message <application>1
delete message <speedy:backup>
3. more to be added