Jacada: Troubleshooting Astro Issues


Summary

Top20 Incidents 06-2018 - Jacada Legacy Article 

Reference

UEM ALARMS

 

1. Fault Manager lost communication to the device. Reason

Description:

 

The Incident generated by the alarm above is one of the most common and straightforward events seen for all UEM managed devices.

In general It shows that the UEM is not able to communicate with the client agent on the other side due to a device inaccessibility or SNMP issues with the client device.

 

Example:

 

SZ0140246:Channel 4:Fault Manager lost communication to the device. Reason: Timeout on device response.

 

 

Actions:

 

First try to send a Ping command from the UEM device action menu.

If the device cannot be pinged,  it may have failed, may not have power, may need re-seating, or it needs to be replaced.

In this scenario the Incident needs to be dispatched to the field for further investigation.

 

If the the device can be pinged but in UEM shows that it is in CommFailure, then the device needs to be rediscovered in UEM.

Please escalate to Tier 2 in case none of the actions above fix the issue.

 

 UEM example of the alarm:

 

2. Site Link : DOWN, INTERFACE FAILURE

Description: 

 

This type of Incidents are generated when the Active Zone Controller indicates that it has lost , one or two out of two, connections with a Site Controller at a  remote ISR site.

The reason could be either the site is down hard, bouncing, or one of the site controllers failed.

 

Example:

 

ZC Trunket Site Link 21.1 - The first number indicates the site number and the second number the link# to the site controller. 

 

Actions:

 

Check the UEM site events corresponding to the site number:

1. If UEM reports that the whole site is in a Commfailure this means that there is a transport issue and needs to be dispatched to the backhaul provider

2. If UEM shows one of two router's reporting WAN Slot Down - it means that one of the links is having transport issues - should be checked with the backhaul provider

3. If UEM shows only a specific Site controller is in а Commfailure  this should be dispatched to the field

4. Please escalate to Tier 2 in case none of the action above fix the issue

 

 

 

3. Channel : CRITICAL MALFUNCTION, LINK FAILURE

Description: 

 

This following alarm is generated by a Base Radio under one of the following conditions:

Planned or unplanned activity from the list below:

- Link Failure(The site experience link issues)

- Preventive Maintenance

- Illegal Carrier

Or a Base Radio module may generate this Incident:

- Rectifier

- Exciter

- Battery

- Power Amplifier

- SRU port on channel bank at SAVE/Prmine site;

 

Example:  

 

A08D110106:Channel 7:CRITICAL MALFUNCTION, LINK FAILURE

 

Actions:

 

1. Examine in UEM all site history events

2. If the UEM reports that the whole site is in a Commfailure this means that there is a transport issue and needs to be dispatched to the backhaul provider

3. If the UEM shows only an alarm for that specific Channel this should be dispatched to the field

4. Please escalate to Tier 2 in case none of the action above fix the issue

 

 

4. Channel : MINOR FAILED, SCB EXTERNAL REFERENCE

Description:

 

 - The TRAK 9100 use GPS satellite signals to derive a high-precision 1PPS, 5 MPPS, or composite (1PPS + 5 MPPS) references. These references are provided to all site's base radios, comparators, and controllers, so that all site devices involved in the audio transmission have a common timing source (GPS)

- The RDM devices provide redundant integrated site reference distribution through two GPS units as timing reference sources to all the base radios at the site, eliminating the need for the TRAK 9100 Simulcast Site Reference (SSR) at the site

 

Actions:

 

When a device is in the alarm for "Channel SCB External Reference Missing"  first should be determined what type of device is used to provide Frequency Reference: RDMs or TRAK.     

This can be done by checking what NTP Servers are assigned to the device in alarm(Channel, Comparator, Site Controller).

Follow the steps below:

 

    1. Open customer's UEM(If you are not aware how to open it, go to Fundamentals script and point to "How to open UEM, find a site, locate device IP Address")

    2. Based on the Side ID provided navigate to the Site and open the site devices list.

Now you can easily determine the type of the NTP sources configured:

 

     a. If, among the other site devices a "GPB 8000 Reference Distribution Module" is presented(see the RDM example below) then the site utilizes RDM for an external reference source 

- RDM NTP Source example:

 

 

     b. If, among the other site devices a "TRAK GPS - 9100" is presented(see the TRAK 9100 example below) then it is obvious that this site utilizes a TRAK unit for an external reference source 

- TRAK 9100 NTP Source example:

 

 

What device does the channel obtain its time reference from?

RDM - Reference Distribution Module

TRACK 9100

 

5. Site Link : DOWN, NO ACTIVITY RECEIVED

Description: 

 

This type of Incidents are generated when the Active Zone Controller indicates that it has lost, one or both paths to a Console site or Conventional site.

The reason could be either the site is down hard, bouncing, or a problem with the site router. 

 

Example:

 

Site Link 1005.2:DOWN, NO ACTIVITY RECEIVED - The first number shows the Console site number(1005),  the second number indicated the path to the site(2)

Site Link 2042.1:DOWN, NO ACTIVITY RECEIVED - The first number shows the Conventional site number(2042),  the second number indicated the path to the site(1)

 

 

Actions:

 

Check the UEM site events corresponding to the Console site number:

1. If the UEM reports that the whole site is in a Commfailure this means that there is a transport issue and needs to be dispatched to the backhaul provider

2. If the UEM shows only an event for OP Position, router or switch is in а Commfailure , the status of the device needs to be checked:

    a. If the device is in Commfailure dispatch to the field

    b. If the device is reachable but in alarm, engage Tier 2

3. Please escalate to Tier 2 in case none of the action above fix the issue

 

 

UEM Alarm Example:

 

 

6. Site : NOT WIDE TRUNKING, SITE CONTROL PATH DOWN

Description: 

 

This alarm is typical when the Zone Controller(ZC) is not able to verify both connections to a remote site.

The usual reason for this incident is a problem with the backhaul transport - the UEM shows the whole site in CommFailure.

It's been reported very rarely to be a problem with the core router, site router, site switch or both site controllers.

It can be also seen due to RF site maintenance.

 

 

Example:

 

Site 13:NOT WIDE TRUNKING, SITE CONTROL PATH DOWN - This message shows that the active ZC is not able to reach site 13. 

 

Actions:

 

Check the UEM site events corresponding to the site number.

1. If  the the ZC reports Not Wide Trunking and the whole site is in a Commfailure - it means that there is a transport issue and needs to be dispatched to the backhaul provider

2. If the site is not in maintenance, the ZC reports Not Wide Trunking but  the UEM is able to see the site devices escalate it to Tier 2

 

UEM example of the alarm:

 

7. Channel : MAJOR FAILED, Rx Illegal Carrier

Description:

 

The Illegal Carrier Determination feature allows base radio channels to continue operating with system configurable levels of channel interference. If the RF Threshold Value configured is exceeded, the base radio enters the Illegal Carrier state and generates an Illegal Carrier message to Unified Event Manager (UEM).

 

Actions: 

 

1. Check UEM for alarm history:        

- If less than 3 times in 30 minutes, monitor for 30 minutes and resolve the Incident

 - If it's 3 or more times in 30 minutes, check if any other channels are currently failed or disabled at this site

  • If yes

  • If more than 33% of channels are failed, escalate to field team for investigation

    If less than 33% of channels are failed, check after 2 hours to see if the alarm has remained clear


    If no, check after 2 hours to see if the alarm has remained clear:

If yes, follow normal case closure procedure

If no, dispatch case to field team for investigation


3. Check for Incident history in case of chronic site issues

 

Rebooting the Channel

 Rebooting the channel only clears the alarm until it meets the illegal carrier threshold.  It is not recommended to restart the channel, because it clears the base station logs. 

 

The channel can be rebooted if it is believed that it will clear the alarm, but the agent needs to collect the base station logs via CSS prior that.  Collecting the logs is necessary when they need to be used for further investigation. 

Note: For "Illegal Carriers" alarms, rebooting the channel only masks the issue.

Please click next for the procedure on how to collect Base Station logs.

 

UEM alarm example:

 

8. Channel : CRITICAL MALFUNCTION, GEN FAILURE

Description: 

 

This following alarm is typically generated by a Base Radio at a Simulcast remote site under one of the conditions below:

Planned or unplanned activity from the list below :

- Link Failure between the Prime and Remote sites

- Activities related to site Preventive Maintenance

- Issues with the network equipment at/between the Prime and Remote sites

- Replacing or rebooting a network device at/between the Prime and Remote sites

- Replacing/Rebooting Comparator at a Prime site

- Replacing/Rebooting a GPS TRAK Unit or RDM at Remote site

- Issues with a X-Hub in a Six Pack Configuration

 

Example: 

 

SZ014012606:Channel 1:CRITICAL MALFUNCTION, GEN FAILURE

 

 

Actions: 

 

1. Check if work performed at the site(CRQ)

2. Examine UEM site's history events

3. Check what device first reported the alarm

 - If the alarm is from a router, WAN port related, this is a transport issue and needs to be dispatched

to the backhaul provider between the Prime and Remote sites

 -  If the alarm is from a router but is not a WAN port related,collect the details and communicate the findings with Tier 2 for further actions(Dispatch to the field)

 -  If the alarm is coming from a site switch, collect the details and communicate the findings with Tier 2 for further actions (Dispatch to the field)

 

9. I/P : NICE MCC7500 Monitoring Service

If the MC7500 is in alarm for No New VOIP Packets or customer calls in stating that their Logger is not recording.  You will need to check the AIS to verify it is recording.

 

Actions:

 

Please click next to take you to the AIS troubleshooting page.

 

10. Channel : OUT OF WIDE SERVICE

This alarm is reported by the Zone Controller(ZC) when a specific RF channel is not available because it is unreachable or It is disabled.  It can be also seen due to RF site maintenance.

 

Example:

 

Site 94 at zone2 Trunked Site Channel 1 OUT OF WIDE SERVICE, NOT ENABLED FROM SITE - Example of disabled Channel

Site 94 at zone2 Trunked Site Channel 1 OUT OF WIDE SERVICE,  NO RFA FROM SITE - Example when the Channel is not ready or down.

 

Actions:

 

Check the UEM site events corresponding to the site/chnnel number.

1. Check if there is an ongoing site maintenance. There should be a CRQ created and the incident should be related to it

2. Check for site events pointing to transport issues. In this scenario the active ZC will show the site and all the channels Out Of Wide Service

3. When a single channel reports the alarm and there is no site maintenance - try to ping the channel from UEM

 - If the channel replays, talk to Tier 2 for further actions

 - If the channel does not replay, dispatch a shop to inspect the channel in place

 

 

UEM example of the alarm:

 

11. CCGW : DOWN, CCGW

This type of Incident is generated when the Active Zone Controller loses its path to the CCGW device.

The most common reason for this alarm is one of the following:

- Transport issues between the CCGW and the master site

- Master Site Zone Controller Swap

- All channels are disabled or not presented

 

Example:

 

ccgw01.site25.zone3:DOWN, CCGW-ZC Connection Lost. - No Activity from Site 2055, Control

 

Actions:

 

1. Check if there is an ongoing site maintenance. There should be a CRQ created and the incident should be related to it

2. Check in UEM if the link to the site bounces or check for  any other events pointing to a transport issues. In this scenario а case with the backhaul provider needs to be opened

3. When a CCGW reports an alarm and there is no site maintenance or a transport issues- try to ping it from UEM:

 - If the CCGW replays, talk to Tier 2 for further actions

 - If the CCGW is unreachable, dispatch a shop to inspect it

 

12. OP : DOWN

This Incident is typical when a Console has been been rebooted or has it's Elite application closed.

 

Example:

 

SZ01A41D14:OP 4:DOWN, RESET - The console has been reset.

SZ913E1D40:OP 4:DOWN, NO STATUS - The Elite  Application is currently down or has been recently exited/lunched.

 

 

 

Actions:

 

1. Check if there is an ongoing site maintenance. There should be a CRQ created and the incident should be related to it

2. Check in UEM to see if:

 - only one console reported the alarm - Try to contact the console operator/field and ask if the console was closed/rebooted intentionally. If not there may be an application issue and may needs to be communicated with Tier 2

 - all the consoles reported the alarm - Check for other, more descriptive site events, also check for power outages

3. Escalate to Tier 2 if the operator has confirmed the Elite application is running bur UEM still shows the alarm

 

UEM Example:

 

13. Site : OUT OF WIDE SERVICE

Description: 

 

This alarm is typical when the Zone Controller(ZC) is not able to verify both connections to a remote site.

The usual reason for this incident is a problem with the backhaul transport - the UEM shows the whole site in CommFailure.

It's been reported very rarely to be a problem with the core router, site router, site switch or both site controllers.

It can also be seen due to RF site maintenance.

 

 

Example:

 

Site 

 

13:OUT OF WIDE SERVICE, NO RFA FROM SITE - This message shows that the active ZC is not able to reach site 13. 

 

Actions:

 

Check the UEM site events corresponding to the site number.

1. If  the the ZC reports OUT OF WIDE SERVICE and the whole site is in a Commfailure - it means that there is a transport issue and needs to be dispatched to the backhaul provider

2. If the site is not in maintenance, the ZC reports OUT OF WIDE SERVICE but  the UEM is able to see the site devices escalate it to Tier 2

 

UEM example of the alarm:

 

14. Motorola IVD PDR Shared Links Down, RNG

An incident with this summary has been opened because UEM generated an alarm stating that the RNG(Radio Network Gateway) is not able to reach a specific remote site due to link between them went down. 

The most common reason for this alarm is either:the Site Link to the remote site is down or the active Site Controller has problem and does not capable of maintaining the connection properly.

 

Example:

 

SZ0140318:Motorola IVD PDR Shared Links 1 at zone3:Down, RNG-Base Site connection failure 

 

 

Actions:

 

 1. Check if there is an ongoing site maintenance. There should be a CRQ created and the incident should be related to it. 

2. Check UEM for site/core router WAN port events before the "RNG-Base Site connection failure" alarm appears:

 - If you see such an events then this should be handled as a transport issue

 - If UEM shoes the site connection is stable then check the active site controller for further events

    a. If UEM shoes that the site controller is in CommFailure and it's not ping-able please dispatch it to the field

    b. If the Site controller is reachable but shows some errors talk to Tier 2 if it should be dispatched to the Field

 

3. Please escalate to Tier 2 in case none of the action above fix the issue

 

UEM alarm example:

 

15. Site Link : NOT IN USE, CONSOLE DISCONNECT

This Incident is typical when the active Zone Controller has lost it's connection to the console maintaining the Link-OP(the Console Site link with the Zone Controller).

This UEM alarm is triggered due to one of the following reasons:

 - Preventive Maintenance

 - Upgrading/Patching work performed on the console site

 - Transport related issues

 - The console handling the Link-op has been rebooted - intentionally or by accident

 

Example:

 

SZ044A2D4:Site Link 1004.2:NOT IN USE, CONSOLE DISCONNECT - 10.2.233.100

 

 

Actions:

 

1. Check if there is an ongoing site maintenance/upgrade/patching work. There should be a CRQ created and the incident should be related to it

2. Check all UEM events  for that console site and verify if  there are transport related issues

3. Please escalate to Tier 2 in case none of the action above fix the issue

 

UEM Example:

 

16. Channel : INITIALIZATION, RESET

Channel : INITIALIZATION, RESET incident is created  when a Base Radio has been rebooted:

 - Intentionally - By site maintenance or Software upgrade

 - When a Hardware or Software problem has occurred

 

 

Example:

 

A05AA1149:Channel 3:INITIALIZATION, RESET

 

 

Actions:

 

1. Check if there is an ongoing site maintenance. There should be a CRQ created and the incident should be related to it

2. Check all UEM events for that site and verify if there was a power outages

3. If the Base Radio is being restarted continuously  - Dispatch the Incident to the Field

4. If the alarm was not caused by, or is not like, any of the above, please communicate your findings with Tier 2 for further actions

 

17. VPM : INITIALIZING, NO REASON

This Incident is generated during the VPM(Voice Processing Module) booting process.

The UEM alarm itself can be seen as a part of a specific set of events(Check the example below), right before the VPM comes up and the alarm is being cleared.

In most of the cases it should be cleared by itself, but if you notice that it comes back in a short period of time, further actions need to be performed.

It's usual when:

 - Preventive Maintenance - Check for CRQ created

 - Upgrading/Patching work performed on the console site -  Check for CRQ created

 - It also can be seen due to unexpected site power outage

 

If the incident is being generated under different than conditions mentioned above:

 - Most likely the device is in never ending booting loop and may needs to be dispatched to the Field for hardware/Software inspection

 

Example:

 

 

 

Actions:

 

1. Verify if there is a ongoing work on the site

2. Check site UEM events  for other similar Initialization events from other VPM/devices pointing to a Power issues

3. For any other scenario please communicate the findings with Tier 2 for dispatching to a Field Technician

 

18. CCGW : MINOR MALFUNCTION, BETWEEN

This Incident is generated when a CCGW loses connectivity to some or all of its provisioned conventional channels.

It usually happens due to one of the following reasons:

 - Preventive Maintenance - Check for CRQ created

 - Upgrading/Patching work performed on that site -  Check for CRQ created

 - The channels are disabled or not available

 - The CCGW cannot reach the channels due to transport issues

 

 

Example:

 

SZ01A42D19:C3:MINOR MALFUNCTION, BETWEEN 100% and 50% (INCLUSIVE) CHANNELS AVAILABLE - More than half channels are up : Up:4 ,Partial:0 ,Dn:4from Site 2297

 

 

Actions:

 

1. Verify if there is an ongoing PM / Patching / Upgrade work at the CCGW site location and / or remote conventional channels - Check for CRQ created

2. Dispatch the Incident to a Field Technician for further investigation

 

UEM Example:

 

19. OP : CRITICAL MALFUNCTION, LINK FAULT

This Incident is being created when a Console cannot verify it's link with the VPM(Voice Processing Module).

It usually happens due to one of the following reasons:

 - Console Site Preventive Maintenance - Check for CRQ created

 - Console Site Upgrading/Patching work -  Check for CRQ created

 - VPM connected to that Console has been restarted

 - Bad port on the site switch where the VPM is connected

 - Bad VPM Ethernet port

 - Bad Ethernet cable between the site swith and the VPM

- It could be also a bad VPM

 

 

Example:

 

A019D1D1:OP 3:CRITICAL MALFUNCTION, LINK FAULT

 

 

Actions:

 

1. Verify if there is an ongoing PM / Patching / Upgrade work at the Console site - Check for CRQ created

2. Check if the UEM shows more than one console at the site reporting the same alarm - In this scenario the problem might be with the site switch. Talk to Tier 2 person about your findings for further investigation

3. If the alarm is latched in UEM try to ping the VPM from there:

 - If the ping is successful talk to TIer 2 to check the VPM/Console configuration

 - If the VPM does not respond dispatch a Field Technician to check the device on site

 

UEM Example:

 

20. ZC: Path Down, Communication Failure

An Incident with the following summary is created due to one or both Control Paths are not seen by the Zone Controller reporting the alarm.

It usually happen during a:

 - Master Site upgrades

 - Linux Patching 

 - Master Site Maintenance 

 - When one of the  Zone Controllers is restated  - the other ZC would generate the alarm showing the missing HA Path to the restarted ZC

 

Example:

 

A05AA1:zc01.zone1:Path Down, Communication Failure - 10.1.233.100

 

 

Actions:

 

1. Verify if there is an ongoing Preventive Maintenance, Patching or Upgrade work at the Master site location - Check for CRQ created

2. Analyze the UEM events sourced from the Master Site, around the time of the alarm. Check if the ZC events are isolated or something abnormal is going on with the whole site

3. Communicate the findings with Tier 2 and Dispatch to the Field for local site examination. Advise the Field to check the cables and VMS Ethernet ports where Control Path 1(VLAN 231) and Control Path 2(VLAN 232) are connected

 

 

UEM Alarm Example:

 

21. Licensing Service at ATR or UCS

This Incident indicates that the link between the ATR and LM or between UCS and UNC is down or the servers are not responding.

 

This alarm may appears during one of the events below:

 - Master Site upgrades

 - Linux Patching 

 - Master Site Maintenance 

 - When the ATR is in alarm the LM is restarted

 - When the UCS is in alarm the UNC is restarted

 

Example:

 

A05AA1:Licensing Service at atr02.zone1:Link is down, Service is unreachable

SZ01CE3:Licensing Service at ucs01.ucs:Link is down, Service is unreachable

 

 

Actions:

 

1. Verify if there is an ongoing Preventive Maintenance, Patching or Upgrade work at the Master site location - Check for CRQ created

2. Analyze the UEM events sourced from the Master Site, check for events related to the following servers and devices: UCS, UNC, ATR, LM, Core Lan Switch, Gateway Routers.

 

3. This alarm should not be treated as a low priority one since it does not affect the system functionality.

4. If the alarm did not clear immediately, wait for 2 hours and check it again.

 - if the alarm is still there, during business hours, talk to Tier 2 for further actions

 - Resolve it if the alarm is clear.