Power & Utilities
Name
PG&E: Streamlining PI System O&M by Centralizing PI Message Logs
Description

Pacific Gas and Electric [PG&E] have streamlined our PI System operations, management and health monitoring by centralizing all our PI message logs in Splunk, building a series of reports and automated alerts and creating a workflow to review error messages daily.  We now have the ability to quickly and easily search all the logs for any event or series of events across multiple collectives, systems and timeframes. 
The solution helps drive our Daily Operational Review [DOR] and allow us proactively detect issues and errors, review and optimize performance, and ensure minimal downtime for our users.  Having the ability to search across all system logs in one place aids in the localization and rapid recovery from unexpected events and outages and streamlines the process of gathering logs for troubleshooting and technical support calls.
We are able to quickly and efficiently surface pending and critical errors across all PI Servers using a single tool and using simple queries, which enhances our ability to diagnose and respond to existing and potential problems swiftly.  We will also discuss the future enhancements we have planned to gain equivalent visibility into the health of PI services that use the Windows Event log and our eventual transition to using the ELK stack.
Consolidating message logs allows us to monitor for application errors and events that are otherwise not reflected in the health counters or visible to users.  The value add is that we can catch critical  [and pending] events revealed only in the logs that show pi components in a bad state, but otherwise outwardly appear to be running.  And, since we have a highly available infrastructure, individual components and services can become unresponsive or compromised but from our user’s perspective, nothing may appear wrong.  This solution also allowed us to discover a bug with the archive registration process and submit needed details for AVEVA to hone in on the bug and ultimately release a fix.