Key Responsibilities:
Monitoring Solution Deployment and Configuration:
- Design, implement, and configure monitoring solutions (e.g. Dynatrace, AppDynamics, Datadog, Splunk, SolarWinds, etc.) for applications, infrastructure, and network monitoring.
- Set up monitoring agents, data collection, and alerting systems for critical IT assets, both on-premises and in the cloud.
- Integrate monitoring systems with existing IT operations tools (e.g., ITSM, ticketing, and incident management systems) to automate alerts and incident responses.
System and Application Monitoring:
- Develop and configure comprehensive monitoring strategies for all layers of IT systems, including Applications, hardware, operating systems, databases, and networks.
- Ensure key metrics are captured, including system uptime, performance, transaction latency, resource utilization, error rates, and other business-critical indicators.
- Monitor On-Premises, Cloud Environments (AWS, Azure, Google Cloud) and containerized applications (Docker, Kubernetes) to ensure optimal performance and resource utilization.
Alerting and Incident Management:
- Design and implement alerting systems based on thresholds and key performance indicators (KPIs) to notify stakeholders of potential issues.
- Investigate, troubleshoot, and resolve monitoring-related incidents, ensuring minimal system downtime and performance degradation.
- Work with operations, development, and support teams to resolve system issues and improve the effectiveness of the monitoring solution.
Data Analytics and Reporting:
- Collect and analyze monitoring data to identify trends, patterns, and potential risks to system availability and performance.
- Develop custom dashboards and reports to visualize system performance, availability, and incident trends.
- Provide actionable insights to improve system reliability and optimize resource allocation.
Continuous Improvement and Automation:
- Continuously evaluate and improve monitoring coverage, ensuring new systems, applications, and technologies are properly monitored.
- Automate monitoring tasks and workflows, including alert escalation, ticket creation, and issue resolution.
- Keep abreast of industry best practices, new monitoring tools, and emerging technologies to enhance monitoring capabilities.
Collaboration and Cross-Functional Support:
- Collaborate with infrastructure, DevOps, and security teams to ensure end to end visibility across the entire IT ecosystem.
- Provide training and support to internal teams on how to interpret monitoring data and respond to alerts.
- Act as a subject matter expert (SME) for monitoring systems, providing guidance on monitoring tool configuration, optimization, and troubleshooting.
Documentation and Knowledge Management:
- Document monitoring system configurations, workflows, and processes for future reference and training.
- Maintain a knowledge base of common issues, alerts, and resolutions to improve incident management efficiency.
Requirements:
- Proven experience in deploying, configuring, and managing monitoring solutions (e.g. Dynatrace, AppDynamics, Datadog, Splunk, SolarWinds etc.).
- Hands-on experience with monitoring applications, systems, cloud environments (AWS, Azure, GCP), and containerized services (Docker, Kubernetes).
- Experience with automated monitoring and incident response workflows, including integration with ITSM or ticketing systems.
- Familiarity with Application, and infrastructure monitoring, including networks, servers, and storage systems.
Qualifications:
- Bachelors or Masters degree in Computer Science or a related field.
- Strong communication skills
- Experience: 1-2 Years
- Timing: Rotational Shifts Morning 9 am to 5:30 pm, Evening and Night
- Working Days: Monday to Friday
Benefits:
- Market competitive salary
- Ongoing training and professional development opportunities.
- A collaborative and dynamic work environment.
- Medical Insurance
- Employee Provident Fund
- EOBI
- Mobile SIM
Skills:
Solar Winds, Datadog, DevOps Management, Splunk,