Title
AIOps Engineer
Quick Summary
OpalStack Reliability is hiring an AIOps Engineer to build intelligent operations that reduce alert noise, detect anomalies early, and trigger safe, automated remediation. You will blend observability, SRE practices, and machine learning to improve uptime, lower MTTR, and give engineers clear, actionable signals. We welcome strong graduates and early-career engineers who have shipped hands-on projects in automation, monitoring, or incident response.
Project Category or Industry
Cloud infrastructure, observability, and intelligent operations for SaaS and enterprise platforms.
Type
Full-time employment.
Experience Level
Entry to mid-level with structured mentorship and progression; experienced applicants are also welcome.
Duration
Permanent role.
Location
Remote-first with optional hybrid collaboration in Chicago and Lisbon. Maintain at least four hours of overlap with teams operating between UTCβ6 and UTC+0.
Salary
USD 102,000β150,000 base depending on location and experience, plus benefits and an annual performance bonus.
Payment Mode
Monthly payroll for employees; compliant contractor arrangements are available in select countries.
Hiring Company Name
OpalStack Reliability
Required Skills or Tools
Strong Python or Go, solid understanding of Kubernetes and Linux networking, and practical experience with observability stacks and alerting. Familiarity with anomaly detection techniques, event correlation, and safe automation patterns. Clear written communication to document runbooks, standards, and change logs.
Project Description
OpalStack Reliability builds the platform and automation that keep customer-facing services healthy. As an AIOps Engineer, you will use telemetry to detect emerging incidents, apply correlation to cut through noise, and implement runbooks that remediate common failures automatically. You will partner with SRE and product teams to define service-level objectives, wire up golden signals, and ensure that alerts are specific, debuggable, and actionable.
Core Responsibilities and Expected Deliverables
Design and maintain anomaly detection, trend analysis, and change-point alerts across metrics, logs, and traces.
Implement event correlation and noise-reduction pipelines that preserve high-severity signals while eliminating flapping alerts.
Build self-healing workflows for common failure modes (restarts, rollbacks, cache warmups, traffic shaping, and feature flag toggles).
Define SLOs, error budgets, and golden signals; publish dashboards and weekly reliability reports with clear follow-ups.
Integrate AIOps insights into incident management with context-rich alerts, runbook links, and automated status updates.
Contribute to post-incident reviews and reliability roadmaps; codify learnings as tests, guardrails, and preventative automations.
Required Experience and Preferred Qualifications
Proficiency in Python or Go; strong fundamentals in Linux, containers, and Kubernetes primitives.
Hands-on experience with observability (Prometheus, Grafana, OpenTelemetry, Loki or ELK) and alert routing tools such as PagerDuty or Opsgenie.
Working knowledge of message buses and streams (Kafka/Kinesis) and data stores (Redis/PostgreSQL) for telemetry pipelines.
Preferred: basic ML for time-series and anomaly detection (prophet, statsmodels, scikit-learn), configuration as code (Terraform), GitOps (Argo CD), and security controls for secrets and RBAC.
Evidence of impact through internships, open-source contributions, hackathon projects, or automation you have shipped.
Tools or Platforms to Be Used
Observability and telemetry: Prometheus, Grafana, Loki, Tempo or Jaeger; OpenTelemetry for traces and metrics.
Automation and workflows: Argo Workflows, Kubernetes Jobs/CronJobs, Terraform, GitHub Actions.
Incident and alerting: PagerDuty or Opsgenie, Alertmanager, status page tooling.
Data and analysis: Kafka for event streams, Redis for queues and rate limiting, Python ML libraries for detection models.
Language Requirement
Professional English is required. Additional languages are welcome for cross-regional collaboration.
Communication Style
Written-first culture using design docs and pull requests on GitHub; Slack for daily coordination; Zoom for stand-ups, incident drills, and post-incident reviews. Clear, accessible documentation is expected for all production changes.
Time Commitment or Working Window
Standard 40 hours per week with flexible scheduling. Maintain a predictable daily block that overlaps at least four hours with the core team between 09:00 and 17:00 in your local time. Participation in a light on-call rotation is shared across the team with documented runbooks.
Payment Terms
Salary is paid monthly via payroll. For contractors, invoices are processed on net-30 terms upon acceptance of deliverables and timesheets.
Evaluation Criteria
Portfolio or code samples showing automation, anomaly detection, or observability improvements.
Practical exercise to build a noise-reduced alert pipeline with runbook automation and measurable MTTR gains.
Technical interview covering SLOs, alert design, Kubernetes operations, and safe remediation patterns.
Final conversation on collaboration, communication, and stakeholder management.
References may be requested.
Other Requirements
New hires sign a confidentiality agreement and follow security and data-handling policies. Lightweight time-tracking may be used for distributed coordination. Occasional on-site collaboration for reliability workshops may be required.
About OpalStack Reliability
OpalStack Reliability is a privately held reliability engineering company focused on intelligent operations for cloud-native services. Headquartered in Chicago with a distributed team across North America and Europe, we combine SRE discipline with pragmatic machine learning to deliver stable, transparent platforms. Learn more at https://www.opalstackreliability.com and reach our hiring team at careers@opalstackreliability.com.
