Title: Site Reliability Engineer (SRE)
Quick Summary
Tidecrest Reliability is hiring a Site Reliability Engineer to harden uptime, tune performance, and remove toil across containerized, multi-region cloud services. You will champion service level objectives, automate safe deployments, and improve observability so product teams can ship confidently. We welcome strong entry-level candidates with solid fundamentals alongside experienced engineers who enjoy mentorship and pragmatic operations.
Project Category or Industry
Cloud platform engineering for data-driven B2B SaaS
Type
Full-time employment
Experience Level
Entry-level to mid-level, with structured mentorship and clear growth paths for motivated freshers who demonstrate strong systems fundamentals
Duration
Permanent role
Location
Remote-first across the Americas, EMEA, and APAC with a minimum of 4 hours overlap between 09:00β18:00 UTC; optional hub days in Seattle and Belfast
Salary
USD 75,000β115,000 base depending on location and experience, plus annual performance bonus and comprehensive benefits
Payment Mode
Monthly payroll via bank transfer; contractor arrangements available where local employment is not supported
Hiring Company Name
Tidecrest Reliability
Required Skills or Tools
Comfort with Linux, containers, and networking; experience with Kubernetes, infrastructure as code, CI/CD, and modern observability stacks; ability to write automation in a language such as Python, Go, or Bash; clear written communication and a bias toward measurable outcomes.
Project Details
Project Description
You will join the reliability group responsible for the health, performance, and delivery safety of Tidecrestβs customer-facing services. The team builds paved roads for product squadsβsecure defaults, fast feedback, and great toolingβso features reach customers without trading off stability. The work blends greenfield automation with iterative hardening of existing systems.
Core Responsibilities and Expected Deliverables
Define and track SLIs and SLOs for critical services; drive error budget policy and reliability reviews.
Build and maintain CI/CD pipelines with progressive delivery, canary analysis, and automated rollback.
Operate and optimize Kubernetes clusters, including autoscaling, network policies, ingress, and service mesh where appropriate.
Implement end-to-end observability: metrics, logs, and traces; create actionable dashboards and alerts tied to user impact.
Reduce toil through runbooks, incident tooling, and self-service platform capabilities; champion post-incident learning.
Deliver well-scoped pull requests with rollout plans, security checks, and documentation.
Required Experience and Preferred Qualifications
Foundation in Linux internals, TCP/IP networking, DNS, and containerization.
Exposure to distributed systems concepts, caching, queues, and backpressure.
Nice to have: GitOps (Argo CD or Flux), policy as code (Open Policy Agent), cost visibility (FinOps), and chaos or load testing.
Awareness of secure supply chain practices, secrets management, and compliance frameworks such as SOC 2.
Certifications are welcome but not required.
Tools or Platforms to Be Used
Cloud: AWS or GCP (EKS/GKE, IAM, VPC, S3/GCS, CloudFront/Cloud CDN, RDS/Cloud SQL)
Orchestration and packaging: Kubernetes, Helm, Kustomize, container registries with image scanning and signing
Infrastructure as code: Terraform or Pulumi; secret management via Vault or cloud KMS
CI/CD: GitHub Actions or GitLab CI with artifact promotion and automated release gates
Observability: Prometheus, Grafana, Loki, Tempo or OpenTelemetry, Alertmanager or PagerDuty
Security: Trivy or Grype for image scanning, Sigstore/Cosign signing, baseline runtime policies
Language Requirement
English is required for daily collaboration; additional languages are a plus but not required.
Communication Style
Asynchronous-first via Slack and GitHub with weekly Zoom stand-ups, design reviews, and post-incident retrospectives. Lightweight RFCs in Notion capture decisions and trade-offs.
Time Commitment or Working Window
Approximately 40 hours per week with flexible scheduling; core collaboration windows target late morning to late afternoon UTC. Participation in a compensated on-call rotation begins after onboarding and shadowing.
Payment Terms
Monthly salary with an annual performance review and bonus eligibility. For contractors, milestone-based deliverables with biweekly invoicing and net-15 payment terms.
Evaluation Criteria
Demonstrated automation and infrastructure-as-code skills
Practical problem-solving in a time-boxed take-home focused on CI/CD and reliability trade-offs
Collaboration during a live pairing session and clarity explaining failure modes and mitigations
Communication, ownership, and reliability assessed through interviews and references
Evidence of observability-driven operations and security-minded practices
Other Requirements
Standard NDA upon offer acceptance, identity verification, and reference checks compliant with local laws. Light-touch time tracking for contractors. Adherence to secure development lifecycle, change management, and incident response guidelines.
About the Company
Tidecrest Reliability is a remote-first engineering company that builds resilient platforms for data-intensive SaaS products. Founded in 2019, we operate small, autonomous teams that value pragmatism, measurable outcomes, and a strong reliability culture. Our engineers are distributed across North America and Europe with collaboration hubs in Seattle and Belfast. Learn more at https://tidecrestreliability.com or contact careers@tidecrestreliability.com.
