Senior DevOps Engineer
Are you a talented Senior DevOps looking for a remote job that lets you show your skills and get decent compensation? Look no further than Lemon.io the marketplace that connects you with hand-picked startups in the US and Europe.
What we offer:
- The rate depends on your seniority level, skills and experience. We’ve already paid out over $11M to our engineers.
- No more hunting for clients or negotiating rates let us handle the business side of things so you can focus on what you do best.
- We’ll manually find the best project for you according to your skills and preferences.
- Choose a schedule that works best for you. Its possible to communicate async or minimally overlap within team working hours.
- We respect your seniority so you can expect no micromanagement or screen trackers.
- Communicate directly with the clients. Most of them have technical backgrounds. Sounds good, yeah?
- We will support you from the time you submit the application throughout all cooperation stages.
- Most of our projects involve working in a fast-paced startup environment. We hope you like it as much as we do.
- Through our community, we will connect you with the best developers from more than 71 countries.
We have several openings for Senior DevOps Engineers, with two types of requirements.
Requirements DevOps with Azure DevOps:
- 4+ years of experience as a DevOps Engineer
- At least 3 years of experience with Azure DevOps
- Experience in at least 2 commercial projects using Microsoft Azure or Kubernetes is required
Requirements DevOps with AWS/GCP/SQL:
4+ years of experience as a DevOps Engineer
At least 3 years of experience with AWS, GCP, or SQL
At least 3 years of commercial experience with Python
Experience in at least 2 commercial projects using Terraform or Kubernetes is required
Strong technical skills: as a Senior DevOps, you are expected to be able to create projects from scratch and have a deep understanding of application architecture.
Clear and effective communication in English advanced ability to discuss business tasks, justify decisions, and communicate issues. Good self-presentation is also essential for upcoming client calls.
Strong self-organizational skills ability to work full-time remotely with no supervision.
Reliability we want to trust you and expect that you wont let us and the client down.
Adaptability and Flexibility the ability to onboard the project promptly after accepting it and start delivering results quickly.
Sounds good for you? Apply now and join the Lemon.io community!
NOT YOUR TECH STACK?
We have multiple projects available for Senior Developers. If you have 4+ years of commercial software development experience and are proficient in any of the following areasand roles: AI Agent Architect, AI Automation Architect,React & Ruby, PHP & Angular, PHP & Vue, Vue & Node.js, React & .NET, Android & iOS, Angular & .NET, Angular & Node.js, Vue & .NET, Python & Vue, MLOps, React & Java, Data Science, Blockchain (Web3/Solidity/Solana), Symfony & React, Symfony & Vue, Symfony & Angular, Symfony & JavaScript & Next.js & TypeScript, Data Analysis, React & PHP, Data Engineering, AI Engineering, Data Annotation, React & Node, React & Golang, React & Python, Golang, Python & Flask, Svelte & Python, Svelte & Node, Svelte & TypeScript, Rust, Shopify & JavaScript, Vue & Nuxt, Python & Node, Angular & TypeScript, Ruby & Ruby on Rails, React Native & Ruby, React Native & Python, PHP & Laravel, .NET & C#, Java & Spring, Unreal Engine & C++, Python & LLM, Unity, Machine Learning Engineering wed be happy to connect and match you with a suitable project.
If your experience matches our requirements, be ready for the next steps:
- VideoAsk watch a short video about our startup, up to 10 minutes
- Complete your profile on our website
- Intro call
- Technical interview
- Feedback
- Magic Box (we are looking for the best project for you).
We do not provide visa assistance, and our cooperation model does not include the benefits typically offered with direct hire.
P.S. We work with developers from 71+ countries in different regions: Europe, LATAM, the U.S (if you are an owner of W-9 ben form), Canada, Asia (Japan, Singapore, South Korea, Philippines, Indonesia), Oceania (Australia, New Zealand, Papua New Guinea), and the the UK. However, we have some exceptions.
At the moment, we dont have a legal basis to accept applicants from the following countries:
- European: Hungary, Iceland, Liechtenstein, Kosovo, Belarus, Russia, and Serbia.
- Latin America: Cuba and Nicaragua
- Most Asian countries and Africa.
We expand and shorten the list of exemptions regularly.
We do not provide visa assistance, and our cooperation model does not include the benefits typically offered with direct hire.
Backend Engineer
Paperpile runs on data at scale, with a literature database of 250M+ academic papers and a growing body of user data accumulated over more than a decade. You’ll work across the systems that ingest, process, store, and serve this data reliably: building pipelines, optimizing search, handling PDFs at scale, and exposing clean APIs.
Requirements
- Strong backend engineering background with experience building and operating data-heavy systems in production.
- Experience deploying and operating services on AWS.
- Experience designing and maintaining data ingestion pipelines handling messy, heterogeneous sources. Comfortable with web scraping and working with third-party data sources and APIs.
- Familiarity with Node.js and TypeScript. Its fine if you come from a different background, such as Java or Python, but you should be comfortable working in this environment.
- High standards for data quality. You think carefully about correctness, deduplication, and consistency.
- Solid understanding of full-text search systems including indexing strategy, relevance tuning, and query optimization.
- Proficient in building reliable REST APIs.
More useful experience
- Familiarity with academic publishing formats and data sources (PubMed, Crossref, arXiv)
- Experience with PDF processing pipelines (extraction, transformation, storage and delivery at scale).
- Experience with LLM-based document processing or ML pipelines for extracting structured data from unstructured text.
- Large scale web crawling and scraping.
Compensation
- Base compensation 60,00090,000 based on the level of your experience
- Bonus/equity program.
Senior Full Stack Engineer
We use React and TypeScript across all our product lines: web app, browser extensions, iOS and Android apps (React Native), and desktop apps (Electron).
You’ll take ownership of substantial parts of our codebase, shipping polished UI and the connected backend services for new features and products.
Youll write and maintain frontend (Playwright) and backend tests, and help keep our CI/CD pipelines healthy.
Requirements
- Deep understanding of React and a track record of building complex, production React applications
- Able to deliver pixel-perfect, production-ready code from Figma mockups on time
- Strong eye for detail and a dedication to creating fast, enjoyable user interactions
- Strong knowledge of TypeScript and its ecosystem (Babel, Webpack, Jest, Express) and underlying web technologies (HTML5, CSS3, REST APIs)
- Solid backend skills in Node.js: comfortable building and consuming APIs, implementing algorithms, and data heavy processing workflows.
- Experience running production backend services reliable and efficiently at scale
- Experience writing complete, robust and maintainable tests with Playwright or similar frameworks.
More useful experience
- React Native for iOS and Android
- Browser extension development (Chrome/Chromium, Safari, Firefox)
- Desktop app development on Windows and macOS (Electron and natively).
- Web crawling
- PDF and document processing
- Server operations on AWS
Compensation
- Base compensation 60,00090,000 based on the level of your experience
- Bonus/equity program.
Senior Software Engineer
Who We Are:
Interra Health is a fast-growing healthcare technology company transforming how providers and patients navigate the prescription journey. Formed through the merger of DoseSpot, Arrive Health, and pVerify, Interra Health delivers trusted eligibility, real-time coverage and pricing insights, prescribing tools, and pharmacy transparency at the point of carehelping providers make informed decisions and patients access the right medications with greater clarity and affordability. Backed by strong market momentum and a bold vision for the future of connected care, Interra Health offers the chance to join an innovative, mission-driven team working at the intersection of software and healthcare to reduce friction, improve access, and make the healthcare experience better for everyone.
Interra Health is a fast-growing healthcare technology company transforming how providers and patients navigate the prescription journey. Formed through the merger of DoseSpot, Arrive Health, and pVerify, Interra Health delivers trusted eligibility, real-time coverage and pricing insights, prescribing tools, and pharmacy transparency at the point of carehelping providers make informed decisions and patients access the right medications with greater clarity and affordability. Backed by strong market momentum and a bold vision for the future of connected care, Interra Health offers the chance to join an innovative, mission-driven team working at the intersection of software and healthcare to reduce friction, improve access, and make the healthcare experience better for everyone.
The Role:
As a Senior Software Engineer, you will play a critical role in advancing the Arrive Health Network platforma core part of Interra Healths ecosystem that powers real-time prescription decision-making and cost transparency at the point of care.
You will help guide the design and development of scalable, high-impact systems used by providers and care teams to deliver more accessible and affordable treatment options. This role combines hands-on engineering with cross-functional partnershipactively collaborating on complex initiatives, contributing to architectural decisions, and working closely with product, platform, and engineering teams to ensure our platform evolves with reliability and integrity at scale.
Key Responsibilities:
Design and implement complex greenfield projects used directly by providers and clinical staff
Partner with engineering and product leadership on planning, prioritization, and execution
Maintain and evolve backend systems and tools used by internal clinical and operational teams
Integrate with electronic health records (EHRs) and external partner APIs
Contribute to platform, monitoring, and infrastructure efforts in partnership with Platform Engineering (AWS, Terraform, Docker, DataDog)
Build solutions with attention to system interoperability, scalability, and long-term maintainability
Troubleshoot and resolve production issues, including participating in on-call rotations
Mentor engineers and help elevate technical standards on the team
Help on-board new engineers, teaching them how to use and monitor our pipeline
Expectations:
Contribute to team-wide technical initiatives that span multiple systems
Develop deep expertise in the healthcare and prescription coverage domain, and use that knowledge to inform architectural decisions by anticipating future needs
Maintain and extend our existing system using Agile practices including TDD, pair programming,
and radically collaborative development
Identify cross-cutting problems and suggest solutions (shared services, tooling, architecture)
Actively participate in collaborative efforts across team and functional boundaries, particularly with Product, to solve shared problems and contribute to company-wide goals
Help evolve and uphold engineering standards, documentation, and team norms
Stay current with modern development practices and tooling, and contribute to evolving team workflows, such as agentic AI workflows
Close collaboration with other lead engineers and product management on planning and execution in a remote-first environment
What Youll Bring:
Experience & Background
5+ years of software development experience (or equivalent combination of education and experience)
Strong experience building and scaling full stack applications in production environments
Experience designing distributed systems and integrating with external APIs (EHR experience a plus)
Technical Skills
Proficiency in one or more backend languages/frameworks (Kotlin/Spring, Ruby on Rails, or similar)
Experience with modern frontend frameworks (React or similar)
Familiarity with cloud infrastructure (AWS), containerization (Docker), and infrastructure as code (Terraform)
Strong understanding of CI/CD, automated testing, and service-oriented architecture
Solid working knowledge of SQL and data modeling
Increasing familiarity with AI models and tools
Ways of Working
Strong problem-solving skills and attention to detail
Ability to operate independently while collaborating effectively across teams
Clear and concise communicationespecially in a distributed, async-friendly environment
A bias toward ownership, action, and continuous improvement
Comfort operating in a fast-paced, evolving environment with shifting priorities
A desire to learn and grow in a fast-paced environment
Core Competencies:
Knowledge & Application: Applies deep technical expertise to design scalable systems and solve complex problems
Complexity & Problem Solving: Contributes potential solutions to ambiguous, high-impact challenges requiring cross-system thinking
Working Conditions & Environment:
Fully remote role within the United States
Periodic travel (approximately 5%) for team meetings, customer visits, and industry events
Operates in a fast-paced, growth-oriented, PE-backed SaaS environment
Requires cross-functional collaboration across Product, Sales, and Customer Success
Remote work environment with a flexible work schedule to encourage work-life balance
Annual company offsite
Generous leave package including flexible time off policy that encourages team members to take time off to relax and recharge; plus 13 paid holidays, paid sick leave, and paid parental leave
Medical, dental, and vision insurance for you and your family, plus a company funded FSA & HSA (dependent on which medical plan you choose)
401(k) company match
One-time workspace reimbursement to help you optimize your remote workspace
Interra Health is an Equal Employment Opportunity/Affirmative Action employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin, protected veteran status, disability status, sexual orientation, gender identity or expression, marital status, genetic information, or any other characteristic protected by law.
Senior Software Engineer
Who We Are:
Interra Health is a fast-growing healthcare technology company transforming how providers and patients navigate the prescription journey. Formed through the merger of DoseSpot, Arrive Health, and pVerify, Interra Health delivers trusted eligibility, real-time coverage and pricing insights, prescribing tools, and pharmacy transparency at the point of carehelping providers make informed decisions and patients access the right medications with greater clarity and affordability. Backed by strong market momentum and a bold vision for the future of connected care, Interra Health offers the chance to join an innovative, mission-driven team working at the intersection of software and healthcare to reduce friction, improve access, and make the healthcare experience better for everyone.
Interra Health is a fast-growing healthcare technology company transforming how providers and patients navigate the prescription journey. Formed through the merger of DoseSpot, Arrive Health, and pVerify, Interra Health delivers trusted eligibility, real-time coverage and pricing insights, prescribing tools, and pharmacy transparency at the point of carehelping providers make informed decisions and patients access the right medications with greater clarity and affordability. Backed by strong market momentum and a bold vision for the future of connected care, Interra Health offers the chance to join an innovative, mission-driven team working at the intersection of software and healthcare to reduce friction, improve access, and make the healthcare experience better for everyone.
The Role:
As a Senior Software Engineer, you will play a critical role in advancing the Arrive Health Network platforma core part of Interra Healths ecosystem that powers real-time prescription decision-making and cost transparency at the point of care.
You will help guide the design and development of scalable, high-impact systems used by providers and care teams to deliver more accessible and affordable treatment options. This role combines hands-on engineering with cross-functional partnershipactively collaborating on complex initiatives, contributing to architectural decisions, and working closely with product, platform, and engineering teams to ensure our platform evolves with reliability and integrity at scale.
Key Responsibilities:
Design and implement complex greenfield projects used directly by providers and clinical staff
Partner with engineering and product leadership on planning, prioritization, and execution
Maintain and evolve backend systems and tools used by internal clinical and operational teams
Integrate with electronic health records (EHRs) and external partner APIs
Contribute to platform, monitoring, and infrastructure efforts in partnership with Platform Engineering (AWS, Terraform, Docker, DataDog)
Build solutions with attention to system interoperability, scalability, and long-term maintainability
Troubleshoot and resolve production issues, including participating in on-call rotations
Mentor engineers and help elevate technical standards on the team
Help on-board new engineers, teaching them how to use and monitor our pipeline
Expectations:
Contribute to team-wide technical initiatives that span multiple systems
Develop deep expertise in the healthcare and prescription coverage domain, and use that knowledge to inform architectural decisions by anticipating future needs
Maintain and extend our existing system using Agile practices including TDD, pair programming,
and radically collaborative development
Identify cross-cutting problems and suggest solutions (shared services, tooling, architecture)
Actively participate in collaborative efforts across team and functional boundaries, particularly with Product, to solve shared problems and contribute to company-wide goals
Help evolve and uphold engineering standards, documentation, and team norms
Stay current with modern development practices and tooling, and contribute to evolving team workflows, such as agentic AI workflows
Close collaboration with other lead engineers and product management on planning and execution in a remote-first environment
What Youll Bring:
Experience & Background
5+ years of software development experience (or equivalent combination of education and experience)
Strong experience building and scaling full stack applications in production environments
Experience designing distributed systems and integrating with external APIs (EHR experience a plus)
Technical Skills
Proficiency in one or more backend languages/frameworks (Kotlin/Spring, Ruby on Rails, or similar)
Experience with modern frontend frameworks (React or similar)
Familiarity with cloud infrastructure (AWS), containerization (Docker), and infrastructure as code (Terraform)
Strong understanding of CI/CD, automated testing, and service-oriented architecture
Solid working knowledge of SQL and data modeling
Increasing familiarity with AI models and tools
Ways of Working
Strong problem-solving skills and attention to detail
Ability to operate independently while collaborating effectively across teams
Clear and concise communicationespecially in a distributed, async-friendly environment
A bias toward ownership, action, and continuous improvement
Comfort operating in a fast-paced, evolving environment with shifting priorities
A desire to learn and grow in a fast-paced environment
Core Competencies:
Knowledge & Application: Applies deep technical expertise to design scalable systems and solve complex problems
Complexity & Problem Solving: Contributes potential solutions to ambiguous, high-impact challenges requiring cross-system thinking
Working Conditions & Environment:
Fully remote role within the United States
Periodic travel (approximately 5%) for team meetings, customer visits, and industry events
Operates in a fast-paced, growth-oriented, PE-backed SaaS environment
Requires cross-functional collaboration across Product, Sales, and Customer Success
Remote work environment with a flexible work schedule to encourage work-life balance
Annual company offsite
Generous leave package including flexible time off policy that encourages team members to take time off to relax and recharge; plus 13 paid holidays, paid sick leave, and paid parental leave
Medical, dental, and vision insurance for you and your family, plus a company funded FSA & HSA (dependent on which medical plan you choose)
401(k) company match
One-time workspace reimbursement to help you optimize your remote workspace
Interra Health is an Equal Employment Opportunity/Affirmative Action employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin, protected veteran status, disability status, sexual orientation, gender identity or expression, marital status, genetic information, or any other characteristic protected by law.
Director Global Account Management
ARE YOU INTERESTED IN JOINING AN INNOVATIVE LOGISTICS TECHNOLOGY COMPANY?
Loadsmart is a growth-stage technology company valued at over $1 billion (a true Tech Unicorn)!
We are a collection of industry veterans and user-centered engineers using innovative technology to fearlessly reinvent the future of freight by helping shippers, brokers, warehouses and carriers to move more with less.
With headquarters in Chicago and a globally distributed remote team, Loadsmart continues to attract top talent committed to driving meaningful change. We seek professionals who embody our core values: curiosity, clarity, results, commitment, and teamwork.
We are seeking an experienced and strategic Director of Global Account Management to lead and grow our Account Management Team and portfolio of key enterprise clients across global markets. Reporting to the SVP of Customer Experience, this role is responsible for building and scaling a high-performing global account management organization, driving revenue retention, expansion growth, and serving as the executive voice for our customer relationships.
Job Type: (Exempt) – U.S. Only
DEPARTMENT: Customer Success
LOCATION:Chicago IL or remote, depending on location
n
WHAT YOU GET TO DO:
Lead, mentor, and develop a team of regional and senior account managers across multiple geographies, fostering a culture of accountability, customer obsession, and continuous growth.
Own the global account management strategy, including retention targets, net revenue retention (NRR), upsell/cross-sell playbooks, rules of engagement with Sales and executive relationship programs.
Serve as an executive sponsor for a defined set of strategic global accounts, building deep C-suite and VP-level relationships in partnership with Sales.
Partner closely with Sales, Product, and Marketing to ensure a seamless customer journey from initial sale through renewal and expansion.
Develop and implement scalable processes, tools, and frameworks that improve account health, increase customer lifetime value, and reduce churn.
Analyze account performance data and market trends to inform strategy, identify risks early, and surface growth opportunities.
Collaborate with regional leaders to ensure consistent execution of account management practices across diverse global markets and customer segments.
Represent the voice of the customer internally, advocating for product improvements and service enhancements based on client feedback.
Build and present regular business reviews (QBRs/EBRs) at the executive level, both internally and with key client stakeholders.
Drive forecasting accuracy and pipeline visibility for renewal and expansion revenue.
REQUIRED QUALIFICATIONS:
8+ years of experience in account management, customer success, or enterprise sales, with at least 4 years in a leadership role managing global or multi-regional teams.
Proven track record of meeting or exceeding NRR, retention, and expansion targets in a fast-moving startup environment
Strong executive presence with demonstrated ability to build and sustain C-suite relationships.
Experience working with large, complex enterprise accounts across multiple industries and geographies.
Excellent cross-functional collaboration skills, comfortable influencing without authority across Sales, Product, Marketing and Operations.
Data-driven mindset with proficiency in CRM platforms (Salesforce preferred) and experience using analytics to drive decisions.
Outstanding communication, negotiation, and presentation skills.
Ability to travel internationally as required (up to 20%).
Experience in logistics or global supply chain is a strong plus.
n
n
WORKING AT LOADSMART:
Competitive base salaries – we believe in rewarding top talent
Extremely competitive Equity package – become a shareholder in our company!
Loadie Time Off – PTO and sick days without a limit
Comprehensive Medical, Dental, and Vision insurance plans
401k Match
*Applicants must be currently authorized to work in the United States on a full-time basis. Loadsmart will not sponsor applicants for work visas.
At Loadsmart, we believe our biggest asset is our people. We are proud to be an equal opportunity employer, hiring and developing individuals from diverse backgrounds and experiences to add to our collaborative culture. Loadsmart treats all candidates and employees with respect and does not discriminate in our recruiting, hiring, and promoting processes, including on the basis of race, color, religion, sex, age, sexual orientation, gender identity and/or expression, national origin, veteran status, or disability.
It is the policy of Loadsmart that all offers of employment made shall be contingent upon successful completion of electronic background check(s). These checks will be job-related, consistent with business necessity and conducted by our vendor, pursuant to all applicable laws, rules, policies and procedures of our candidates’ specific locale.
Self Pay & Customer Service Specialist
The vision of Clinical Health Network for Transformation (CHN) is to support the mission and promise of Planned Parenthood to bring high-quality, affordable care to every member of our communities. CHN is a collaboration between Planned Parenthood affiliates across the United States.
CHN is looking for individuals who are committed to supporting our shared goal of strengthening andenhancing our awareness and commitment to advancing the cause of health equityin our organization.
Reporting directly to the Revenue Cycle Manager, theSelf-Pay & Customer Service Specialistisresponsibleformanaging self-pay accounts and providing exceptional customer service to patients. This role involves handling patient inquiries, setting up payment plans, and ensuringtimelycollection of outstanding balances whilemaintainingpositive patient relationships.
n
Essential Functions
- Manage self-pay accounts by setting up payment plans and ensuring timely collection of outstanding balances
- Provide excellent customer service to patients, addressing billing and payment inquiries, and offering solutions to resolve any concerns
- Accurately post payments and adjustments from patients into the billing system
- Perform regular reconciliations of patient accounts to ensure accuracy and completeness
- Investigate and resolve any discrepancies or disputes related to patient accounts in a timely manner
- Maintain accurate and up-to-date financial records, including payment plans and collection activities
- Generate and analyze reports on self-pay accounts and customer service interactions to monitor cash flow and identify trends or issues
- Perform various clerical activities to support daily operations
- Creates and promotes a culture of continuous improvement
- Ensures compliance with all CHN and affiliate policies, as well as all state and federal regulations
- Demonstrates a commitment to CHN and Planned Parenthoods mission related to health equity, especially centering racial equity, and deep sense of accountability to community
- Demonstrates a commitment to learning about and enhancing practices related to racial equity and the impact of structural racism on healthcare systems
- Provides positive and development feedback and accountability related to practices including, but not limited to, equity
Qualifications and Experience (Required)
- Minimum of 2 years of relevant account receivable experience
- Previous experience using ICD-10 Medical Coding and Current Procedural Terminology (CPT)
- Knowledge of medical terminology
- Strong analytical and problem-solving abilities
- Proficiency with Microsoft software (Excel, etc.)
- Demonstrated ability to maintain a customer-centric service approach in a fast-paced environment
- Excellent written and verbal communication skills and ability to collaborate and interact with all levels within and outside of CHN if necessary
- Strong attention to detail and follow-up; and ability to multi-task in fast-paced environment
- Demonstrated dedication to Planned Parenthoods mission, vision, and values
- Strong interpersonal skills and the ability to build relationships with stakeholders
- Excellent time management, and problem-solving skills
Qualifications and Experience (Preferred)
- Strong General Technology Skills: proficient utilization of Excel, Word, and Windows environment,Epic, eCW, NextGenor other practice management systems experience a plus
- Medical Billing and Coding certification, a plus
Key Requirements
- Commitment to advancing race (+) equity in ones work: interested in expanding knowledge about the role that racial inequity plays in our society
- Awareness of multiple group identities and their dynamics, bringing a high level of self-awareness about personal identity, empathy, and humility to interpersonal interactions
- Demonstrated ability to communicate clearly and directly as well as hear and act on feedback related to identity and equity with the aim to learn
- Strong sense of accountability to equitable practices
- Understanding of the impact of identity dynamics on organizational culture
- Commitment to CHN and Planned Parenthoods In This Together service ethos, workplace values, and service standards
n
n
Clinical Health Network for Transformation (CHN) is an equal employment opportunity employer. We comply with all applicable laws prohibiting discrimination based on race, color, religion, gender and gender expression/identity, age, ethnicity, national origin, ancestry, physical or mental disability, uniformed service member/veteran status, marital status, medical condition, pregnancy, sexual orientation, citizenship status, genetic information, as well as any other category protected by federal, state, or local.We are committed to building an inclusive workplace that values racial & social justice. We strongly encourage all persons to apply, including members from all racial and ethnic groups and members of the LGBTQIA+ community.
Senior Site Reliability Engineer AI Infrastructure
Senior Site Reliability Engineer – AI Infrastructure
Location: Global Remote / San Francisco Full-Time
About Andromeda
Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.
We began with a single managed cluster but it filled almost instantly. Since then, weve been quietly building the systems, network, and orchestration layer that makes the worlds AI infrastructure more accessible.
Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where its needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.
Our long-term vision is to build the liquidity layer for global AI compute a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the worlds financial markets.
We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
The Role
This is not a generalist SRE role.
You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems.
Were looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric kernel framework.
What Youll Own
GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency.
Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time.
Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput.
Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations.
Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics.
Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes.
What Were Looking For
GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation.
High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale.
Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don’t need to write the models, but you need to understand what’s happening at the systems level when a 1,000-GPU training run stalls.
Linux & Systems Internals: Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling at the syscall and hardware level.
Kubernetes & Orchestration: Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators. Experience with Slurm or other HPC schedulers is equally valued.
Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash. You build production-grade tools and services, not just scripts. Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent).
Observability & Monitoring: Hands-on experience building monitoring and alerting for GPU infrastructure, not just Prometheus/Grafana basics, but GPU-specific telemetry (DCGM, nvidia-smi, fabric manager metrics) integrated into actionable dashboards.
Incident Management: Proven track record leading incident response for complex distributed systems where the failure could be in hardware, firmware, networking, drivers, orchestration, or application code and you need to narrow it down fast.
Strong Candidates May Have
Distributed Storage: Experience with high-performance parallel file systems (VAST, Weka, Lustre, GPFS) and the checkpoint I/O and data-loading bottlenecks that come with large training runs.
Training Optimization: Experience profiling and optimizing distributed training performance: identifying stragglers, tuning collective communication strategies, improving MFU (Model FLOPs Utilization), and reducing idle GPU time across large runs.
Cluster Buildout & Hardware: Experience involved in physical cluster design – rack layout, power/cooling constraints, network topology design, and hardware validation/burn-in at scale.
Team Leadership: Experience leading or mentoring a team of infrastructure engineers. We’re growing and need people who raise the bar for everyone around them.
Why Youll Love It Here
This is a high-impact, senior builders role. Youll have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute. Youll influence technical direction and help define what world-class AI infrastructure operations look like.
Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
