Site Reliability Engineer (Lisbon or Porto)
DefinedCrowd provides high quality training data to fuel AI applications and initiatives. Our fully customizable workflows in speech, NLP, and vision are designed to help our clients reach their business goals quickly, easily and with guaranteed results.
But, the power of artificial intelligence is anything but purely artificial. The ability to process and analyze millions of data points per second gives AI the ability to quickly present information to human workers in a digestible fashion. The combination of automation and machine learning gives human workers superpowers they have never possessed.
By working in collaboration, human powers and skills are amplified by augmented intelligence. This new collaborative intelligence will be capable of performing unimaginable superhuman feats. So, human roles aren’t disappearing. In fact, AI is creating new human career paths: Trainers, Explainers, and Sustainers.
We are currently looking for talented new members across the world to join this energetic, hardworking and fun team in Porto.
- Run our infrastructure with Ansible, Terraform and Kubernetes.
- Building resilient infrastructure and tooling solutions.
- Make monitoring and alerting alert on symptoms and not on outages.
- Improve the deployment process to make it as boring as possible.
- Debug production issues across services and levels of the stack.
- Managing Application configurations, Load balancing, Proxies and CDN.
- Monitoring and Metrics in Prometheus, Grafana and integrations.
- Managing Logging infrastructure.
- Disaster Recovery and High Availability strategy.
- Actively work to automatically detect potential issues in a large virtualized environment.
- Write automation scripts to auto-correct or completely prevent issues in our online services.
- Track and review changes in a highly dynamic environment.
- Perform software updates, testing, and Common Vulnerabilities and Exposures (CVE) analysis.
- Respond to security threats.
- Participate in a regular shift and on call rotation.
- Engaging with cutting edge technology and a fast paced, dynamic environment.
- Health Insurance.
- Remote work friendly.
- Fresh fruits, snacks, “all you can drink” coffee/tea and soup every Tuesday and Thursday.
- Stimulating environment committed to diversity and inclusion.
- Individual budget for training and conferences.
- MSc in Computer Science or related field.
- 3+ years of experience on DevOps/Site Reliability engineering positions.
- Experience with designing and implementing CI/CD DevOps solutions using Azure DevOps, Git, Shell, YAML, Kubernetes and Docker.
- Expert level scripting (e.g. Python, Bash, PowerShell).
- Strong Experience with Micro services platform.
- Experience supporting mission-critical systems.
- Experience running Linux servers, Windows Servers and monitoring systems.
- Experience with RabbitMQ, REDIS, Elasticsearch.
- Experience in Configuration Management with Ansible, Chef or Puppet.
- Experience in Cloud technologies and different providers (OpenStack, AWS, Google Cloud Platform, Microsoft Azure).
- Strong communication skills and ability to work effectively across multiple business and technical teams.
- Demonstrated ability to quickly and accurately troubleshoot issues.
- Knowledge of SDLC and Agile methodologies.
- Excellent English speaking and writing skills.