PhD Student, Lund University
Polje Istraživanja: Computer science Machine learning Software engineering
After three years of working with quality assurance of embedded, web, and mobile applications, I started a Ph.D. at Lund University in 2019, in collaboration with System Verification and WASP - Sweden's largest research program in AI, autonomous systems, and software. My research is concerned with applied machine learning for anomaly detection in DevOps with the main goal of identifying early signs of operations failures. So, my focus is now shifted towards data science for software reliability in operations.
Context: Anomaly detection is crucial for maintaining cloud-based software systems, as it enables early identification and resolution of unexpected failures. Given rapid and significant advances in the anomaly detection domain and the complexity of its industrial implementation, an overview of techniques that utilize real-world operational data is needed. Aim: This study aims to complement existing research with an extensive catalog of the techniques and monitoring data used for detecting anomalies affecting the performance or reliability of cloud-based software systems that have been developed and/or evaluated in a real-world context. Method: We perform a systematic mapping study to examine the literature on anomaly detection in cloud-based systems, particularly focusing on the usage of real-world monitoring data, with the aim of identifying key data categories, tools, data preprocessing, and anomaly detection techniques. Results: Based on a review of 104 papers, we categorize monitoring data by structure, types, and origins and the tools used for data collection and processing. We offer a comprehensive overview of data preprocessing and anomaly detection techniques mapped to different data categories. Our findings highlight practical challenges and considerations in applying these techniques in real-world cloud environments. Conclusion: The findings help practitioners and researchers identify relevant data categories and select appropriate data preprocessing and anomaly detection techniques for their specific operational environments, which is important for improving the reliability and performance of cloud-based systems.
With the dynamic nature of modern software development and operations environments and the increasing complexity of cloud-based software systems, traditional monitoring practices are often insufficient to timely identify and handle unexpected operational failures. To address these challenges, this paper presents the findings from a quantitative industry survey focused on the application of Machine Learning (ML) to enhance software monitoring and alert management strategies. The survey targets industry professionals, aiming to understand the current challenges and future trends in ML-driven software monitoring. We analyze 25 responses from 11 different software companies to conclude if and how ML is being integrated into their monitoring systems. Key findings revealed a growing but still limited reliance on ML to intelligently filter raw monitoring data, prioritize issues, and respond to system alerts, thereby improving operational efficiency and system reliability. The paper also discusses the barriers to adopting ML-based solutions and provides insights into the future direction of software monitoring.
Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log analysis using GPT-3. We demonstrated that operations engineers tend to be more efficient by having access to interpretable alert notifications based on detected anomalies that contain information about implications and potential root causes. Additionally, proposed autonomous monitors turned out to be beneficial for the timely identification and revision of potential issues before they propagate and cause severe consequences.
DevOps represent the tight connection between development and operations. To address challenges that arise on the borderline between development and operations, we conducted a study in collaboration with a Swedish company responsible for ticket management and sales in public transportation. The aim of our study was to explore and describe the existing DevOps environment, as well as to identify how the feedback from operations can be improved, specifically with respect to the alerts sent from system operations. Our study complies with the basic principles of the design science paradigm, such as understanding and improving design solutions in the specific areas of practice. Our diagnosis, based on qualitative data collected through interviews and observations, shows that alert flooding is a challenge in the feedback loop, i.e. too much signals from operations create noise in the feedback loop. Therefore, we design a solution to improve the alert management by optimizing when to raise alerts and accordingly introducing a new element in the feedback loop, a smart filter. Moreover, we implemented a prototype of the proposed solution design and showed that a tighter relation between operations and development can be achieved, using a hybrid method which combines rule-based and unsupervised machine learning for operations data analysis.
Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo
Saznaj više