RT Dissertation/Thesis T1 Performance Anomaly Detection in HPC T2 Detección de anomalías de rendimiento en HPC A1 Halawa, Mohamed Soliman K1 1203.04 Inteligencia Artificial K1 1209.09 Análisis Multivariante K1 1209.15 Series Temporales AB In recent years the demand for High-performance computing (HPC) data centers has increased. HPC often consists of thousands of computing services. Given the high costs related with the setup of such systems, it is vital that the service provider maximize the utilization of the limited data center resources as efficiently as possible and reduce the service cost to fit the “pay as you go” pricing model.As HPC systems and applications continue to increase in complexity, HPC systems become more exposed for performance problems like (resource contention, software- and firmware-related problems, etc.) that can lead to premature job termination, reduced performance, and wasted compute platform resources. Permanent management of such systems health well has a huge impact financially and operationally. So it is essential for the HPC operators to monitor and analyze the performance of such complex system environment.Manually monitoring systems in this size and complexity is an impossible task; since it generates a huge amount of data as metrics of resource usage data and other key performance indicators (KPI) per day form thousands of computational nodes. There is a lot of visualizing toots available that monitors and collect HPC performance data that may contain evidence of anomalies, but the problem is the lack of analytic engine to process this data to identify performance anomalies activity.Therefore, performance problem management has become a major task in HPC cloud environment which includes on three main tasks:(i) Real-time detection of performance Anomalies within HPC cloud datacenters.(ii) Identifying the root cause of these anomalies.(iii) Identify methods to prevent these anomalies from occurring.These performance problems moved the research on computational intelligence into a new era to develop the tools and techniques to identify these anomalies. These tools use some data analytic techniques such as (Statistical, Machine Learning, Time series, Threshold, etc.) that capture information on a large number of the time-varying system performances metrics, and then analyze the relationships among system components and applications. YR 2021 FD 2021-11-08 LK http://hdl.handle.net/11093/2647 UL http://hdl.handle.net/11093/2647 LA eng DS Investigo RD 26-sep-2023