Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

AI/ML/DL [clear filter]
Tuesday, October 27

17:15 GMT

Prometheus Enabled AI Deep Observability Based on eBPF - Ivy He, Huawei Technologies Co, LTD
AI training process is complex and invisible, when running the task, there are some monitoring blind spots by using the traditional tracing tools, which brings many difficulties to the developers to debug and tuning. For this reason, we choose eBPF to analyze the changes what we want to know in the real-time, such as: to understand whether a specific kernel function is called, short-lifetime processes, etc. With the data collected dynamically by eBPF, we choose the Prometheus to monitor and show them to the developers. In this topic, I will share the practice of eBPF in the observability of AI kernel. While running the AI training and reasoning tasks, we can dynamically inject the eBPF code into the kernel function to collect data, and report the data to the Prometheus in a unified format for visual management. The practice of the observability is currently in the experimental stage.

avatar for Luwei He

Luwei He

I am Ivy He, an open source engineer from Huawei. I was involved in open source work related to high-performance storage and edge computing. Contributed in SPDK, Kubernetes, Akraino and other open source communities. Currently I am mainly engaged in open source practice in AI obs... Read More →

Tuesday October 27, 2020 17:15 - 18:05 GMT
AI/ML/DL Theater
  AI/ML/DL, AI Observability

Twitter Feed