Loading…
Tuesday, October 27 • 17:15 - 18:05
Prometheus Enabled AI Deep Observability Based on eBPF - Ivy He, Huawei Technologies Co, LTD

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
AI training process is complex and invisible, when running the task, there are some monitoring blind spots by using the traditional tracing tools, which brings many difficulties to the developers to debug and tuning. For this reason, we choose eBPF to analyze the changes what we want to know in the real-time, such as: to understand whether a specific kernel function is called, short-lifetime processes, etc. With the data collected dynamically by eBPF, we choose the Prometheus to monitor and show them to the developers. In this topic, I will share the practice of eBPF in the observability of AI kernel. While running the AI training and reasoning tasks, we can dynamically inject the eBPF code into the kernel function to collect data, and report the data to the Prometheus in a unified format for visual management. The practice of the observability is currently in the experimental stage.

Speakers
avatar for Luwei He

Luwei He

Open Source Engineer, HUAWEI TECHNOLOGIES CO., LTD.
I am Ivy He, an open source engineer from Huawei. I was involved in open source work related to high-performance storage and edge computing. Contributed in SPDK, Kubernetes, Akraino and other open source communities. Currently I am mainly engaged in open source practice in AI obs... Read More →


Tuesday October 27, 2020 17:15 - 18:05 GMT
AI/ML/DL Theater
  AI/ML/DL, AI Observability