Model Lifecycle Management

Reliability ML Lifecycle Management @C3 AI

Role & Duration

Lead Product Designer

4 Months

Highlights & Stage

ML / AI Developer Tool

Enterprise Product

Featured on C3 AI Annual Conference

Shipped 06/2024

Team Structure

Leadership: Vertical C-Suite Reviews

Peers: 2 PM, 3 DS, 3 SME, 1 Eng Lead, 10+ Eng, 2 Partner Teams

Reports: 1 Product Designer

Reliability Model Management is an End-to-end ML lifecycle management workflow that empowers Reliability Engineers to seamlessly create, manage, and monitor machine learning models from data preparation to deployment, enhancing efficiency and scalability for enterprise AI workflows.

Background

The C3 AI Reliability application uses Anomaly Detection Model to predict equipment failures, detect anomalies, and optimize maintenance, helping industries reduce downtime and improve asset performance. While being recognized as one of the top predictive maintenance tool in the market, we realized the following issues from recent customer’s deployments.

Low Contract Renew Rate

Low User Engagement Rate

Poor Asset Scalability

Conducting vision workshop to uncover the biggest opportunities

Thus, I was tasked with PM to co-organize 5 product vision workshops with key stakeholders to identify the biggest product & design opportunities for the new year’s roadmap planning that can help us better fit the current market.

From the product vision workshops, we learned about the biggest opportunity of improvements is related to model training and deployment.

In deployments of C3 AI Reliability, each AI/ML model is applied to a specific asset. When a user wants to configure and train a model for an asset, they need to highly rely on C3’s field team to manually writing python code in Jupyter notebook, which is a time-intensive, iterative process that requires knowledge of the physical system and involves manually adjusting lines of code.

Avg Model Deployment time for 10 Assets

6+ FTE / Month

Setting a Vision for Self-service & reduce deployment time

Final Solution

Introducing Reliability ML Lifecycle Management Module

Alex, a Tesla Reliability Engineer, is tasked with deploying predictive maintenance models across hundreds of electric motors in Tesla vehicles. To ensure operational excellence, Alex must validate sensor data for the entire fleet, prepare a detailed dataset for a specific asset, and configure models efficiently across all motors.

Step 1: Bulk Data Validation

Alex begins by validating sensor data for all motors at once using the bulk data validation feature.

Step 2: Data Preparation for a Single Asset

After completing the fleet-wide validation, Alex focuses on a specific motor that requires detailed tuning for its predictive maintenance model.

Step 3: Bulk Model Configuration

With validated data and a prepared dataset, Alex moves on to deploying models across the entire fleet using the bulk model configuration tool.

Discovery

Identifying key user persona and uncovering pain points in ML lifecycle

With the focus on reduce the ML deployment time, my team and I created a research plan to document our goals and assumptions. We investigated the domain from multiple angles: user interviews with internal field team and external customers, as well as the competitive analysis on similar products.

Imagine C3 is helping Tesla deploy models for 100 cars

Familiarity with ML & Tesla Cars (Assets)

Expert with C3’s framework for ML

Expert on REL ML approaches

Actively developing new models for REL

Not familiar with Tesla Cars at all

Familiarity with ML & Tesla Cars (Assets)

Familiar with basic technical words (“algorithm”, “dataset”)

Unfamiliar with C3’s framework for ML

Unfamiliar with coding

Expert of sensor data for each car

Reponsibilities

Coding in Jupyter to Validate Car Data to make sure car is connected with sensors

Using Jupyter to prepare and clean the data based on Reliability Engineers requirements

Configure Model based on Reliability Engineers requirements

Investigate issues with configuration, training, or deployment of models

Reponsibilities

Define the ML objective (I want to predict potential failure for 100 Cars in my plant)

Tell Data Scientist which sensors need to be included.

Tell Data Scientist which data need to be masked out in dataset.

Evaluate the model performance after training.

Report out on any issues with configuration, training, or deployment of models

So, What’s the problem?

Back and Forth Collaboration Process

High Learning Curve

Dataset Preparation is too manual, high dependent on eng experience

Shaping the strategy

Focus on the primary users, Automating the ML technical part while empowering what the users good at

I explored ways to simplify complex information while keeping core actions clear. Tasked with designing a workflow inspired by Google Vertex AI, I recognized its broad use case might not fit Reliability Engineers. I advocated for user validation to refine the functionality based on real feedback.

Design Decision 1 - Empower & Guide

HMW simplify complicated dataset preparation while empower users to dive deep?

Among all the dataset preparation steps, applying the data masks is the most complicated one.

Understand types of masks

I collaborated with data scientist and 2 users from Shell - one of our biggest client to understand the common masking strategies,

Create quick design to validate with users

To gather early feedback, I quickly created a one-page mockup and tested it with users. A key insight was that the masking configuration process felt too overwhelming and technical, leaving most users confused. While my PM suggested adding descriptions and tooltips to guide users, I believed this approach would further increase cognitive load.

Refine the designs

After multiple design iterations and user validation, I focused on simplicity as a core design principle. By organizing input categories into clear sections and using progressive disclosure for advanced inputs, users felt more guided and confident throughout the experience.

Design Decision 2 - Empower & Guide

HMW help users effectively decide which sensors to include in Dataset?

To create a high-quality dataset, users need to select relevant sensors for each asset and ensure those sensors provide reliable data. This requires users to manually review sensor data to confirm it meets quality standards, avoiding sensors with excessive low-variance data, flatlines, outliers, or missing values.

Understand what defines a good sensor data

Through the process, I constantly work with data scientist to brainstorm if there are any ways can help users automate sensors with good quality. So I came up with an idea - What if we can leverage AI to generate a data quality score that take variance, flatline, outlier and missing data as model input.

Don't automate just for the sake of it

However, when I presented this concept to both users and data scientists, they raised a valid concern.

Rule-based instead of ML-based

After discussions with data scientists and developers, we decided to implement a straightforward rule-based approach, allowing users to set configurable thresholds for variance, missing data, flatlines, and outliers. This approach also gives users the flexibility to manually add or remove sensors as needed.