Image Segmentation with Databricks

From Ingestion to Prediction

9 min readOct 20, 2022

Introduction

We are humans and we use our vision, touch, smell, and language to apprehend the world.

In manufacturing, for humans and machines, vision is important for quality inspection, health, safety, assembly support, inventory management, anomaly detection, and predictive maintenance, …
The same questions always existed:

“Does that part look good? Can I reuse it?”
“Did I assemble this correctly?”
“Did I apply the coating with a minimal amount of bubbles”

These examples are from manufacturing, and the same applies to many industries. For now, algorithms have made fantastic progress in two of our “senses”: vision and language. This blog will focus on vision.

Problem statement

The first problem is not so much about analyzing images, but rather about collecting and organizing them. All analytics systems (data warehouses) were organized to manage structured data (numbers in a table) but not really images. With the advance of Data Lakes, it became possible to store vast amounts of images cheaply but still, it was difficult to organize them. Problem 1.

The second problem is around detecting anomalies or objects in images. You have to train an algorithm relying on deep neural networks. This requires a lot of data and specific technical competencies. There are no cats to detect in manufacturing, so can you really copy/paste the code that you can see in many Colabs and tutorials which teach you to detect a cat? Problem 2.

Let’s take a Kaggle contest on computer vision and try to get our two problems nailed with one platform. First of course the computer vision, and then second, the full end-to-end data pipeline, so that we can go from Kaggle to Production.

The Kaggle we picked is about detecting boats in an aerial image. This can be super helpful for analyzing images systematically and applies to many other applications going from security breach detection to plant safety monitoring and combined with other data using geolocation for example. More examples can be found on this blog. The Kaggle contest in question can be found here: Kaggle link

The problems can be summarised in the following architecture:

How to build a data Pipeline with unstructured data?

How do I manage unstructured data in an organized manner and at scale? Do I build the metastore manually? How do I ensure that my pipeline scales accordingly? that I can rerun parts of the process if my transformation steps change? How can I handle streaming images from applications? How can those images be served for downstream users? how can I query them? …

Computer vision yes, but I am not an expert!

How can I train my algorithm for my specific image patterns without having to start from scratch or require a huge amount of data so that the model is accurate, or require the best Ph.D. to be able to train, fine-tune and maintain the algorithm? How do I access the latest libraries? How can I train faster on a GPU without paying a heavy cost? How can I make sure all those data science experiments are tracked, so that I can track the model in production through a proper CI/CD chain?

Solution

We are going to use Databricks as the underlying platform to stitch image ingestion, data pipelining, governance, model training, experiment tracking, model registry and model serving.

Access data from Kaggle

An API key is required as per this documentation, to download and upload the dataset to your storage. You can set up your clusters with Databricks secrets (documentation here) for connecting to your API using:

Download the dataset from the notebook using:

Build the data pipeline

As a first step, we work with Data Engineer to build the data pipeline that will ingest, transform and store images dataset that we can then use for training and that business analysts can query for insights. We are going to use Databricks runtime scalability for ingesting the images.

Databricks runs on Apache Spark and images can be treated as binary content in Spark Dataframes. We use our AutoLoader, see the documentation here, to ensure the ingestion is done exactly once, can be triggered in streaming or batch and can scale from a few images to millions of files.

We can then use any library (here in Python) to extract all the basic required metadata from those images to enrich our dataset. The main library for that is PIL (documentation here) but any other libraries could be used.

In our case, we extracted image size, format and histogram so that we can include quality criteria to reject any images that do not fit our criteria.

Now that we define the transformations we would like to apply to our images, we can orchestrate our pipeline using Databricks multi-task orchestrator or Delta Live Table (see documentation here) which would bring automated optimization for your clusters, table performance management, lineage and observability out of the box.

Loading the raw image mask

For each image, we have a companion dataset containing all the pixels where a boat has been detected. This data is saved as a CSV file.

We’ll first start by ingesting this CSV data and saving them as a raw Delta table.

Transforming masks into images

We now have all the information on the image labels saved as an array containing the pixel coordinate (where the boat has been detected).

However, ML models work with images. Therefore, we need to create a Mask (image) based on this information.

The mask will be saved as a jpg, black (no boat) and white (boat detected).

Joining image and mask both as the gold layer

Ultimately, we can build our gold dataset by merging the mask and the initial image together in 1 single table.

This is a simple join operation, based on the image id.

We can use that table to display masks and images with a SQL command.

That’s very powerful to see if the mask and the image are correctly loaded.

That’s it. We have a robust pipeline with observability. We would need to build some integration tests and we are ready to go into production!

Build the Image segmentation model

Choose the appropriate architecture

Our next step as Data Scientists is to implement an image segmentation ML. We’ll re-use the gold table built in our previous data pipeline as a training dataset.

Image segmentation models can be complicated and the development of transfer learning methods greatly simplified the training of such models on your specific dataset. Using pre-trained models with pre-defined architectures on pre-defined underlying datasets greatly simplified the development of computer vision. Here we are relying on Pytorch’s library segmentation_models.pytorch.

Split data as train/test dataset

Now we split the images in a training/test dataset, like for any ML model training.

Prepare the dataset

While building an image segmentation model can be easily done, deploying such a model in production is much harder.

Delta table for DL with Petastorm

Our data is currently stored as a Delta table and available as a Spark dataframe. However, PyTorch is expecting a specific type of data. We need a library to create a dataset in PyTorch format and manage the caching from blob storage to local SSD. For that, we use Petastorm as described in this blog.

Train the base Model

The following cells implement an ML model leveraging PyTorch.

A couple of transformations are required to prepare the image for PyTorch (resizing), the rest is standard deep-learning code using the Pytorch Segmentation Models library. Note that we’ll be using MLflow to log our experiment metrics automatically.

Also, while the data pipeline was built on CPUs, we will use a GPU single-node machine to accelerate our model training. This can be done easily on Databricks by reattaching the notebook to an instantiated single-node GPU Virtual Machine.

Fine-tune hyperparameters with Hyperopt

Our model is ready. Now, tuning such a model can be tricky. Different architectures, encoders and extra hyperparameters like the learning rate, can be chosen.

Typically, we would train several models from different parameters combination and choose the best output based on a specific metric. Here for image segmentation, the metric is IoU. Typically too, this hyperparameter grid search is done sequentially which can take a lot of time. In our case, we will use Databricks runtime that can distribute and parallelize this parameter search. We leverage the Hyperopt library to do so.

Deploy our model in production

Databricks simplify this process and accelerate the data science journey from POC to production, with the help of MLFlow by providing the:

Auto experimentation tracking to keep track of progress
Model packaging in MLFlow, abstracting our ML framework
Model registry for governance
Batch or real-time serving (1 click deployment)
Model Monitoring, Feature Store and ML pipelining

Our model is now trained. All we have to do is get the best model (based on the valid_per_image_iou metric) and register it in the MLflow registry as our production model. This registry serves as the reference production model for our project.

Save the best model to the registry (as a new version)

Flag this version as production-ready

Our model is now deployed and flagged as production-ready! This gives model governance and simplifies and accelerates all downstream pipeline developments. The model can be used in any data pipeline (DLT, batch or real-time with Databricks Model Serving).

Let’s see how we can use it to run inferences at scale.

Inference at Scale

This is the last step of our pipeline and one of the most complicated:

The model should be used in batch or real-time, based on the use-case
The model should be packaged and deployed with all the dependencies
To simplify integration, model flavour/framework should be abstracted
Model serving monitoring is required to track model behaviour over time
Deploying a new model in production should be 1-click, without downtime
Model governance is required to understand which model is deployed and how it has been trained, ensure schema, etc.

Once our model is in the registry, Databricks simplifies these final steps with a 1-click deployment for

Batch inference (high throughput but >sec latencies)
Real-time (ms latencies)

Environment Recreation

Handling dependencies is often tricky and time-consuming. MLflow solves this challenge by tracking the dependencies saved with the model. We can load them from our registry and pip-install them in our cluster:

Install model requirement for batch inferences

Inference — batch mode

Databricks lets you run cost-effective inferences at scale through a batch. These are integrated into your data pipeline with a streaming or batch job (ex: every hour) as new images are processed.

The first step is to retrieve our model from the MLFlow registry and save the function as a pandas_udf. Once it’s done, this function can be used by an analyst or a data engineer in a pipeline to run our inferences at scale.

Run ad-hoc inferences as simple python or SQL queries or integrate them in the DLT Pipeline.

Inference — real-time with Model Serving

The second option is to use Databricks Model Serving capabilities for your applications. Within 1 click, Databricks will start a serverless REST API serving the model defined in MLflow.

Open your model registry and click on Model Serving. This will grant you real-time capabilities without any infrastructure setup. The serving scales up (and down to 0).

Real-time Inferences with Python REST API on our model-serving endpoint

Conclusion

That’s it, we have built an end-to-end pipeline to incrementally ingest our dataset, clean it and train a Deep Learning model. The model is now deployed and ready for production-grade usage.

Databricks Lakehouse accelerates your team and simplifies the go-to production:

Unique ingestion and data preparation capabilities with Delta Live Table, making Data Engineering accessible to all
Ability to support all use-cases ingest and process structured and unstructured dataset
Advanced ML capabilities for ML training
MLOps coverage lets your Data Scientist team focus on what matters (improving your business) and not on operational tasks
Support for all types of production deployment to cover all your use cases, without external tools
Security and compliance covered all along, from data security (table ACL) to model governance

As a result, teams using Databricks are able to deploy advanced ML projects in a matter of weeks, from ingestion to model deployment, drastically accelerating business.

Written by Tarik Boukherissa and Florent Brosse. Thanks to Bala Amavasai and Quentin Ambard for your help.