I trained convolutional neural networks on images of household appliances to simulate a home inspector with machine learning.
The end result is a video like this:
When you purchase a house, you pay $400-$800 for an inspection. The inspector walks through the house looking for defects in the home, out of date construction techniques, environmental hazards, and the like. They then write a report, which is effectively a manual for your home.
The inspection is primarily a visual process. An experienced inspector recognizes many rare or old objects. Essentially, this process is a reverse image search – many, though not all of the items could be recognized from photos.
When I began this project, I was unable to find a pre-made dataset to match my needs. While there are many datasets for testing machine learning, most are challenge datasets (i.e. used to test algorithms), lacking the quality or edge cases required to solve a business problem.
I built a simple app to record stock videos with WebRTC, select a tag, and upload them to S3. Files in S3 are stoed in folders named for the tags (washing_machine, sump_pump, furnace, etc).
The training script downloads the videos, extract frames with ffmpeg, applies pre-processing, then does transfer learning with Apache mxnet / gluon.
mkdir -p /data/videos
aws s3 sync --delete s3://gsieling-video-bucket /data/videos
/data/videos /data/videos/stairs /data/videos/stairs.VID_1mp4 /data/videos/stairs/VID_2.mp4
It also takes a significant amount of trial and error to set up an effective training process.
It’s important to independently debug and test each stage, to avoid costly mistakes. When I began the project, I trained an algorithm on 2-3 objects, using a pre-made model called MobileNet. This trains in minutes on a CPU, and allowed me to validate my approach. With a more robust dataset of tens of thousands of images, training on a CPU takes days.
The tags this tool learns are: breaker_box, furnace, hot_water_tank, oil_tank, pillar, smoke_detector, stairs, sump_pump, wall, washer, and water_softener. You get better results with more tags. Object classification models always output a tag (which thing it thinks is most likely) – i.e. there’s no clear way to say that it sees nothing.
Pre-trained models are a good starting point, and can be obtained from Apache gluon’s model zoo. These have been trained on some existing dataset of images. We just change the tags they output and tune the model to detect our tags.
There is some data setup required to do the actual training (see the full script here). For low-accuracy models and a few hundred images, this trains in minutes, which is enough to prove out that the problem is possible.
Setting up all of the dependencies turns out to be a significant amount of work (i.e. mxnet, opencv, CUDA). It is really valuable to dockerize as much as you can – NVIDIA provides a Docker runtime that helps with this. That being said, their tooling is exceptionally difficult to set up.
After training, we then run the model against all of the images in a video. Shown below is a notebook the views the video frame-by-frame in our notebook, so we can compare the results of different models:
I chose mxnet because AWS seems to be pushing it. For instance, their DeepLens camera comes with it pre-installed.
It has bindings for multiple languages – I like having the ability to convert python code to Scala as my project matures.
MobileNet, ResNet, or VGG16?
The mxnet documentation includes a visualization of model training time vs. accuracy. Models on the outer edge of the curve express the trade-off between accuracy and training time.
When I started this project I used Mobilenet, which trains quickly, but isn’t super-accurate. Once I got to the point where I wanted to produce a demo, I used other models.
It’s worth nothing that these are not all trained with the same image datasets. Each model has different behaviors when they glitch out (e.g. mobilenet defaulted to thinking everything was a sump pump, and could never detect a wall).
It’s also worth noting that while this chart denotes “samples per second”, the models don’t seem to learn information at the same rate – VGG16 consumed images very slowly, but seemed to still improve in accuracy very rapidly.
If you build an image dataset, you can either collect the images yourself or crawl what’s publicly available. For this project I took videos in a couple different houses.
Before landing on this approach, I evalauted the Bing Image API. While this didn’t give me great results, it did help me understand the problem I’m trying to solve much better.
If you search for a term like “cherry flooring”, you quickly see a lot of similarity in the images. Search engines try to find results closest to the search query, which doesn’t give you a lot of diversity in the results.
These houses also appear to professional photos of houses be staged for sale. Our use case is taking cell phone videos as you walk around a home, which is unlikely to ever produce an image of similar quality.
Images of smoke detectors show a different class of problem – these are all stock photos (i.e. no backgrounds), without much variety of angles.
These are also product photos, which suggests they are likely items currently for sale. If you buy a house, you’re more interested in products that are no longer on the market, for instance items subject to recalls.
Environmental hazards also present interesting challenges. Shown here are images of asbestos fabric, which used to be a common way to insulate pipes, as it doesn’t burn.
From a distance it just looks like a white paper / fabric. You can tell what it is from the texture, but this is unlikely to be visible in a cell phone image.
Other asbestos items can only be recognized through a product database, e.g. 9×9 tile that dates from a certain time period, but without having an object of a known size in the images, we don’t know how big anything is, whereas a home inspector would have a tape measure.
Fire damage presents an interesting machine learning challenge – this could be detected, but it’s really a change from the natural state of a room. Most of these images online represent total destruction, which is unlikely what you’d see if you bought a house that had some damage.
Unlike other scenarios, we also can’t easily generate our own training data.
A third interesting technique for generating images is to create 3-D models, then virtually photograph the objects we care about.
Shown below is an image I generate from stock videos of a sump pump.
The photogrammetry software did a great job rendering the basin around the pump, but lost the top. I found generally that it loses smooth, solid color surfaces, making it a poor choice for rendering appliances.
However, there are depth sensors on the market (e.g. the Intel Realsense devkit cameras). If these make their way into phones, it may become possible to generate a higher quality renderings.
Classification, Object Detection, or Segmentation?
One of the major choices in machine learning is what you want the machine to “learn”. Classification gives you the most likely label for each image (whether or not something is really there). Object detection gives you bounding boxes, and segmentation gives you a boundary line.
Each of those requires successively more detailed training data.
Aside from the labor cost of generating the data, there are is a clear reasons to prefer classification for this problem.
As seen below, many household objects will be take up the entire frame – pipes, wiring, and framing. It would be difficult to draw segmentation lines around these.
A future technique that virtually photographs 3D models might make it possible to auto-generate segmentation lines. Photogrammetry tools create point clouds from images, and meshes from the point clouds – so tis assumes that either the points or mesh triangles can be labelled.
Alternately, one could build and run multiple models of different types, or a model with a combined architecture – one part to do classification on things that can only be classified, and one part to produce bounding boxes where possible.
Notebooks are ideal environments for the initial phases of a machine learning project. You can construct a small dataset, write a paragraph for each step, and carefully test each piece. Because each paragraph has access to the global state, you can start running from any point.
Notebooks also allow you to build small debugging tools into your process, which we’ll discuss here. The following examples use Apache Zeppelin, which lets you log a macro that tells what kind of output your paragraph produces (%table gives you an excel-like charting tool, %html assumes the output is html, and can include base 64 encoded images).
Zeppelin also lets you create UI controls (checkboxes).
Spot checking tags
After we’ve pulled a bunch of data, we want to make sure the frames extracted from the videos are being placed in the right locations (i.e. that a folder named “washing machine” in fact has washing machines).
To check this using our notebook, we output a combobox containing a list of tags. When the notebook operator selects a tag, it posts back to the notebook and chooses 25 randomly selected images from the folder, base64 encodes them, and displays the HTML output:
We also want to see stats on the size of our dataset – is it balanced? Are any of the tags completely empty or lacking data?
Processed Image Preview
Before we hand images off to the machine learning algorithm, they go through a series of pre-processing steps. They must be resized down to 224 x 224. This can be done by resizing or by cropping. We also apply filters to discourage the algorithm from becoming too sensitive to lighting conditions.
Tag an image
After the algorithm finishes we want to test some example tags on images.
A confusion matrix tells you which tags the algorithm mixes up.
Read a TSV file
Once your training code is stable, you’ll probably want to export logs as you go, so you can train a bunch of models and do later analysis.
For instance, the below chart shows accuracy over time – this chart is generated in Apache Zeppling by reading the TSV data from our script.
We know that’s not quite right – we should eventually see the accuracy drop as the algorithm overtrains (i.e. starts to memorize the images we’ve given).
Manual Validation (863 entries)
To confront the above issue, I decided to build a better validation dataset. This is a manually tagged collection of images, where the images are all the frames in a video of me walking around my basement with a cell phone.
Some objects show up more than others, but this helped identify issues not apparent in the “stock videos” used for training. For instance I discovered that some cell phone images were blurry, and included multiple or no objects from our dataset.
To do the tagging, I wrote a Zeppelin notebook that creates an HTML page with each image from a video, which someone manually tags and saves off.
Once we apply the over-time tracking described above, we get more realistic validation results – much lower, and we see that the validation quality does eventually start dropping, as expected.
Augmentation with mxnet filters
To improve this further, we need to ensure that the model doesn’t overfit the training dataset.
Since we’re collecting tons of images by collecting videos, one approach is to randomly apply filters to images in the dataset, preventing the training from memorizing the dataset. This leads to slower training, but higher quality generalization.
Several such filters are offered by mxnet.
For instance, the below shows (from the mxnet docs) shows examples of random cropping and lighting changes:
While we can train a model, none of our debugging tools so far let us see what it’s thinking.
For certain model architectures, there is a tool called gradcam that renders images of what features of an image are important for a chosen tag.
In my dataset, this image includes both an oil tank and a water softener. The model seems to hone in on the oil tank – not just the boundaries, but also a label sticker on the tank:
In this image, shifted slightly left, it hones in on the water filter, perhaps because the entire object is now in-frame, suggesting we should train with more partially occluded objects:
By contrast, here’s an example that is completely wrong – it detects a furnace as an oil tank. Visually, the saliency map seems to have no focus, perhaps because the original image has motion blur. This suggests we might either want to refuse to tag images with significant motion blur, or introduce more blurry into our training set.
As you’re working on a project like this, you’ll likely want to return to previous experiments.
It’s important write code that allows you to reproduce your own results, so that you can compare current techniques to past versions of your application.
Consequently, it’s important to version data, model, and the code simultaneously. To address this, I’ve typically uploaded all three to folders in S3 when the model training completes. It’s helpful to have your training output a notebook that is set up to do inference or further training and debugging on the current version of the process.
I alobuild docker images containing the libraries I use, so I have control over dependencies.
Many of the tools I built for this project to understand the machine learning process are already readily available, if you follow popular ML courses, like fast.ai.
There is a library (mxboard) to produce log output suitable for tensorboard, so you can get a bunch of great debugging tools.
Machine learning projects often focus on the accuracy of algorithms, rather than solving business problems. The most helpful thing I did was to try to manually collect and tag images – that improved the quality of tags and the architecture of my approach. Using this dataset kept me honest about the quality of my results.
It’s also worth noting that image recognition can often be a supplement, rather than a replacement for human judgement (improving recall). Many tasks rely on a person to quickly identify images with potential issues, which are then passed to a more robust scientific test. In a home inspection, this is what you’d do if you found potential asbestos or lead and were concerned.
Using ML in an only advisory capacity is likely the only way it is suitable for medical applications. There are a number of papers that show how object classification can be used for identifying potential trouble spots for colonoscopies. In this test, potential trouble spots might then be biopsied to check for cancer. This seems to be a way to alert a doctor to a spot they may have missed, rather than a way to automate them out of a job.
- mxboard (tensorboard integration)
Here are example videos produced by this process: