MultiModel Machine Learning Across Traits

neural network

–Siddharth Gupta [ Member of Technical Staff]

The current state of neural network architecture is more focused on learning from a single trait, with most of the architecture focusing on either visual processing or sound recognition, among others. In reality, we have information coming in from various sources, processing this multi-model information is what the human brain does, with ease. We have encountered use cases where information to be processed has multiple traits like video with audio, video with doppler sounds etc. Does a trained model predict better when it is trained with multiple traits? Although the answer is intuitively clear we will still give it a test run!

The structure of neural networks is quite analogous to the brain, multiple neurons firing signals from different areas and merging at a common point. But artificial neural networks are still highly specific for a single trait, as the model can only be trained to recognize person through its image or sound characteristics but it will fail to converge when both traits are provided simultaneously. Training in specific trait has its individual weakness for example image object detection faces the problem of image occlusion which is overlapping figures; while with audio, it is complex to find a pattern as same semantic can be produced by variable frequencies and pitch.

Today we present a multi-model neural network, an architecture which can be trained with multiple real-world inputs. To predict the output the network takes the results from its subnetwork and predicts the result with the highest probability. This system can be trained to utilize the power of multiple traits.

To test the architecture we have performed some simple experiments. We have surveillance devices sitting in elevators capturing audio and video data.

Objective: To recognize whether the input is from an elevator or not.

Dataset: For this experiment, image and audio samples with timestamp were collected from our test cameras installed in elevators. Image samples are .jpg files and recorded audio is in .wav format. Both the audio and video inputs are normalized. For audio, the input is clipped to a fixed size. Preprocessing removes background noises and periods of silence.

Sound waves

Fig 1: Top, raw sound waves and bottom, a processed sound wave

Experiment -1

In phase 1 we have trained separate models for image and audio. For image recognition, we have trained state of the art Inception v3 architecture with transfer learning to detect elevator or with no elevator images.

For audio recognition, Wavenet neural network was trained on the audio files to detect elevator or with no elevator sounds.

The accuracy is calculated by the total detected.

The observations are as following:

For Images: Accuracy = 81%

For sound: Accuracy = 67%

Experiment – 2

In this phase, we have trained a single model with ibeyonde’s multimodel architecture. The images and audio are mapped with a timestamp and given as input to the neural network for training. In final prediction, it gives whether it detects Lift or no lift.

Accuracy = 91.4%


The combined training in Experiment 2 came out with a better result. As it is trained in both audio and video, at any point in time it can analyzed two inputs and predicts a better result then the models in Experiment 1.

This approach can be useful for many real-time applications where the judgment on a single trait is not reliable.

You can reach me at to get more details.

Share this post