Sheldon

Sheldon, my team’s project, won 1st place at DataHack 2018. Here is how we conceived and built it. If you are unfamiliar with it, DataHack is the biggest data hackathon in Israel. It takes place once a year, in Jerusalem.

8 min readOct 22, 2018

The Vision

In our mind’s eye, we envisioned a product capable of providing a high-level understanding of interpersonal relationships using sensory information. Put simply, Artificial Social Intelligence. This was a lofty goal for four people in a day and a half, as of October 2018…

When we thought about how Artificial Social Intelligence could be useful in the real world, we quickly realized that it could help people with Asperger’s syndrome to interpret social nuances. The thought that in just 39 hours we could develop an MVP that could improve someone’s life really motivated us!

Why Sheldon you ask?

Sheldon, a character from The Big Bang Theory, often misses social nuance. So for us he was a comical representation of some of the very serious challenges faces by our intended end-users. After the hackathon, we came across a clip of Sheldon with a product which identifies emotions using AI (his product is surprisingly similar to ours).

If you think about it, Sheldon is just the tip of the iceberg. Full social intelligence would be beneficial in countless other ways.

So who are we?

Our team was a mix of data scientists and programmers. Philip Tannor and I are data scientists, and Gal Vinograd and Omri Kaduri are full stack developers. Together we formed Team Sheldon and won first place in both the Intel Social Good Challenge and the Datacup at DataHack 2018.

The First Steps

In the first hour of the hackathon we decided on the features we wanted to include in our app and prioritized them as follows:

Face Detection
Facial Emotion Recognition
Identity Detection
High-Level Social Pattern Recognition
Vocal Emotion Recognition

From the outset it was clear to us that we would use a mobile app to showcase our algorithms and models. For this reason our experienced and skilled full stack developers were critical to the project’s success. The experience of a joint programmer and data scientist team was new to all of us.

Data Collection

A quick internet search revealed that many accurate and efficient solutions for face detection existed. We therefore used an existing solution for this task and focused our efforts on the more novel features of the project.

Next, we searched for facial emotion recognition datasets. While we found some results, none quite suited our need. We wanted the raw data so that we could filter the data, expand it or train our model to recognize new emotions.

We ended up using data from a few sources:

Datasets

Fer2013: ~29000 pictures of faces, low quality, grayscale
KDEF: ~5000 pictures of faces, same faces from different angles

Scraping

A series of keyword searches on google.

Onsite manual collection

We took pictures at DataHack of hackers acting out the different emotions (or trying to…): ~150 pictures in total.

Data Processing

First we cropped the faces from the images using the face detection algorithm implemented in dlib. We cropped even the pre-cropped images to standardize the input to our model. Images in which the face was not recognized were not used. In addition we didn’t use any of the side facing pictures from KDEF.

Next, we applied a grayscale transform to all of the images and resized them to 48x48 pixels, to match the pictures in fer2013.

We trained our models using four basic emotions: Happy, Sad, Angry and Neutral. These emotions were present in both academic data-sets.

The scraped data significantly reduced our model’s accuracy so it wasn’t used in our final model. Some manual analysis (at about 02:30 AM…) revealed that many of the scraped images were not representative of the emotions we were training the model to recognize. For example the search results for “sad man” after about twenty pictures looked like this:

Emotion Recognition Model

We first tried transfer learning on MobileNet and XCEPTION, which did not work well for us. Our final model was a condensed version of XCEPTION.

During training we augmented the data using rotations of up to 10 degrees, horizontal flipping and ±10% zoom. We used a cross-entropy loss, reduced the learning rate when the loss didn’t improve for more than ten epochs and employed early stopping.

Measuring Accuracy

We split our data into random train and validation sets from the fer2013 and KDEF datasets. We used the pictures we took at DataHack as a test set.

Our final model achieved 75% accuracy and about 0.7 cross-entropy on the validation data. Here are some confusion matrices for the geekier readers:

We noticed that the sad and angry classes were significantly harder for our model to recognize than the happy and neutral classes. This is most pronounced in the test results. We discovered that often we disagreed with the labels assigned to the images representing these classes. We suspect that this is due to subtle physical differences between angry and sad expressions (especially in a single image). Another explanation is that, although the data scientists we photographed at DataHack are very talented individuals, acting is not quite their forte.

Building the App

Ideally, we would run the entire model on a devoted device similar to the Google Glass or Orcam. However, to create a first prototype, we used a smartphone as our sensor and an Azure server with a Tesla K-80 GPU to do the computational heavy lifting.

Faces were identified on the phone and a rough crop of the faces was sent to the server. First, the server ran a more fine-grained face cropping to get a tight crop of the face. The server then attempted to identify the face using eigenfaces. Next, the server ran our custom emotion recognition on each crop. The identifications and emotions were then sent back to the client to be shown on the screen.

Transitioning from Micro to Macro

We knew we wanted to save historical information about faces and emotions the app identified. The questions were:

What do we want to save?
At what resolution?
How to save/index it?

We decided to save aggregated data from each “appearance” of a face (i.e. from the time a face was identified until it doesn’t appear in the frame for a few seconds).

Some examples of aggregate data are “most common emotion”, “first emotion” and “last emotion”. Average emotion would be the key building block for generating an “emotion distribution” for each face the app recognized. This would enable the app to say “Person x is angry, but he is angry 75% of the time”.

High-Level Social Pattern Recognition

In order to identify high level social patterns we broke them down into basic building blocks that our model could identify. For example:

The app was able to identify the “Cheering person x up” pattern by detecting a person with a neutral expression for some time and then detecting the same person with a happy face later.

Another pattern the app was able to detect was the “Insulting person x” pattern by detecting a person with a neutral expression and then an angry expression.

Challenges

One of our most significant challenges in developing the app was reducing the network latency to make it work in real-time. We solved this by sending frames to the server intermittently and sending some of the image in low resolution. We considered other solutions, but they were infeasible during the hackathon.

During development, we found some misidentified emotions were bothersome. Therefore, to improve the user experience, we tweaked the app to only predict an emotion if one of the classes was assigned a much higher probability than the others, indicating that the model was confident in the classification. If no emotion was predicted, the assigned emotion was “Neutral”.

By the end of the hackathon we had a working app that was visually pleasing, worked in real-time and most importantly provided value!

Lessons learned

Scraping is rough. We assumed that scraping images from the internet wouldn’t be very difficult and would enable us to level up our model. Neither of these assumptions was correct.

Sleep is overrated. On the second night of the hackathon our team slept less than 6 hours (all four of us combined!!!).

Data Science + Programming = Success. We were able to create a common language despite our diverse backgrounds. Each team member gave it their all and we wouldn’t have succeeded otherwise.

Closing Remarks

At the end of the day, it was very gratifying to tackle and overcome big technical challenges, collaborate with experts in specialties other than our own and, at the same time, create a prototype of something to do good for the world. We even shared a lot of laughs in the process.

We would like to thank the entire DataHack2018 staff for an amazing hackathon. The good vibes you spread throughout the event and the smoothness with which everything was run really made it an event to remember!

I would like to thank Philip Tannor for assisting in writing this and Rochelle Meiseles for editing.