It was back in 2006, we were working on a class project.. The professor had asked us to implement Sentiment Analysis on Spoken Conversations or Dialogs. As you may be aware, most research on Sentiment Analysis was originally conducted on Text Datasets, e.g. online product reviews, movie reviews, social media posts.. The main challenge for this assignment, just like any other Machine Learning & Natural Language Understanding (NLU) project, was to find a good source of data that we can train and evaluate our algorithms on..
Datasets for Dialog Research:
There were some datasets available for research on spoken language systems or dialogs. These were phone conversations or call-logs where you can hear customers complaining about the products they received, or asking for refund, technical support etc..
There were some obvious challenges, however, in using some of those datasets. First, you need to listen to hours and hours of call recordings before you can find maybe one or two examples that are clearly emotional. Most examples you will find in this kind of natural phone conversations tend to be very subtle or mildly expressed. It’s not something your algorithm can pick up very easily.
The other challenge was also the annotation of the data. Most of my classmates and colleagues were manually labeling the data themselves, deciding when the customer sounds angry, frustrated etc. Not only this process is very time consuming, but it’s also very subjective. For example, if you ask 2 or 3 annotators to label the same dialog to identify the emotions in it, you will be very surprised that even 2–3 annotators often struggle to come up with consistent labeling.
For any Machine Learning project, it is essential to have a good source of data that has ample of training examples, and getting the data annotated with correct class labels to train and test our algorithms. Personally, I was never very fond of manual data annotations: it’s time-consuming and very subjective. So when we started working on this class project, I knew I had to come up with a clever idea or solution to overcome some of these challenges.
Humor Analysis in Television Sitcoms:
One day, I was watching my favorite program on television, F.R.I.E.N.D.S. Somewhere at the back of my mind, I was thinking that I still have a class project to finish (remember, Sentiment Analysis in Spoken Conversations).. It was in that moment, it suddenly occurred to me that wait a second.. these are spoken conversations or dialogs, and humor or laughter is a kind of sentiment.. So why don’t I use this dataset?
If you think about it, it actually makes perfect sense.. Every time there is a joke or the actors say something funny, you can hear a laughter in the background. So these dialogs are already pre-annotated or pre-labeled. I didn’t have to decide myself what’s funny or what’s not funny. But somebody has already put those labels here in the data itself..
Frequency & Intensity of Humor
The other advantage of using this dataset was the frequency and intensity of humor. Unlike phone conversations or call-logs (where there are hardly 1–2 examples of sentiments in 1 hr call-recording), you can find several jokes even in 5 min clip of F.R.I.E.N.D.S or any other sitcom for that matter. Also, these are trained professional actors who use their facial expressions and vocal intonations effectively while expressing humor. So the features you can capture (whether it’s voice or facial expressions), are often strong indicators of humor.
Humor Analysis in F.R.I.E.N.D.S (EMNLP 2006):
There is a conference in NLP called EMNLP (Empirical Methods in Natural Language Processing), which happened to have a deadline right around the same time when we were working on this class project. So when we had our project reports ready, I decided to just go ahead and submit it there. When it was accepted for publication, I remember some of the reviewers comments in which they said “this is an ingenious idea, to analyze humor in television sitcoms”..
What excited me the most, however, about this project was not that it’s about F.R.I.E.N.D.S or humor in particular.. But it was the general idea that we can apply AI / Machine Learning algorithms to analyze the content in television programs and movies, or use the content from television programs and movies as a source of data to train machine learning models or language understanding models.
Internship at SONY, Japan (2008):
My interests to work in this field were reinforced further when I had an opportunity to do an internship in SONY, Japan just 2 years later. I was working with the research team in Tokyo on Music Recommendation. It was during this internship that I could clearly envision potential industry applications of AI / Machine Learning in Media & Entertainment domain. Also during this internship, I realized the algorithms we were building (whether it’s a search or recommendation engine), they need not run on desktops and servers, but it is something we can directly plug or embed in our Smart TVs, Car Media Players, Gaming Consoles and other Home Entertainment devices. Today, we can notice this trend on Embedded AI that most consumer electronics companies are working towards, to plug AI / ML applications directly into home appliances and consumer electronic devices..
Early Challenges & State of the Art:
Back in 2006–2008, there were many practical challenges to implement & execute some of these ideas. For instance, if we wanted to capture visual features like gestures or facial expressions in F.R.I.E.N.D.S. program back then, it could easily turn into a massive decade-long research project. Today, we have number of software libraries available for Deep Learning, Computer Vision, Image Processing tasks (e.g. PyTorch, Keras, TensorFlow, OpenCV, OpenPose), as well as, pre-trained models (Coco-SSD, Mobile-Net, Pose-Net) and publicly available datasets (like ImageNet, Kinetics) that make it feasible to quickly build some prototypes or MVPs in matter of couple of days or weeks at most..
On the negative side though, most data science and analytics teams I recently spoke or collaborated with often seem more enthusiastic about the technical or implementation details, like the algorithm, deployment platform, engineering services, model training, parameter tuning etc, rather than the core idea or the domain itself.. For me, if the idea is captivating and interesting enough, the implementation and execution often follows naturally and is just a matter of putting the pieces together.. The process of compiling git code, or calling Python SDKs and REST APIs in Jupyter Notebook may not be fun by itself, unless the problem we are trying to solve using all those libraries and tools is exciting.. On the contrary, even boring & tedious tasks like data annotation (mentioned earlier) can be fun if the data you are trying to label contains fashion show videos, showing beautiful models in colorful dresses walking down the Milan runways.. Ask my team, I am not joking! :-)
Scope & Opportunity:
There is a variety of content out there in the form of television programs, movies, online videos that we can potentially use as a source of data in our AI / Machine Learning experiments.. If you look at just the television programs alone for example, we have: sports channels, news broadcasts, fashion show videos, music / dance videos, action, comedy, drama, children’s animation films, even all those documentaries we see on National Geographic or Discovery channels. It’s pretty mind-blowing what you can potentially create with all this data!
Below are just few examples of fun applications I have recently worked on :
- Computational Music Generation using Hand Gestures
- Sports Activity Recognition using Transfer Learning
- Object Detection to identify Animal Species in Wild Life Films
- Object Detection to identify famous Movie and Cartoon Characters
- Chroma-Key Effect to replace Background in Real-Time
- Capturing Body Movements & Postures in Dance & Action Videos
- Capturing Facial Expressions from Comedy Movies
- Mapping Facial Expressions into Animated Characters
- Artistic Style Transfer to create Visual Effects (VFx)
The above use-cases not only cover different genres of videos (e.g. sports, music, animation, comedy, wild-life documentaries etc), but also different algorithms and techniques like: Object Detection, Image Classification, Transfer Learning, Generative Neural-Networks, Style Transfer, Video Segmentation, Recognizing Gestures / Facial Expressions etc..