The Internet is flooded with instructional videos that can teach curious viewers everything from cooking the perfect pancake to performing the life-saving Heimlich Maneuver.
But pinpointing exactly when and where specific actions occur in a long video can be tedious. To streamline the process, scientists are trying to teach computers to perform this task. Ideally, the user would describe the desired action and the AI model would jump to that location in the video.
However, training machine learning models to do this typically requires large amounts of expensive, hand-labeled video data.
A new, more efficient approach from MIT researchers and the MIT-IBM Watson AI Lab uses only video and automatically generated scripts to train a model that performs this task, called spatiotemporal grounding.
Researchers teach a model to understand unlabeled videos in two different ways. This means looking at small details to determine where objects are (spatial information) and looking at the bigger picture to understand when action occurs (temporal information).
Compared to other AI approaches, this method more accurately identifies actions in long videos containing multiple activities. Interestingly, they found that training spatial and temporal information simultaneously allowed the model to better identify each separately.
In addition to streamlining online learning and virtual training processes, this technology can also be useful in healthcare settings, for example by quickly finding key moments in videos of diagnostic procedures.
“We solve the challenge of trying to encode spatial and temporal information simultaneously, and instead think of it as two experts working on their own, which turns out to be a clearer way to encode the information. Our model combining these two separate branches provides the best performance,” says Brian Chen, lead author of a paper on the technique.
Chen, a 2023 Columbia University graduate who conducted this research as a visiting student at the MIT-IBM Watson AI Lab, was joined on the paper by James Glass, senior research scientist, MIT-IBM Watson AI Lab member, and principal investigator. Member of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, member of the MIT-IBM Watson AI Lab and Goethe University Frankfurt; There are also others from MIT, Goethe University, MIT-IBM Watson AI Lab, and Quality Match GmbH. This research will be presented at the Computer Vision and Pattern Recognition Conference.
Global and local learning
Researchers typically teach models to perform spatiotemporal grounding using videos in which humans annotate the start and end times of specific tasks.
Not only is this data expensive to generate, but it can be difficult for humans to know exactly what to label. If the action is “cooking pancakes,” does the action begin when the cook begins mixing the batter, or when the batter is poured into the pan?
“This time it might be cooking, next time it might be fixing a car. There are many different domains in which people can annotate. But if you can learn everything without labels, this is a more general solution,” says Chen.
For their approach, researchers use unlabeled educational videos and text transcripts from websites such as YouTube as training data. No special preparation is required.
They divided the training process into two parts. First, they teach a machine learning model that looks at the entire video to understand what actions are occurring at specific times. This high-level information is called the global representation.
Second, they teach the model to focus on a specific region of the video part where the action occurs. For example, in a large kitchen, the model only needs to focus on the wooden spoons that chefs use to mix pancake batter rather than the entire countertop. This fine-grained information is called a local representation.
The researchers incorporate additional components into the framework to mitigate the inconsistencies that arise between narration and video. Perhaps the chef will talk about cooking the pancakes first and do the work later.
To develop a more realistic solution, the researchers focused on a few minutes of uncut video. In contrast, most AI technologies are trained using clips of just a few seconds, cut by someone to show only a single action.
new benchmark
However, when they came to evaluate their approach, the researchers could not find an effective benchmark for testing their model on such long, uncut videos, so they created a benchmark.
To build the benchmark dataset, the researchers devised a new annotation technique that is effective in identifying multi-step tasks. Instead of drawing boxes around objects that are important to the user, they have them display intersections of objects, such as the point where a blade cuts a tomato.
“This saves manpower and costs by providing clearer definitions and speeding up annotation processing,” says Chen.
Additionally, having multiple people add point annotations to the same video can better capture actions that occur over time, such as the flow of milk being poured. All annotators do not mark exactly the same points in the liquid flow.
When the researchers tested their approach using this benchmark, they found that it was more accurate at pinpointing behavior than other AI techniques.
Their method was also better at focusing on human-object interactions. For example, if the task is “serve pancakes,” many other approaches might focus only on the key object, such as a pile of pancakes sitting on the counter. Instead, their method focused on the actual moment when a chef flips a pancake onto a plate.
Existing approaches rely heavily on human-labeled data and therefore do not scale well. This work goes towards solving this problem by providing a new way to localize an event in space and time using naturally occurring speech within the event. This type of data is ubiquitous, so in theory it would be a powerful learning signal. However, they are often completely unrelated to what is displayed on the screen, making them difficult to use in machine learning systems. “This work helps address this issue and makes it easier for researchers to create systems that use this form of multimodal data in the future,” said Andrew Owens, assistant professor of electrical engineering and computer science at the University of Michigan. says this.
Next, the researchers plan to improve their approach so that the model can automatically detect when text and narration are misaligned and switch focus from one modality to another. They also seek to extend the framework to audio data, since there is usually a strong correlation between motion and the sounds an object makes.
“AI research has made incredible progress in creating models like ChatGPT that understand images. However, our progress in video understanding lags far behind. “This work represents an important step in that direction,” says Kate Saenko, a professor of computer science at Boston University who was not involved in the work.
This research is funded in part by the MIT-IBM Watson AI Lab.