Understanding the visual knowledge of language models

You've probably heard that a picture is worth a thousand words. But can a large-scale language model (LLM) process a picture even if it has never seen the image before?

As a result, language models trained purely on text have a solid understanding of the visual world. They can write image rendering code to generate complex scenes with interesting objects and compositions, and even if that knowledge is not used properly, LLM can improve the images. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) observed this when they prompted a language model to modify its own code for different images. The system improved simple clip art drawings for each query.

Visual knowledge of these language models is gained through the way concepts such as shapes and colors are described over the Internet, either in language or code. Given instructions such as “Draw a parrot in the jungle,” the user jogs the LLM to consider what it has read in the previous description. To assess how much visual knowledge LLMs have, the CSAIL team organized a “visual check” for LLMs. In other words, we used a “visual aptitude dataset” to test the model’s ability to draw, recognize, and self-correct these concepts. The researchers collected the final drafts of each of these illustrations and trained a computer vision system to identify the content of the actual photo.

“We essentially train the visual system without using visual data directly.” said Tamar Lott Shaham, co-senior author of the study and MIT Electrical Engineering and Computer Science (EECS) postdoc at CSAIL. “Our team wrote image-rendering code that queries a language model to generate data, and then trained the visual system to evaluate natural images. We were inspired by questions about how visual concepts are expressed through other mediums, such as text. To represent visual knowledge, LLM can use code as a common basis for text and visuals.”

To build this dataset, the researchers first queried the model to generate code for a variety of shapes, objects, and scenes. They then compiled that code to render a simple digital illustration, such as a row of bicycles, showing that LLM understands spatial relationships well enough to draw a two-wheeler in a horizontal row. In another example, the model combined two random concepts to generate a car-shaped cake. The language model also demonstrated the ability to create visual effects by generating glowing light bulbs.

“Our research shows that when you query LLM (multimodal no prior learning) to create an image, it knows a lot more than it meets the eye,” says co-author Pratyusha Sharma, EECS PhD student and CSAIL member. “Let's say you ask it to draw a chair. The model knows other things about the furniture that might not have been rendered out of the box, so users can query the model to improve the visuals it produces at each iteration. Surprisingly, the model renders We can make significant improvements to the code and iteratively enrich the picture.”

Researchers collected these illustrations and used them to train a computer vision system that can recognize objects in real-world photographs (even if they have never been seen before). Using this synthetic text-generated data as the sole reference point, the system outperformed other procedurally generated image datasets trained on real-world photographs.

The CSAIL team believes that combining LLM’s hidden visual knowledge with the artistic capabilities of other AI tools, such as diffusion models, could also help. Systems like Midjourney lack the know-how to continually adjust the fine details of an image, making it difficult to handle requests like reducing the number of cars in a photo or placing an object behind another car. If LLM had sketched out the requested changes to the diffusion model in advance, the resulting edits could have been more satisfying.

The irony, as Rott Shaham and Sharma acknowledge, is that LLM sometimes fails to recognize the same concepts that it can draw. This became apparent when the model misidentified images that humans had recreated in the dataset. These different representations of the visual world likely triggered the language model’s misunderstandings.

While the models had difficulty recognizing these abstract depictions, they showed creativity in drawing the same concept differently each time. When researchers asked LLM to draw concepts such as strawberries and arcade multiple times, they generated drawings of different shapes and colors from different angles, which allowed the model to learn more about the visual concepts (rather than just reciting previously seen examples). It was implied that it was possible to have mental images.

The CSAIL team believes that this procedure can serve as a baseline for evaluating how well generative AI models can train computer vision systems. Researchers are also looking to expand their work by challenging language models. In a recent study, the MIT group noted that it is difficult to further investigate the sources of visual knowledge because they do not have access to the training sets of the LLMs they used. In the future, they plan to explore teaching an even better vision model by putting the LLM directly to work.

Sharma and Rott Shaham contributed to the paper along with former CSAIL affiliate Stephanie Fu '22, MNG '23, and EECS PhD students Manel Baradad, Adrián Rodríguez-Muñoz '22, and Shivam Duggal, all of whom are CSAIL affiliates. They also contributed to the paper along with MIT Associate Professors Phillip Isola and Antonio Torralba. Their work was supported in part by grants from the MIT-IBM Watson AI Lab, the LaCaixa Fellowship, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. They are presenting their paper this week at the IEEE/CVF Computer Vision and Pattern Recognition Conference.

Understanding the visual knowledge of language models | MIT News

Top 10 robotics stories of June 2024

Interview with Roberto Figueiredo: the RoboCup experience

Organ-on-a-chip mimics blood-brain barrier for better drug delivery

Leave A Reply Cancel Reply

Books / Publications | Scarleteen

‘Reality Is…’: How Much Money Did Hindenburg Make By Shorting Adani Stocks And Bonds?

Young Thug’s Gang Trial Is Paused Because of Judge’s Secret Meeting

Hunter Biden ‘Imploring’ His Father To Stay In The Race After Horrible Showing At Debate: Report

Top 10 robotics stories of June 2024

Cruise Line Increase Capacity By 30%, New Year-Round Regional Sailings

How to Tell If Shortness of Breath Is From Anxiety

Ruggable Washable Rugs Review 2024: Tufted vs. Flatwoven, Pricing

New superintendent outlines her vision for the Future of Albuquerque Public Schools

A Major Part of Biden’s Student Loan Repayment Plan, SAVE, Is Restored

Jill Biden Is Vogue’s Cover Star

House Democrat to File Constitutional Amendment to Reverse Supreme Court Ruling on Presidential Immunity | The Gateway Pundit

Popular Posts

‘Reality Is…’: How Much Money Did Hindenburg Make By Shorting Adani Stocks And Bonds?

A Major Part of Biden’s Student Loan Repayment Plan, SAVE, Is Restored

Royal Mint 50p coin celebrates Team GB and ParalympicsGB athletes

Most Read

Travis Kelce Attends Taylor Swift’s Eras Tour in Paris

3 strategies to support inclusive learning

Utah lawmakers pass bill banning DEI at public colleges

Understanding the visual knowledge of language models | MIT News

Related Posts

Leave A Reply Cancel Reply