Google AI Researchers Demo System That Auto-Generates Video Spots From Web Pages

Researchers at Google have started showing off demos of an AI system that mines and converts a web page into a short video, known as URL2Video.

The system uses AI to extract the visuals and design styles – including fonts, colours, and layouts – and generate a decent-looking, albeit simplistic, video spot from it.

The idea here is that marketing and messaging from one medium can be re-purposed for a second medium, reducing production time and costs.

From Google’s AI blog:

At Google, we’re actively exploring how people can use creativity tools powered by machine learning and computational methods when producing multimedia content, from creating music and reframing videos, to drawing and more. One creative process in particular, video production, can especially benefit from such tools, as it requires a series of decisions about what content is best suited to a target audience, how to position the available assets within the field of view, and what temporal arrangement will yield the most compelling narrative.

But what if one could leverage existing assets, such as a website, to get a jump-start on video creation? Businesses commonly host websites that contain rich visual representations about their services or products, all of which could be repurposed for other multimedia formats, such as videos, potentially enabling those without extensive resources the ability to reach a broader audience.

In “Automatic Video Creation From a Web Page”, published at UIST 2020, we introduce URL2Video, a research prototype pipeline to automatically convert a web page into a short video, given temporal and visual constraints provided by the content owner.

URL2Video extracts assets (text, images, or videos) and their design styles (including fonts, colors, graphical layouts, and hierarchy) from HTML sources and organizes the visual assets into a sequence of shots, while maintaining a look-and-feel similar to the source page. Given a user-specified aspect ratio and duration, it then renders the repurposed materials into a video that is ideal for product and service advertising.

Automating content production is not new, though this seems a different approach.

There are video automation platforms that can spit out 100s or 1,000s of videos quickly, though they don’t really scan and scrape a web page. They use structured, organized data from a web site, like a real estate board’s listings.

There also plenty of CMS platforms that enable content automation and somewhat dynamic ad building based on real-time data. A template shows the message, product beauty shot and price based on the data set it is mapped to.

This Google effort is different, and quite interesting as a possible tool for digital signage networks that have a lot of messaging changes for simple material. Templates can be used to knock a lot of messages out manually, but it still takes time. This would be very fast.

The challenge I see – nerds may flag more technical challenges is that it is a garbage in, garbage out thing. If web pages are nicely designed in terms of font, color and image choices, and there is a minimalist design approach, this can work.

But the structural design of the ads for a web page need to have some common thread to what would be seen in a video, and bear that in mind in design. If a headline or call to action is 13 words instead of three, will the auto-ad work? If the image is too small, too wide, whatever, is that a problem?

This effort is correctly billed as an R&D lab exercise, so Google is working the idea, not selling the dream.

Small to medium businesses maybe don’t think in terms of the structure and full breadth of marketing materials they need, but I am thinking for big agencies working on big accounts that have branding and ad campaigns that cross many platforms and have a lot of continuity, this could work down the road.

Says Google in its summary:

While this current research focuses on the visual presentation, we are developing new techniques that support the audio track and a voiceover in video editing. All in all, we envision a future where creators focus on making high-level decisions and an ML model interactively suggests detailed temporal and graphical edits for a final video creation on multiple platforms.

Google folks, if you stumble on this post, I reached out. No answer. I’d like to do a podcast on this.

Leave a comment