Creative Automation with ML

DK
10 min readMar 19, 2021

--

Adapting the state-of-art generative models to creative applications

March 15th, 2021

Large-scale natural language models are becoming increasingly common and their accessibility is steadily improving. For one example, Google T5 and OpenAI’s GPT-3 clearly demonstrate the power of transfer learning in NLP models trained on a wide variety of natural language examples. At the same time, this vast potential remains untapped in the commercial applications. Although OpenAI’s GPT-3 API already sparked a number of startups (like copy.ai for one example), such businesses currently remain limited to “first order execution” patterns –like auto-generated social media channels and blog posts — that still require a heavy human touch before accepting the results.

At the same time, longer and more involved creative automation pipelines with generative models appear much harder to design. A canonical example would be, for example, a children’s illustrated book. The compelling nature of this use-case is well known to any parent — kids love fairy tales, and having an unlimited supply of unique bedtime stories could clearly be useful.

The baseline cost for producing an illuminated story would be several weeks worth of work of a professional human writer and an illustrator. But while it might be compelling to just offload this task to the machine, the production pipeline would have several missing parts. First, a human writer is still needed to seed the generative model with the appropriate prompts, control the development of story arc, and make sure the long-term dependencies are preserved. This human oversight is required because without it the generational model would drift away in one or another aspect of the plot. Here is an example of GPT-3 text that loses coherence in the story arc:

One night, Greta was sitting in her room, feeling sad, when suddenly the door opened and the genie appeared. “I’ve come to take you home,” he said. “You can never be ordinary again, but I can give you one more wish.” Greta thought for a moment and then said, “I wish that just for one day I would be the ugliest girl in the world.” The genie granted her wish, and Greta became the ugliest girl in the world. But Greta was happy at last.<… > Suddenly she heard a strange noise. She looked up, and there, sitting on the fence, was the ugliest girl in the world. She stared at Greta, and Greta stared back.

An attempt to auto-generate illustrations for a fairy-tale without a human oversight would perhaps fare even worse. Although models like DALL-E can possibly create stunningly precise drawings to a prompt, extracting key scenes from the story and maintaining a consistent drawing style (same-looking characters across a series of illustrations) would remain an unsolved problem.

One workaround for the issue of consistency is to limit creative generation to small pieces that do not have long-range dependencies. A particularly good candidate in this space is small-business advertising in social media, where budgets are small and brandbook compliance is not important.

So let us further assume we want to auto-generate an internet ad campaign for a given product. This equates to a tuple of components: the copy (written tagline) and the visual (still image or video). Generation of a creative copy is a problem generally solvable by a large NLP model like GPT-3: even a minimal amount of fine-tuning should be capable of translating product descriptions into short and memorable taglines, from which a customer can choose a few that capture the spirit of their product.

A deeper issue arises with creating a visual to match this tagline. The difficulty here is that the scene (content and plot of the visual) is not always well connected to tagline — and this is where human creativity traditionally shines. For one example, the famous Apple’s “1984” Super Bowl advertisement used tagline “Think different” coupled to a video of an athlete throwing a sledgehammer into a movie screen. Because of this possibly wide discrepancy between text and image scene in the ads, a naive attempt to generate visual from the tagline would typically result in a somewhat underwhelming result. This stumble is a direct consequence of the fact that taglines are usually quite abstract in semantic space — for instance, “Nike gives you wings” could be easily interpreted as a Greek goddess handing someone the wings, while the main meaning (a company making fast-running shoes for athletes) remains quite indirect. Therefore, while a human has no problem conceiving at least some matching visual for Nike’s tagline, the machine may easily get stuck in a too-literal interpretation of the prompt. For an example, a variational autoencoder generating visuals from the semantic latent space of “Nike gives you wings” could produce a picture similar to the below — it has some semblance to wings and a swoosh, but does not conclude in a coherent scene that could be human-interpretable.

VaE interpretation of the slogan “Nike gived you wings”

Even the more complicated generative models (like DALL-E) that could produce the visually appealing results would still root them in tagline’s semantic space and thus will miss the “creativity” aspect of connecting text with images.

To work around this problem, we can apply two techniques. First, we can generate (tagline -> scene) mappings using the collective human intelligence captured in previously aired commercial advertisements. This means, we can add an intermediate step (tagline -> existing ad -> candidate image) and hope to recover the “good enough” visual plot. Second, we can ground any semantic embeddings that partially intepret tagline in photo image banks under assumption that photographers already took care of scene compositions and we will automatically get human-interpretable visuals as a side effect of finding semantically relevant objects.

To illustrate how this latter technique works, let’s find some candidate photo bank images for an auto-generated prompt. Let’s assume we got a company “Bad Brothers Coffee” with a rather creative generated tagline “Coffee so good it’s almost criminal”. This is difficult prompt to find a visual pair because we expect the image to communicate some spirit of coffee drinking but also feature the dark notes reflected in this tagline. Our baseline for this task is a regular text search over free photo website (Unsplash.com):

Unsplash.com stock image search

It is not clear what is the algorithm Unsplash uses to annotate their images, but it is clearly not sophisticated enough to pick the dark connotations of our prompt and just returns stock coffee narratives. A semantic search over Unsplash image collections with CLIP embeddings actually fares much better:

Unsplash.com semantic image search with CLIP embeddings

Here, the results matching the semantic embeddings are in the film noir style, which somewhat mirrors the additional semantic meaning of tagline. But we can still improve these results further. If we pass the tagline first through a VaE encoder, and then search Unsplash.com for the matching embeddings, the outcome will become even more interesting:

VaE interpretation of slogan “Coffee so good it’s almost criminal”
Unsplash images semantically similar to VaE generated sample above

Therefore, we can resonably expect that passing the tagline through intermediate embedding stages should provide us with enough image variety to match the work of a human designer at least at the low fidelity level.

Ad generation pipeline

Armed with these ideas, let us design the conceptual ad generation pipeline grounded in a free image bank and a commercial ad dataset.

Pipeline color coded for human-automation blocks

We start with the tagline creation from product description and use this tagline directly (embedding it into semantic space) or indirectly (passing through the commercial ads to extract a semantic space). One interesting tidbit here is that most image encoders (like Microsoft Vision or OpenAI CLIP) are considering BOTH the image scene and the text to produce the embedding — which is very helpful because a commercial ad may have a tagline match in one or another or both, which allows recovery of creative pieces where image scene does not match the tagline directly.

Once we obtained the semantic encoding, we can match it in public art databases like Pexels or Unsplash. The match can be a photo, digital art or a video (we can produce a semantic embedding for video by averaging several keyframes). Using this candidate royalty-free material we can move on to apply the regular CV algorithms to crop the image and overlay it with text, company logos and action item boxes to complete the advertisement.

Datasets

So far we have assumed we have ready access to the databases of commercial ads and royalty-free images with corresponding feature embeddings. In reality, the global search via semantic features are not readily available via APIs and require a combination of web scraping and feature compiling. An example of this approach is given by Vladimir Haltakov’s github repository. We extend this general approach by compiling commercial ads from moat.com and royalty-free images from Unsplash.

Data pre-processing steps

As always with web scraping, one needs to be creative with the approach to avoid overloading destination sights or hitting the anti-crawler limits. Once data collection and pre-processing are complete, we move onto coding a sample app to see how these pieces work together. To maintain logical separation and speed of execution, we split all ML pieces into REST µservices that are called by the interactive app as user progresses through pipeline.

Interactive app architecture

Sample App Workflow

After putting these pieces together in one Streamlit app, we can finally see how our creative automation pipeline works.

  1. In the first stage, we enter the company name and the short product description to generate taglines using GPT-3. We fine-tune GPT-3 to tagline expansions using the slogan examples from textart.ru:
Step 1. Tagline generation.

2. Once we chose the tagline, we can see what commercial ads would be most similar to it semantically:

Semantic matches in moat.com database

Here, the right image (134_5) is just an image representation from the prior Sony Playstation campaign from Honey — which is neither creative nor visually appealing. However, the left image (60_185) provides an interesting scene composition — it appears designers from Purple chose to promote gaming with an image of a middle-aged man comfortably slouching on the floor while being highly engaged in a game. This is a useful discovery because it is not very close to product description or tagline in any semantical sense.

3. As an illustration, let us see what we can get from the direct semantic search over tagline and product name on Unsplash:

Unsplash images from semantic search by the tagline

As expected, the sample images we pull from Unsplash are visually appealing, but they very literally reflect the tagline and feature historical product images that are likely irrelevant to modern Sony product line.

4. Now, let us see what we can get from Unsplash using the scene we extracted in part 2. Let us add the “interesting” Moat.com image index to search and see how it fares:

Unsplash semantic match with a scene from moat.com
Unsplash semantic match with a scene from moat.com

This is actualy much better — and something that closely resembles what human designers could have come up with. One may still argue that product in the picture is not the actual console model that we need to promote, but we got an engaging scene (lady playing the console) and with a larger dataset we could have obtained more images of humans interacting with games.

Reflections and broader impacts

A wide consensus between futurists and social scientists studying AI so far has been that AI primarily threatens low-intelligence jobs and companies that own the algorithms are also positioned to profit from them. These viewpoints are fairly logical, but go against many practical experiences of working with AI. This particular sample project have used two state of the art ML models — CLIP and GPT-3, and in both cases the algorithm developer (OpenAI) does not appear to benefit as much as potential downstream application owners (digital marketing agencies). Moreover, the largest challenge in this project was not to find the appropriate models, but acquire the datasets to operate upon because sample marketing ads and taglines must be scraped off the Internet.

It also appears that a pipeline of ML algorithms can do well in harvesting human knowledge and displacing the work of creative industry workers that work on the social media ads. The remaining human-in-the-loop actions on selecting appropriate taglines and images can also be automated with sufficent amount of A/B test deployments.

Therefore, the intelligent vs. mundane dichotomy in jobs “most vulnerable to AI deployments” can be entirely false. It may actually turn out that AI does not align well with our undestanding of “job complexity” and will shine displacing some traditonally high-end and sought-after jobs while completely failing in some occupations that are mundane but require the actions that do not lend itself to data-driven pipelines.

You are welcome to play with the app:

Streamlit app: http://54.201.80.61:8000

References:

Microsoft Vision: https://pypi.org/project/microsoftvision/

OpenAI CLIP: https://openai.com/blog/clip/

OpenAI GPT-3: https://openai.com/blog/openai-api/

OpenAI DALL-E: https://openai.com/blog/dall-e/

Unsplash: unsplash.com

Pexels: pexels.com

--

--