0% found this document useful (0 votes)
222 views

Word-As-Image For Semantic Typography

1. The document presents a new method for automatically creating "word-as-image" illustrations where the letters of a word are modified to visually represent the word's meaning while maintaining readability. 2. Existing methods for text stylization often rely on raster textures, manually designed styles, or predefined shapes, whereas the new method modifies only the geometry of letter outlines to convey semantics in a simple, vector-based way. 3. The method uses a pretrained Stable Diffusion model to connect text and images and distill the textual concept visually into modifications of each letter's shape, guided by additional losses to preserve font style and legibility.

Uploaded by

小冰
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views

Word-As-Image For Semantic Typography

1. The document presents a new method for automatically creating "word-as-image" illustrations where the letters of a word are modified to visually represent the word's meaning while maintaining readability. 2. Existing methods for text stylization often rely on raster textures, manually designed styles, or predefined shapes, whereas the new method modifies only the geometry of letter outlines to convey semantics in a simple, vector-based way. 3. The method uses a pretrained Stable Diffusion model to connect text and images and distill the textual concept visually into modifications of each letter's shape, guided by additional losses to preserve font style and legibility.

Uploaded by

小冰
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Word-As-Image for Semantic Typography

Shir Iluz∗ Yael Vinker∗ Amir Hertz


Tel-Aviv University, Israel Tel-Aviv University, Israel Tel-Aviv University, Israel

Daniel Berio Daniel Cohen-Or Ariel Shamir


Goldsmiths University, London Tel-Aviv University, Israel Reichman University, Israel
arXiv:2303.01818v2 [cs.CV] 6 Mar 2023

Fig. 1. A few examples of our word-as-image illustrations in various fonts and for different textual concept. The semantically adjusted letters are created
completely automatically using our method, and can then be used for further creative design as we illustrate here.

A word-as-image is a semantic typography technique where a word illus- a great deal of creativity and design skills to integrate the chosen
tration presents a visualization of the meaning of the word, while also visual concept into the letter’s shape [Lee 2011]. In Figure 2 we show
preserving its readability. We present a method to create word-as-image some word-as-image examples created manually. For example, to
illustrations automatically. This task is highly challenging as it requires create the “jazz” depiction, the designer had to first choose the visual
semantic understanding of the word and a creative idea of where and how to concept that would best fit the semantics of the text (a saxophone),
depict these semantics in a visually pleasing and legible manner. We rely on
consider the desired font characteristics, and then choose the most
the remarkable ability of recent large pretrained language-vision models to
distill textual concepts visually. We target simple, concise, black-and-white suitable letter to be replaced. Finding the right visual element to
designs that convey the semantics clearly. We deliberately do not change the illustrate a concept is ill-defined as there are countless ways to il-
color or texture of the letters and do not use embellishments. Our method lustrate any given concept. In addition, one cannot simply copy a
optimizes the outline of each letter to convey the desired concept, guided by selected visual element onto the word – there is a need to find subtle
a pretrained Stable Diffusion model. We incorporate additional loss terms modifications of the letters shape.
to ensure the legibility of the text and the preservation of the style of the Because of these complexities, the task of automatic creation of
font. We show high quality and engaging results on numerous examples word-as-image illustrations was practically impossible to achieve
and compare to alternative techniques. using computers until recently. In this paper, we define an algo-
Code will be available at our project page. rithm for automatic creation of word-as-image illustrations based
on recent advances in deep-learning and the availability of huge
1 INTRODUCTION
foundational models that combine language and visual understand-
Semantic typography is the practice of using typography to visually ing. Our resulting illustrations (see Figure 1) could be used for logo
reinforce the meaning of text. This can be achieved through the design, for signs, in greeting cards and invitations, and simply for
choice of typefaces, font sizes, font styles, and other typographic fun. They can be used as-is, or as inspiration for further refinement
elements. A more elaborate and engaging technique for semantic of the design.
typography is presented by word-as-image illustrations, where the Existing methods in the field of text stylization often rely on raster
semantics of a given word are illustrated using only the graphical textures [Yang et al. 2018], place a manually created style on top
elements of its letters. Such illustrations provide a visual repre- of the strokes segmentation [Berio et al. 2022], or deform the text
sentation of the meaning of the word, while also preserving the into a pre-defined target shape [Zou et al. 2016] (see Figure 3). Only
readability of the word as a whole. a few works [Tendulkar et al. 2019; Zhang et al. 2017] deal with
The task of creating a word-as-image is highly challenging, as it semantic typography, and they often operate in the raster domain
requires the ability to understand and depict the visual characteris- and use existing icons for replacement (see Figure 3E).
tics of the given concept, and to convey them in a concise, aesthetic, Our word-as-image illustrations concentrate on changing only
and comprehensible manner without harming legibility. It requires the geometry of the letters to convey the meaning. We deliberately
∗ Denotes equal contribution. do not change color or texture and do not use embellishments. This
2 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 2. Manually created word-as-image illustrations.

allows simple, concise, black-and-white designs that convey the


semantics clearly. In addition, since we preserve the vector-based
representation of the letters, this allows smooth rasterization in
any size, as well as applying additional style manipulations to the
illustration using colors and texture, if desired.
Given an input word, our method is applied separately for each
letter, allowing the user to later choose the most likeable combina- Fig. 3. Examples of previous text stylization works – (A) Yang et al. [2018],
tion for replacement. We represent each letter as a closed vectorized (B) Berio et al. [2022], (C) Zhang et al. [2017], (D) Zou et al. [2016], and (E)
shape, and optimize its parameters to reflect the meaning of the Tendulkar et al. [2019]. Most use color and texture or copy icons onto the
letters. Our work concentrates on subtle geometric shape deformations of
word, while still preserving its original style and design.
the letters to convey the semantic meaning without color or texture (that
We rely on the prior of a pretrained Stable Diffusion model [Rom-
can be added later).
bach et al. 2021] to connect between text and images, and utilize
the Score Distillation Sampling approach [Poole et al. 2022] (see
Section 3) to encourage the appearance of the letter to reflect the approaches are often limited to mild deviations from the input data.
provided textual concept. Since the Stable Diffusion model is trained Other methods rely on templates [Lian et al. 2018; Suveeranont and
on raster images, we use a differentiable rasterizer [Li et al. 2020] Igarashi 2010] or on user guided [Phan et al. 2015] and automatic
that allows to backpropagate gradients from a raster-based loss to [Berio et al. 2022] stroke segmentation to produce letter stylization
the shape’s parameters. (such as in Figure 3B). However, they rely on a manually defined
To preserve the shape of the original letter and ensure legibility style, while we rely on the expressiveness of Stable Diffusion to
of the word, we utilize two additional loss functions. The first loss guide the modification of the letters’ shape, to convey the meaning
regulates the shape modification by constraining the deformation of the provided word. In the task of calligram generation [Xu and
to be as-conformal-as-possible over a triangulation of the letter’s Kaplan 2007; Zou et al. 2016] the entire word is deformed into a
shape. The second loss preserves the local tone and structure of the given target shape. This task prioritises shape over the readability
letter by comparing the low-pass filter of the resulting rasterized of the word (see Figure 3D), and is inherently different from ours,
letter to the original one. as we use the semantics of the word to derive the deformation of
We compare to several baselines, and present many results using individual letters.
various typefaces and a large number of concepts. Our word-as- Most related to our goal, are works that perform semantic styl-
image illustrations convey the intended concept while maintaining ization of text. Tendulkar et al. [2019] replace letters in a given
legibility and preserving the appearance of the font, demonstrating word with clip-art icons describing a given theme (see Figure 3E).
visual creativity. To choose the most suitable icon for replacement, an autoencoder
is used to measure the distance between the letter and icons from
2 RELATED WORK the desired class. Similarly, Zhang et al. [2017] replace stroke-like
parts of one or more letters with instances of clip art to generate
Text Stylization. One approach to text stylization is artistic text
ornamental stylizations. An example is shown in Figure 3C. These
style transfer, where the style from a given source image is migrated
approaches operate in the raster domain, and replace letters with
into the desired text (such as in Figure 3A). To tackle this task,
existing icons, which limits them to a predefined set of classes
existing works incorporate patch-based texture synthesis [Fish et al.
present in the dataset. Our method, however, operates in the vector
2020; Yang et al. 2017] as well as variants of GANs [Azadi et al.
domain, and incorporates the expressiveness of large pretrained
2018; Jiang et al. 2019; Mao et al. 2022; Wang et al. 2019; Yang et al.
image-language models to create a new illustration that conveys
2022]. These works operate within the raster domain, a format that
the desired concept.
is undesirable for typographers since fonts must be scalable. In
contrast, we operate on the parametric outlines of the letters, and Large Language-Vision Models. With the recent advancement of
our glyph manipulation is guided by the semantic meaning of the language-vision models [Radford et al. 2021] and diffusion mod-
word, rather than a pre-defined style image. els [Nichol et al. 2021; Ramesh et al. 2022; Rombach et al. 2021], the
A number of works [Ha and Eck 2018; Lopes et al. 2019; Wang field of image generation and editing has undergone unprecedented
and Lian 2021] tackle the task of font generation and stylization evolution. Having been trained on millions of images and text pairs,
in the vector domain. Commonly, a latent feature space of font’s these models have proven effective for performing challenging vi-
outlines is constructed, represented as outline samples [Balashova sion related tasks such as image segmentation [Amit et al. 2021],
et al. 2019; Campbell and Kautz 2014] or parametric curve segments domain adaptation [Song et al. 2022], image editing [Avrahami et al.
[Ha and Eck 2018; Lopes et al. 2019; Wang and Lian 2021]. These 2022; Hertz et al. 2022; Tumanyan et al. 2022a], personalization [Gal
Word-As-Image for Semantic Typography • 3

3.2 Latent Diffusion Models


Diffusion models are generative models that are trained to learn
a data distribution by the gradual denoising of a variable sampled
from a Gaussian distribution.
In our work, we use the publicly available text-to-image Stable
Diffusion model [Rombach et al. 2021]. Stable Diffusion is a type of
a latent diffusion model (LDM), where the diffusion process is done
over the latent space of a pretrained image autoencoder. The encoder
E is tasked with mapping an input image 𝑥 into a latent vector 𝑧,
and the decoder D is trained to decode 𝑧 such that D (𝑧) ≈ 𝑥.
As a second stage, a denoising diffusion probabilistic model (DDPM)
[Ho et al. 2020] is trained to generate codes within the learned latent
space. At each step during training, a scalar 𝑡 ∈ {1, 2, ...𝑇 } is uni-
formly sampled and used to define a noised latent code 𝑧𝑡 = 𝛼𝑡 𝑧+𝜎𝑡 𝜖,
Fig. 4. More word-as-images produced by our method. Note how styles of where 𝜖 ∼ N (0, 𝐼 ) and 𝛼𝑡 , 𝜎𝑡 are terms that control the noise sched-
different fonts are preserved by the semantic modification.
ule, and are functions of the diffusion process time 𝑡.
The denoising network 𝜖𝜃 which is based on a UNet architecture
[Ronneberger et al. 2015], receives as input the noised code 𝑧𝑡 , the
timestep 𝑡 and an optional condition vector 𝑐 (𝑦), and is tasked with
et al. 2022, 2023; Ruiz et al. 2022], and explainability [Chefer et al. predicting the added noise 𝜖. The LDM loss is defined by:
2021]. Despite being trained on raster images, their strong visual
and semantic priors have also been shown to be successfully applied L𝐿𝐷𝑀 = E𝑧∼E (𝑥),𝑦,𝜖∼N (0,1),𝑡 ||𝜖 − 𝜖𝜃 (𝑧𝑡 , 𝑡, 𝑐 (𝑦))|| 22 .
 
(1)
to other domains, such as motion [Tevet et al. 2022], meshes [Michel
et al. 2021], point cloud [Zhang et al. 2021], and vector graphics. In Stable Diffusion, for text-to-image generation, the condition
CLIPDraw [Frans et al. 2021] uses a differentiable rasterizer [Li vector is the text embedding produced by a pre-trained CLIP text
et al. 2020] to optimize a set of colorful curves w.r.t. a given text encoder [Radford et al. 2021]. At inference time, a random latent
prompt, guided by CLIP’s image-text similarity metric. Tian and Ha code 𝑧𝑇 ∼ N (0, 𝐼 ) is sampled, and iteratively denoised by the trained
[2021] use evolutionary algorithms combined with CLIP guidance to 𝜖𝜃 until producing a clean 𝑧 0 latent code, which is passed through
create abstract visual concepts based on text. Other works [Vinker the decoder 𝐷 to produce the image 𝑥.
et al. 2022a,b] utilize the image encoder of CLIP to generate abstract
vector sketches from images.
3.3 Score Distillation
Diffusion models have been used for the task of text guided image-
to-image translation [Choi et al. 2021; Tumanyan et al. 2022b]. In It is desirable to utilize the strong prior of pretrained large text-
SDEdit [Meng et al. 2022], an adequate amount of noise is added image models for the generation of modalities beyond rasterized
to a reference image, such that its overall structure is preserved, images. In Stable Diffusion, text conditioning is performed via the
and then the image is denoised in a reverse process with a guiding cross-attention layers defined at different resolutions in the UNet
text. Pretrained diffusion models have also been used to generate network. Thus, it is not trivial to guide an optimization process
3D objects [Metzer et al. 2022; Poole et al. 2022], or vector art [Jain using the conditioned diffusion model.
et al. 2022] conditioned on text. DreamFusion [Poole et al. 2022] proposed a way to use the diffu-
In our work we also utilize the strong visual and semantic prior sion loss to optimize the parameters of a NeRF model for text-to-3D
induced by a pretrained Stable Diffusion model [Rombach et al. generation. At each iteration, the radiance field is rendered from a
2021], however, for the task of semantic typography. For that purpose random angle, forming the image 𝑥, which is then noised to form
we add new components to the optimization process to preserve 𝑥𝑡 = 𝛼𝑡 𝑥 + 𝜎𝑡 𝜖. The noised image is then passed to the pretrained
the font’s style and text legibility. UNet model of Imagen [Saharia et al. 2022], that outputs the pre-
diction of the noise 𝜖. The score distillation loss is defined by the
gradients of the original diffusion loss:
3 BACKGROUND
3.1 Fonts and Vector Representation 
𝜕𝑥

∇𝜙 L𝑆𝐷𝑆 = 𝑤 (𝑡)(𝜖𝜃 (𝑥𝑡 , 𝑡, 𝑦) − 𝜖) (2)
Modern typeface formats such as TrueType [Penney 1996] and 𝜕𝜙
PostScript [Inc. 1990] represent glyphs using a vectorized graphic
representation of their outlines. Specifically, the outline contours are where 𝑦 is the condition text prompt, 𝜙 are the NeRF’s parameters
typically represented by a collection of lines and Bézier or B-Spline and 𝑤 (𝑡) is a constant multiplier that depends on 𝛼𝑡 . During train-
curves. This representation allows to scale the letters and rasterize ing, the gradients are back-propagated to the NeRF parameters to
them in any desired size similar to other vector representations. gradually change the 3D object to fit the text prompt. Note that the
This property is preserved by our method as our output preserves gradients of the UNet are skipped, and the gradients to modify the
the vectorized representations of the letters. Nerf’s parameters are derived directly from the LDM loss.
4 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

𝑝! 𝑙!
𝑝"
𝒟 𝑃= LPF “Surfing * “ CLIP
2
𝑝#!

3 t~[50,950]
LPF
Encoder z z! UNet z
𝑝! Augment
𝑝" 𝑙$!
𝒟 𝑃" = DiffVG ..
𝑝#! 1

Fig. 5. An overview of our method. Given an input letter 𝑙𝑖 represented by a set of control points 𝑃 , and a concept (shown in purple), we optimize the new
positions 𝑃ˆ of the deformed letter 𝑙ˆ𝑖 iteratively. At each iteration, the set 𝑃ˆ is fed into a differentiable rasterizer (DiffVG marked in blue) that outputs the
rasterized deformed letter 𝑙ˆ𝑖 . 𝑙ˆ𝑖 is then augmented and passed into a pretrained frozen Stable Diffusion model, that drives the letter shape to convey the
semantic concept using the ∇𝑃ˆ LLSDS loss (1). 𝑙𝑖 and 𝑙ˆ𝑖 are also passed through a low pass filter (LPF marked in yellow) to compute L𝑡𝑜𝑛𝑒 (2) which encourages
the preservation of the overall tone of the font style and also the local letter shape. Additionally, the sets 𝑃 and 𝑃ˆ are passed through a Delaunay triangulation
operator (D marked in green), defining L𝑎𝑐𝑎𝑝 (3) which encourages the preservation of the initial shape.

3.4 VectorFusion
Recently, VectorFusion [Jain et al. 2022] utilized the SDS loss for the
task of text-to-SVG generation. The proposed generation pipeline
involves two stages. Given a text prompt, first, an image is generated
using Stable Diffusion (with an added suffix to the prompt), and
is then vectorized automatically using LIVE [Ma et al. 2022]. This
defines an initial set of parameters to be optimized in the second Fig. 6. Illustration of the letter’s outline and control points before (left) and
stage using the SDS loss. At each iteration, a differentiable rasterizer after (right) the subdivision process. The orange dots are the initial Bézier
[Li et al. 2020] is used to produce a 600 × 600 image, which is then curve segment endpoints. The blue dots are the remaining control points
augmented as suggested in CLIPDraw [Frans et al. 2021] to get a respectively before and after subdivision.
512 × 512 image 𝑥𝑎𝑢𝑔 . Then 𝑥𝑎𝑢𝑔 is fed into the pretrained encoder
E of Stable Diffusion to produce the corresponding latent code
Depending on the letter’s complexity and the style of the font,
𝑧 = E (𝑥𝑎𝑢𝑔 ). The SDS loss is then applied in this latent space, in a
the extracted outlines are defined by a different number of control
similar way to the one defined in DreamFusion:
points. We have found that the initial number of control points
   𝜕𝑧 𝜕𝑥𝑎𝑢𝑔  affects the final appearance significantly: as the number of control
∇𝜃 LLSDS = E𝑡,𝜖 𝑤 (𝑡) 𝜖ˆ𝜙 (𝛼𝑡 𝑧𝑡 + 𝜎𝑡 𝜖, 𝑦) − 𝜖 (3) points increases, there is more freedom for visual changes to occur.
𝜕𝑧𝑎𝑢𝑔 𝜕𝜃 Therefore, we additionally apply a subdivision procedure to letters
We find the SDS approach useful for our task of producing se- containing a small number of control points. We define a desired
mantic glyphs, and we follow the technical steps proposed in Vec- number of control points for each letter of the alphabet (shared
torFusion (e.g. augmentations and the added suffix). across different fonts), and then iteratively subdivide the Bézier
segments until reaching this target number. At each iteration, we
4 METHOD compute the maximum arc length among all Bézier segments and
Given a word 𝑊 represented as a string with 𝑛 letters {𝑙 1, ...𝑙𝑛 }, our split each segment with this length into two (see Figure 6). We
method is applied to every letter 𝑙𝑖 separately to produce a semantic analyse the effect of the number of control points in Section 5.3.
visual depiction of the letter. The user can then choose which letters This procedure defines a set of 𝑘𝑖 control points 𝑃𝑖 = {𝑝 𝑗 }𝑘𝑗=1
𝑖

to replace and which to keep in their original form. representing the shape of the letter 𝑙𝑖 .

4.1 Letter Representation 4.2 Optimization


We begin by defining the parametric representation of the letters The pipeline of our method is provided in Figure 5. Since we are
in 𝑊 . We use the FreeType font library [FreeType 2009] to extract optimizing each letter 𝑙𝑖 separately, for brevity, we will omit the
the outline of each letter. We then translate each outline into a set letter index 𝑖 in the following text and define the set of control points
of cubic Bézier curves, to have a consistent representation across for the input letter as 𝑃.
different fonts and letters, and to facilitate the use of diffvg [Li et al. Given 𝑃 and the desired textual concept 𝑐 (both marked in purple
2020] for differentiable rasterization. in Figure 5), our goal is to produce a new set of control points, 𝑃, ˆ
Word-As-Image for Semantic Typography • 5

Fig. 8. Our tone-preserving loss preserves the local tone of the font by
comparing the low-pass filter of the letters images before (left) and after
Fig. 7. Visual illustration of the constraint Delaunay triangulation applied deformation (right). It constrains the adjusted letter not to deviate too much
to the initial shapes (left) and the resulting ones (right), for the word “pants”. from the original. This example is of the letter B and the word “Bear”.
The ACAP loss maintains the structure of the letter after the deformation.
The zoomed rectangle shows the angles for a given control point 𝑝 𝑗 .
The Delaunay triangulation D (𝑃) splits the glyph represented by
defining an adjusted letter 𝑙ˆ that conveys the given concept, while 𝑃 into a set of triangles. This defines a set of size 𝑚 𝑗 of corresponding
maintaining the overall structure and characteristics of the initial angles for each control point 𝑝 𝑗 (see Figure 7). We denote this set of
𝑚
letter 𝑙. angles as {𝛼 𝑖𝑗 }𝑖=1𝑗 . The ACAP loss encourages the induced angles
We initialize the learned set of control points 𝑃ˆ with 𝑃, and pass of the optimized shape 𝑃ˆ not to deviate much from the angles of
it through a differentiable rasterizer R [Li et al. 2020] (marked in the original shape 𝑃, and is defined as the L2 distance between the
ˆ The rasterized letter
blue), which outputs the rasterized letter R (𝑃). corresponding angles:
is then randomly augmented and passed into a pretrained Stable 𝑘 𝑚𝑗 !
Diffusion [Rombach et al. 2021] model, conditioned on the CLIP’s ˆ 1 ∑︁ ∑︁ 𝑖 𝑖 2
L𝑎𝑐𝑎𝑝 (𝑃, 𝑃) = 𝛼 − 𝛼ˆ 𝑗 (4)
embedding of the given text 𝑐. The SDS loss ∇𝑃ˆ LLSDS is then used 𝑘 𝑗=1 𝑖=1 𝑗
as described in Section 3 to encourage R (𝑃) ˆ to convey the given
ˆ
where 𝑘 = |𝑃 | and 𝛼ˆ are the angles induced by D (𝑃).
text prompt.
To preserve the shape of each individual letter and ensure the Tone Preservation Loss. To preserve the style of the font as well
legibility of the word as a whole, we use two additional loss functions as the structure of the letter we add a local-tone preservation loss
to guide the optimization process. The first loss limits the overall term. This term constrains the tone (amount of black vs. white in
shape change by defining as-conformal-as-possible constraint on all regions of the shape) of the adjusted letter not to deviate too
the shape deformation. The second loss preserves the overall shape much from tone of the original font’s letter. Towards this end, we
and style of the font by constraining the tone (i.e. amount of dark apply a low pass filter (LPF) to the rasterized letter (before and after
vs. light areas in local parts of the shape) of the modified letter not deformation) and compute the L2 distance between the resulting
to diverge too much from the original letter (see Section 4.3). blurred letters:
The gradients obtained from all the losses are then backpropa-
ˆ 2

ˆ We repeat this process for 500 L𝑡𝑜𝑛𝑒 = 𝐿𝑃𝐹 (R (𝑃)) − 𝐿𝑃𝐹 (R (𝑃)) 2 (5)
gated, to update the parameters 𝑃.
steps, which takes ∼ 5 minutes to produce a single letter illustration An example of the blurred letters is shown in Figure 8, as can be
on RTX2080 GPU. seen, we use a high value of standard deviation 𝜎 in the blurring
kernel to blur out small details such as the ears of bear.
4.3 Loss Functions Our final objective is then defined by the weighted average of the
Our primary objective of encouraging the resulting shape to con- three terms:
vey the intended semantic concept, is utilized by ∇𝑃ˆ LLSDS loss ˆ 𝑐) + 𝛼 · L𝑎𝑐𝑎𝑝 (𝑃, 𝑃)
min ∇𝑃ˆ LLSDS (R (𝑃), ˆ
(described in Section 3). We observe that using ∇𝑃ˆ LLSDS solely can 𝑃ˆ (6)
cause large deviations from the initial letter appearance, which is ˆ
+𝛽𝑡 · L𝑡𝑜𝑛𝑒 (R (𝑃), R (𝑃))
undesired. Hence, our additional goal is to maintain the shape and
ˆ as well as to keep the original font’s where 𝛼 = 0.5 and 𝛽𝑡 depends on the step 𝑡 as described next.
legibility of the letter R (𝑃),
characteristics. For that purpose we use two additional losses. 4.4 Weighting
As-Conformal-As-Possible Deformation Loss. To prevent the final Choosing the relative weights of the three losses presented above
letter shape from diverging too much from the initial shape, we is crucial to the appearance of the final letter. While the ∇𝑃ˆ LLSDS
triangulate the inner part of the letter and constrain the deformation loss encourages the shape to deviate from its original appearance to
of the letter to be as conformal as possible (ACAP) [Hormann and better fit the semantic concept, the two terms L𝑡𝑜𝑛𝑒 and L𝑎𝑐𝑎𝑝 are
Greiner 2000]. We use constrained Delaunay triangulation [Barber responsible for maintaining the original shape. Hence, we have two
and Huhdanpaa 1995; Delaunay et al. 1934] on the set of control competing parts in the formula, and would like to find a balance
points defining the glyph. It is known that Delaunay triangulation between them to maintain the legibility of the letter while allowing
can be used to produce the skeleton of an outline [Prasad 1997; Zou the desired semantic shape to change.
et al. 2001], so the ACAP loss also implicitly captures a skeletal We find that L𝑡𝑜𝑛𝑒 can be very dominant. In some cases, if it is
representation of the letter form. used from the beginning, no semantic deformation is performed.
6 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Table 1. Perceptual study results. The level of concept recognizability and


letter legibility are very high, and style matching of the font is well above
random. The “Only SDS” results are created by removing our structure and
style preserving losses.

Method Semantics Legibility Font


Ours 0.8 0.9 0.51
Only SDS 0.88 0.53 0.33

for the “DRESS” and “LION” are thin and fit well with the rest of
the word. In addition, observe the serifs of the letter A used for the
fin of the shark in the “SHARK” example. We further use human
evaluation to validate this as described below.

Fig. 9. Word-as-images produced by our method for the word “YOGA”, 5.1 Quantitative
using eight different fonts.
We conduct a perceptual study to quantitatively assess the three
objectives of our resulting word-as-images. We randomly select two
Therefore, we adjust the weight of L𝑡𝑜𝑛𝑒 to kick-in only after some instances from each of the resulting word-as-image illustrations
semantic deformation has occurred. We define 𝛽𝑡 as follows: for the five classes described above, and visually select one letter
from each word, resulting in 10 letters in total. In each question
(𝑡 − 𝑏) 2  we show an isolated letter illustration, without the context of the
𝛽𝑡 = 𝑎 · exp − (7)
2𝑐 2 word. To evaluate the ability of our method to visually depict the
with 𝑎 = 100, 𝑏 = 300, 𝑐 = 30. We analyse the affect of various desired concept, we present four label options from the same class,
weighting in Section 5.3. Note that the same hyper-parameter choice and ask participants to choose the one that describes the letter
works for various words, letters, and fonts. illustration best. To evaluate the legibility of the results, we ask
participants to choose the most suitable letter from a random list of
5 RESULTS four letters. To asses the preservation of the font style, we present
The robustness of our approach means it should be capable of han- the four fonts and ask participants to choose the most suitable font
dling a wide range of input concepts as well as supporting different for the illustration. We gathered answers from 40 participants, and
font designs. Figures 1, 4, 33, 17, and more results in the supplemen- the results are shown in Table 1. As can be seen, the level of concept
tal file demonstrate that our approach can handle inputs from many recognizability and letter legibility are very high, and the 51% of
different categories and various fonts, and that the generated results style matching of the letter illustration to the original font is well
are legible and creative. Figure 9 demonstrate how the illustrations above random, which is 25%. We also test our algorithm without
created by our method for the same word follow the characteristics the two additional structure and style preserving losses (L𝑎𝑐𝑎𝑝 and
of different fonts. Although the perceived aesthetics of a word-as- L𝑡𝑜𝑛𝑒 ) on the same words and letters (“Only SDS” in the table).
image illustration can be subjective, we define three objectives for As expected, without the additional constraints, the letter deforms
an effective result: (1) it should visually capture the given semantic significantly resulting in higher concept recognizability but lower
concept, (2) it should maintain readability, and (3) it should preserve legibility and font style preservation. More details and examples are
the original font’s characteristics. provided in the supplementary material.
We evaluate the performance of our method on a randomly se-
lected set of inputs. We select five common concept classes - animals, 5.2 Comparison
fruits, plants, sports, and professions. Using ChatGPT, we sample ten In the absence of a relevant baseline for comparison, we define base-
random instances for each class, resulting in 50 words in total. Next, lines based on large popular text-to-image models. Specifically, we
we select four fonts that have distinct visual characteristics, namely use (1) SD Stable Diffusion [Rombach et al. 2021], (2) SDEdit [Meng
Quicksand, Bell MT, Noteworthy-Bold, and HobeauxRococeaux- et al. 2022], (3) DallE2 [Ramesh et al. 2022] illustrating the word,
Sherman. For each word, we randomly sampled one of the four (4) DallE2+letter illustrating only the letter, and (5) CLIPDraw
fonts, and applied our method to each letter. For each word with [Frans et al. 2021]. We applied the methods above (details can be
𝑛 letters we can generate 2𝑛 possible word-as-images, which are found in supplemental material) to three representative words –
all possible combinations of replacements of illustrated letters. A “bird”, “dress”, and “tulip”, with the fonts Bell MT, Quicksand, and
selected subset of these results is presented in Figure 33. The results Noteworthy-Bold, respectively. The results can be seen in Figure 10.
of all letters and words are presented in the supplementary material. In some cases Stable Diffusion (SD) did not manage to produce
As can be seen, the resulting word-as-image illustrations success- text at all (such as for the bird) and when text is produced, it is
fully convey the given semantic concept in most cases while still often not legible. The results obtained by SDEdit preserve the font’s
remaining legible. In addition, our method successfully captures characteristics and the letter’s legibility, but often fail to reflect
the font characteristics. For example, in Figure 33, the replacements the desired concept, such as in the case of the bird and the dress.
Word-As-Image for Semantic Typography • 7

The word
BIRD and
the letter R

The word
DRESS and
the letter E

The word
TULIP and
the letter U
Input SD SDEdit DallE2 DallE2+letter CLIPDraw Ours

Fig. 10. Comparison to alternative methods based on large scale text-to-image models. On the left are the letters used as input (only for SDEdit, CLIPDraw, and
ours), as well as the desired object of interest. The results from left to right obtained using Stable Diffusion [Rombach et al. 2021], SDEdit [Meng et al. 2022],
DallE2 [Ramesh et al. 2022], DallE2 with a letter specific prompt, CLIPDraw [Frans et al. 2021], and our single-letter results, as well as the final word-as-image.

Additionally, it operates in the raster domain and tends to add details By using the hyperparameters described in the paper, we are able
on top of the letter, while our method operates directly on the vector to achieve a reasonable balance between semantics and legibility.
representation of the letters with the objective of modifying their The parameters were determined manually based on visual assess-
shape. DallE2 manages to reflect the visual concept, however it often ments, but can be adjusted as needed based on the user’s personal
fails to produce legible text. When applied with a dedicated prompt taste and goals.
to produce the word-as-image of only one letter (fifth column), it
manages to produce a legible letter, but there is less control over "Ballet"
the output – it is impossible to specify the desired font or to control
the size, position, and shape of the generated letter. Therefore, it is
not clear how to combine these output illustrations into the entire "Gorilla"
word to create a word-as-image.
CLIPDraw produces reasonable results conveying the semantics
"Gym"
of the input word. However, the results are non-smooth and the
characteristics of the font are not preserved (for example observe
how the letter "E" differs from the input letter). We further examine Input 𝑃𝑜 𝑃 2×𝑃
CLIPDraw with our shape preservation losses in the next Section.
Fig. 11. The effect of the initial number of control points on outputs. On the
left are the input letters and the target concepts used to generate the results
5.3 Ablation on the right. 𝑃𝑜 indicates the original number of control points as extracted
Figure 11 illustrates the impact of the letter’s initial number of from the font, 𝑃 is the input letter with our chosen hyperparameters, and
control points. When less control points are used (𝑃𝑜 is the original for 2 × 𝑃 we increase the number of control points in 𝑃 by two.
number of control points), we may get insufficient variations, such
as for the gorilla. However, this can also result in more abstract 6 CONCLUSIONS
depictions, such as the ballerina. As we add control points, we get We presented a method for the automatic creation of vector-format
more graphic results, with the tradeoff that it often deviate from the word-as-image illustrations. Our method can handle a large variety
original letter. In Figure 15 we show the results of using only the of semantic concepts and use any font, while preserving the legibility
∇𝑃ˆ LLSDS loss. As can be seen, in that case the illustrations strongly of the text and the font’s style.
convey the semantic concept, however at the cost of legibility. In There are limitations to our method. First, our method works
Figure 16 we analyze the effect of the weight 𝛼 applied to L𝑎𝑐𝑎𝑝 . letter by letter, and therefore, it cannot deform the shape of the
Ranging from 1 to 0. When L𝑎𝑐𝑎𝑝 is too dominant, the results may entire word. In the future we can try to optimize the shape of several
not enough reflect the semantic concept, while the opposite case letters. Second, the approach works best on concrete visual concepts,
harms legibility. Figure 13 illustrates a change in the 𝜎 parameter of and may fail with more abstract ones. This can be alleviated by
the low pass filter. When 𝜎 = 1 almost no blur is applied, resulting optimizing the shape of letters using different concepts than the
in a shape constraint that is too strong. word itself. Third, the layout of letters can also be automated for
In Figure 14 we show the results of replacing the ∇𝑃ˆ LLSDS loss example, using methods such as [Wang et al. 2022].
with a CLIP based loss, while using our proposed shape preservation Our word-as-image illustrations demonstrate visual creativity
terms. Although the results obtained with CLIP often depict the and open the possibility for the use of large vision-language models
desired visual concept, we find that using Stable Diffusion leads for semantic typography, possibly also adding human-in-the-loop
to smoother illustrations, that capture a wider range of semantic to arrive at more synergistic design methods of ML models and
concepts. humans.
8 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Input
"Bear"
Letter

"Singer"
Ours

"Giraffe" Only
SDS
Without
Input 1 5 30 200 "Cat" "Music" "Robot" "Cup" "Hands"
L𝑡𝑜𝑛𝑒
Fig. 15. The effect of using only the SDS loss: note how the third row simply
Fig. 13. Altering the 𝜎 parameter of the low pass filter using in the L𝑡𝑜𝑛𝑒
looks like icon illustrations, while the second row still resembles legible
loss. On the leftmost column are the original letters and concepts used, then
letters.
from left to right are the results obtained when using 𝜎 ∈ {1, 5, 30, 200},
and without L𝑡𝑜𝑛𝑒 .
"Bear"
Input
Letter "Singer"

CLIP "Giraffe"
loss
Without
SDS Input 1 0.75 0.5 0.25
L𝑎𝑐𝑎𝑝
loss
"Snail" "Skirt" "Socks" "Queen" "Strawberry" Fig. 16. Altering the weight 𝛼 of the L𝑎𝑐𝑎𝑝 loss. On the leftmost column
are the original letters and concepts used, then from left to right are the
results obtained when using 𝛼 ∈ {1, 0.75, 0.5, 0.25, 0}.
Fig. 14. Replacing the SDS loss with a CLIP-based loss.

Fig. 12. Word-as-images produced by our method. This subset was chosen from the random set of words.
Word-As-Image for Semantic Typography • 9

Fig. 17. Additional results produced by our method.


10 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

7 ACKNOWLEDGMENTS //doi.org/10.1609/aaai.v33i01.33014015
Ji Lee. 2011. Word As Image. Adams Media, London.
We are grateful to Richard Hao Zhang for the early discussion of Tzu-Mao Li, Michal Lukáč, Gharbi Michaël, and Jonathan Ragan-Kelley. 2020. Differen-
the text-as-image problem. Ali Mahdavi-Amiri and Oren Katzir for tiable Vector Graphics Rasterization for Editing and Learning. ACM Trans. Graph.
(Proc. SIGGRAPH Asia) 39, 6 (2020), 193:1–193:15.
reviewing earlier versions of the manuscript and to Anran Qi for Zhouhui Lian, Bo Zhao, Xudong Chen, and Jianguo Xiao. 2018. EasyFont: A style
assisting in evaluating the Chinese words. This research was sup- learning-based system to easily build your large-scale handwriting fonts. ACM
ported in part by the Israel Science Foundation (grants no. 2492/20 Transactions on Graphics (TOG) 38, 1 (2018), 1–18.
Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. 2019. A Learned
and 3441/21), Len Blavatnik and the Blavatnik family foundation, Representation for Scalable Vector Graphics. In Proceedings of the IEEE/CVF Interna-
and the Tel Aviv University Innovation Laboratories (TILabs). tional Conference on Computer Vision (ICCV).
Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and
Humphrey Shi. 2022. Towards Layer-wise Image Vectorization. https://doi.org/10.
REFERENCES 48550/ARXIV.2206.04655
Wendong Mao, Shuai Yang, Huihong Shi, Jiaying Liu, and Zhongfeng Wang. 2022. Intel-
Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. 2021. SegDiff: Image ligent Typography: Artistic Text Style Transfer for Complex Texture and Structure.
Segmentation with Diffusion Probabilistic Models. https://doi.org/10.48550/ARXIV. IEEE Transactions on Multimedia (2022), 1–15. https://doi.org/10.1109/TMM.2022.
2112.00390 3209870
Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended Diffusion for Text- Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and
Driven Editing of Natural Images. In Proceedings of the IEEE/CVF Conference on Stefano Ermon. 2022. SDEdit: Guided Image Synthesis and Editing with Stochastic
Computer Vision and Pattern Recognition (CVPR). 18208–18218. Differential Equations. In International Conference on Learning Representations.
Samaneh Azadi, Matthew Fisher, Vladimir G. Kim, Zhaowen Wang, Eli Shechtman, Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. 2022.
and Trevor Darrell. 2018. Multi-Content GAN for Few-Shot Font Style Transfer. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. https:
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition //doi.org/10.48550/ARXIV.2211.07600
(CVPR). IEEE, Salt Lake City, UT, USA, 7564–7573. Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka.
Elena Balashova, Amit H. Bermano, Vladimir G. Kim, Stephen DiVerdi, Aaron Hertz- 2021. Text2Mesh: Text-Driven Neural Stylization for Meshes. arXiv preprint
mann, and Thomas Funkhouser. 2019. Learning a Stroke-Based Representation for arXiv:2112.03221 (2021).
Fonts. Computer Graphics Forum 38, 1 (2019), 429–442. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob
Brad Barber and Hannu Huhdanpaa. 1995. QHull. The Geometry Center, University of McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic
Minnesota, http://www. geom. umn. edu/software/qhull (1995). image generation and editing with text-guided diffusion models. arXiv preprint
Daniel Berio, Frederic Fol Leymarie, Paul Asente, and Jose Echevarria. 2022. StrokeStyles: arXiv:2112.10741 (2021).
Stroke-Based Segmentation and Stylization of Fonts. ACM Trans. Graph. 41, 3, Article Laurence Penney. 1996. A History of TrueType. https://www.truetype-
28 (apr 2022), 21 pages. https://doi.org/10.1145/3505246 typography.com/.
Neill DF Campbell and Jan Kautz. 2014. Learning a Manifold of Fonts. ACM Transactions Huy Quoc Phan, Hongbo Fu, and Antoni B Chan. 2015. Flexyfont: Learning Transferring
on Graphics (TOG) 33, 4 (2014). https://doi.org/10.1145/2601097.2601212 Article no. Rules for Flexible Typeface Synthesis. Computer Graphics Forum 34, 7 (2015), 245–
91. 256.
Hila Chefer, Shir Gur, and Lior Wolf. 2021. Transformer Interpretability Beyond Atten- Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion:
tion Visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
and Pattern Recognition (CVPR). 782–791. Lakshman Prasad. 1997. Morphological analysis of shapes. CNLS newsletter 139, 1
Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. (1997), 1997–07.
2021. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
CoRR abs/2108.02938 (2021). arXiv:2108.02938 https://arxiv.org/abs/2108.02938 Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Boris Delaunay et al. 1934. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matem- Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From
aticheskii i Estestvennyka Nauk 7, 793-800 (1934), 1–2. Natural Language Supervision. CoRR abs/2103.00020 (2021). arXiv:2103.00020
Noa Fish, Lilach Perry, Amit Bermano, and Daniel Cohen-Or. 2020. SketchPatch: Sketch https://arxiv.org/abs/2103.00020
Stylization via Seamless Patch-Level Synthesis. ACM Trans. Graph. 39, 6, Article Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022.
227 (nov 2020), 14 pages. https://doi.org/10.1145/3414685.3417816 Hierarchical text-conditional image generation with clip latents. arXiv preprint
Kevin Frans, Lisa B Soros, and Olaf Witkowski. 2021. Clipdraw: Exploring arXiv:2204.06125 (2022).
text-to-drawing synthesis through language-image encoders. arXiv preprint Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
arXiv:2106.14843 (2021). Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models.
FreeType. 2009. FreeType library. https://freetype.org/ arXiv:2112.10752 [cs.CV]
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional
and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing Text-to- networks for biomedical image segmentation. In International Conference on Medical
Image Generation using Textual Inversion. https://doi.org/10.48550/ARXIV.2208. image computing and computer-assisted intervention. Springer, 234–241.
01618 Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir
Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, and Daniel Aberman. 2022. DreamBooth: Fine Tuning Text-to-image Diffusion Models for
Cohen-Or. 2023. Designing an Encoder for Fast Personalization of Text-to-Image Subject-Driven Generation. (2022).
Models. https://doi.org/10.48550/ARXIV.2302.12228 Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den-
David Ha and Douglas Eck. 2018. A Neural Representation of Sketch Draw- ton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
ings. In Sixth International Conference on Learning Representations (ICLR). Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad
https://arxiv.org/abs/1704.03477. Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Understanding. https://doi.org/10.48550/ARXIV.2205.11487
Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. Kunpeng Song, Ligong Han, Bingchen Liu, Dimitris Metaxas, and Ahmed Elgammal.
(2022). 2022. Diffusion Guided Domain Adaptation of Image Generators. https://doi.org/
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic 10.48550/ARXIV.2212.04473
Models. CoRR abs/2006.11239 (2020). arXiv:2006.11239 https://arxiv.org/abs/2006. Rapee Suveeranont and Takeo Igarashi. 2010. Example-Based Automatic Font Genera-
11239 tion. In Smart Graphics. Number LNCS 6133 in Lecture Notes in Computer Science.
Kai Hormann and Günther Greiner. 2000. MIPS: An efficient global parametrization 127–138.
method. Technical Report. Erlangen-Nuernberg Univ (Germany) Computer Graphics Purva Tendulkar, Kalpesh Krishna, Ramprasaath R. Selvaraju, and Devi Parikh. 2019.
Group. Trick or TReAT: Thematic Reinforcement for Artistic Typography. https://doi.org/
Adobe Systems Inc. 1990. Adobe Type 1 Font Format. Addison Wesley Publishing 10.48550/ARXIV.1903.07820
Company. Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022.
Ajay Jain, Amber Xie, and Pieter Abbeel. 2022. VectorFusion: Text-to-SVG by Abstract- Motionclip: Exposing human motion generation to clip space. In Computer Vision–
ing Pixel-Based Diffusion Models. arXiv preprint arXiv:2211.11319 (2022). ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
Yue Jiang, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. 2019. SCFont: Structure- Part XXII. Springer, 358–374.
Guided Chinese Font Generation via Deep Stacked Networks. Proceedings of the
AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 4015–4022. https:
Word-As-Image for Semantic Typography • 11

Yingtao Tian and David Ha. 2021. Modern Evolution Strategies for Creativity: Fitting
Concrete Images and Abstract Conceptst. arXiv:2109.08857 [cs.NE]
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2022a. Plug-and-Play
Diffusion Features for Text-Driven Image-to-Image Translation. https://doi.org/10.
48550/ARXIV.2211.12572
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2022b. Plug-and-Play
Diffusion Features for Text-Driven Image-to-Image Translation. https://doi.org/10.
48550/ARXIV.2211.12572
Yael Vinker, Yuval Alaluf, Daniel Cohen-Or, and Ariel Shamir. 2022a. CLIPascene: Scene
Sketching with Different Types and Levels of Abstraction. https://doi.org/10.48550/
ARXIV.2211.17256
Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Roman Christian Bachmann, Amit Haim
Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. 2022b. CLIPasso:
Semantically-Aware Object Sketching. ACM Trans. Graph. 41, 4, Article 86 (jul
2022), 11 pages. https://doi.org/10.1145/3528223.3530068
Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif
Rasul, Mishig Davaadorj, and Thomas Wolf. 2022. Diffusers: State-of-the-art diffu-
sion models. https://github.com/huggingface/diffusers.
Wenjing Wang, Jiaying Liu, Shuai Yang, and Zongming Guo. 2019. Typography With
Decor: Intelligent Text Style Transfer. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR).
Yizhi Wang and Zhouhui Lian. 2021. DeepVecFont: Synthesizing High-Quality Vector
Fonts via Dual-Modality Learning. ACM Transactions on Graphics 40, 6 (Dec. 2021),
1–15. https://doi.org/10.1145/3478513.3480488
Yizhi Wang, Guo Pu, Wenhan Luo, Yexin Wang, Pengfei Xiong, Hongwen Kang, and
Zhouhui Lian. 2022. Aesthetic Text Logo Synthesis via Content-Aware Layout
Inferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). 2436–2445.
Jie Xu and Craig S. Kaplan. 2007. Calligraphic Packing. In Proceedings of Graphics
Interface 2007 on - GI ’07. ACM Press, Montreal, Canada, 43. https://doi.org/10.1145/
1268517.1268527
Shuai Yang, Jiaying Liu, Zhouhui Lian, and Zongming Guo. 2017. Awesome Typography:
Statistics-Based Text Effects Transfer. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR).
Shuai Yang, Jiaying Liu, Wenhan Yang, and Zongming Guo. 2018. Context-Aware Un-
supervised Text Stylization. In Proceedings of the 26th ACM International Conference
on Multimedia (Seoul, Republic of Korea) (MM ’18). Association for Computing Ma-
chinery, New York, NY, USA, 1688–1696. https://doi.org/10.1145/3240508.3240580
Shuai Yang, Zhangyang Wang, and Jiaying Liu. 2022. Shape-Matching GAN++: Scale
Controllable Dynamic Artistic Text Style Transfer. IEEE Transactions on Pattern
Analysis and Machine Intelligence 44, 7 (2022), 3807–3820. https://doi.org/10.1109/
TPAMI.2021.3055211
Junsong Zhang, Yu Wang, Weiyi Xiao, and Zhenshan Luo. 2017. Synthesizing Orna-
mental Typefaces: Synthesizing Ornamental Typefaces. Computer Graphics Forum
36, 1 (Jan. 2017), 64–75. https://doi.org/10.1111/cgf.12785
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao,
Peng Gao, and Hongsheng Li. 2021. PointCLIP: Point Cloud Understanding by CLIP.
https://doi.org/10.48550/ARXIV.2112.02413
Changqing Zou, Junjie Cao, Warunika Ranaweera, Ibraheem Alhashim, Ping Tan, Alla
Sheffer, and Hao Zhang. 2016. Legible Compact Calligrams. ACM Transactions on
Graphics 35, 4 (July 2016), 1–12. https://doi.org/10.1145/2897824.2925887
Ju Jia Zou, Hung-Hsin Chang, and Hong Yan. 2001. Shape skeletonization by identifying
discrete local symmetries. Pattern Recognition 34, 10 (2001), 1895–1905.
12 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

SUPPLEMENTARY MATERIAL

A IMPLEMENTATION DETAILS
In this section we provide further implementation details. We intend
to release the code to promote future research in this domain.
Our method is based on the pre-trained 𝑣1 − 5 Stable Diffusion
Fig. 18. Some additional examples of word-as-image applied on Chinese
model [Rombach et al. 2021], which we use through the diffusers characters. In Chinese, a whole word can be represented by one character.
[von Platen et al. 2022] Python package. We optimize only the Here we show from left: bird, rabbit, cat and surfing (two last characters
control points’ coordinates (i.e. we do not modify the color, width, together). The complexity of characters imposes an additional challenge for
and other parameters of the shape). We use the Adam optimizer with our method. This could be alleviated in the future for example by dividing
𝛽 1 = 0.9, 𝛽 2 = 0.9, 𝜖 = 10−6 . We use learning rate warm-up from the characters to radicals and applying the method only on parts of the
0.1 to 0.8 over 100 iterations and exponential decay from 0.8 to 0.4 character.
over the rest 400 iterations, 500 iteration in total. The optimization
process requires at least 10GB memory and approximately 5 minutes
to produce a single letter illustration on RTX2080 GPU. C PERCEPTUAL STUDY
Before we feed the rasterized 600𝑥600 letter image into the Stable In this section, we provide more details about the perceptual study
Diffusion model, we apply random augmentations as proposed in described in Section 5.1. The randomly chosen objects, fonts, and
CLIPDraw [Frans et al. 2021]. Specifically, perspective transform letters are shown in Table 2. A few visual examples are shown in
with a distortion scale of 0.5, with probability 0.7, and a random Figure 19.
512𝑥512 crop. We add the suffix "a [word]. minimal flat 2d vector.
lineal color. trending on artstation." to the target word 𝑊 , before Ours Only SDS
feeding it into the text encoder of a pretrained CLIP model.
"Coat"

B COMPARISONS
As described in Section 5.2 we define five baselines to compare with.
In this section we provide more details about the evaluation and "Soccer"
more qualitative results. For (1) SD, we run Stable Diffusion [Rom-
bach et al. 2021] with the default hyper parameters of 50 inference
steps and a guidance scale of 7.5. We use the prompt “Word as image
of the word [word]. [font] font. minimal flat 2d vector. lineal color. "Shirt"
black and white style”.
For (2) SDEdit [Meng et al. 2022], we utilized the diffusers [von
Platen et al. 2022] implementation, using the prompt “A [word].
minimal flat 2d vector. lineal color. black and white style”, and the
"Rugby"
rasterized input letter as the reference image. We use the default
values of 50 inference steps and a guidance scale of 7.5. We use a
strength value of 0.85. The strength value determines the quantity
of noise added to the input image – a value close to 1.0 results in
higher degree of variation in the output, and vice versa. Font
We use the official website of OpenAI to run (3) DallE2 [Ramesh Rec.
et al. 2022], using the prompt “Word as image of the word [word].
Where the letter [letter] looks like a [word]. [font] font. minimal
flat 2d vector. lineal color. black and white style”. To encourage
the manipulation of a specific letter, for (4) DallE2+letter we use Fig. 19. Examples of illustrations presented in the perceptual study. Each
pair in the top part shows illustrations obtained using our proposed method
the prompt “The letter [letter] in the shape of a [word]. [font] font.
(left) and using only SDS loss (right). On the bottom is an example of an
minimal flat 2d vector. lineal color. black and white style”. For (5) illustration presented for the font recognition questions.
CLIPDraw [Frans et al. 2021], we use the author’s official imple-
mentation with the recommended hyper-parameters. Instead of
using randomly initialized strokes, we use our vectorized letter as
input, along with the prompt “A [word]. [font] font. minimal flat D ADDITIONAL RESULTS
2d vector. lineal color. black and white style”. We provide more We provide additional results of our generated word-as-images. In
comparisons to the methods described above in Figure 20. Figures 21-32 we show results of selected words and unique fonts.
Word-As-Image for Semantic Typography • 13

"Muffin"

"Tiger"

"Octopus"

"Plant"

"Astronaut"

"Robot"

"Bunny"

"Flamingo"

"Paris"

"Owl"

"Swan"

"Mermaid"

Input SD SDEdit DallE2 DallE2+letter CLIPDraw Ours

Fig. 20. Comparison to alternative methods based on large scale text-to-image models. On the left are the letters used as input (only for SDEdit, CLIPDraw,
and ours), as well as the desired object of interest. The results from left to right obtained using Stable Diffusion [Rombach et al. 2021], SDEdit [Meng et al.
2022], DallE2 [Ramesh et al. 2022], DallE2 with a letter specific prompt, CLIPDraw [Frans et al. 2021], and our single-letter results.

In Figures 33-48 we show the results obtained for the random set of
words.
14 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Table 2. Randomly chosen objects, letters, and fonts for the perceptual
study.

Object Letter Font


Pineapple P Noteworthy-Bold
Orange O Quicksand
Rugby Y Noteworthy-Bold
Soccer S Noteworthy-Bold
Bear B Bell MT
Lion O Quicksand
Singer N Noteworthy-Bold
Pilot P Noteworthy-Bold
Coat O HobeauxRococeaux-Sherman
Shirt S Bell MT

Fig. 21. Word-as-image illustrations created by our method.


Word-As-Image for Semantic Typography • 15

Fig. 22. Word-as-image illustrations created by our method.

Fig. 23. Word-as-image illustrations created by our method.

Fig. 24. Word-as-image illustrations created by our method.


16 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 25. Word-as-image illustrations created by our method.

Fig. 26. Word-as-image illustrations created by our method.


Word-As-Image for Semantic Typography • 17

Fig. 27. Word-as-image illustrations created by our method.

Fig. 28. Word-as-image illustrations created by our method.


18 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 29. Word-as-image illustrations created by our method.

Fig. 30. Word-as-image illustrations created by our method.


Word-As-Image for Semantic Typography • 19

Fig. 31. Word-as-image illustrations created by our method.

Fig. 32. Word-as-image illustrations created by our method.


20 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 33. Word-as-image illustrations created by our method for randomly chosen words.

Fig. 34. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 21

Fig. 35. Word-as-image illustrations created by our method for randomly chosen words.

Fig. 36. Word-as-image illustrations created by our method for randomly chosen words.
22 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 37. Word-as-image illustrations created by our method for randomly chosen words.

Fig. 38. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 23

Fig. 39. Word-as-image illustrations created by our method for randomly chosen words.

Fig. 40. Word-as-image illustrations created by our method for randomly chosen words.
24 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 41. Word-as-image illustrations created by our method for randomly chosen words.

Fig. 42. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 25

Fig. 43. Word-as-image illustrations created by our method for randomly chosen words.

Fig. 44. Word-as-image illustrations created by our method for randomly chosen words.
26 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 45. Word-as-image illustrations created by our method for randomly chosen words.

Fig. 46. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 27

Fig. 47. Word-as-image illustrations created by our method for randomly chosen words.
28 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir

Fig. 48. Word-as-image illustrations created by our method for randomly chosen words.

You might also like