Word-As-Image For Semantic Typography
Word-As-Image For Semantic Typography
Fig. 1. A few examples of our word-as-image illustrations in various fonts and for different textual concept. The semantically adjusted letters are created
completely automatically using our method, and can then be used for further creative design as we illustrate here.
A word-as-image is a semantic typography technique where a word illus- a great deal of creativity and design skills to integrate the chosen
tration presents a visualization of the meaning of the word, while also visual concept into the letter’s shape [Lee 2011]. In Figure 2 we show
preserving its readability. We present a method to create word-as-image some word-as-image examples created manually. For example, to
illustrations automatically. This task is highly challenging as it requires create the “jazz” depiction, the designer had to first choose the visual
semantic understanding of the word and a creative idea of where and how to concept that would best fit the semantics of the text (a saxophone),
depict these semantics in a visually pleasing and legible manner. We rely on
consider the desired font characteristics, and then choose the most
the remarkable ability of recent large pretrained language-vision models to
distill textual concepts visually. We target simple, concise, black-and-white suitable letter to be replaced. Finding the right visual element to
designs that convey the semantics clearly. We deliberately do not change the illustrate a concept is ill-defined as there are countless ways to il-
color or texture of the letters and do not use embellishments. Our method lustrate any given concept. In addition, one cannot simply copy a
optimizes the outline of each letter to convey the desired concept, guided by selected visual element onto the word – there is a need to find subtle
a pretrained Stable Diffusion model. We incorporate additional loss terms modifications of the letters shape.
to ensure the legibility of the text and the preservation of the style of the Because of these complexities, the task of automatic creation of
font. We show high quality and engaging results on numerous examples word-as-image illustrations was practically impossible to achieve
and compare to alternative techniques. using computers until recently. In this paper, we define an algo-
Code will be available at our project page. rithm for automatic creation of word-as-image illustrations based
on recent advances in deep-learning and the availability of huge
1 INTRODUCTION
foundational models that combine language and visual understand-
Semantic typography is the practice of using typography to visually ing. Our resulting illustrations (see Figure 1) could be used for logo
reinforce the meaning of text. This can be achieved through the design, for signs, in greeting cards and invitations, and simply for
choice of typefaces, font sizes, font styles, and other typographic fun. They can be used as-is, or as inspiration for further refinement
elements. A more elaborate and engaging technique for semantic of the design.
typography is presented by word-as-image illustrations, where the Existing methods in the field of text stylization often rely on raster
semantics of a given word are illustrated using only the graphical textures [Yang et al. 2018], place a manually created style on top
elements of its letters. Such illustrations provide a visual repre- of the strokes segmentation [Berio et al. 2022], or deform the text
sentation of the meaning of the word, while also preserving the into a pre-defined target shape [Zou et al. 2016] (see Figure 3). Only
readability of the word as a whole. a few works [Tendulkar et al. 2019; Zhang et al. 2017] deal with
The task of creating a word-as-image is highly challenging, as it semantic typography, and they often operate in the raster domain
requires the ability to understand and depict the visual characteris- and use existing icons for replacement (see Figure 3E).
tics of the given concept, and to convey them in a concise, aesthetic, Our word-as-image illustrations concentrate on changing only
and comprehensible manner without harming legibility. It requires the geometry of the letters to convey the meaning. We deliberately
∗ Denotes equal contribution. do not change color or texture and do not use embellishments. This
2 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
𝑝! 𝑙!
𝑝"
𝒟 𝑃= LPF “Surfing * “ CLIP
2
𝑝#!
3 t~[50,950]
LPF
Encoder z z! UNet z
𝑝! Augment
𝑝" 𝑙$!
𝒟 𝑃" = DiffVG ..
𝑝#! 1
Fig. 5. An overview of our method. Given an input letter 𝑙𝑖 represented by a set of control points 𝑃 , and a concept (shown in purple), we optimize the new
positions 𝑃ˆ of the deformed letter 𝑙ˆ𝑖 iteratively. At each iteration, the set 𝑃ˆ is fed into a differentiable rasterizer (DiffVG marked in blue) that outputs the
rasterized deformed letter 𝑙ˆ𝑖 . 𝑙ˆ𝑖 is then augmented and passed into a pretrained frozen Stable Diffusion model, that drives the letter shape to convey the
semantic concept using the ∇𝑃ˆ LLSDS loss (1). 𝑙𝑖 and 𝑙ˆ𝑖 are also passed through a low pass filter (LPF marked in yellow) to compute L𝑡𝑜𝑛𝑒 (2) which encourages
the preservation of the overall tone of the font style and also the local letter shape. Additionally, the sets 𝑃 and 𝑃ˆ are passed through a Delaunay triangulation
operator (D marked in green), defining L𝑎𝑐𝑎𝑝 (3) which encourages the preservation of the initial shape.
3.4 VectorFusion
Recently, VectorFusion [Jain et al. 2022] utilized the SDS loss for the
task of text-to-SVG generation. The proposed generation pipeline
involves two stages. Given a text prompt, first, an image is generated
using Stable Diffusion (with an added suffix to the prompt), and
is then vectorized automatically using LIVE [Ma et al. 2022]. This
defines an initial set of parameters to be optimized in the second Fig. 6. Illustration of the letter’s outline and control points before (left) and
stage using the SDS loss. At each iteration, a differentiable rasterizer after (right) the subdivision process. The orange dots are the initial Bézier
[Li et al. 2020] is used to produce a 600 × 600 image, which is then curve segment endpoints. The blue dots are the remaining control points
augmented as suggested in CLIPDraw [Frans et al. 2021] to get a respectively before and after subdivision.
512 × 512 image 𝑥𝑎𝑢𝑔 . Then 𝑥𝑎𝑢𝑔 is fed into the pretrained encoder
E of Stable Diffusion to produce the corresponding latent code
Depending on the letter’s complexity and the style of the font,
𝑧 = E (𝑥𝑎𝑢𝑔 ). The SDS loss is then applied in this latent space, in a
the extracted outlines are defined by a different number of control
similar way to the one defined in DreamFusion:
points. We have found that the initial number of control points
𝜕𝑧 𝜕𝑥𝑎𝑢𝑔 affects the final appearance significantly: as the number of control
∇𝜃 LLSDS = E𝑡,𝜖 𝑤 (𝑡) 𝜖ˆ𝜙 (𝛼𝑡 𝑧𝑡 + 𝜎𝑡 𝜖, 𝑦) − 𝜖 (3) points increases, there is more freedom for visual changes to occur.
𝜕𝑧𝑎𝑢𝑔 𝜕𝜃 Therefore, we additionally apply a subdivision procedure to letters
We find the SDS approach useful for our task of producing se- containing a small number of control points. We define a desired
mantic glyphs, and we follow the technical steps proposed in Vec- number of control points for each letter of the alphabet (shared
torFusion (e.g. augmentations and the added suffix). across different fonts), and then iteratively subdivide the Bézier
segments until reaching this target number. At each iteration, we
4 METHOD compute the maximum arc length among all Bézier segments and
Given a word 𝑊 represented as a string with 𝑛 letters {𝑙 1, ...𝑙𝑛 }, our split each segment with this length into two (see Figure 6). We
method is applied to every letter 𝑙𝑖 separately to produce a semantic analyse the effect of the number of control points in Section 5.3.
visual depiction of the letter. The user can then choose which letters This procedure defines a set of 𝑘𝑖 control points 𝑃𝑖 = {𝑝 𝑗 }𝑘𝑗=1
𝑖
to replace and which to keep in their original form. representing the shape of the letter 𝑙𝑖 .
Fig. 8. Our tone-preserving loss preserves the local tone of the font by
comparing the low-pass filter of the letters images before (left) and after
Fig. 7. Visual illustration of the constraint Delaunay triangulation applied deformation (right). It constrains the adjusted letter not to deviate too much
to the initial shapes (left) and the resulting ones (right), for the word “pants”. from the original. This example is of the letter B and the word “Bear”.
The ACAP loss maintains the structure of the letter after the deformation.
The zoomed rectangle shows the angles for a given control point 𝑝 𝑗 .
The Delaunay triangulation D (𝑃) splits the glyph represented by
defining an adjusted letter 𝑙ˆ that conveys the given concept, while 𝑃 into a set of triangles. This defines a set of size 𝑚 𝑗 of corresponding
maintaining the overall structure and characteristics of the initial angles for each control point 𝑝 𝑗 (see Figure 7). We denote this set of
𝑚
letter 𝑙. angles as {𝛼 𝑖𝑗 }𝑖=1𝑗 . The ACAP loss encourages the induced angles
We initialize the learned set of control points 𝑃ˆ with 𝑃, and pass of the optimized shape 𝑃ˆ not to deviate much from the angles of
it through a differentiable rasterizer R [Li et al. 2020] (marked in the original shape 𝑃, and is defined as the L2 distance between the
ˆ The rasterized letter
blue), which outputs the rasterized letter R (𝑃). corresponding angles:
is then randomly augmented and passed into a pretrained Stable 𝑘 𝑚𝑗 !
Diffusion [Rombach et al. 2021] model, conditioned on the CLIP’s ˆ 1 ∑︁ ∑︁ 𝑖 𝑖 2
L𝑎𝑐𝑎𝑝 (𝑃, 𝑃) = 𝛼 − 𝛼ˆ 𝑗 (4)
embedding of the given text 𝑐. The SDS loss ∇𝑃ˆ LLSDS is then used 𝑘 𝑗=1 𝑖=1 𝑗
as described in Section 3 to encourage R (𝑃) ˆ to convey the given
ˆ
where 𝑘 = |𝑃 | and 𝛼ˆ are the angles induced by D (𝑃).
text prompt.
To preserve the shape of each individual letter and ensure the Tone Preservation Loss. To preserve the style of the font as well
legibility of the word as a whole, we use two additional loss functions as the structure of the letter we add a local-tone preservation loss
to guide the optimization process. The first loss limits the overall term. This term constrains the tone (amount of black vs. white in
shape change by defining as-conformal-as-possible constraint on all regions of the shape) of the adjusted letter not to deviate too
the shape deformation. The second loss preserves the overall shape much from tone of the original font’s letter. Towards this end, we
and style of the font by constraining the tone (i.e. amount of dark apply a low pass filter (LPF) to the rasterized letter (before and after
vs. light areas in local parts of the shape) of the modified letter not deformation) and compute the L2 distance between the resulting
to diverge too much from the original letter (see Section 4.3). blurred letters:
The gradients obtained from all the losses are then backpropa-
ˆ
2
ˆ We repeat this process for 500 L𝑡𝑜𝑛𝑒 =
𝐿𝑃𝐹 (R (𝑃)) − 𝐿𝑃𝐹 (R (𝑃)) 2 (5)
gated, to update the parameters 𝑃.
steps, which takes ∼ 5 minutes to produce a single letter illustration An example of the blurred letters is shown in Figure 8, as can be
on RTX2080 GPU. seen, we use a high value of standard deviation 𝜎 in the blurring
kernel to blur out small details such as the ears of bear.
4.3 Loss Functions Our final objective is then defined by the weighted average of the
Our primary objective of encouraging the resulting shape to con- three terms:
vey the intended semantic concept, is utilized by ∇𝑃ˆ LLSDS loss ˆ 𝑐) + 𝛼 · L𝑎𝑐𝑎𝑝 (𝑃, 𝑃)
min ∇𝑃ˆ LLSDS (R (𝑃), ˆ
(described in Section 3). We observe that using ∇𝑃ˆ LLSDS solely can 𝑃ˆ (6)
cause large deviations from the initial letter appearance, which is ˆ
+𝛽𝑡 · L𝑡𝑜𝑛𝑒 (R (𝑃), R (𝑃))
undesired. Hence, our additional goal is to maintain the shape and
ˆ as well as to keep the original font’s where 𝛼 = 0.5 and 𝛽𝑡 depends on the step 𝑡 as described next.
legibility of the letter R (𝑃),
characteristics. For that purpose we use two additional losses. 4.4 Weighting
As-Conformal-As-Possible Deformation Loss. To prevent the final Choosing the relative weights of the three losses presented above
letter shape from diverging too much from the initial shape, we is crucial to the appearance of the final letter. While the ∇𝑃ˆ LLSDS
triangulate the inner part of the letter and constrain the deformation loss encourages the shape to deviate from its original appearance to
of the letter to be as conformal as possible (ACAP) [Hormann and better fit the semantic concept, the two terms L𝑡𝑜𝑛𝑒 and L𝑎𝑐𝑎𝑝 are
Greiner 2000]. We use constrained Delaunay triangulation [Barber responsible for maintaining the original shape. Hence, we have two
and Huhdanpaa 1995; Delaunay et al. 1934] on the set of control competing parts in the formula, and would like to find a balance
points defining the glyph. It is known that Delaunay triangulation between them to maintain the legibility of the letter while allowing
can be used to produce the skeleton of an outline [Prasad 1997; Zou the desired semantic shape to change.
et al. 2001], so the ACAP loss also implicitly captures a skeletal We find that L𝑡𝑜𝑛𝑒 can be very dominant. In some cases, if it is
representation of the letter form. used from the beginning, no semantic deformation is performed.
6 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
for the “DRESS” and “LION” are thin and fit well with the rest of
the word. In addition, observe the serifs of the letter A used for the
fin of the shark in the “SHARK” example. We further use human
evaluation to validate this as described below.
Fig. 9. Word-as-images produced by our method for the word “YOGA”, 5.1 Quantitative
using eight different fonts.
We conduct a perceptual study to quantitatively assess the three
objectives of our resulting word-as-images. We randomly select two
Therefore, we adjust the weight of L𝑡𝑜𝑛𝑒 to kick-in only after some instances from each of the resulting word-as-image illustrations
semantic deformation has occurred. We define 𝛽𝑡 as follows: for the five classes described above, and visually select one letter
from each word, resulting in 10 letters in total. In each question
(𝑡 − 𝑏) 2 we show an isolated letter illustration, without the context of the
𝛽𝑡 = 𝑎 · exp − (7)
2𝑐 2 word. To evaluate the ability of our method to visually depict the
with 𝑎 = 100, 𝑏 = 300, 𝑐 = 30. We analyse the affect of various desired concept, we present four label options from the same class,
weighting in Section 5.3. Note that the same hyper-parameter choice and ask participants to choose the one that describes the letter
works for various words, letters, and fonts. illustration best. To evaluate the legibility of the results, we ask
participants to choose the most suitable letter from a random list of
5 RESULTS four letters. To asses the preservation of the font style, we present
The robustness of our approach means it should be capable of han- the four fonts and ask participants to choose the most suitable font
dling a wide range of input concepts as well as supporting different for the illustration. We gathered answers from 40 participants, and
font designs. Figures 1, 4, 33, 17, and more results in the supplemen- the results are shown in Table 1. As can be seen, the level of concept
tal file demonstrate that our approach can handle inputs from many recognizability and letter legibility are very high, and the 51% of
different categories and various fonts, and that the generated results style matching of the letter illustration to the original font is well
are legible and creative. Figure 9 demonstrate how the illustrations above random, which is 25%. We also test our algorithm without
created by our method for the same word follow the characteristics the two additional structure and style preserving losses (L𝑎𝑐𝑎𝑝 and
of different fonts. Although the perceived aesthetics of a word-as- L𝑡𝑜𝑛𝑒 ) on the same words and letters (“Only SDS” in the table).
image illustration can be subjective, we define three objectives for As expected, without the additional constraints, the letter deforms
an effective result: (1) it should visually capture the given semantic significantly resulting in higher concept recognizability but lower
concept, (2) it should maintain readability, and (3) it should preserve legibility and font style preservation. More details and examples are
the original font’s characteristics. provided in the supplementary material.
We evaluate the performance of our method on a randomly se-
lected set of inputs. We select five common concept classes - animals, 5.2 Comparison
fruits, plants, sports, and professions. Using ChatGPT, we sample ten In the absence of a relevant baseline for comparison, we define base-
random instances for each class, resulting in 50 words in total. Next, lines based on large popular text-to-image models. Specifically, we
we select four fonts that have distinct visual characteristics, namely use (1) SD Stable Diffusion [Rombach et al. 2021], (2) SDEdit [Meng
Quicksand, Bell MT, Noteworthy-Bold, and HobeauxRococeaux- et al. 2022], (3) DallE2 [Ramesh et al. 2022] illustrating the word,
Sherman. For each word, we randomly sampled one of the four (4) DallE2+letter illustrating only the letter, and (5) CLIPDraw
fonts, and applied our method to each letter. For each word with [Frans et al. 2021]. We applied the methods above (details can be
𝑛 letters we can generate 2𝑛 possible word-as-images, which are found in supplemental material) to three representative words –
all possible combinations of replacements of illustrated letters. A “bird”, “dress”, and “tulip”, with the fonts Bell MT, Quicksand, and
selected subset of these results is presented in Figure 33. The results Noteworthy-Bold, respectively. The results can be seen in Figure 10.
of all letters and words are presented in the supplementary material. In some cases Stable Diffusion (SD) did not manage to produce
As can be seen, the resulting word-as-image illustrations success- text at all (such as for the bird) and when text is produced, it is
fully convey the given semantic concept in most cases while still often not legible. The results obtained by SDEdit preserve the font’s
remaining legible. In addition, our method successfully captures characteristics and the letter’s legibility, but often fail to reflect
the font characteristics. For example, in Figure 33, the replacements the desired concept, such as in the case of the bird and the dress.
Word-As-Image for Semantic Typography • 7
The word
BIRD and
the letter R
The word
DRESS and
the letter E
The word
TULIP and
the letter U
Input SD SDEdit DallE2 DallE2+letter CLIPDraw Ours
Fig. 10. Comparison to alternative methods based on large scale text-to-image models. On the left are the letters used as input (only for SDEdit, CLIPDraw, and
ours), as well as the desired object of interest. The results from left to right obtained using Stable Diffusion [Rombach et al. 2021], SDEdit [Meng et al. 2022],
DallE2 [Ramesh et al. 2022], DallE2 with a letter specific prompt, CLIPDraw [Frans et al. 2021], and our single-letter results, as well as the final word-as-image.
Additionally, it operates in the raster domain and tends to add details By using the hyperparameters described in the paper, we are able
on top of the letter, while our method operates directly on the vector to achieve a reasonable balance between semantics and legibility.
representation of the letters with the objective of modifying their The parameters were determined manually based on visual assess-
shape. DallE2 manages to reflect the visual concept, however it often ments, but can be adjusted as needed based on the user’s personal
fails to produce legible text. When applied with a dedicated prompt taste and goals.
to produce the word-as-image of only one letter (fifth column), it
manages to produce a legible letter, but there is less control over "Ballet"
the output – it is impossible to specify the desired font or to control
the size, position, and shape of the generated letter. Therefore, it is
not clear how to combine these output illustrations into the entire "Gorilla"
word to create a word-as-image.
CLIPDraw produces reasonable results conveying the semantics
"Gym"
of the input word. However, the results are non-smooth and the
characteristics of the font are not preserved (for example observe
how the letter "E" differs from the input letter). We further examine Input 𝑃𝑜 𝑃 2×𝑃
CLIPDraw with our shape preservation losses in the next Section.
Fig. 11. The effect of the initial number of control points on outputs. On the
left are the input letters and the target concepts used to generate the results
5.3 Ablation on the right. 𝑃𝑜 indicates the original number of control points as extracted
Figure 11 illustrates the impact of the letter’s initial number of from the font, 𝑃 is the input letter with our chosen hyperparameters, and
control points. When less control points are used (𝑃𝑜 is the original for 2 × 𝑃 we increase the number of control points in 𝑃 by two.
number of control points), we may get insufficient variations, such
as for the gorilla. However, this can also result in more abstract 6 CONCLUSIONS
depictions, such as the ballerina. As we add control points, we get We presented a method for the automatic creation of vector-format
more graphic results, with the tradeoff that it often deviate from the word-as-image illustrations. Our method can handle a large variety
original letter. In Figure 15 we show the results of using only the of semantic concepts and use any font, while preserving the legibility
∇𝑃ˆ LLSDS loss. As can be seen, in that case the illustrations strongly of the text and the font’s style.
convey the semantic concept, however at the cost of legibility. In There are limitations to our method. First, our method works
Figure 16 we analyze the effect of the weight 𝛼 applied to L𝑎𝑐𝑎𝑝 . letter by letter, and therefore, it cannot deform the shape of the
Ranging from 1 to 0. When L𝑎𝑐𝑎𝑝 is too dominant, the results may entire word. In the future we can try to optimize the shape of several
not enough reflect the semantic concept, while the opposite case letters. Second, the approach works best on concrete visual concepts,
harms legibility. Figure 13 illustrates a change in the 𝜎 parameter of and may fail with more abstract ones. This can be alleviated by
the low pass filter. When 𝜎 = 1 almost no blur is applied, resulting optimizing the shape of letters using different concepts than the
in a shape constraint that is too strong. word itself. Third, the layout of letters can also be automated for
In Figure 14 we show the results of replacing the ∇𝑃ˆ LLSDS loss example, using methods such as [Wang et al. 2022].
with a CLIP based loss, while using our proposed shape preservation Our word-as-image illustrations demonstrate visual creativity
terms. Although the results obtained with CLIP often depict the and open the possibility for the use of large vision-language models
desired visual concept, we find that using Stable Diffusion leads for semantic typography, possibly also adding human-in-the-loop
to smoother illustrations, that capture a wider range of semantic to arrive at more synergistic design methods of ML models and
concepts. humans.
8 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
Input
"Bear"
Letter
"Singer"
Ours
"Giraffe" Only
SDS
Without
Input 1 5 30 200 "Cat" "Music" "Robot" "Cup" "Hands"
L𝑡𝑜𝑛𝑒
Fig. 15. The effect of using only the SDS loss: note how the third row simply
Fig. 13. Altering the 𝜎 parameter of the low pass filter using in the L𝑡𝑜𝑛𝑒
looks like icon illustrations, while the second row still resembles legible
loss. On the leftmost column are the original letters and concepts used, then
letters.
from left to right are the results obtained when using 𝜎 ∈ {1, 5, 30, 200},
and without L𝑡𝑜𝑛𝑒 .
"Bear"
Input
Letter "Singer"
CLIP "Giraffe"
loss
Without
SDS Input 1 0.75 0.5 0.25
L𝑎𝑐𝑎𝑝
loss
"Snail" "Skirt" "Socks" "Queen" "Strawberry" Fig. 16. Altering the weight 𝛼 of the L𝑎𝑐𝑎𝑝 loss. On the leftmost column
are the original letters and concepts used, then from left to right are the
results obtained when using 𝛼 ∈ {1, 0.75, 0.5, 0.25, 0}.
Fig. 14. Replacing the SDS loss with a CLIP-based loss.
Fig. 12. Word-as-images produced by our method. This subset was chosen from the random set of words.
Word-As-Image for Semantic Typography • 9
7 ACKNOWLEDGMENTS //doi.org/10.1609/aaai.v33i01.33014015
Ji Lee. 2011. Word As Image. Adams Media, London.
We are grateful to Richard Hao Zhang for the early discussion of Tzu-Mao Li, Michal Lukáč, Gharbi Michaël, and Jonathan Ragan-Kelley. 2020. Differen-
the text-as-image problem. Ali Mahdavi-Amiri and Oren Katzir for tiable Vector Graphics Rasterization for Editing and Learning. ACM Trans. Graph.
(Proc. SIGGRAPH Asia) 39, 6 (2020), 193:1–193:15.
reviewing earlier versions of the manuscript and to Anran Qi for Zhouhui Lian, Bo Zhao, Xudong Chen, and Jianguo Xiao. 2018. EasyFont: A style
assisting in evaluating the Chinese words. This research was sup- learning-based system to easily build your large-scale handwriting fonts. ACM
ported in part by the Israel Science Foundation (grants no. 2492/20 Transactions on Graphics (TOG) 38, 1 (2018), 1–18.
Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. 2019. A Learned
and 3441/21), Len Blavatnik and the Blavatnik family foundation, Representation for Scalable Vector Graphics. In Proceedings of the IEEE/CVF Interna-
and the Tel Aviv University Innovation Laboratories (TILabs). tional Conference on Computer Vision (ICCV).
Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and
Humphrey Shi. 2022. Towards Layer-wise Image Vectorization. https://doi.org/10.
REFERENCES 48550/ARXIV.2206.04655
Wendong Mao, Shuai Yang, Huihong Shi, Jiaying Liu, and Zhongfeng Wang. 2022. Intel-
Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. 2021. SegDiff: Image ligent Typography: Artistic Text Style Transfer for Complex Texture and Structure.
Segmentation with Diffusion Probabilistic Models. https://doi.org/10.48550/ARXIV. IEEE Transactions on Multimedia (2022), 1–15. https://doi.org/10.1109/TMM.2022.
2112.00390 3209870
Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended Diffusion for Text- Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and
Driven Editing of Natural Images. In Proceedings of the IEEE/CVF Conference on Stefano Ermon. 2022. SDEdit: Guided Image Synthesis and Editing with Stochastic
Computer Vision and Pattern Recognition (CVPR). 18208–18218. Differential Equations. In International Conference on Learning Representations.
Samaneh Azadi, Matthew Fisher, Vladimir G. Kim, Zhaowen Wang, Eli Shechtman, Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. 2022.
and Trevor Darrell. 2018. Multi-Content GAN for Few-Shot Font Style Transfer. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. https:
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition //doi.org/10.48550/ARXIV.2211.07600
(CVPR). IEEE, Salt Lake City, UT, USA, 7564–7573. Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka.
Elena Balashova, Amit H. Bermano, Vladimir G. Kim, Stephen DiVerdi, Aaron Hertz- 2021. Text2Mesh: Text-Driven Neural Stylization for Meshes. arXiv preprint
mann, and Thomas Funkhouser. 2019. Learning a Stroke-Based Representation for arXiv:2112.03221 (2021).
Fonts. Computer Graphics Forum 38, 1 (2019), 429–442. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob
Brad Barber and Hannu Huhdanpaa. 1995. QHull. The Geometry Center, University of McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic
Minnesota, http://www. geom. umn. edu/software/qhull (1995). image generation and editing with text-guided diffusion models. arXiv preprint
Daniel Berio, Frederic Fol Leymarie, Paul Asente, and Jose Echevarria. 2022. StrokeStyles: arXiv:2112.10741 (2021).
Stroke-Based Segmentation and Stylization of Fonts. ACM Trans. Graph. 41, 3, Article Laurence Penney. 1996. A History of TrueType. https://www.truetype-
28 (apr 2022), 21 pages. https://doi.org/10.1145/3505246 typography.com/.
Neill DF Campbell and Jan Kautz. 2014. Learning a Manifold of Fonts. ACM Transactions Huy Quoc Phan, Hongbo Fu, and Antoni B Chan. 2015. Flexyfont: Learning Transferring
on Graphics (TOG) 33, 4 (2014). https://doi.org/10.1145/2601097.2601212 Article no. Rules for Flexible Typeface Synthesis. Computer Graphics Forum 34, 7 (2015), 245–
91. 256.
Hila Chefer, Shir Gur, and Lior Wolf. 2021. Transformer Interpretability Beyond Atten- Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion:
tion Visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
and Pattern Recognition (CVPR). 782–791. Lakshman Prasad. 1997. Morphological analysis of shapes. CNLS newsletter 139, 1
Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. (1997), 1997–07.
2021. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
CoRR abs/2108.02938 (2021). arXiv:2108.02938 https://arxiv.org/abs/2108.02938 Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Boris Delaunay et al. 1934. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matem- Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From
aticheskii i Estestvennyka Nauk 7, 793-800 (1934), 1–2. Natural Language Supervision. CoRR abs/2103.00020 (2021). arXiv:2103.00020
Noa Fish, Lilach Perry, Amit Bermano, and Daniel Cohen-Or. 2020. SketchPatch: Sketch https://arxiv.org/abs/2103.00020
Stylization via Seamless Patch-Level Synthesis. ACM Trans. Graph. 39, 6, Article Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022.
227 (nov 2020), 14 pages. https://doi.org/10.1145/3414685.3417816 Hierarchical text-conditional image generation with clip latents. arXiv preprint
Kevin Frans, Lisa B Soros, and Olaf Witkowski. 2021. Clipdraw: Exploring arXiv:2204.06125 (2022).
text-to-drawing synthesis through language-image encoders. arXiv preprint Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
arXiv:2106.14843 (2021). Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models.
FreeType. 2009. FreeType library. https://freetype.org/ arXiv:2112.10752 [cs.CV]
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional
and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing Text-to- networks for biomedical image segmentation. In International Conference on Medical
Image Generation using Textual Inversion. https://doi.org/10.48550/ARXIV.2208. image computing and computer-assisted intervention. Springer, 234–241.
01618 Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir
Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, and Daniel Aberman. 2022. DreamBooth: Fine Tuning Text-to-image Diffusion Models for
Cohen-Or. 2023. Designing an Encoder for Fast Personalization of Text-to-Image Subject-Driven Generation. (2022).
Models. https://doi.org/10.48550/ARXIV.2302.12228 Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den-
David Ha and Douglas Eck. 2018. A Neural Representation of Sketch Draw- ton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
ings. In Sixth International Conference on Learning Representations (ICLR). Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad
https://arxiv.org/abs/1704.03477. Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Understanding. https://doi.org/10.48550/ARXIV.2205.11487
Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. Kunpeng Song, Ligong Han, Bingchen Liu, Dimitris Metaxas, and Ahmed Elgammal.
(2022). 2022. Diffusion Guided Domain Adaptation of Image Generators. https://doi.org/
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic 10.48550/ARXIV.2212.04473
Models. CoRR abs/2006.11239 (2020). arXiv:2006.11239 https://arxiv.org/abs/2006. Rapee Suveeranont and Takeo Igarashi. 2010. Example-Based Automatic Font Genera-
11239 tion. In Smart Graphics. Number LNCS 6133 in Lecture Notes in Computer Science.
Kai Hormann and Günther Greiner. 2000. MIPS: An efficient global parametrization 127–138.
method. Technical Report. Erlangen-Nuernberg Univ (Germany) Computer Graphics Purva Tendulkar, Kalpesh Krishna, Ramprasaath R. Selvaraju, and Devi Parikh. 2019.
Group. Trick or TReAT: Thematic Reinforcement for Artistic Typography. https://doi.org/
Adobe Systems Inc. 1990. Adobe Type 1 Font Format. Addison Wesley Publishing 10.48550/ARXIV.1903.07820
Company. Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022.
Ajay Jain, Amber Xie, and Pieter Abbeel. 2022. VectorFusion: Text-to-SVG by Abstract- Motionclip: Exposing human motion generation to clip space. In Computer Vision–
ing Pixel-Based Diffusion Models. arXiv preprint arXiv:2211.11319 (2022). ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
Yue Jiang, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. 2019. SCFont: Structure- Part XXII. Springer, 358–374.
Guided Chinese Font Generation via Deep Stacked Networks. Proceedings of the
AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 4015–4022. https:
Word-As-Image for Semantic Typography • 11
Yingtao Tian and David Ha. 2021. Modern Evolution Strategies for Creativity: Fitting
Concrete Images and Abstract Conceptst. arXiv:2109.08857 [cs.NE]
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2022a. Plug-and-Play
Diffusion Features for Text-Driven Image-to-Image Translation. https://doi.org/10.
48550/ARXIV.2211.12572
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2022b. Plug-and-Play
Diffusion Features for Text-Driven Image-to-Image Translation. https://doi.org/10.
48550/ARXIV.2211.12572
Yael Vinker, Yuval Alaluf, Daniel Cohen-Or, and Ariel Shamir. 2022a. CLIPascene: Scene
Sketching with Different Types and Levels of Abstraction. https://doi.org/10.48550/
ARXIV.2211.17256
Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Roman Christian Bachmann, Amit Haim
Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. 2022b. CLIPasso:
Semantically-Aware Object Sketching. ACM Trans. Graph. 41, 4, Article 86 (jul
2022), 11 pages. https://doi.org/10.1145/3528223.3530068
Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif
Rasul, Mishig Davaadorj, and Thomas Wolf. 2022. Diffusers: State-of-the-art diffu-
sion models. https://github.com/huggingface/diffusers.
Wenjing Wang, Jiaying Liu, Shuai Yang, and Zongming Guo. 2019. Typography With
Decor: Intelligent Text Style Transfer. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR).
Yizhi Wang and Zhouhui Lian. 2021. DeepVecFont: Synthesizing High-Quality Vector
Fonts via Dual-Modality Learning. ACM Transactions on Graphics 40, 6 (Dec. 2021),
1–15. https://doi.org/10.1145/3478513.3480488
Yizhi Wang, Guo Pu, Wenhan Luo, Yexin Wang, Pengfei Xiong, Hongwen Kang, and
Zhouhui Lian. 2022. Aesthetic Text Logo Synthesis via Content-Aware Layout
Inferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). 2436–2445.
Jie Xu and Craig S. Kaplan. 2007. Calligraphic Packing. In Proceedings of Graphics
Interface 2007 on - GI ’07. ACM Press, Montreal, Canada, 43. https://doi.org/10.1145/
1268517.1268527
Shuai Yang, Jiaying Liu, Zhouhui Lian, and Zongming Guo. 2017. Awesome Typography:
Statistics-Based Text Effects Transfer. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR).
Shuai Yang, Jiaying Liu, Wenhan Yang, and Zongming Guo. 2018. Context-Aware Un-
supervised Text Stylization. In Proceedings of the 26th ACM International Conference
on Multimedia (Seoul, Republic of Korea) (MM ’18). Association for Computing Ma-
chinery, New York, NY, USA, 1688–1696. https://doi.org/10.1145/3240508.3240580
Shuai Yang, Zhangyang Wang, and Jiaying Liu. 2022. Shape-Matching GAN++: Scale
Controllable Dynamic Artistic Text Style Transfer. IEEE Transactions on Pattern
Analysis and Machine Intelligence 44, 7 (2022), 3807–3820. https://doi.org/10.1109/
TPAMI.2021.3055211
Junsong Zhang, Yu Wang, Weiyi Xiao, and Zhenshan Luo. 2017. Synthesizing Orna-
mental Typefaces: Synthesizing Ornamental Typefaces. Computer Graphics Forum
36, 1 (Jan. 2017), 64–75. https://doi.org/10.1111/cgf.12785
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao,
Peng Gao, and Hongsheng Li. 2021. PointCLIP: Point Cloud Understanding by CLIP.
https://doi.org/10.48550/ARXIV.2112.02413
Changqing Zou, Junjie Cao, Warunika Ranaweera, Ibraheem Alhashim, Ping Tan, Alla
Sheffer, and Hao Zhang. 2016. Legible Compact Calligrams. ACM Transactions on
Graphics 35, 4 (July 2016), 1–12. https://doi.org/10.1145/2897824.2925887
Ju Jia Zou, Hung-Hsin Chang, and Hong Yan. 2001. Shape skeletonization by identifying
discrete local symmetries. Pattern Recognition 34, 10 (2001), 1895–1905.
12 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
SUPPLEMENTARY MATERIAL
A IMPLEMENTATION DETAILS
In this section we provide further implementation details. We intend
to release the code to promote future research in this domain.
Our method is based on the pre-trained 𝑣1 − 5 Stable Diffusion
Fig. 18. Some additional examples of word-as-image applied on Chinese
model [Rombach et al. 2021], which we use through the diffusers characters. In Chinese, a whole word can be represented by one character.
[von Platen et al. 2022] Python package. We optimize only the Here we show from left: bird, rabbit, cat and surfing (two last characters
control points’ coordinates (i.e. we do not modify the color, width, together). The complexity of characters imposes an additional challenge for
and other parameters of the shape). We use the Adam optimizer with our method. This could be alleviated in the future for example by dividing
𝛽 1 = 0.9, 𝛽 2 = 0.9, 𝜖 = 10−6 . We use learning rate warm-up from the characters to radicals and applying the method only on parts of the
0.1 to 0.8 over 100 iterations and exponential decay from 0.8 to 0.4 character.
over the rest 400 iterations, 500 iteration in total. The optimization
process requires at least 10GB memory and approximately 5 minutes
to produce a single letter illustration on RTX2080 GPU. C PERCEPTUAL STUDY
Before we feed the rasterized 600𝑥600 letter image into the Stable In this section, we provide more details about the perceptual study
Diffusion model, we apply random augmentations as proposed in described in Section 5.1. The randomly chosen objects, fonts, and
CLIPDraw [Frans et al. 2021]. Specifically, perspective transform letters are shown in Table 2. A few visual examples are shown in
with a distortion scale of 0.5, with probability 0.7, and a random Figure 19.
512𝑥512 crop. We add the suffix "a [word]. minimal flat 2d vector.
lineal color. trending on artstation." to the target word 𝑊 , before Ours Only SDS
feeding it into the text encoder of a pretrained CLIP model.
"Coat"
B COMPARISONS
As described in Section 5.2 we define five baselines to compare with.
In this section we provide more details about the evaluation and "Soccer"
more qualitative results. For (1) SD, we run Stable Diffusion [Rom-
bach et al. 2021] with the default hyper parameters of 50 inference
steps and a guidance scale of 7.5. We use the prompt “Word as image
of the word [word]. [font] font. minimal flat 2d vector. lineal color. "Shirt"
black and white style”.
For (2) SDEdit [Meng et al. 2022], we utilized the diffusers [von
Platen et al. 2022] implementation, using the prompt “A [word].
minimal flat 2d vector. lineal color. black and white style”, and the
"Rugby"
rasterized input letter as the reference image. We use the default
values of 50 inference steps and a guidance scale of 7.5. We use a
strength value of 0.85. The strength value determines the quantity
of noise added to the input image – a value close to 1.0 results in
higher degree of variation in the output, and vice versa. Font
We use the official website of OpenAI to run (3) DallE2 [Ramesh Rec.
et al. 2022], using the prompt “Word as image of the word [word].
Where the letter [letter] looks like a [word]. [font] font. minimal
flat 2d vector. lineal color. black and white style”. To encourage
the manipulation of a specific letter, for (4) DallE2+letter we use Fig. 19. Examples of illustrations presented in the perceptual study. Each
pair in the top part shows illustrations obtained using our proposed method
the prompt “The letter [letter] in the shape of a [word]. [font] font.
(left) and using only SDS loss (right). On the bottom is an example of an
minimal flat 2d vector. lineal color. black and white style”. For (5) illustration presented for the font recognition questions.
CLIPDraw [Frans et al. 2021], we use the author’s official imple-
mentation with the recommended hyper-parameters. Instead of
using randomly initialized strokes, we use our vectorized letter as
input, along with the prompt “A [word]. [font] font. minimal flat D ADDITIONAL RESULTS
2d vector. lineal color. black and white style”. We provide more We provide additional results of our generated word-as-images. In
comparisons to the methods described above in Figure 20. Figures 21-32 we show results of selected words and unique fonts.
Word-As-Image for Semantic Typography • 13
"Muffin"
"Tiger"
"Octopus"
"Plant"
"Astronaut"
"Robot"
"Bunny"
"Flamingo"
"Paris"
"Owl"
"Swan"
"Mermaid"
Fig. 20. Comparison to alternative methods based on large scale text-to-image models. On the left are the letters used as input (only for SDEdit, CLIPDraw,
and ours), as well as the desired object of interest. The results from left to right obtained using Stable Diffusion [Rombach et al. 2021], SDEdit [Meng et al.
2022], DallE2 [Ramesh et al. 2022], DallE2 with a letter specific prompt, CLIPDraw [Frans et al. 2021], and our single-letter results.
In Figures 33-48 we show the results obtained for the random set of
words.
14 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
Table 2. Randomly chosen objects, letters, and fonts for the perceptual
study.
Fig. 33. Word-as-image illustrations created by our method for randomly chosen words.
Fig. 34. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 21
Fig. 35. Word-as-image illustrations created by our method for randomly chosen words.
Fig. 36. Word-as-image illustrations created by our method for randomly chosen words.
22 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
Fig. 37. Word-as-image illustrations created by our method for randomly chosen words.
Fig. 38. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 23
Fig. 39. Word-as-image illustrations created by our method for randomly chosen words.
Fig. 40. Word-as-image illustrations created by our method for randomly chosen words.
24 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
Fig. 41. Word-as-image illustrations created by our method for randomly chosen words.
Fig. 42. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 25
Fig. 43. Word-as-image illustrations created by our method for randomly chosen words.
Fig. 44. Word-as-image illustrations created by our method for randomly chosen words.
26 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
Fig. 45. Word-as-image illustrations created by our method for randomly chosen words.
Fig. 46. Word-as-image illustrations created by our method for randomly chosen words.
Word-As-Image for Semantic Typography • 27
Fig. 47. Word-as-image illustrations created by our method for randomly chosen words.
28 • Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir
Fig. 48. Word-as-image illustrations created by our method for randomly chosen words.