Keywords

1 Introduction

To quickly capture a thought or explain something in a visual way, we often tend to draw a quick doodle. Sketching by hand is still one of the most common ways to support thought processes [6] or to create new design ideas [18]. Therefore it is an attractive idea to support sketching processes with technology.

A well known research project for collaborative sketching is Sketch-RNN [7], where a Recurrent Neural Network is trained on a large dataset (Quickdraw) of small hand-drawn sketch samples. Given the image category (“dog”, “car”, etc.) the model is able to complete a sketch that was started by a user. Collabdraw [5] builds upon Sketch-RNN and explores a turn-taking way to interact with the Quickdraw dataset.

Creativity Support Tools [17] are technologies that aim to enhance a user’s creativity or support ideation processes. Sketch-based Creativity Support Tools have been explored by Davis et al. [3]. Here, a Q-Learning agent is steadily trained on user pen strokes to learn their preferred drawing style. The user is also able to give feedback to suggestions via the drawing interface to improve upcoming suggestions.

“Cobbie” [10], a drawing robot, is another interesting example of a sketch-based Creativity Support Tool: robot and user draw together on a sheet of paper. Participants in a user study were asked to design new products. Results show that Cobbie’s drawings are often utilized as inspiration for shapes in the final product design.

This research aims to explore sketch-based Creativity Support Tools for abstract pattern drawing. A pattern is defined as a (repeated) decorative element used in design.

However, for designers it is often difficult to gather a large dataset of their drawings to train a generative model, like the Quickdraw dataset for Collabdraw. Also the step-by-step training of a learning agent is a large amount of additional work for a designer. Therefore we focus on a One-Shot approach where the designer only needs to provide one or few examples to train a generative model.

Also we aim to create support tools that produce path-based and not pixel-based data. Path-based data is important in domains like digital fabrication [21] or robotics like the above mentioned drawing robot Cobbie: machines like laser cutters or pen plotters need the path information to move along these lines.

2 Model Architecture: Transformer

To train a generative model on pen strokes, we first need a neural representation for stroke paths. A sketch consists of a sequence of pen strokes and each stroke can be described by small pen movements along straight lines. This sequence of sequences can be easily flattened to one large sequence. So, learning a neural representation for sketches can be formulated as a sequence generation task. Transformer Neural Networks [19] are a novel approach to processing sequential data in machine learning. They are mainly used in Natural Language Processing for translation tasks [4] or for text generation [12]. But Transformers also have been used for sketch image recognition [23, 24] and recently Carlier et al. showed that Transformers outperform Recurrent Neural Networks for Vector Graphics [1].

When Transformers are used for sequence-to-sequence tasks, a encoder-decoder architecture is used. As we only want to generate a sequence in our setting, we only need the encoder module. At first, this seems like an unusual decision, as typically the decoder would be used. But in case of a Transformer, encoder and decoder are very similarly structured. The main difference is that the decoder receives additional input from the encoder in a translation setting. Since we can not provide this input (we do not translate and only generate), we are left with the encoder architecture: the decoder without the additional input capabilities equals the encoder architecture.

A sequence of straight pen moves will be used as input to our Transformer. Though, these pen moves can not be used as input directly and need to be converted to a vector representation first. In Natural Language Processing, word sequences are used as neural net input, which poses a very similar problem. Here, word embeddings [9] are used to encode words into vectors. Pen moves and the embedding process which will be explained in detail in the next section.

The Transformer input also receives a mask that prevents the neural net from seeing “future” sequence elements. In addition to the embedded pen move sequence, a positional encoding is needed to provide relative positional information to the Transformer network, as it does not contain any recurrence or convolution. In our research, we use the standard positional encoding defined in [19] where the sine function is used for even input indices and cosine function is used for odd indices:

$$\begin{aligned} PE_{(pos, 2i)}&= \sin (\frac{pos}{10000^\frac{2i}{N}})\end{aligned}$$
(1)
$$\begin{aligned} PE_{(pos, 2i+1)}&= \cos (\frac{pos}{10000^\frac{2i}{N}}) \end{aligned}$$
(2)

where pos is the position in the input sequence, i the dimension (the position in the positional encoding vector) and N equals the embedding dimension, so embedding and positional encoding can be summed. 10000 was chosen by [19], so the sinusoid wavelengths form a geometric progression from \(2\pi \) to \(10000 \cdot 2 \pi \).

Transformer encoder layers can be stacked on top of each other any number of times. They consist of a masked Multi-Head Attention layer with normalization and a feed-forward network with normalization.

The Multi-Head Attention layer is the most important part: it consists of multiple parallel Self-Attention layers, whose output is concatenated and finalized with a linear layer. The Self-Attention layers give the neural net the ability to focus or ignore certain elements in the sequence. Running multiple Self-Attention layers in parallel prevents bad random initialization. Recent research shows, that a majority of the heads can be pruned without significant drop in performance [20]. The Attention mechanism for each head is defined as:

$$\begin{aligned} X \cdot W_Q&= Q \end{aligned}$$
(3)
$$\begin{aligned} X \cdot W_K&= K\end{aligned}$$
(4)
$$\begin{aligned} X \cdot W_V&= V\end{aligned}$$
(5)
$$\begin{aligned} Attention(Q,K,V)&= softmax \Bigl (\frac{QK^T}{\sqrt{d_k}} \Bigr )V \in \mathbb {R}^{S \times d_v} \end{aligned}$$
(6)

Q (Query), K (Key) and V (Value) with \(Q,K \in \mathbb {R}^{S \times d_k}\) and \(V \in \mathbb {R}^{S \times d_v}\) are calculated by multiplying the embedding matrix \(X \in \mathbb {R}^{S \times N}\), which contains all embedding vectors for all embedded pen moves with the trained weight matrices \(W_Q,W_K \in \mathbb {R}^{N \times d_k}\) and \(W_V \in \mathbb {R}^{N \times d_v}\). Where S is the input sequence length and \(d_k\) and \(d_v\) are attention projection dimensions.

With the trained weight matrices Q and K, a score for each sequence element is calculated and normalized by the square root of the dimension of K. With softmax, these scores are further normalized so they sum up to 1. Finally these scores are multiplied with the values V, which results in weighted values.

After the Attention layer, the data is processed in a feed-forward neural net and after passing all stacked encoder layers, it reaches a linear and softmax layer. These scale the output to the embedding size and give a vector of probabilities for each pen move. From these probabilities, a single pen-move is sampled via top-k sampling. Figure 1 gives a visual overview of the Transformer encoder we used and input and output vectors are visualized in Fig. 3b.

Fig. 1.
figure 1

Transformer Encoder: input pen moves are embedded and so transformed into a vector representation. They also receive a relative positional encoding and are then processed through multiple parallel Self-Attention layers and a feed-forward net. Encoder layers can be stringed together multiple times. The output is then transformed back to the embedding size with a linear layer. Finally a softmax layer is used to receive the probabilities from a pen move is sampled with top-k.

In this research, we use the Adam optimizer [8] with the described changes in [19], where the learning rate is first linearly increased for the first warm-up steps and thereafter decreased again.

3 Data Set Generation

In One- or Few-Shot learning with natural images [16], the original pixel image is transformed into multiple altered, smaller images called “patches” which then form the training dataset. These patches are often produced by scaling, rotating, sheering or other manipulations.

Fig. 2.
figure 2

Path processing from stroke to pen moves: first a hand-drawn stroke is recorded by adding a new point to the path on every mouse move event. This point cluttered lines is simplified substituting points with curves. These curves are again flattened with a certain error and too long sections are split into multiple smaller path segments.

To train our sketch-based generative model, we use the method proposed in our latest work [22]: The whole stroke-based image is manipulated by mirroring, rotating, scaling and translating. Because each path consists of a list of points, a path has an implicit direction. In our setting, the path direction is not important, so it can be reversed to generate new patches. After all paths have been manipulated, they are rearranged in a new order. We sort them in a greedy way by distance, so that the pen travel is as short as possible. This step is important to ensure that the next stroke will be generated close to the previous stroke. If the sorting step is not performed, strokes in the generated images appear very scattered and incoherent.

In our experiments, we use hand-drawn images in a 180 \(\times \) 180 unit boundary box. The images are drawn directly on a computer screen using a digitizer pen. In the drawing process, the current pen location is recorded on every frame update. This results in a very point cluttered path, which can be seen in the example in Fig. 2. The path is then simplified by fitting Cubic Bézier Curves [13] through the points with an allowed maximum error. The algorithm used is described in [14] and the result can be seen in the second image in Fig. 2. In the next step, the simplified curves are flattened: the curves are approximated by straight lines with a given maximum error. If a line is too long, it can be divided into multiple shorter lines. These resulting short straight lines form the actions our Transformer is trained on and will further be called “pen moves”. A visual representation of the stroke curve flattening and pen move generation can be seen in the last two images in Fig. 2.

A pen move is defined with the following attributes:

  • Position: relative x and y coordinates to the last point.

  • Pen State: binary state if the virtual pen is drawing or only moving.

The pen state attribute enables the possibility to move the virtual pen without drawing. These later invisible moves are needed to end a stroke and move the pen to the beginning of the next following stroke. With the current pen move definition, it is not possible for the Transformer to indicate if the image is finished. Using an embedding as seen in Fig. 3 allows us to add special event moves like an “imageEnd” token. A path-end token is not needed, as it can be read from the pen state. Embeddings work like a lookup table where a list of predefined pen moves is mapped to vector representations, which are trained along with the Transformer. Figure 3a shows a short example of the embedding process. Because only valid pen moves are embedded, the Transformer can also only predict valid pen moves:

If the pen move attributes were used directly as Transformer input, for example in the form of (xypenStateimageEnd), it would allow for invalid pen moves where the pen state is 1 (line will be drawn) but the imageEnd flag is also activated.

To make the embedded pen moves cross compatible with every input image, the maximum pen travel length for each move needs to be small and should be set to a fixed number. So every stroke can be well approximated by one set of predefined pen moves. If all generative models share the same pen move set, they have the possibility to interact with each other. If models would use different pen move sets, certain pen moves might not be contained in the other model’s embedding. In this research, we chose a maximum pen move length of 15 units.

Fig. 3.
figure 3

a) Pen moves (their object representations) are embedded. The embedding acts as a lookup table, where a list of pen move is mapped to a list of trained vector representations. So with a known index, either the pen move object or the vector representation can be retrieved. b) The input vector to the Transformer contains a sequence of pen moves represented by their embedding index. The vector representation will then be looked up by the Transformer. The output vector has the length of the embedding size. It contains prediction probabilities for each embedded pen move. For a greedy sampling, one needs to search the highest probability value and use the value’s position in the output vector to look up the embedded pen move.

Fig. 4.
figure 4

“Boxes” and “Spirals” dataset recorded for our experiments. Each path was assigned a random Color for better distinction. (Color figure online)

4 Generative Sketches

For our experiments in co-creative drawing, we recorded two template images as can be seen in Fig. 4: The first image “Boxes” consists of rectangles that are drawn into each other. They all have the same orientation and lines do not cross. The second image is called “Spirals” and consists of multiple spiral-shapes that turn into different directions. It also contains one little circle that fills some open space. The images differ in the number of paths (11 and 6) and also path lengths.

From each image, we created a new patch set of size 500 for each epoch to prevent overfitting (the benefits of rotating patch sets has been evaluated by Wieluch and Schwenker [22]). We used the following settings for our Transformer:

  • Epochs: 200

  • Batch Size: 200

  • Sequence Length: max. template image sequence length

  • Embedding Dimension: 52

  • Encoder Layers: 3

  • Attention Heads: 4

  • Feed Forward Size: 2048

The sequence length was chosen to give the neural net the possibility to learn the whole image as context. If the sequence length was chosen shorter, the neural net often did not produce image-end tokens in our experiments. Also we do not train the Transformer image by image, but instead use all 500 patch images as one long sequence and move a sliding window along this sequence. With this technique, the Transformer learns to generate a new image after another image is finished. So the input vector may contain the end of patch one and the beginning of patch 2 as can be seen in Fig. 3b.

Fig. 5.
figure 5

Generated images from the “Spirals” (left) and “Boxes” (right) dataset.

Figure 5 shows two generated images from our trained models.

It is clearly visible that the generated paths represent the intention (spirals, boxes, circles) of the template image. Most shapes appear close together, the box paths are also stacked into each other and the boxes also show the same direction. This represents the template image very well.

However, it can be seen that some paths cross each other, which should not happen as no path crosses another in the template image.

5 Co-creative Drawing

In this section, we present three co-creative drawing tools to interact with our generative models. Recordings of each interaction technique can be seen in the additional material.

5.1 Autocompletion

In this scenario, the user starts by drawing a line and the model completes the given path. The results of four such iterations can be seen in Fig. 6: the red stroke is drawn by the user and the black part is the completed path drawn by the model. On the left side the “Spirals” model is used and on the right side the “Boxes” model.

Both models work very well with completing the stroke. But it is also visible that the unknown user input “confuses” the models and shapes emerge that are not intended by the template image. This can be best observed in the last “Spirals” image where the user starts to draw a small circle, but the model draws a rather odd loop. Also the boxes on the right side tend to have unusual interlinked ends.

We expect that these unusual generated shapes will be used in creative drawing processes, as unexpected outcomes support creative thinking [11].

Fig. 6.
figure 6

Iterations of user-drawn lines (red), finished by a generative model (black) in one image. The left side utilizes the “Spirals” model, the right side the “Boxes” model. Both models work well on continuing lines, though some unexpected shapes appear. (Color figure online)

5.2 Generative Stamps

In this experiment, the user suggests a point where the model should place a new stroke (in our setting by clicking via mouse). If the user does not like the created shape, the stamping process can be undone and a new shape can be generated at the same position.

This co-creative interaction loop can be used to quickly produce large pattern-filled areas by a generative type of “stamping” new shapes. This way the user is in control of where the next stroke should be placed but not of what exactly is drawn. The drawing process is very quick, because the user does not need to draw and specify the beginnings of lines. Accordingly the resulting images will look more coherent to the template image, because the user does not draw lines and the model is also not “confused” by user-drawn lines. But this also results in less novelty in the generated strokes.

On the technical side, this interaction loop can be created by calculating pen moves from the last drawn stroke point to the suggested new stroke position and adding these to the neural net input sequence.

An example of this procedure is shown in Fig. 7: The red line is already placed in the image. The red circles indicate possible positions and the corresponding black lines show the strokes that would be drawn at these suggested positions.

Fig. 7.
figure 7

Generative Stamping: The user can choose a position (red circle) on where the model should draw a new stroke (corresponding black line), considering the already drawn lines (red line). (Color figure online)

5.3 Suggestions

In our last experiment, we let the user draw a line and the model will suggest multiple possibilities on how to continue this line. Though inferring from a neural net is a deterministic process: the model will only give one result on one certain input. Though sampling mechanics like top-k can be used to receive different outcomes: only the top k predictions are used and their probabilities normalized. According to these new probabilities, a prediction is sampled.

Figure 8 shows examples of such a generative processes: In red the currently drawn line is shown. The black lines are the possible continuations suggested by the model and can be accepted by clicking on the suggestion of choice.

Depending on the stroke, the number of distinguishable predictions changes. Also the length between suggestions might change as the model is able to predict a path end before the maximum suggestion length is reached.

This co-creative interaction loop gives more freedom to the user as the generative stamping experiment but also more control over the actual outcome. In this scenario, novelty or unexpected shapes that are not directly part of the template input image, will also occur.

Fig. 8.
figure 8

Model-suggestion supported drawing with the “Spirals” and “Boxes” dataset: the user (red) starts to draw a line and receives suggestions by the model on how the stroke could be continued (black). (Color figure online)

6 User Study

To evaluate the three described co-creative drawing methods, we conducted a user study with 8 participants. The study was implemented as a website, so the majority of five participants took part remotely. Three participants conducted the study on site and were asked to verbalize their thoughts while drawing. The participants had differing levels of practice in drawing, ranging from “daily”, “once a week”, “once a month” to “never”. Also, different input devices were used: one half used a mouse, whereas the other half used a digitizer pen.

The study consisted of two drawing tasks and surveys in between. In the first task, the participants were asked to draw any pattern of their liking. They were allowed to use any of the three drawing tools and could also choose to draw without any AI support. Each drawing tool could be used with two template patterns: the “Spirals” and “Boxes” template images introduced in Fig. 4. No time limit was given and the drawing canvas could be cleared and drawing steps could also be undone to create an experimentation friendly environment. When the participant was satisfied, the result was saved and the survey started.

The second task was similar, with the exception that the participant was asked to choose one of the two template images to draw a pattern in the same style. This task was introduced to evaluate if certain co-creative drawing tools are especially useful to create novel but style preserving images from a template. The surveys consisted of five questions from the Creativity Support Index (CSI) [2] to evaluate Exploration, Collaboration, Engagement, Effort and Expressiveness of all three tools in both drawing tasks. Additionally, the participants were asked to describe situations in which they did and did not like to use a certain tool.

The co-creative drawing tools were configured as follows:

  • Autocomplete: the tool completes a user-drawn line in context of the next closest line, if available (the pen moves from the next closest line and the new pen moves from the user stroke are used as input sequence). The moves in the path are sampled in a greedy manner (\(k=1\) for top-k sampling).

  • Stamp: the tool creates a new line at a given point in context of the next closest line, if available. The pen moves are sampled with \(k=2\) top-k sampling to give the user slightly different results for the same drawing spot.

  • Suggestion: the tool creates suggestions utilizing top-k sampling with \(k=50\) for a broad variety of suggestions. The suggested sequence has a maximum length of 10 pen moves.

6.1 Quantitative Results

The survey results are visualized as box-plots in Fig. 9: “Autocomplete” shows the highest mean score in all five CSI attributes, though we could not find a significant difference between tools in a paired test.

Pen users showed a significantly better assessment than mouse users in the collaboration aspect of the “autocomplete” drawing tool. This might be due to the fact that it is hard to draw precise lines with a mouse, especially curves. Those mouse-drawn lines might be more likely to “confuse” the trained model and so the model might more likely suggest to end the user-drawn line than continue it. Of course this behaviour will be seen as un-collaborative.

We also found a significant difference between frequent drawers (once a week or more often) and non-frequent drawers in the “stamp” tool: frequent drawers rated the “stamp” tool significantly better than non-frequent drawers.

We could not find any significant differences in the tool usage perception between the free and the style preserving drawing task.

Fig. 9.
figure 9

Creative Support Index survey results for all three generative drawing tools.

Fig. 10.
figure 10

Example images drawn by participants for the free drawing and the style preserving drawing task: Autocompletion (blue), Stamp (orange), Suggestion (green). (Color figure online)

6.2 Qualitative Results

In the following, we will summarize results from observations or written assessments from the questionnaires:

  • Autocomplete:

    • positive: quick, better suggestion than suggestion tool. Used for: exploring shapes, idea generation, filling in details, being lazy and letting the algorithm finish, fine and detailed lines.

    • negative: shapes overlap, it added unnecessary edges (Boxes template), tool has problems with unspecific shapes.

  • Stamp:

    • positive: fun to use. Used for: filling in blank areas, loosen up the image, quick repetition of small shapes.

    • negative: only made rather small shapes, hard to control.

  • Suggestions:

    • positive: easier to use with mouse. Used for: style imitation while maintaining control, experimentation and idea generation.

    • negative: suggestions sometimes feel useless or too similar, suggested path segments too short, similar suggestions are hard to distinguish.

From observing three participants, we also found that unexpected results from the model often were used as inspiration and so were included in the drawing rather than deleted. Users also recognized quickly that the model is trained on shapes of a certain size and that starting a shape larger than the training data will not adequately be completed. It was also recognized that tools depend on previous lines. This was especially clear to users when using the “Boxes” template: newly creates boxes aligned in their orientation with previous drawn boxes. Example drawings for the free and style preserving drawing task can be seen in Fig. 10.

In summary, “Autocomplete” and “Suggestions” are used in similar situations: for experimentation and idea generation. In style preserving tasks, also both tools were rated similar useful. Negative aspects from both drawing tools could be decreased my combining both tools into one: the combined tool could auto complete a user-drawn line multiple times to give more than one solution. So the user still has control over the look of the shape but does not need to click multiple times to complete the line.

The “Stamping” tool was rated very controversially ranging from “did not like to use it at all” to “it was fun and useful for filling blank spots”. The rating differed significantly between frequent (positive rating) and non-frequent drawers (negative rating). The cause of these drastically differing perceptions is not clear and needs further investigation in future research.

7 Conclusion

In this research, we evaluated three co-creative drawing tools where a user collaboratively creates sketches with a generative model. The model is trained on only one template image (One-Shot).

The three tools (line autocompletion, stamping and line continuation suggestions) where evaluated in a user study utilizing the Creative Support Index for evaluating exploration, collaboration, engagement, effort and expressiveness. Results showed that “autocompletion” and “suggestions” were well received and supported creative sketching and ideation. Both tools could be combined into one for future usage where multiple suggestions for line autocompletions are given. This would erase most negative aspects in both tools.

The “stamping” tool was rated controversially and we found significant differences between frequent (low ratings) and non-frequent (high ratings) drawers. The tool was most often used for filling in blank spaces or loosening up a image.

As the model’s context view is limited by the sequence length, it would be very interesting to use hierarchical approaches in our future work, as they have been successfully used in other domains like dialogue generation [15] to provide a broader structure to the generative process. We also would like to experiment with a forced widening of the drawing area, so that the model will fill a large canvas with shapes of the input dataset and essentially create a large image from a small sample. This could be a very useful application in several design domains.