AI Writer: an Exploration of Pre-Attention Generative Capabilities
To what extent can an LSTM mock one's writing style?
Objective
Writing is a hobby I’ve had for more than ten years now. From large novels to tiny tales, I also wrote tutorials, essays, and formal reports. However, I always wondered how it felt to be a reader of myself, which is why I wanted to create a model that could write something new like I would. Technically speaking, this meant training a model to understand the subtle patterns in my writing style to the extent that it would be capable of generalizing them.
This project initially aimed at creating an LSTM that could write like I do. Later, it evolved to comparing the performances of different LSTMs, seeking to understand how various structures perform and whether there is a cap to how good their outputs can be. You can find the code I used here in this Colab notebook.
The Dataset
For this project, I gathered 10 years worth of writing data from my notes and blogs, which were all downloaded as HLTM files (52 in total). Among the files, I had opinion pieces, poems, tales, tutorials, novels, and simple notes.
Data treatment
Before diving into the LSTM itself, I worked to clean the data and ensure it was appropriate for processing. This meant removing formatting marks, converting the text to lowercase, removing punctuation signs, and breaking down paragraphs into sentences.
(You can find the code for these steps in the Colab notebook.)
Not only did this result in higher quality data, but it also reduced the space of possibilities the model has to learn. The final dataset looked like the following:
# As I'm Brazilian, the content is in Portuguese :)
Category Content
0 tale seus olhos navegavam no espaço en...
1 tale entre risos perguntando “do que se trata...
2 tale a luz do fim de tarde, perdia-se entr...
3 tale teria passado ali dias horas ou minutos...
4 tale três águas de coco e duas noite depoi...
... ... ...
4933 poem porém fácil mesmo é morrer
4934 poem assim como uma semente plantada no inverno
4935 poem assim como um anjo nascido no inferno
4936 poem assim como o amor que não se consegue viver
4937 poem talvez morrerei sem ter a chance da verdade co...
In quantitative terms, the dataset contained 57997 words, of which 8860 were unique. There were also imbalances among the categories of writing pieces, with 2.5x more novel entries in the dataset than poems and tales, which in turn were 3x more present than entries from notes and tutorials. Such an imbalance can cause skews on the model’s behavior, which we need to be aware of, and also influence how we evaluate performance, given that metrics such as accuracy can become biased and be little informative.
label_counts = updated_df['Category'].value_counts()
print(label_counts)
'''
Output:
novel 2281
poem 953
tale 915
notes 335
tutorial 308
opinion 64
Name: Category, dtype: int64
'''
Model selection
To accomplish the task of text generation, I chose to build a Long Short-Term Memory (LSTM) network. LSTMs are a type of RNN (Recurrent Neural Network) designed to capture long-term dependencies in sequential data.
The way LSTMs work can be illustrated with the analogy of reading a book and trying to understand the plot: as we read the pages, we continuously update our understanding based on the current sentence and what we’ve read previously. An LSTM does a similar process but uses numerical data instead of words. As in any neural network, each layer takes in some input, applies a set of weights, and produces an output. However, in an RNN (and consequently in an LSTM), there’s a hidden state that’s passed along from one step to the next. This hidden state acts like a memory, allowing the network to consider past information while processing current input. The difference between RNNs and LSTMs is that the latter is better at handling memory, suffering less from vanishing gradients when the input becomes large.
Each time a new input is given to the model, the following process happens:
01) Deciding how much of the long-term memory to forget
The first part of an LSTM (named Forget Gate) determines how much of the long term memory should be remembered for the current inference. To do so, it uses the short-term memory itself (h_{t-1}) and the current input (x_t), returning a percentage (f_t) that will be factored in the long-term memory later.
02) Deciding what to add to the long-term memory
Next, the LSTM combines the short-term memory with the given input to create a potential long-term memory:
Then, it determines what percentage of this potential memory should be actually incorporated into the long-term memory. This entire process happens on what is called the Input Gate.
Following these steps, we update the long-term memory, C_t, based on the previous memory (and the amount of it we decided to forget) and the candidate new memory (along with the amount we decided to remember):
03) Deciding what to output
Last, we output a value by first combining the short-term memory and the input, which gives us a candidate output o_t:
And then we factor in our long-term memory, thus obtaining the final output. Given that this output will be the short-term memory for the next input, we call it h_t:
(Note: In a LSTM, everything we have just described is a single neuron)
The weights and biases are randomly initialized and updated through backpropagation.
Data preparation for training
To create a model that can write like me, I made the simplifying assumption that the writing's category (poem, tale…) doesn’t matter, which means all of the data can be grouped together. We thus start by converting all of the text into a single string.
raw_text = updated_df['Content'].str.cat(sep=' ')
Next, we map the characters of the vocabulary to integers. Given that LSTMs are made to work with numerical data, each character in the text needs to be represented as a numerical value. This mapping allows us to process characters through the model and, later, to reverse the process and convert numerical outputs into text again.
# Creates mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
n_chars = len(raw_text)
n_vocab = len(chars)
Now, we split the data into input-output pairs. We want the model to predict one character at a time based on the previous 100 characters. Therefore, our input will be a sequence of 100 characters starting in i and finishing in i+99, and the output, a sequence of 100 characters starting in i+1 and finishing in i+100.
# Prepares the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)
#Total Patterns: 322645
Last, we reshape the input to the format expected by Keras, normalize it, and convert the output to 58-dimensional vectors (the size of the vocabulary). This means that, after processing the data, the LSTM will output a vector with probabilities for the next letter.
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
Model initialization and training
Initialization
Below we initialize our base LSTM model. We have two layers with 256 neurons each, two dropout layers in between to prevent overfitting, and a softmax at the end. Additionally, the fact we are using stacked LSTMs should also increase our capacity to represent more complex inputs.
# Creates LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Second, we initialize an LSTM with the same structure, but more neurons. Theoretically, the greater number of neurons should allow the model to capture more information and patterns.
# Creates LSTM model
larger_model = Sequential()
larger_model.add(LSTM(768, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
larger_model.add(Dropout(0.25))
larger_model.add(LSTM(768))
larger_model.add(Dropout(0.25))
larger_model.add(Dense(y.shape[1], activation='softmax'))
larger_model.compile(loss='categorical_crossentropy', optimizer='adam')
Third, we adjust the base model to predict words, instead of characters. Although this increases considerably the number of different inputs the model can take (and patterns it needs to learn), it might make it easier for the model to connect words together in a coherent way.
Notice that this requires retokenizing the data because the model takes a different input, which we do below:
# Concatenates text data
raw_text = updated_df['Content'].str.cat(sep=' ')
# Tokenizes the text into words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([raw_text])
sequences = tokenizer.texts_to_sequences([raw_text])[0]
total_words = len(tokenizer.word_index) + 1 # Adding 1 for Out of Vocabulary (OOV) token
# Prepares sequences of 30 words as input and one word as output
seq_length = 30
dataX = []
dataY = []
for i in range(seq_length, len(sequences)):
seq_in = sequences[i - seq_length:i]
seq_out = sequences[i]
dataX.append(seq_in)
dataY.append(seq_out)
# Converts the sequences into numpy arrays
X = np.array(dataX)
y = to_categorical(dataY, num_classes=total_words)
print("Total Sequences: ", len(dataX))
# Now, X contains sequences of 30 words, and y is the one-hot encoded output.
# These can be used for training the LSTM model.
# reshapes X to be [samples, time steps, features]
n_patterns = len(dataX)
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalizes
X = X / float(total_words)
# one hot encode the output variable
y = to_categorical(dataY)
# Adjusting the model for word-level prediction
words_model = Sequential()
words_model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
words_model.add(Dropout(0.2))
words_model.add(LSTM(256)) # No return_sequences needed in the last LSTM layer
words_model.add(Dropout(0.2))
words_model.add(Dense(total_words, activation='softmax')) # Changed y.shape[1] to total_words
words_model.compile(loss='categorical_crossentropy', optimizer='adam')
Training
For the base model, we trained for 70 epochs and a batch size of 60 (which means 60 training samples will be passed through the network before we update weights with backpropagation). With the resources from Colab Free, the training took 1h56min, achieving a minimum loss of 1.4405.
For the model with extra neurons, we trained for 20 epochs before Colab free shut itself down, achieving a minimum loss of 1.2908.
Last, for the model predicting words, the training lasted 3 hours, covering 300 epochs with a batch size of 15, which yielded a 0.5208 loss.
Inference and analysis
How to measure performance in the first place?
My goal was to create a model that could understand the patterns in my writing style to an extent that it would be capable of generalizing them. This meant that the model should be writing text that sounds like me, which is clearly a difficult thing to measure — not only because it's very particular, but also because, unlike classification or regression tasks, there apparently are few metrics to evaluate the performance of generative models. Some of the most common ones are:
- BLEU Score: commonly used in translation tasks, it computes the similarity between the generated text and a set of reference (human-generated) texts.
- Perplexity: measures how well a model predicts a sample of text.
- ROGUE: commonly used in text summarization, it evaluates the quality of summaries or generated text by measuring the overlap in n-grams (sequences of words) between the generated and the reference texts.
But as we notice, none are particularly suitable for the goal at hand. Therefore, rather than using quantitative measures, I employ a qualitative evaluation, judging how much I think each model approximates my writing style.
Last, I created two functions (found in the Colab notebook) that generate text. One retrieves a random part of the dataset and feeds it into the model, while the other allows us to input a customized text.
Base model
The first to be analyzed was the base model that predicted characters (two layers with 256 neurons each, two dropout layers in between, and a softmax at the end).
In general, a surprising characteristic of the outputs was that words were not misspelled. Most results, however, seemed to be a mix of different pieces, blending poem-like language and tutorial-like structures that did not make sense together. Below is an example:
input: "yesterday was a beautiful day, the sea shone in the city
while the sun illuminated the hills and"
output: "the context they were in because of their position could
not be and the context that they were containing at each
position of their position the first time the mind no matter
how much the contention of their position the first time"
Model with extra neurons
Despite the additional number of neurons, which in theory should allow the LSTM to capture more patterns, this model performed worse than the previous one. It was capable of outputting characters in sequences that resemble words (placing spaces correctly, alternating between vowels and consonants), but it often made grammatical mistakes. Furthermore, the words it generated correctly did not make sense as a whole — unlike the previous model, neighboring words have little connection with each other.
input: "you can find the code for my application here finally we're
done I really hope this article has"
output: "gone from wood to eter the couple for the smuggler was
no longer pure swork of his app and unless some maner pue
was in his face of tm celes fmi the only thing that was inn
each new work in the part of corrado for the couple For
the smuggler it was no longer pure work of art."
Words-based model
Finally, we had the variant of the base model that predicted words, rather than characters. After training for 300 epochs, the final set of weights was highly overfit, so instead I used for prediction a set from epoch 90, which had captured some patterns from my writing but wasn’t copying the training data yet. Sampling a random part of the dataset as input:
input: "cells with content for this we will use another collection
view protocol but first we need a brief explanation imagine
we have 10000 items to display in the cv if we continued
implementing"
output: "the codes normally we would create a cell of each more than
10000 items even though there are no pain cells and eyes that
way the night of leaving and leaving was one no three no the
no strident no streets of the bar of the go what face that the
future time"
There are no grammatical mistakes here, which is expected since we trained the model on words, not characters. However, the output is a mix of overfit text and hallucinations. The beginning of the output is the exact continuation of the input, which comes from an iOS tutorial I once wrote. However, at some point, it switches to a nearly random set of words. This random set resembles some of my novels, but they don’t make sense.
It is worth mentioning again that this result comes from the weights the model had around epoch 90. Weights from earlier epochs resulted in text with no meaning, and from later epochs, in copies of the training data due to overfitting.
Overall, the model trained on words seems to be unable to find the balance between learning my writing style, learning to generate text that makes sense, and not overfitting the training data.
Conclusion
Overall, the character-predicting LSTMs generated words correctly and hardly ever misspelled. This is, first of all, surprising, given that our models are simply predicting one character at a time. What we see is that it samples letters in a way that makes sense — for example, it doesn’t sample a list of 20 consecutive characters, nor does it sample things like “yzgsfat” — and these samples turn out to have meaning to us, being words we actually understand. Additionally, they sometimes even presented words that could make sense together, such as “because of their position” or “the mind becomes more than”.
However, the sentences the LSTMs built were not really logical, and we were left with an output that resembled natural language but was not. One might argue that the performance could have been enhanced if we had trained for longer, but most likely, it seems that there wasn’t much room for improvement. For the word-based model, the results were a bit better but far from optimal. Even when balancing underfitting and overfitting, the output did not carry much meaning and, after a few words, made no sense to the reader.
All in all, we see that LSTMs can only scratch the surface of text generation. These models produce outputs that individually make sense (such as characters that make up actual words), but they struggle to arrange these successful units in a meaningful way, apparently being unable to create useful sentences without overfitting the training data.
In order to bridge this gap, we need a model that can understand the relevance of each word relative to each other. This means a model that contains attention mechanisms, such as Transformer-based architectures, which inherit all good features from LSTMs (such as memory and the capacity of processing sequential inputs) and more.
References
Tutorial for LSTM for text generation:
How LSTMs work:
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://www.youtube.com/watch?v=YCzL96nL7j0
How to evaluate generative models: