Skip to content

Commit 7aaff67

Browse files
committed
add text generator tutorial
1 parent c2050cb commit 7aaff67

File tree

9 files changed

+3913
-0
lines changed

9 files changed

+3913
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
1919
- ### [Machine Learning](https://www.thepythoncode.com/topic/machine-learning)
2020
- ### [Natural Language Processing](https://www.thepythoncode.com/topic/nlp)
2121
- [How to Build a Spam Classifier using Keras in Python](https://www.thepythoncode.com/article/build-spam-classifier-keras-python). ([code](machine-learning/nlp/spam-classifier))
22+
- [How to Build a Text Generator using Keras in Python](https://www.thepythoncode.com/article/text-generation-keras-python). ([code](machine-learning/nlp/text-generator))
2223

2324
- [How to Detect Human Faces in Python using OpenCV](https://www.thepythoncode.com/article/detect-faces-opencv-python). ([code](machine-learning/face_detection))
2425
- [Building a Speech Emotion Recognizer using Scikit-learn](https://www.thepythoncode.com/article/building-a-speech-emotion-recognizer-using-sklearn). ([code](machine-learning/speech-emotion-recognition))
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# [How to Build a Text Generator using Keras in Python](https://www.thepythoncode.com/article/text-generation-keras-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
- To use pre-trained text generator model that was trained on Alice's wonderland text book:
5+
```
6+
python generate.py --help
7+
```
8+
**Output:**
9+
```
10+
usage: generate.py [-h] [-n N_CHARS] seed
11+
12+
Text generator that was trained on Alice's Adventures in the Wonderland book.
13+
14+
positional arguments:
15+
seed Seed text to start with, can be any english text, but
16+
it's preferable you take from the book itself.
17+
18+
optional arguments:
19+
-h, --help show this help message and exit
20+
-n N_CHARS, --n-chars N_CHARS
21+
Number of characters to generate, default is 200.
22+
```
23+
Generating 200 characters with that seed:
24+
```
25+
python generate.py "down down down there was nothing else to do so alice soon began talking again " -n 200
26+
```
27+
**Output:**
28+
```
29+
Generating text: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:40<00:00, 5.02it/s]
30+
Generated text:
31+
the duchess asked to the dormouse she wanted about for the world all her life i dont know what to think that it was so much sort of mine for the world a little like a stalking and was going to the mou
32+
```
Binary file not shown.
Binary file not shown.

machine-learning/nlp/text-generator/data/wonderland.txt

Lines changed: 3735 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
import numpy as np
2+
import pickle
3+
import tqdm
4+
from keras.models import Sequential
5+
from keras.layers import Dense, LSTM, Dropout, Activation
6+
from keras.callbacks import ModelCheckpoint
7+
8+
# seed = "do not try to"
9+
10+
char2int = pickle.load(open("data/wonderland-char2int.pickle", "rb"))
11+
int2char = pickle.load(open("data/wonderland-int2char.pickle", "rb"))
12+
13+
sequence_length = 100
14+
n_unique_chars = len(char2int)
15+
16+
# building the model
17+
model = Sequential([
18+
LSTM(256, input_shape=(sequence_length, n_unique_chars), return_sequences=True),
19+
Dropout(0.3),
20+
LSTM(256),
21+
Dense(n_unique_chars, activation="softmax"),
22+
])
23+
24+
model.load_weights("results/wonderland-v2-0.75.h5")
25+
26+
if __name__ == "__main__":
27+
import argparse
28+
parser = argparse.ArgumentParser(description="Text generator that was trained on Alice's Adventures in the Wonderland book.")
29+
parser.add_argument("seed", help="Seed text to start with, can be any english text, but it's preferable you take from the book itself.")
30+
parser.add_argument("-n", "--n-chars", type=int, dest="n_chars", help="Number of characters to generate, default is 200.", default=200)
31+
args = parser.parse_args()
32+
33+
n_chars = args.n_chars
34+
seed = args.seed
35+
36+
# generate 400 characters
37+
generated = ""
38+
for i in tqdm.tqdm(range(n_chars), "Generating text"):
39+
# make the input sequence
40+
X = np.zeros((1, sequence_length, n_unique_chars))
41+
for t, char in enumerate(seed):
42+
X[0, (sequence_length - len(seed)) + t, char2int[char]] = 1
43+
# predict the next character
44+
predicted = model.predict(X, verbose=0)[0]
45+
# converting the vector to an integer
46+
next_index = np.argmax(predicted)
47+
# converting the integer to a character
48+
next_char = int2char[next_index]
49+
# add the character to results
50+
generated += next_char
51+
# shift seed and the predicted character
52+
seed = seed[1:] + next_char
53+
54+
print("Generated text:")
55+
print(generated)
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
numpy
2+
tensorflow==1.13.1
3+
keras
4+
requests
Binary file not shown.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
import numpy as np
2+
import os
3+
import pickle
4+
from keras.models import Sequential
5+
from keras.layers import Dense, LSTM, Dropout
6+
from keras.callbacks import ModelCheckpoint
7+
from string import punctuation
8+
9+
# commented because already downloaded
10+
# import requests
11+
# content = requests.get("http://www.gutenberg.org/cache/epub/11/pg11.txt").text
12+
# open("data/wonderland.txt", "w", encoding="utf-8").write(content)
13+
14+
# read the data
15+
text = open("data/wonderland.txt", encoding="utf-8").read()
16+
# remove caps
17+
text = text.lower()
18+
# remove punctuation
19+
text = text.translate(str.maketrans("", "", punctuation))
20+
# print some stats
21+
n_chars = len(text)
22+
unique_chars = ''.join(sorted(set(text)))
23+
print("unique_chars:", unique_chars)
24+
n_unique_chars = len(unique_chars)
25+
print("Number of characters:", n_chars)
26+
print("Number of unique characters:", n_unique_chars)
27+
28+
# dictionary that converts characters to integers
29+
char2int = {c: i for i, c in enumerate(unique_chars)}
30+
# dictionary that converts integers to characters
31+
int2char = {i: c for i, c in enumerate(unique_chars)}
32+
33+
# save these dictionaries for later generation
34+
pickle.dump(char2int, open("wonderland-char2int.pickle", "wb"))
35+
pickle.dump(int2char, open("wonderland-int2char.pickle", "wb"))
36+
37+
# hyper parameters
38+
sequence_length = 100
39+
step = 1
40+
batch_size = 128
41+
epochs = 40
42+
43+
sentences = []
44+
y_train = []
45+
for i in range(0, len(text) - sequence_length, step):
46+
sentences.append(text[i: i + sequence_length])
47+
y_train.append(text[i+sequence_length])
48+
print("Number of sentences:", len(sentences))
49+
50+
# vectorization
51+
X = np.zeros((len(sentences), sequence_length, n_unique_chars))
52+
y = np.zeros((len(sentences), n_unique_chars))
53+
54+
for i, sentence in enumerate(sentences):
55+
for t, char in enumerate(sentence):
56+
X[i, t, char2int[char]] = 1
57+
y[i, char2int[y_train[i]]] = 1
58+
59+
print("X.shape:", X.shape)
60+
61+
# building the model
62+
# model = Sequential([
63+
# LSTM(128, input_shape=(sequence_length, n_unique_chars)),
64+
# Dense(n_unique_chars, activation="softmax"),
65+
# ])
66+
67+
# a better model (slower to train obviously)
68+
model = Sequential([
69+
LSTM(256, input_shape=(sequence_length, n_unique_chars), return_sequences=True),
70+
Dropout(0.3),
71+
LSTM(256),
72+
Dense(n_unique_chars, activation="softmax"),
73+
])
74+
75+
# model.load_weights("results/wonderland-v2-2.48.h5")
76+
77+
model.summary()
78+
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
79+
80+
if not os.path.isdir("results"):
81+
os.mkdir("results")
82+
83+
checkpoint = ModelCheckpoint("results/wonderland-v2-{loss:.2f}.h5", verbose=1)
84+
85+
# train the model
86+
model.fit(X, y, batch_size=batch_size, epochs=epochs, callbacks=[checkpoint])

0 commit comments

Comments
 (0)