Machine Learning to Make Non-Fiction more readable

View the Working example here.


The Problem

I can read forum threads for hours (HackerNews, reddit, etc.). But I want to spend that time reading real books.

The Solution

Make Long-form text more like a thread. Give paragraphs big indentation to indicate how it relates to the text that has come before it.

I created an embedding for each paragraph using the doc2vec model from Gensim.

Document Tree

I used a tree-like structure with each paragraph as a node. If the node was more similar to the previous node than the next node, the depth of that node was increased and the next node was pushed to that depth. In the other case, I would look to see if there was any node up the tree that it could be branched off of (more similar than the next node).

for index, paragraph in enumerate(cleaned_train_corpus[1:-1],start=1):
sim_prev = model.docvecs.similarity(paragraph.tags[0], paragraph.prev.tags[0])
sim_next_ = model.docvecs.similarity(paragraph.tags[0], paragraph.next_.tags[0])
if sim_prev >= sim_next_:
paragraph.depth +=1
paragraph.next_.depth = paragraph.depth
search_depth = paragraph.depth
while search_depth > 0 :
p1 = paragraph.prev
p2 = p1.prev
while p2.depth -1 >= p1.depth:
p2 = p2.prev
sim_1 = model.docvecs.similarity(paragraph.tags[0], p1.tags[0])
sim_2 = model.docvecs.similarity(paragraph.tags[0], p2.tags[0])
if sim_2 > sim_1:
paragraph.depth = paragraph.prev.depth
search_depth =-1

Text Cleaning.

I removed text between brackets, which I took to be footnotes. I replaced all words only mentioned once with a special signifier. I converted everything to lower case and added signifiers denoting uppercases.

def replace_words(word_list):
new_list = ["xxxbos"]
for i in word_list:
# if i[0] == '"':
# new_list += ["xxxquote"]
#check if in vocab
if i in vocab:
token = i
token = "xxxunk"
if i == i.title():
new_list += ["xxxcap", token.lower()]
elif i[-1] == ".":
new_list += [token[0:-1], ".", "xxxbos"]
elif i[-1] == ",":
new_list += [token[0:-1]]
elif i[-1] == '"':
new_list += [token[0:-1,"xxxquote"]]
new_list +=[i]
return new_list

Next Steps

I want to break large paragraphs into smaller chunks and make all the docs in a similar size.