I can read forum threads for hours (HackerNews, reddit, etc.). But I want to spend that time reading real books.
Make Long-form text more like a thread. Give paragraphs big indentation to indicate how it relates to the text that has come before it.
I created an embedding for each paragraph using the doc2vec model from Gensim.
I used a tree-like structure with each paragraph as a node. If the node was more similar to the previous node than the next node, the depth of that node was increased and the next node was pushed to that depth. In the other case, I would look to see if there was any node up the tree that it could be branched off of (more similar than the next node).
for index, paragraph in enumerate(cleaned_train_corpus[1:-1],start=1):sim_prev = model.docvecs.similarity(paragraph.tags, paragraph.prev.tags)sim_next_ = model.docvecs.similarity(paragraph.tags, paragraph.next_.tags)if sim_prev >= sim_next_:paragraph.depth +=1paragraph.next_.depth = paragraph.depthelse:search_depth = paragraph.depthwhile search_depth > 0 :p1 = paragraph.prevp2 = p1.prevwhile p2.depth -1 >= p1.depth:p2 = p2.prevsim_1 = model.docvecs.similarity(paragraph.tags, p1.tags)sim_2 = model.docvecs.similarity(paragraph.tags, p2.tags)if sim_2 > sim_1:search_depth-=1else:paragraph.depth = paragraph.prev.depthsearch_depth =-1
I removed text between brackets, which I took to be footnotes. I replaced all words only mentioned once with a special signifier. I converted everything to lower case and added signifiers denoting uppercases.
def replace_words(word_list):new_list = ["xxxbos"]for i in word_list:# if i == '"':# new_list += ["xxxquote"]#check if in vocabif i in vocab:token = ielse:token = "xxxunk"if i == i.title():new_list += ["xxxcap", token.lower()]elif i[-1] == ".":new_list += [token[0:-1], ".", "xxxbos"]elif i[-1] == ",":new_list += [token[0:-1]]elif i[-1] == '"':new_list += [token[0:-1,"xxxquote"]]else:new_list +=[i]return new_list
I want to break large paragraphs into smaller chunks and make all the docs in a similar size.