Preprocess Your Text with SpaCy

7 min readJul 15, 2018

In this post, I’ll walk you through how to preprocess your text before feeding to statistical algorithms. Preprocessing is basically normalizing your text for further processing. One usually begins with lexical attributes, then advance through more linguistic features.

Preprocess your text to compactify the patterns

For an example I’ll process customer e-mails for the sales area. Imagine you work as a sales agent and want to pitch your company’s brand new product; send e-mails and receive customer e-mails in return. Customers can write down they’d love to get to know the product, they can be potentially interested, they’re not interested, their business is not related at all or the person is not responsible for these sort of communication you should e-mail to his/her boss/sales department instead. I’ll do the processing for English, you can visit my blog for German counterparts which is juuuust a bit longer 😁

I’ll use “mail” argument for the mail itself, “sentence” argument for an individual sentence.

I’ll extract some entities into a output dictionary, at the same time I’ll tokenize them. Tokenization is good idea for small to mid size corpus, but if you have enough data to train a NN you can leave entity recognition to the network.

I skipped using the bcc, cc information and the message title. In a real system, you will and you should use them for extracting entities. Here, I’ll only process the e-mail body.

OK, let’s get started:

Use UTF-8

This is the biggest issue in this post indeed. Even if the e-mails are written in English language, always hold your strings in UTF-8. There can be entities, person names, geographic names with UTF-8 characters; or punctuation UTF-8 characters. Always hold your string as unicode, not str. For any library you want to use, always check for UTF_8 compatibility. This advice applies to Python 2; in Python 3 unicode and str distinction no longer exists, all strings are unicode.

In this post I’ll use Python 2. If you want to do Python 3, just remove u’s from the front of the strings. Another option is to import unicode_laterals from the future to make your code both Python 2 and Python 3 compatible. More details can be found at http://python-future.org/unicode_literals.html .

Go Over the String Only Once if Possible

This is yet another very important issue. Many text processing applications are inefficient due to going over the string many times. String algorithms can be very expensive in general, however you can still save time with some tricks. For instance, I don’t use regexes on the whole mail. I always tokenize the email, then go over the tokens one by one. Something like this:

for tokens in email:
    if token is email-adress:
        do something
    elif token is url-adress:
        do something2
     elif token is location-name:
         do something3

For instance if you use regexes for urls and e-mails twice, then you bactrack on the whole e-mail twice. Don’t get me wrong, you can’t avoid form regex altogether anyway, but use it wisely and ideally on individual tokens.

Lexical replacements

In this method, we replace some fancy UTF-8 characters with their “usual” counterpart. For instance, horizontal ellipsis is used as “three dots”; most probably you want to recognize both … and … as a potential sentence boundary. Hyphen has many sorts as well, long hyphen, ordinary hyphen from the keyboard, non-breaking hyphen, horizontal bar … You can see all sorts here: http://jkorpela.fi/dashes.html

For these sort of methods, one doesn’t need any linguistic information. Just replace fancy UTF-8 with their usual counterparts, unicode.replace would suffice. Argument is of type unicode, return is again unicode.

Let’s see some short examples:

def replace_canadian_period(mail):
    mail = mail.replace(u"\u1427", ".")
    return maildef replace_fancy_hyphens(mail):
    hlist = [u"\u002d", u"\u058a", u"\u058b", u"\u2010",   u"\u2011", u"\u2012", u"\u2013", u"\u2014", u"\u2015", u"\u2e3a", u"\u2e3b", u"\ufe58", u"\ufe63", u"\uff0d"]
    for h in hlist:
        mail = mail.replace(h, "-")
    return mail

Then your lexical preprocessor looks like:

def lexical_processor(mail):
    mail = replace_canadian_period(mail)
    mail = replace_fancy_hyphens(mail)
    return mail

Cut the Greetings

Opening and closing greetings can be deduced from opening and closing words and the e-mail structure. For instance, if the first line is short (one, two words); begins with a opening greeting or ends with a comma most probably it is a greeting line. For instance:

Hi Jeremy,   #Yes
I hope all is well. I'd like to inform...Jeremy, #Yes, one word followed by a comma
We're happy to announce that...Thank you, but we're not interested. #No, e-mail is not even multilineHi Jeremy, we're not interested. #There is a greeting word but line is too long. So: no

The method would look like this:

def cut_greetings(email):
    opening_greeting = ...
    closing_greeting = ...
    return cut_email, greetings_list

I mean to return the cut email and a list of tuples:

Hi Jan,
I hope you're doing great. I forwarded your email to the sales department.Best Regards,
Aliciareturn:(cut_email="I hope you're doing great. I forwarded your email to the sales department",
greetings_list = 
    [("Hi", "opening", 0), ("Best Regards,"closing", -2)]
)

I usually cut the greetings, leave only the email body and feed the opening/closing types as discreet covariants to the statistical algorithms. The greeting word itself or following the person name is usually not very correlated to the result; however if the greetings exist, if so what type directly correlated to the result (at least the sentiment). Nobody would write:

Hello Mrs. Lopez,
TAKE ME OUT OF YOUR SPAM LIST OR I SUE Kind Regards,
Adam

Extract E-mails

Until now, e-mail was unicode.

import spacy
nlp = spacy.load("en") #we load the English models once, use it many times. Don't include this line into any methods, it'll dramatically reduce the efficiencydef email_preprocessor(mail):
    entities = {}
    mail = lexical_processor(mail)     #mail is unicode
    mail = cut_greetings(email)        #mail is unicode
    doc = nlp(mail)                    # doc is of spacy.doc type
    entities['emails'] = extract_emails(doc)

We construct a spacy.doc object from our unicode e-mail. The most common way of extracting e-mails is using a regex. However, as I said before regex engine backtracks a lot. Instead we use spacy.token.like_email:

def extract_emails(doc):
    resultlis = []
    for token in doc:
        if token.like_email:
            resultlis.append((token.text,token.idx, token.idx + len(token)))
    return resultlis

In the if statement, if the token is an e-mail address, we return the token, it’s beginning position in the sentence and ending position in the sentence.

Let’s see what we get as output:

>>> email = u"I'm not really responsible from the sales. Please directly email either my boss at duygu@iam.uni-bonn.de or email our speaking partner Ms. Ethridge at ethridge@duygu.de.">>> doc = nlp(email)
>>> extract_emails(doc)
[('duygu@iam.uni-bonn.de', 83, 104), ('ethridge@duygu.de', 151, 168)]

For now we extract the e-mail adresses. If you want to tokenize the sentence, just use the indices from the results list.

After extracting and tokenizing the e-mails, your tokenized e-mail and the entities dictionary should look like:

I'm not really responsible from the sales, please directly email either my boss at email_tok or email our speaking partner Ms. Ethridge at email_tok.{
'email': [('duygu@iam.uni-bonn.de', 83, 104),('ethridge@duygu.de', 151, 168)]
}

Processing and tokenizing url addresses is the same, use spacy.token.like_url to extract the urls instead. Determining beginning and ending positions is exactly the same.

Extract Person Names

Extracting person names is almost the same. SpaCy’s entity extraction scheme allows multi-word entities. Hence, we don’t operate on tokens instead, we operate on the document itself. doc.ents is the way to extract the entities. doc.ents is of type list(spacy.span), as tokens words are allowed. Entities has several types, we’ll filter the results by PERSON type.

def extract_person_names(doc):
    resultlis = [(entity.text, entity.start_char, entity.end_char) for entity in doc.ents if entity.label_=="PERSON"]
    return resultlis

One thing to be careful is that extracted entities does not contain title words, Ms., Mrs., Prof., Dr. etc. One can make a small list of titles and check the previous tokens. Personally, I include the titles not to bother about title/gender later. Here , I’ll just put the name for the sake of simplicity. If you happen to extract the title, deduce the gender and add them both to the result tuples.

Let’s add it to the our mini pipeline:

import spacy
nlp = spacy.load("en") #we load the English models once, use it many times. Don't include this line into any methods, it'll dramatically reduce the efficiencydef email_preprocessor(mail):
    entities = {}
    mail = lexical_processor(mail)     #mail is unicode
    doc = nlp(mail)                    # mail is spacy.doc type
    entities['emails'] = extract_emails(doc)
    entities['persons'] = extract_person_names(doc)

Here’s the resulting entities dictionary and the tokenized e-mail:

I'm not really responsible from the sales, please directly email either my boss at email_tok or email our speaking partner Ms. person_tok at email_tok.{
'email': [('duygu@iam.uni-bonn.de', 83, 104),('ethridge@duygu.de', 151, 168)],
'person':[('Ethridge', 139, 147)],
}

Extract Dates

Sales or customer care emails can include dates that point to specific appointment times or further dates. These ones are typical dates I come across:

I have free time tomorrow afternoon after 16.00.
Our company is restructuring now, please get back to us after 6 months.
Can we set up a call next week Thursday or Friday?

I explained how to do it with a context free parser in https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/ . This issue deserves a post of its own 😁

Until this point we extracted the entities and prepared a entities dictionary. Now it’s stemmer, synonyms replacer and stopword cleaner’s turn to come into the play. My blog post explains the details of stemming algorithms, efficiency concerns and corpus stop words:

Chatbot NLU Series, Part I

From Neural Networks to CFG Parsers Fun in Language & Speech

duygua.github.io

Rest of this telenovela 😁 will continue with the statistical models. Once you preprocess the text, you should feed it to an algorithm, right? Stay tuned for how to embed, encode, attend, classify or just vanilla SGD 😉. In any case, be ready to train a super classifier and code a killer NLU pipeline.