Skip to content Skip to sidebar Skip to footer

Python Regular Expression Nltk Website Extraction

Hi I have never had to deal with regex before and I'm trying to preprocess some raw text with Python and NLTK. when I tried to tokenize the document using : tokens = nltk.regexp_t

Solution 1:

In this tokenizer RegularExpressions are used to specify how the Tokens you want to extract from the text can look like. I'm a bit confused which of the many regular expressions above you used, but for a very simple tokenization to non-whitespace tokens you could use:

>>> corpus = "this is a sentence. and another sentence. my homepage is http://test.com">>> nltk.regexp_tokenize(corpus, r"\S+")
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

which is equivalent to:

>>> corpus.split()
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

another approach could be using the nltk functions sent_tokenize() and nltk.word_tokenize():

>>> sentences = nltk.sent_tokenize(corpus)
>>> sentences
['this is a sentence.', 'and another sentence.', 'my homepage is http://test.com']
>>> for sentence in sentences:
    print nltk.word_tokenize(sentence)
['this', 'is', 'a', 'sentence', '.']
['and', 'another', 'sentence', '.']
['my', 'homepage', 'is', 'http', ':', '//test.com']

though if your text contains lots of website-urls this might not be the best choice. information about the different tokenizers in the NLTK can be found here.

if you just want to extract URLs from the corpus you could use a regular expression like this:

nltk.regexp_tokenize(corpus, r'(http://|https://|www.)[^"\' ]+')

Hope this helps. If this was not the answer you were looking for, please try to explain a bit more precisely what you want to do and how exactely you want your tokens look like (e.g. an example input/output you would like to have) and we can help finding the right regular expression.

Post a Comment for "Python Regular Expression Nltk Website Extraction"