How To Output Nltk Chunks To File?
Solution 1:
Firstly, see this video: https://www.youtube.com/watch?v=0Ef9GudbxXY
Now for the proper answer:
import re
import io
from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser
xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)
chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent)))
for sent in sent_tokenize(xstring)]
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(str(chunk)+'\n\n')
[out]:
alvas@ubi:~$ python test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(str(chunk)+'\n\n')TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TOas/IN
(Chunk digital/JJ library/NN)
or/CC
If you have to stick to python2.7:
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(unicode(chunk)+'\n\n')
[out]:
alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TOas/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(unicode(chunk)+'\n\n')NameError: name 'unicode' is not defined
And strongly recommended if you must stick with py2.7:
from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(text_type(chunk)+'\n\n')
[out]:
alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
Solution 2:
Your code has several problems, though the main culprit is that your for
loop does not modify the contents of the xstring
:
I will address all the issues in your code here:
you cannot write paths like this with single \
, as \t
will be interpreted as a tabulator, and \f
as a linefeed character. You must double them. I know it was an example here, but such confusions often arise:
withopen('path\\to\\file.txt', 'r') as infile:
xstring = infile.readlines()
The following infile.close
line is wrong. It does not call the close method, it does not actually do anything. Furthermore, your file was closed already by the with clause if you see this line in any answer anywhere, please just downvote the answer outright with the comment saying that file.close
is wrong, should be file.close()
.
The following should work, but you need to be aware that it replacing every non-ascii character with ' '
it will break words such as naïve and café
defremove_non_ascii(line):
return''.join([i iford(i) < 128else' 'for i in line])
But here is the reason why your code fails with an unicode exception: you are not modifying the elements of xstring
at all, that is, you are calculating the line with ascii characters removed, yes, but that is a new value, that is never stored into the list:
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
Instead it should be:
for i, line in enumerate(xstring):
xstring[i] = remove_non_ascii(line)
or my preferred very pythonic:
xstring = [ remove_non_ascii(line) for line in xstring ]
Though these Unicode Errors occur mainly just because you are using Python 2.7 for handling pure Unicode text, something for which recent Python 3 versions are way ahead, thus I'd recommend you that if you are in very beginning with task that you'd upgrade to Python 3.4+ soon.
Post a Comment for "How To Output Nltk Chunks To File?"