Extracting @mentions From Tweets Using Findall Python (giving Incorrect Results)
Solution 1:
You can use str.findall
method to avoid the for loop, use negative look behind to replace (^|[^@\w])
which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![@\w])@(\w{1,25})').apply(','.join)
df
# text mention#0 RT @CritCareMed: New Article: Male-Predominant... CritCareMed#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress#2 RT @gvwilson: Where's the theory for software ... gvwilson#3 RT @sciencemagazine: What’s killing off the se... sciencemagazine#4 RT @MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:]
gives back a data frame, so str(X.iloc[:i,:])
gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text
column, you can use X.text.iloc[0]
, or a better way to iterate through a column, use iteritems
:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![@\w])@(\w{1,25})", s)
print(','.join(result))
#CritCareMed#CellCellPress#gvwilson#sciencemagazine#MHendr1cks,nucAmbiguous
Solution 2:
While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'@([^:\s]+)')
withopen("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT @CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT @gvwilson: Where's the theory for software ... gvwilson3 RT @sciencemagazine: What’s killing off the se... sciencemagazine
4 RT @MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df
once it's already constructed.
Solution 3:
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))@.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
@.*?
carries out a non-greedy match for a word starting with a hashtag(?=\s|$)
look-ahead for the end of the word or end of the sentence(?:(?<=\s)|(?<=^))
look-behind to ensure there are no false positives if a @ is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a @ character.
Post a Comment for "Extracting @mentions From Tweets Using Findall Python (giving Incorrect Results)"