Removing Emojis From A String In Python
Solution 1:
On Python 2, you have to use u''
literal to create a Unicode string. Also, you should pass re.UNICODE
flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')
):
#!/usr/bin/env python
import re
text = u'This dog \U0001f602'
print(text) # with emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
Output
This dog 😂
This dog
Note: emoji_pattern
matches only some emoji (not all). See Which Characters are Emoji.
Solution 2:
I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn't allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.
#!/usr/bin/env python
import re
text = u'This is a smiley face \U0001f602'
print(text) # with emoji
def deEmojify(text):
regrex_pattern = re.compile(pattern = "["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)
print(deEmojify(text))
This was my previous answer, do not use this.
def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
Solution 3:
Complete Version of remove Emojis
✍ 🌷 📌 👈🏻 🖥
import re
def remove_emojis(data):
emoj = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', data)
Solution 4:
If you are not keen on using regex, the best solution could be using the emoji python package.
Here is a simple function to return emoji free text (thanks to this SO answer):
import emoji
def give_emoji_free_text(text):
allchars = [str for str in text.decode('utf-8')]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
return clean_text
If you are dealing with strings containing emojis, this is straightforward
>> s1 = "Hi 🤔 How is your 🙈 and 😌. Have a nice weekend 💕👭👙"
>> print s1
Hi 🤔 How is your 🙈 and 😌. Have a nice weekend 💕👭👙
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend
If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.
>> s2 = u'This dog \U0001f602'
>> print s2
This dog 😂
>> print give_emoji_free_text(s2.encode('utf8'))
This dog
Edits
Based on the comment, it should be as easy as:
def give_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
Solution 5:
If you're using the example from the accepted answer and still getting "bad character range" errors, then you're probably using a narrow build (see this answer for more details). A reformatted version of the regex that seems to work is:
emoji_pattern = re.compile(
u"(\ud83d[\ude00-\ude4f])|" # emoticons
u"(\ud83c[\udf00-\uffff])|" # symbols & pictographs (1 of 2)
u"(\ud83d[\u0000-\uddff])|" # symbols & pictographs (2 of 2)
u"(\ud83d[\ude80-\udeff])|" # transport & map symbols
u"(\ud83c[\udde0-\uddff])" # flags (iOS)
"+", flags=re.UNICODE)
Post a Comment for "Removing Emojis From A String In Python"