Pandas Dataframe [cell=(label,value)], Split Into 2 Separate Dataframes
I found an awesome way to parse html with pandas. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes. Notice how each c
Solution 1:
you can do it this way:
DF1:
In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)
In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska \
0 Bluehead Copper Rockfish
1 Blue Tang Lingcod
2 Stoplight Parrotfish Painted Greenling
3 Bicolor Damselfish Sunflower Star
4 French Grunt Plumose Anemone
0 Hawaii Tropical Eastern Pacific \
0 Saddle Wrasse King Angelfish
1 Hawaiian Whitespotted Toby Mexican Hogfish
2 Raccoon Butterflyfish Barberfish
3 Manybar Goatfish Flag Cabrilla
4 Moorish Idol Panamic Sergeant Major
0 South Pacific Northeast US and Eastern Canada \
0 Regal Angelfish Cunner
1 Bluestreak Cleaner Wrasse Winter Flounder
2 Manybar Goatfish Rock Gunnel
3 Brushtail Tang Pollock
4 Two-spined Angelfish Grubby Sculpin
0 South Atlantic States Central Indo-Pacific
0 Slippery Dick Moorish Idol
1 Belted Sandfish Three-spot Dascyllus
2 Black Sea Bass Bluestreak Cleaner Wrasse
3 Tomtate Blacklip Butterflyfish
4 Cubbyu Clark's Anemonefish
and DF2
In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)
In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii \
08554.692184.853.285.828150.885.7379.950.285.7474.849.782.90 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada \
085.77967.4182.577.346.6275.273.926.2368.973.325.2467.972.823.70 South Atlantic States Central Indo-Pacific
079.780.1178.575.6278.573.5372.771.4465.770.2
RegEx debugging and explanation:
we basically want to remove everything, except number in parentheses:
(\d+\.*\d*)
- group(1) - it's our number
\((\d+\.*\d*)\)
- our number in parentheses
.*\((\d+\.*\d*)\).*
- the whole thing - anything before '(', '(', our number, ')', anything till the end of the cell
it will be replaced with the group(1) - our number
Post a Comment for "Pandas Dataframe [cell=(label,value)], Split Into 2 Separate Dataframes"