Skip to content Skip to sidebar Skip to footer

Pandas Dataframe [cell=(label,value)], Split Into 2 Separate Dataframes

I found an awesome way to parse html with pandas. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes. Notice how each c

Solution 1:

you can do it this way:

DF1:

In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)

In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska  \
0                  Bluehead                          Copper Rockfish
1                 Blue Tang                                  Lingcod
2      Stoplight Parrotfish                        Painted Greenling
3        Bicolor Damselfish                           Sunflower Star
4              French Grunt                          Plumose Anemone

0                      Hawaii Tropical Eastern Pacific  \
0               Saddle Wrasse           King Angelfish
1  Hawaiian Whitespotted Toby          Mexican Hogfish
2       Raccoon Butterflyfish               Barberfish
3            Manybar Goatfish            Flag Cabrilla
4                Moorish Idol   Panamic Sergeant Major

0              South Pacific Northeast US and Eastern Canada  \
0            Regal Angelfish                          Cunner
1  Bluestreak Cleaner Wrasse                 Winter Flounder
2           Manybar Goatfish                     Rock Gunnel
3             Brushtail Tang                         Pollock
4       Two-spined Angelfish                  Grubby Sculpin

0 South Atlantic States       Central Indo-Pacific
0         Slippery Dick               Moorish Idol
1       Belted Sandfish       Three-spot Dascyllus
2        Black Sea Bass  Bluestreak Cleaner Wrasse
3               Tomtate     Blacklip Butterflyfish
4                Cubbyu        Clark's Anemonefish

and DF2

In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)

In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii  \
08554.692184.853.285.828150.885.7379.950.285.7474.849.782.90 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada  \
085.77967.4182.577.346.6275.273.926.2368.973.325.2467.972.823.70 South Atlantic States Central Indo-Pacific
079.780.1178.575.6278.573.5372.771.4465.770.2

RegEx debugging and explanation:

we basically want to remove everything, except number in parentheses:

(\d+\.*\d*) - group(1) - it's our number

\((\d+\.*\d*)\) - our number in parentheses

.*\((\d+\.*\d*)\).* - the whole thing - anything before '(', '(', our number, ')', anything till the end of the cell

it will be replaced with the group(1) - our number

Post a Comment for "Pandas Dataframe [cell=(label,value)], Split Into 2 Separate Dataframes"