How to remove parentheses and all data within using Pandas/Python?
Each Answer to this Q is separated by one/two green lines.
I have a dataframe where I want to remove all parentheses and stuff inside it.
I checked out :
How can I remove text within parentheses with a regex?
Where the answer to remove the data was
re.sub(r'\([^)]*\)', '', filename)
I tried this as well as
re.sub(r'\(.*?\)', '', filename)
However, I got an error: expected a string or buffer
When I tried using the column df['Column Name']
I got no item named 'Column Name'
I checked the dataframe using df.head()
and it showed up as a clean table with the column names as what I wanted them to be….however when I use the re
expression to remove the (stuff) it isn’t recognizing the column name that I have.
I normally use
df['name'].str.replace(" ()","")
However, I want to remove the parentheses and what is inside….How can I do this using either regex or pandas?
Thanks!
Here is the solution I used…thanks for the help!
All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\(.*\)","")
df['name'].str.replace(r"\(.*\)","")
You can’t run re functions directly on pandas objects. You have to loop them for each element inside the object. So Series.str.replace((r"\(.*\)", "")
is just syntactic sugar for Series.apply(lambda x: re.sub(r"\(.*\)", "", x))
.
If you have multiple (...)
substrings in the data you should consider using either
All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\(.*?\)", "", regex=True)
or
All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\([^()]*\)", "", regex=True)
The difference is that .*?
is slower and does not match line breaks, and [^()]
matches any char but (
and )
and is quite efficient and matches line breaks. The first one will match (...(...)
but the second will only match (...)
.
If you want to normalize all whitespace after removing these substrings, you may consider
All['Manufacturer Standard Name'] = All['Manufacturer Standard Name'].str.replace(r"\s*\([^()]*\)", "", regex=True).str.strip()
The \s*\([^()]*\)
regex will match 0+ whitespaces and then the string between parentheses and then str.stip()
will get rid of any potential trailing whitespace.
NOTE on regex=True
:
Acc. to Pandas 1.2.0 release notes:
The default value of regex for
Series.str.replace()
will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).
#removing the unwanted characters
Energy['Country'] = Energy['Country'].str.replace(r" \(.*\)","")
Blockquote
Energy['Country'] = Energy['Country'].str.replace(r"([0-9]+)$","")
this are ways you may also remove the unwanted errors
All the answers above seem great; However, the following links provide a better understanding.
a) https://docs.python.org/3/howto/regex.html#regex-howto
b) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
To summarize, to replace a unwanted character, you have to use the pandas.DataFrame.replace function. For instance to remove [] from a dataframe, one can do the following.
import re
p=re.compile('\[]') %% regular expression for matching [] (see reference (a)
result.replace(to_replace=p,value="",inplace=False,regex=True) %%For a dataframe named result, this way one can replace [] with "". see reference (b)