转载自
https://www.listendata.com/2019/06/python-string-functions.htmlwww.listendata.comSTRING FUNCTIONS IN PYTHON WITH EXAMPLES
This tutorial outlines various string (character) functions used in Python. To manipulate strings and character values, python has several in-built functions. It means you don't need to import or have dependency on any external package to deal with string data type in Python. It's one of the advantage of using Python over other data science tools. Dealing with string values is very common in real-world. Suppose you have customers' full name and you were asked by your manager to extract first and last name of customer. Or you want to fetch information of all the products that have code starting with 'QT'.
Table of Contents
- List of frequently used string functions
- LEFT, RIGHT and MID Functions
- Extract Words from String
- SQL LIKE Operator in Pandas DataFrame
- Find position of a particular character or keyword
- Replace substring
- Find length of string
- Convert to lowercase and uppercase
- Remove Leading and Trailing Spaces
- Convert Numeric to String
- Concatenate or Join Strings
- SQL IN Operator in Pandas
- Extract a particular pattern from string
List of frequently used string functions
The table below shows many common string functions along with description and its equivalent function in MS Excel. We all use MS Excel in our workplace and familiar with the functions used in MS Excel. The comparison of string functions in MS EXCEL and Python would help you to learn the functions quickly and mug-up before interview.
Function Description MS EXCEL FUNCTION
mystring[:N] Extract N number of characters from start of string. LEFT( )
mystring[-N:] Extract N number of characters from end of string RIGHT( )
mystring[X:Y] Extract characters from middle of string, starting from X position and ends with Y MID( )
str.split(sep=' ') Split Strings -
str.replace(old_substring, new_substring) Replace a part of text with different sub-string REPLACE( )
str.lower() Convert characters to lowercase LOWER( )
str.upper() Convert characters to uppercase UPPER( )
str.contains('pattern', case=False) Check if pattern matches (Pandas Function) SQL LIKE Operator
str.extract(regular_expression) Return matched values (Pandas Function) -
str.count('sub_string') Count occurence of pattern in string -
str.find( ) Return position of sub-string or pattern FIND( )
str.isalnum() Check whether string consists of only alphanumeric characters -
str.islower() Check whether characters are all lower case -
str.isupper() Check whether characters are all upper case -
str.isnumeric() Check whether string consists of only numeric characters -
str.isspace() Check whether string consists of only whitespace characters -
len( ) Calculate length of string LEN( )
cat( ) Concatenate Strings (Pandas Function) CONCATENATE( )
separator.join(str) Concatenate Strings CONCATENATE( )LEFT, RIGHT and MID Functions
If you are intermediate MS Excel users, you must have used LEFT, RIGHT and MID Functions. These functions are used to extract N number of characters or letters from string.
1. Extract first two characters from beginning of string
mystring = "Hey buddy, wassup?"
mystring[:2]
Out[1]: 'He'string[start:stop:step]means item start from 0 (default) through (stop-1), step by 1 (default).mystring[:2]is equivalent tomystring[0:2]mystring[:2]tells Python to pull first 2 characters frommystringstring object.- Indexing starts from zero so it includes first, second element and excluding third.
2. Find last two characters of string
mystring[-2:]The above command returnsp?.The-2starts the range from second last position through maximum length of string.
3. Find characters from middle of string
mystring[1:3]
Out[1]: 'ey'mystring[1:3]returns second and third characters.1refers to second character as index begins with 0.
4. How to reverse string?
mystring[::-1]
Out[1]: '?pussaw ,yddub yeH'-1tells Python to start it from end and increment it by 1 from right to left.
5. How to extract characters from string variable in Pandas DataFrame?
Let's create a fake data frame for illustration. In the code below, we are creating a dataframe nameddfcontaining only 1 variable calledvar1
import pandas as pd
df = pd.DataFrame({"var1": ["A_2", "B_1", "C_2", "A_2"]})
var1
0 A_2
1 B_1
2 C_2
3 A_2To deal text data in Python Pandas Dataframe, we can usestrattribute. It can be used for slicing character values.
df['var1'].str[0]In this case, we are fetching first character fromvar1variable. See the output shown below.
Output
0 A
1 B
2 C
3 AExtract Words from String
Suppose you need to take out word(s) instead of characters from string. Generally we consider one blank space as delimiter to find words from string.
1. Find first word of string
mystring.split()[0]
Out[1]: 'Hey'How it works?
split()function breaks string using space as a default separatormystring.split()returns['Hey', 'buddy,', 'wassup?']0returns first item or wordHey
2. Comma as separator for words
mystring.split(',')[0]
Out[1]: 'Hey buddy'3. How to extract last word
mystring.split()[-1]
Out[1]: 'wassup?'4. How to extract word in DataFrame
Let's build a dummy data frame consisting of customer names and call it variablecustname
mydf = pd.DataFrame({"custname": ["Priya_Sehgal", "David_Stevart", "Kasia_Woja", "Sandy_Dave"]})
custname
0 Priya_Sehgal
1 David_Stevart
2 Kasia_Woja
3 Sandy_Dave
#First Word
mydf['fname'] = mydf['custname'].str.split('_').str[0]
#Last Word
mydf['lname'] = mydf['custname'].str.split('_').str[1]Detailed Explanation
str.split( )is similar tosplit( ). It is used to activate split function in pandas data frame in Python.- In the code above, we created two new columns named
fnameandlnamestoring first and last name.
Output
custname fname lname
0 Priya_Sehgal Priya Sehgal
1 David_Stevart David Stevart
2 Kasia_Woja Kasia Woja
3 Sandy_Dave Sandy DaveSQL LIKE Operator in Pandas DataFrame
In SQL, LIKE Statement is used to find out if a character string matches or contains a pattern. We can implement similar functionality in python usingstr.contains( )function.
df2 = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "a_2"],
"var2": ["X_2", "Y_1", "Z_2", "X2"]})
var1 var2
0 AA_2 X_2
1 B_1 Y_1
2 C_2 Z_2
3 a_2 X2How to find rows containing either A or B in variable var1?
df2['var1'].str.contains('A|B')str.contains(pattern)is used to match pattern in Pandas Dataframe.
Output
0 True
1 True
2 False
3 FalseThe above command returns FALSE against fourth row as the function is case-sensitive. To ignore case-sensitivity, we can use case=False parameter. See the working example below.df2['var1'].str.contains('A|B', case=False)How to filter rows containing a particular pattern?
In the following program, we are asking Python to subset data with condition - contain character values either A or B. It is equivalent to WHERE keyword in SQL.
df2[df2['var1'].str.contains('A|B', case=False)]
Output
var1 var2
0 AA_2 X_2
1 B_1 Y_1
3 a_2 X2Suppose you want only those values thathave alphabet followed by '_'
df2[df2['var1'].str.contains('^[A-Z]_', case=False)]^is a token of regular expression which means begin with a particular item.
var1 var2
1 B_1 Y_1
2 C_2 Z_2
3 a_2 X2Find position of a particular character or keyword
str.find(pattern)is used to find position of sub-string. In this case, sub-string is '_'.
df2['var1'].str.find('_')
0 2
1 1
2 1
3 1Replace substring
str.replace(old_text,new_text,case=False)is used to replace a particular character(s) or pattern with some new value or pattern. In the code below, we are replacing _ with -- in variable var1.
df2['var1'].str.replace('_', '--', case=False)
Output
0 AA--2
1 B--1
2 C--2
3 A--2We can also complex patterns like the following program.+means item occurs one or more times. In this case, alphabet occurring 1 or more times.
df2['var1'].str.replace('[A-Z]+_', 'X', case=False)
0 X2
1 X1
2 X2
3 X2Find length of string
len(string)is used to calculate length of string. In pandas data frame, you can applystr.len()for the same.
df2['var1'].str.len()
Output
0 4
1 3
2 3
3 3To find count of occurrence of a particular character (let's say, how many time 'A' appears in each row), you can usestr.count(pattern)function.
df2['var1'].str.count('A')Convert to lowercase and uppercase
str.lower()andstr.upper()functions are used to convert string to lower and uppercase values.
#Convert to lower case
mydf['custname'].str.lower()
#Convert to upper case
mydf['custname'].str.upper()Remove Leading and Trailing Spaces
str.strip()removes both leading and trailing spaces.str.lstrip()removes leading spaces (at beginning).str.rstrip()removes trailing spaces (at end).
df1 = pd.DataFrame({'y1': [' jack', 'jill ', ' jesse ', 'frank ']})
df1['both']=df1['y1'].str.strip()
df1['left']=df1['y1'].str.lstrip()
df1['right']=df1['y1'].str.rstrip()
y1 both left right
0 jack jack jack jack
1 jill jill jill jill
2 jesse jesse jesse jesse
3 frank frank frank frankConvert Numeric to String
With the use ofstr( )function, you can convert numeric value to string.
myvariable = 4
mystr = str(myvariable)Concatenate or Join Strings
By simply using+, you can join two string values.
x = "Deepanshu"
y ="Bhalla"
x+y
DeepanshuBhallaIn case you want to add a space between two strings, you can use this -x+' '+yreturnsDeepanshu BhallaSuppose you have a list containing multiple string values and you want to combine them. You can usejoin( )function.
string0 = ['Ram', 'Kumar', 'Singh']
' '.join(string0)
Output
'Ram Kumar Singh'Suppose you want to combine or concatenate two columns of pandas dataframe.
mydf['fullname'] = mydf['fname'] + ' ' + mydf['lname']
OR
mydf['fullname'] = mydf[['fname', 'lname']].apply(lambda x: ' '.join(x), axis=1)
custname fname lname fullname
0 Priya_Sehgal Priya Sehgal Priya Sehgal
1 David_Stevart David Stevart David Stevart
2 Kasia_Woja Kasia Woja Kasia Woja
3 Sandy_Dave Sandy Dave Sandy DaveSQL IN Operator in Pandas
We can useisin(list)function to include multiple values in our filtering or subsetting criteria.
mydata = pd.DataFrame({'product': ['A', 'B', 'B', 'C','C','D','A']})
mydata[mydata['product'].isin(['A', 'B'])]
product
0 A
1 B
2 B
6 AHow to apply NOT criteria while selecting multiple values?
We can use sign~to tell python to negate the condition.
mydata[~mydata['product'].isin(['A', 'B'])]Extract a particular pattern from string
str.extract(r'regex-pattern')is used for this task.
df2['var1'].str.extract(r'(^[A-Z]_)')r'(^[A-Z]_)'means starts with A-Z and then followed by '_'
0 NaN
1 B_
2 C_
3 NaNTo remove missing values, we can usedropna( )function.
df2['var1'].str.extract(r'(^[A-Z]_)').dropna()