DAP_Module3
DAP_Module3
Module 3
Data Pre-processing and Data Wrangling
Topics Covered
Loading from CSV files, Accessing SQL databases. Cleansing Data with Python:
Stripping out extraneous information, Normalizing data AND Formatting data.
Combining and Merging Data Sets – Reshaping and Pivoting – Data
Transformation – String Manipulation, Regular Expressions.
Pandas features a number of functions for reading tabular data as a DataFrame object. Table
below has a summary of all of them.
Functions, which are meant to convert text data into a DataFrame. The options for these
functions fall into a few categories:
• Indexing: can treat one or more columns as the returned DataFrame, and whether to get
column names from the file, the user, or not at all.
• Type inference and data conversion: this includes the user-defined value conversions and
custom list of missing value markers.
• Datetime parsing: includes combining capability, including combining date and time
information spread over multiple columns into a single column in the result.
• Unclean data issues: skipping rows or a footer, comments, or other minor things like
numeric data with thousands separated by commas.
Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file
contains any other delimiters then, read_table can be used by specifying the delimiter.
Suppose we wanted the message column to be the index of the returned DataFrame. We can
either indicate we want the column at index 4 or named 'message' using the index_col
argument:
To form a hierarchical index from multiple columns, just pass a list of column numbers or
names:
The parser functions have many additional arguments to help you handle the wide variety of
exception file formats that occur. For example, you can skip the first, third, and fourth rows
of a file with skiprows:
• The na_values option can take either a list or set of strings to consider missing values:
• A database is a file that is organized for storing data. Most databases are organized
like a dictionary in the sense that they map from keys to values. The biggest
difference is that the database is on disk (or other permanent storage), so it persists
after the program ends. Because a database is stored on permanent storage, it can
store far more data than a dictionary, which is limited to the size of the memory in
the computer.
• Like a dictionary, database software is designed to keep the inserting and accessing
of data very fast, even for large amounts of data. Database software maintains its
performance by building indexes as data is added to the database to allow the
computer to jump quickly to a particular entry.
• There are many different database systems which are used for a wide variety of
purposes including: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite.
• Python to work with data in SQLite database files, many operations can be done more
conveniently using software called the Database Browser for SQLite which is freely
available from:
http://sqlitebrowser.org/
• Using the browser you can easily create tables, insert data, edit data, or run simple
SQL queries on the data in the database
Database concepts
• For the first look, database seems to be a spreadsheet consisting of multiple sheets.
The primary data structures in a database are tables, rows and columns.
• Each table may consist of n number of attributes and m number of tuples (or
records).
• Every tuple gives the information about one individual. Every cell (i, j) in the table
indicates value of jth attribute for ith tuple.
Consider the problem of storing details of students in a database table. The format may
look like –
Roll No Name DOB Marks
Student1 1 Akshay 22/10/2001 82.5
Student 2 2 Arun 20/12/2000 81.3
............... ............... ............... ...............
............... ............... ............... ...............
Student m ............... ............... ............... ...............
Thus, table columns indicate the type of information to be stored, and table rows gives
record pertaining to every student. We can create one more table say department
consisting of attributes like dept_id, homephno, City. To relate this table with a
respective Rollno stored in student, and dept_id stored in department table. Thus, there
is a relationship between two tables in a single database. There are softwares that can
maintain proper relationships between multiple tables in a single database and are
known as Relational Database Management Systems (RDBMS).
import sqlite3
conn = sqlite3.connect('music.db') #create database name music
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
conn.close()
• The connect operation makes a “connection” to the database stored in the file
music.db in the current directory. If the file does not exist, it will be created.
• A cursor is like a file handle that we can use to perform operations on the data
stored in the database. Calling cursor() is very similar conceptually to calling
open() when dealing with text files.
• Once we have the cursor, we can begin to execute commands on the contents of
the database using the execute() method is as shown in figure below.
Example1: Write a python to create student Table from college database.(the attributes of
student like Name,USN,Marks.)Perform the following operations like insert,delete and
retrieve record from student Table.
import sqlite3
conn = sqlite3.connect(‘college.db’)
cur=conn.cursor()
print(“Opened database successfully”)
cur.execute(‘CREATE TABLE student(name TEXT, usn NUMERIC, Marks INTEGER)’)
print(“Table created successfully”)
cur.execute(‘INSERT INTO student(name,usn,marks) values (?,?,?)’,(‘akshay’,’1rn16mca16’,30))
cur.execute(‘insert into student(name,usn,marks) values (?,?,?)’,(‘arun’,’1rn16mca17’,65))
print(‘student’)
cur.execute(‘SELECT name, usn ,marks from student’)
for row in cur:
print(row)
cur.execute(‘DELETE FROM student WHERE Marks < 40’)
cur.execute(‘select name,usn,marks from student’)
conn.commit()
cur.close()
Output:
Opened database successfully
Table created successfully
student
('akshay', '1rn16mca16', 30)
('arun', '1rn16mca17', 65)
Example 2: Write a python code to create a database file(music.sqlite) and a table named
Tracks with two columns- title , plays. Also insert , display and delete the contents of the
table
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
cur.execute(“INSERT INTO Tracks (title, plays) VALUES ('Thunderstruck', 200)”)
cur.execute(“INSERT INTO Tracks (title, plays) VALUES (?, ?)”,('My Way', 15))
conn.commit()
print('Tracks:')
cur.execute('DELETE FROM Tracks WHERE plays < 100')
cur.execute('SELECT title, plays FROM Tracks')
for row in cur:
print(row)
cur.close()
Output
Tracks:
('Thunderstruck', 200)
Extraneous information refers to irrelevant or unnecessary data that can clutter a dataset and
make it difficult to analyze. This could include duplicate entries, empty fields, or irrelevant
columns. Stripping out this information involves removing it from the dataset, resulting in a more
concise and manageable dataset.
To strip out extraneous information in a Pandas DataFrame, you can use various methods and
functions provided by the library. Some commonly used methods include:
• dropna( ): This method removes rows with missing values (NaN or None) from the DataFrame.
You can specify the axis (0 for rows and 1 for columns) along which the rows or columns with
missing values should be dropped.
Example:
df = df.dropna()
#This will remove all rows that contain at least one missing value.
• drop( ): The drop() method in Pandas is used to remove columns from a DataFrame. It can be
used to drop a single column or multiple columns at once.
df.drop(columns, axis=1, inplace=False)
Ex:
cars2 = cars_data.drop(['Doors','Weight'],axis='columns')
• drop_duplicates(): methods to remove missing values and duplicate rows specify the
columns based on which the duplicates should be checked.
• loc[ ] and iloc[ ]: These indexing methods allow you to select specific rows and columns from
the DataFrame. They are used to select only the relevant data and exclude the unwanted
information.
Ex:1
Ex:2
• Filtering: conditional statements can be used to filter the DataFrame and select only the
rows that meet certain criteria. This allows to remove unwanted data based on specific
conditions.
Example:
Data normalization is the process of transforming data into a consistent format to facilitate
comparison and analysis. This may involve converting data to a common unit of measurement,
formatting dates and times consistently, or standardizing data formats. Normalization ensures
that data is comparable and can be easily processed and analysed.
Normalization is a crucial step in data preprocessing for machine learning tasks. It involves
transforming numerical features to have a mean of 0 and a standard deviation of 1. This process
ensures that all features are on the same scale, enabling efficient and accurate learning by
machine learning algorithms.
We import all the required libraries, NumPy and sklearn. You can see that we import the
preprocessing from the sklearn itself. That’s why this is the sklearn normalization method. We
created a NumPy array with some integer value that is not the same. We called the normalize
method from the preprocessing and passed the numpy_array, which we just created as a
parameter. We can see from the results, all integer data are now normalized between 0 and 1.
We can also normalize the particular dataset column. In this, we are going to discuss about that.
We import the library pandas and sklearn. We created a dummy CSV file, and we are now
loading that CSV file with the help of the pandas (read_csv) package. We print that CSV file
which we recently loaded. We read the particular column of the CSV file using the np. array and
store the result to value_array. We called the normalize method from the preprocessing and
passed the value_array parameter.
Method 3: Convert to normalize without using the columns to array (using sklearn)
In the previous method 2, we discussed how to a particular CSV file column we could normalize.
But sometimes we need to normalize the whole dataset, then we can use the below method
where we do normalize the whole dataset but along column-wise (axis = 0). If we mention the
axis = 1, then it will do row-wise normalize. The axis = 1 is by default value.
Now, we pass the whole CSV file along with one more extra parameter axis =0, which said to the
library that the user wanted to normalize the whole dataset column-wise.
We called the MinMaxScalar from the preprocessing method and created an object
(min_max_Scalar) for that. We did not pass any parameters because we need to normalize the
data between 0 and 1. But if you want, you can add your values which will be seen in the next
method.
We first read all the names of the columns for further use to display results. Then we call the
fit_tranform from the created object min_max_Scalar and passed the CSV file into that. We get
the normalized results which are between 0 and 1.
The sklearn also provides the option to change the normalized value of what you want. By
default, they do normalize the value between 0 and 1. But there is a parameter which we called
feature_range, which can set the normalized value according to our requirements.
Here, We call the MinMaxScalar from the preprocessing method and create an object
(min_max_Scalar) for that. But we also pass another parameter inside of the MinMaxScaler
(feature_range). That parameter value we set 0 to 2. So now, the MinMaxScaler will normalize the
data values between 0 to 2. We first read all the names of the columns for further use to display
results. Then we call the fit_tranform from the created object min_max_Scalar and passed the
CSV file into that. We get the normalized results which are between 0 and 2.
We can also do normalize the data using pandas. These features are also very popular in
normalizing the data. The maximum absolute scaling does normalize values between 0 and 1.
We are applying here .max () and .abs() as shown below:
We call each column and then divide the column values with the .max() and .abs(). We print the
result, and from the result, we confirm that our data normalize between 0 and 1.
The next method which we are going to discuss is the z-score method. This method converts the
information to the distribution. This method calculates the mean of each column and then
subtracts from each column and, at last, divides it with the standard deviation. This normalizes
the data between -1 and 1.
We calculate the column’s mean and subtract it from the column. Then we divide the column
value with the standard deviation. We print the normalized data between -1 and 1.
One popular library is Scikit-Learn, which offers the StandardScaler class for normalization.
Here's an example of how to use StandardScaler to normalize a dataset:
Formatting Data:
• Formatting data in Pandas involves transforming and presenting data in a structured and
readable manner. Pandas, a popular Python library for data analysis, offers various methods
and techniques to format data effectively.
• One of the key features of Pandas is its ability to handle different data types and structures.
It provides specific formatting options for each data type, ensuring that data is displayed in
a consistent and meaningful way. For example, numeric data can be formatted with specific
number of decimal places, currency symbols, or percentage signs. Date and time data can be
formatted in various formats, such as "dd/mm/yyyy" or "hh:mm:ss".
• Pandas also allows users to align data within columns, making it easier to read and compare
values. This can be achieved using the "justify" parameter, which takes values such as "left",
"right", or "center". Additionally, Pandas provides options to control the width of columns,
ensuring that data is presented in a visually appealing manner.
Data contained in pandas objects can be combined together in a number of built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This will be
familiar to users of SQL or other relational databases, as it implements database join
operations.
• combine_first instance method enables splicing together overlapping data to fill in missing
values in one object with values from another.
import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2
• The below examples shows many-to-one merge situation; the data in df1 has multiple rows
labeled a and b, whereas df2 has only one row for each value in the key column.
Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.
df1 df2
Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in
the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method
only affects the distinct key values appearing in the result:
Merging on Index
The merge key or keys in a DataFrame will be found in its index. In this case, you can pass
left_index=True or right_index=True (or both) to indicate that the index should be used as the
merge key
By default concat works along axis=0, producing another Series. If you pass axis=1, the result
will instead be a DataFrame (axis=1 is the columns):
There are a number of fundamental operations for rearranging tabular data. These are
alternatingly referred to as reshape or pivot operations.
• Using the stack method on this data pivots the columns into the rows, producing a
Series.
From a hierarchically-indexed Series, we can rearrange the data back into a DataFrame with
unstack.
• By default the innermost level is unstacked (same with stack). You can unstack a different
level by passing a level number or name:
• Unstacking might introduce missing data if all of the values in the level aren’t found in each
of the subgroups:
• Data is frequently stored this way in relational databases like MySQL as a fixed schema allows
the number of distinct values in the item column to increase or decrease as data is added or
deleted in the table.
• The data may not be easy to work with in long format; it is preferred to have a DataFrame
containing one column per distinct item value indexed by timestamps in the date column.
ROOPA.H.M, Dept of MCA, RNSIT Page 24
Module 1 [22MCA31] Data Analytics using Python
The pivot() function is used to reshape a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a
MultiIndex in the columns.
Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)
Example:
Suppose you had two value columns that you wanted to reshape simultaneously:
• By omitting the last argument, you obtain a DataFrame with hierarchical columns:
Data transformation
Data transformation is the process of converting raw data into a format that is suitable for
analysis and modeling. It's an essential step in data science and analytics workflows, helping to
unlock valuable insights and make informed decisions.
Few of the data transfer mechanisms are :
• Removing Duplicates
• Replacing Values
• Renaming Axis Indexes
• Discretization and Binning
• Detecting and Filtering Outliers
• Permutation and Random Sampling
i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly,
drop_duplicates returns a DataFrame where the duplicated array is True.
data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )
data.duplicated() data.drop_duplicates()
data
• Suppose we wanted to find values in one of the columns exceeding one in magnitude:
• To select all rows having a value exceeding 1 or -1, we can use the any method on a
boolean DataFrame:
• Some times it is necessary to replace missing values with some specific values or NAN
values. It can be done by using replace method. Let’s consider this Series:
• The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:
data.replace(-999, np.nan)
• If we want to replace multiple values at once, you instead pass a list then the substitute
value:
Like values in a Series, axis labels can be similarly transformed by a function or mapping
of some form to produce new, differently labeled objects. The axes can also be modified in
place without creating a new data structure.
import pandas as pd
• To create a transformed version of a data set without modifying the original, a useful
method is rename:
data.rename(index=str.title, columns=str.upper)
• rename can be used in conjunction with a dict-like object providing new values for a subset
of the axis labels:
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
• Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose we have data about a group of people in a study, and we want to group them into
discrete age buckets:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do
so, we have to use cut, a function in pandas:
import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
• The object pandas returns is a special Categorical object. We can treat it like an array of
strings indicating the bin name; internally it contains a levels array indicating the distinct
category names along with a labeling for the ages data in the labels attribute:
cats.labels
cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)
Consistent with mathematical notation for intervals, a parenthesis means that the side is
open while the square bracket means it is closed (inclusive).
Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of the axis you
want to permute produces an array of integers indicating the new ordering:
df
sampler = np.random.permutation(5)
array([1, 0, 2, 3, 4])
sampler
That array can then be used in ix-based indexing or the take function:
df.take(sampler)
Another type of transformation for statistical modeling or machine learning applica tions
is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a
DataFrame has k distinct values, you would derive a matrix or DataFrame containing k
columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this,
though devising one yourself is not difficult. Let’s return to an earlier ex ample
DataFrame:
pd.get_dummies(df['key'])
In some cases, you may want to add a prefix to the columns in the indicator DataFrame,
which can then be merged with the other data. get_dummies has a prefix argument for
doing just this:
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
String Manipulation
Python has long been a popular data munging language in part due to its ease-of-use for
string and text processing. Most text operations are made simple with the string object’s built-
in methods. For more complex pattern matching and text manipulations, regular expressions
may be needed. pandas adds to the mix by enabling you to apply string and regular
expressions concisely on whole arrays of data, additionally handling the annoyance of missing
data.
Regular Expressions
Expressions.
RegEx Functions
The re module offers a set of functions that allows us to search a string for a match.
By using these functions we can search required pattern. They are as follows:
• match(): re.match() determine if the RE matches at the beginning of the string. The
method returns a match object if the search is successful. If not, it returns None.
import re Output:
abyss
pattern = '^a...s$'
Search successful.
test_string = 'abyss'
result = re.match(pattern,test_string)
if result:
print("Search successful.")
else:
print("Search unsuccessful.")
• search(): The search( ) function searches the string for a match, and returns a Match
object if there is a match. If there is more than one match found, only the first
occurrence of the match will be returned.
import re Output:
pattern='Tutorials' <re.Match object;
line ='Python Tutorials' span=(7,16),match='Tutorials'>
result = re.search(pattern, line)
Tutorials
print(result)
print(result.group())
• findall() : Find all substrings where the RE matches, and returns them as a list. It
searches from start or end of the given string and returns all occurrences of the
pattern. While searching a pattern, it is recommended to use re.findall() always, it
works like re.search() and re.match() both.
import re Output:
str = "The rain in Spain" ['ai', 'ai']
x = re.findall("ai", str)
print(x)
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a
special meaning:
\d Returns a match where the string contains "\d"
digits (numbers from 0-9)
\D Returns a match where the string DOES NOT "\D"
contain digits
[0-9] Returns a match for >>>str ="8 times before 11:45 AM"
any digit >>>re.findall("[0-9]", str)
between 0 and 9 ['8', '1', '1', '4', '5']
[0- Returns a match for >>>str = "8 times before 11:45 AM"
5][0- any two-digit numbers >>>re.findall("[0-5][0-9]", str)
9] ['11', '45']
from 00 and 59
[+] In
sets, +, *, ., |, (), $,{} has
no special meaning, >>>str ="8 times before 11:45 AM"
so [+] means: return a >>>re.findall("[+]", str)
match for []
any + character in the
string
EX:1 Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
import re Output:
hand = open('pattern.txt')
for line in hand: From: Bengaluru^560098
line = line.rstrip() From:<[email protected]>
if re.search('^F..m:', From: <[email protected]>
line):
print(line)
The regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”,
or “F!@m:” since the period characters in the regular expression match any character.
Ex:2 Search for lines that start with From and have an at sign
import re Output:
hand = open('pattern.txt')
for line in hand: From:<[email protected]>
line = line.rstrip() From:
if re.search('^From:.+@', <[email protected]>
line):
print(line)
The search string ˆFrom:.+@ will successfully match lines that start with “From:”,
followed by one or more characters (.+), followed by an at-sign.
Ex:1 Extract anything that looks like an email address from the line.
import re
s = 'A message from [email protected] to [email protected] about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)
Output: ['[email protected]', '[email protected]']
— Translating the regular expression, we are looking for substrings that have at least One or
more non-whitespace character, followed by an at-sign, followed by at least one more non-
whitespace character.
Ex:3 Search for lines that have an at sign between characters .The characters
must be a letter or number
import re Output:
hand = open('pattern.txt')
for line in hand: ['From:[email protected]']
line = line.rstrip() ['[email protected]']
x=re.findall('[a-zA-Z0-9]\S+@\S+[a-zA-
Z]',line)
if len(x) > 0:
print(x)
Here, we are looking for substrings that start with a single lowercase letter, uppercase
letter, or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (\S*),
followed by an at-sign, followed by zero or more non-blank characters (\S*), followed by
an uppercase or lowercase letter. Note that we switched from + to * to indicate zero or
more non-blank characters since [a-zA-Z0-9] is already one non-blank character.
Remember that the * or + applies to the single character immediately to the left of the
plus or asterisk.
X-DSPAM-Confidence:
0.8475
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
X-DSPAM
done with the file
content
Ex:1 Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number can
include a decimal.
import re Output:
hand = open('file.txt') X-DSPAM-Confidence:
for line in hand: 0.8475
line = line.rstrip() X-DSPAM-Probability:
x =re.search('^X-.*: ([0-9.]+)', 0.0000
line) X-DSPAM-Confidence:
if x: 0.6178
print(x.group()) X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
• But, if we want only the numbers in the above output. We can use split() function on
extracted string. However, it is better to refine regular expression. To do so, we need
the help of parentheses.
When we add parentheses to a regular expression, they are ignored when matching
the string(with search()). But when we are using findall(), parentheses indicate that
while we want the whole expression to match, we are only interested in extracting a
portion of the substring that matches the regular expression.
Ex:2 Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number can include
a decimal. Then print the number if it is greater than zero.
import re Output:
hand = open('file.txt') ['0.8475']
for line in hand: ['0.0000']
line = line.rstrip() ['0.6178']
x = re.findall('^X.*: ([0-9.]+)', ['0.0000']
line) ['0.6961']
if len(x) > 0:
print(x)
• Let us consider another example, assume that the file contain lines of the form:
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
If we wanted to extract all of the revision numbers (the integer number at the end of
these lines) using the same technique as above, we could write the following program:
Ex:3 Search for lines that start with 'Details: rev=' followed by numbers and '.'
Then print the number if it is greater than zero.
import re
str="Details:http://source.sakaiproject.org/viewsvn/?view=rev&rev=3
9772"
x = re.findall('^Details:.*rev=([0-9]+)', str)
if len(x) > 0:
print(x)
Output:
['39772']
In the above example, we are looking for lines that start with Details:, followed by
any number of characters (.*), followed by rev=, and then by one or more digits. We
want to find lines that match the entire expression but we only want to extract the
integer number at the end of the line, so we surround [0-9]+ with parentheses.
Note that, the expression [0-9] is greedy, because, it can display very large
number. It keeps grabbing digits until it finds any other character than the digit.
Escape character
Character like dot, plus, question mark, asterisk, dollar etc. are meta characters in
regular expressions. Sometimes, we need these characters themselves as a part of
matching string. Then, we need to escape them using a backslash. For example,
import re Output:
x = 'We just received $10.00 for
cookies.' matched string: $10.00
y = re.search('\$[0-9.]+',x)
print("matched string:",y.group())
Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we
need to use \ before it. So that, now $ is treated as a part of matching string, but not
as metacharacter.
Question Bank