0% found this document useful (0 votes)
9 views

Data Gathering

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Gathering

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Gathering

CSV
Opening a local .CSV file

Import pandas as pd
Data_frame_name = pd.read_csv('file_name.csv')

Opening a .CSV file from a URL

import requests
from io import stringIo
url = “”
headers = {"user-Agent": "Mozi1la/5.0 (Macintosh; Intel Mac os x 10.14; rv:66.0)
Gecko/20100101 Firefox/66.0")
req =requests.get(url,headers=headers)data=stringIo(req.text)
pd.read_csv(data)

Sep paramiter

For .tsv file sep = ‘\t’


Data_frame_name = pd.read_csv('file_name.tsv', sep=’\t’)
If name is not in the table then
Data_frame_name = pd.read_csv('file_name.tsv', sep=’\t’, names = [---],[---],[---],[---])

Index_col Paramiter (convert any column into index)


Data_frame_name = pd.read_csv('file_name.tsv', index_col =’column_name’)
Header paramiter (First row into coloum)
Data_frame_name = pd.read_csv('file_name.tsv', header =1)

Use_cols paramiter (specific coloum)

Data_frame_name = pd.read_csv('file_name.tsv', use_cols = [---],[---],[---],[---])

Skiprows/nrows paramiter
Data_frame_name = pd.read_csv('file_name.csv', skiprows =[1,5])
Data_frame_name = pd.read_csv('file_name.csv', nrows = number of rows want to
input like 100 200 )

Encoding paramiter
Data_frame_name = pd.read_csv('file_name.tsv', encoding = ‘’)
Skip bad lines
Data_frame_name = pd.read_csv('file_name.csv', error_bad_lines = False)

Dtype paramiter (Convert any specific or group of column data type)


Data_frame_name = pd.read_csv('file_name.csv', dtype={‘col_name: data type})
When we do read csv all dates convert to object/string then
Data_frame_name = pd.read_csv('file_name.csv', parse_date=[‘column_name] })

It convert text into date


Convertors

Def fun_name(var_name):
If name == “value”:
Return “new value”
Else:
Return name

pd.read_csv('file_name.csv', conertors = {‘coloum name’: fun_name})


Na Value Consider
pd.read_csv('file_name.csv', na_values = [‘---’,’----’]}
Loading huge dataset
Dff = pd.read_csv('file_name.csv',chunksize = row count like 5000)
For chunks indfs:
Print(chunk.shape)
JSON
Load :
Import pandas as pd
Data_frame_name = pd.read_json('file_name.json’)
From url :
Import pandas as pd
Data_frame_name = pd.read_json(‘URL’)

SQL
I need to load sql file into mysql

Installing -- !pip install mysql.connector


Import mysql.connector
Mysql.connector.connect(host = ‘localhost or IP adress’, user=’root’, password =
‘’,database = ‘db name’)
Pd.read_sql_query(“Select * from table_name”, conn)

Fetching Data From an API

Import pandas as pd
Import requeset
Response = Requests.get(api link)
Response.json()[‘json name like results’]
Pd.dataframe(Response.json()[‘results’])
Extract column from a lot of column

Pd.dataframe(Response.json()[‘results’]).[[‘id’,’title’,’---’,’---’]]
Df = Pd.dataframe(Response.json()[‘results’]).[[‘id’,’title’,’---’,’---’]]

Df = pd.dataframe()
For i in range(1, total number of page):

Response = Requests.get(api link) ### we ned to formetting link

Temp_df = Pd.dataframe(Response.json()[‘results’]).[[‘id’,’title’,’---’,’---’]]
Df = Df.append(Temp_df, ignore_index = True)

File convert to csv

Df.to_csv(‘file_name.csv’)
Rapid api csv file
Web Scraping

Import pandas as pd
Import requests
From bs4 import BeautifulSoup

webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page=1').text
soup=BeautifulSoup(webpage,'lxml')
#print(soup.prettify())

find all h1 tag --- soup.find_all('h1')[0].text


TO FIND OUT NAMES OF THE COMPANIES
for i in soup.find_all('h2'):
print(i.text.strip())

#comment #for in soup.find_all('p'):


#print(i.text.strip()
TO FIND OUT THE RATINGS
len(soup.find_all('p',class_='rating'))

TO FIND OUT THE NUMBER OF REVIEWS


len(soup.find_all('a' , class_='review-count'))

if response code is 403


headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML ,
like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
requests.get('url',headers=headers).text
final=pd.DataFrame()
for j in range(1,1001):
webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page={}'.format(j)).text
soup=BeautifulSoup(webpage,'lxml')
company=soup.find_all('div',class_='company-content-wrapper')
name=[]
rating=[]
reviews=[]
ctype=[]
hq=[]
how_old=[]
no_of_employee=[]

for i in company:

try:
name.append(i.find('h2').text.strip())
except:
name.append(np.nan)

try:
rating.append(i.find('p',class_='rating').text.strip())
except:
rating.append(np.nan)

try:

reviews.append(i.find('a' , class_='review-count').text.strip())
except:
reviews.append(np.nan)

try:

ctype.append(i.find_all('p',class_='infoEntity')[0].text.strip())
except:
ctype.append(np.nan)
try:

hq.append(i.find_all('p',class_='infoEntity')[1].text.strip())
except:
hq.append(np.nan)

try:

how_old.append(i.find_all('p',class_='infoEntity')[2].text.strip())
except:
how_old.append(np.nan)
try:
no_of_employee.append(i.find_all('p',class_='infoEntity')[3].text.strip())
except:
no_of_employee.append(np.nan)

df=pd.DataFrame({'name':name,
'rating':rating,
'reviews':reviews,
'company_type':ctype,
'Head_Quarters':hq,
'Company_Age':how_old,
'No_of_Employee':no_of_employee,
})

final=final.append(df,ignore_index=True)

You might also like