0% found this document useful (0 votes)

9 views

Data Gathering

Uploaded by

abdur.rahman.30104

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Data Gathering

Uploaded by

abdur.rahman.30104

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Gathering

CSV
Opening a local .CSV file

Import pandas as pd
Data_frame_name = pd.read_csv('file_name.csv')

Opening a .CSV file from a URL

import requests
from io import stringIo
url = “”
headers = {"user-Agent": "Mozi1la/5.0 (Macintosh; Intel Mac os x 10.14; rv:66.0)
Gecko/20100101 Firefox/66.0")
req =requests.get(url,headers=headers)data=stringIo(req.text)
pd.read_csv(data)

Sep paramiter

For .tsv file sep = ‘\t’

Data_frame_name = pd.read_csv('file_name.tsv', sep=’\t’)
If name is not in the table then
Data_frame_name = pd.read_csv('file_name.tsv', sep=’\t’, names = [---],[---],[---],[---])

Index_col Paramiter (convert any column into index)

Data_frame_name = pd.read_csv('file_name.tsv', index_col =’column_name’)
Header paramiter (First row into coloum)
Data_frame_name = pd.read_csv('file_name.tsv', header =1)

Use_cols paramiter (specific coloum)

Data_frame_name = pd.read_csv('file_name.tsv', use_cols = [---],[---],[---],[---])

Skiprows/nrows paramiter
Data_frame_name = pd.read_csv('file_name.csv', skiprows =[1,5])
Data_frame_name = pd.read_csv('file_name.csv', nrows = number of rows want to
input like 100 200 )

Encoding paramiter
Data_frame_name = pd.read_csv('file_name.tsv', encoding = ‘’)
Skip bad lines
Data_frame_name = pd.read_csv('file_name.csv', error_bad_lines = False)

Dtype paramiter (Convert any specific or group of column data type)

Data_frame_name = pd.read_csv('file_name.csv', dtype={‘col_name: data type})
When we do read csv all dates convert to object/string then
Data_frame_name = pd.read_csv('file_name.csv', parse_date=[‘column_name] })

It convert text into date

Convertors

Def fun_name(var_name):
If name == “value”:
Return “new value”
Else:
Return name

pd.read_csv('file_name.csv', conertors = {‘coloum name’: fun_name})

Na Value Consider
pd.read_csv('file_name.csv', na_values = [‘---’,’----’]}
Loading huge dataset
Dff = pd.read_csv('file_name.csv',chunksize = row count like 5000)
For chunks indfs:
Print(chunk.shape)
JSON
Load :
Import pandas as pd
Data_frame_name = pd.read_json('file_name.json’)
From url :
Import pandas as pd
Data_frame_name = pd.read_json(‘URL’)

SQL
I need to load sql file into mysql

Installing -- !pip install mysql.connector

Import mysql.connector
Mysql.connector.connect(host = ‘localhost or IP adress’, user=’root’, password =
‘’,database = ‘db name’)
Pd.read_sql_query(“Select * from table_name”, conn)

Fetching Data From an API

Import pandas as pd
Import requeset
Response = Requests.get(api link)
Response.json()[‘json name like results’]
Pd.dataframe(Response.json()[‘results’])
Extract column from a lot of column

Pd.dataframe(Response.json()[‘results’]).[[‘id’,’title’,’---’,’---’]]
Df = Pd.dataframe(Response.json()[‘results’]).[[‘id’,’title’,’---’,’---’]]

Df = pd.dataframe()
For i in range(1, total number of page):

Response = Requests.get(api link) ### we ned to formetting link

Temp_df = Pd.dataframe(Response.json()[‘results’]).[[‘id’,’title’,’---’,’---’]]
Df = Df.append(Temp_df, ignore_index = True)

File convert to csv

Df.to_csv(‘file_name.csv’)
Rapid api csv file
Web Scraping

Import pandas as pd
Import requests
From bs4 import BeautifulSoup

webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page=1').text
soup=BeautifulSoup(webpage,'lxml')
#print(soup.prettify())

find all h1 tag --- soup.find_all('h1')[0].text

TO FIND OUT NAMES OF THE COMPANIES
for i in soup.find_all('h2'):
print(i.text.strip())

#comment #for in soup.find_all('p'):

#print(i.text.strip()
TO FIND OUT THE RATINGS
len(soup.find_all('p',class_='rating'))

TO FIND OUT THE NUMBER OF REVIEWS

len(soup.find_all('a' , class_='review-count'))

if response code is 403

headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML ,
like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
requests.get('url',headers=headers).text
final=pd.DataFrame()
for j in range(1,1001):
webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page={}'.format(j)).text
soup=BeautifulSoup(webpage,'lxml')
company=soup.find_all('div',class_='company-content-wrapper')
name=[]
rating=[]
reviews=[]
ctype=[]
hq=[]
how_old=[]
no_of_employee=[]

for i in company:

try:
name.append(i.find('h2').text.strip())
except:
name.append(np.nan)

try:
rating.append(i.find('p',class_='rating').text.strip())
except:
rating.append(np.nan)

try:

reviews.append(i.find('a' , class_='review-count').text.strip())
except:
reviews.append(np.nan)

try:

ctype.append(i.find_all('p',class_='infoEntity')[0].text.strip())
except:
ctype.append(np.nan)
try:

hq.append(i.find_all('p',class_='infoEntity')[1].text.strip())
except:
hq.append(np.nan)

try:

how_old.append(i.find_all('p',class_='infoEntity')[2].text.strip())
except:
how_old.append(np.nan)
try:
no_of_employee.append(i.find_all('p',class_='infoEntity')[3].text.strip())
except:
no_of_employee.append(np.nan)

df=pd.DataFrame({'name':name,
'rating':rating,
'reviews':reviews,
'company_type':ctype,
'Head_Quarters':hq,
'Company_Age':how_old,
'No_of_Employee':no_of_employee,
})

final=final.append(df,ignore_index=True)

How To Install - West Coast Grammy 2 VST
No ratings yet
How To Install - West Coast Grammy 2 VST
7 pages
Maximo in Oil and Gas
100% (1)
Maximo in Oil and Gas
29 pages
app
No ratings yet
app
7 pages
csv files
No ratings yet
csv files
22 pages
Streamlit PDF Application Setup All Commands in One Single File
No ratings yet
Streamlit PDF Application Setup All Commands in One Single File
8 pages
Python Day 14 (Typed Notes) - Data Extraction Test Cases
No ratings yet
Python Day 14 (Typed Notes) - Data Extraction Test Cases
3 pages
Loading and Saving Data
No ratings yet
Loading and Saving Data
5 pages
Source Code Python Jemmy
No ratings yet
Source Code Python Jemmy
7 pages
project file part 4
No ratings yet
project file part 4
5 pages
De Programs2
No ratings yet
De Programs2
16 pages
Phyton Script Example
No ratings yet
Phyton Script Example
4 pages
3
No ratings yet
3
7 pages
File Handling
No ratings yet
File Handling
6 pages
j Son Server
No ratings yet
j Son Server
9 pages
app.py
No ratings yet
app.py
7 pages
8 10
No ratings yet
8 10
6 pages
Shreya Srivastava-27
No ratings yet
Shreya Srivastava-27
3 pages
DOC-20250211-WA0009. (1)
No ratings yet
DOC-20250211-WA0009. (1)
26 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
01 Python 02 Data Sourcing
No ratings yet
01 Python 02 Data Sourcing
9 pages
Views - Py Forlder
No ratings yet
Views - Py Forlder
8 pages
Full Code
No ratings yet
Full Code
76 pages
How to load data in pandas dataframe
No ratings yet
How to load data in pandas dataframe
2 pages
SCRIPT - WF Calling From MIF
No ratings yet
SCRIPT - WF Calling From MIF
23 pages
CSV BoardQuestions
No ratings yet
CSV BoardQuestions
3 pages
Introduction To Using Apis With Python: Nalette Brodnax
No ratings yet
Introduction To Using Apis With Python: Nalette Brodnax
42 pages
Summer Training
No ratings yet
Summer Training
10 pages
XX
No ratings yet
XX
4 pages
MBA Sem 1 Unit 3 Fundamentals of R (1)
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R (1)
41 pages
Ip HHW
No ratings yet
Ip HHW
32 pages
Cabico Tan
No ratings yet
Cabico Tan
11 pages
Sourcecode
No ratings yet
Sourcecode
16 pages
Economist Old Edition
No ratings yet
Economist Old Edition
7 pages
Aryan Cs Project
No ratings yet
Aryan Cs Project
28 pages
Class 12 IP Final Practical
No ratings yet
Class 12 IP Final Practical
21 pages
My_own_cheatsheet
No ratings yet
My_own_cheatsheet
13 pages
Apache Kafka
No ratings yet
Apache Kafka
6 pages
10 Realtime Python Automation Scripts
100% (2)
10 Realtime Python Automation Scripts
12 pages
MongoDB Query
No ratings yet
MongoDB Query
9 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
IDAP Assignment
No ratings yet
IDAP Assignment
6 pages
10 PHP Moreforms
No ratings yet
10 PHP Moreforms
26 pages
CS HOMEWORK
No ratings yet
CS HOMEWORK
8 pages
Python Microservices - Tornado REST and Unit Tests - Slanglabs
No ratings yet
Python Microservices - Tornado REST and Unit Tests - Slanglabs
21 pages
33 CSV File SQP
No ratings yet
33 CSV File SQP
10 pages
prac1
No ratings yet
prac1
5 pages
prac1
No ratings yet
prac1
5 pages
5 Mark
No ratings yet
5 Mark
16 pages
Nekobin
No ratings yet
Nekobin
2 pages
Mainpy (Customer Segmentation)
No ratings yet
Mainpy (Customer Segmentation)
6 pages
6 To 10
No ratings yet
6 To 10
10 pages
CSV
No ratings yet
CSV
3 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
56 Assignments
No ratings yet
56 Assignments
12 pages
Working With CSV Files
No ratings yet
Working With CSV Files
4 pages
Data frames pandas, handout 1 (1)
No ratings yet
Data frames pandas, handout 1 (1)
16 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
No ratings yet
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
20 pages
cloudinary
No ratings yet
cloudinary
7 pages
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Socket Programming With TCP
No ratings yet
Socket Programming With TCP
14 pages
SIMIC TSP Simulator10.1.1.103
No ratings yet
SIMIC TSP Simulator10.1.1.103
52 pages
Printer Specifications For HP Officejet 7610, 7612 Printers - HP® Customer Support
No ratings yet
Printer Specifications For HP Officejet 7610, 7612 Printers - HP® Customer Support
15 pages
Chapter 3:its Subject, Form and Content
No ratings yet
Chapter 3:its Subject, Form and Content
11 pages
2024 Oxford PAT Calculator Guide
No ratings yet
2024 Oxford PAT Calculator Guide
2 pages
Dcloud Webex Calling Lab v4
No ratings yet
Dcloud Webex Calling Lab v4
84 pages
Contact Center Vendor Comparison Matrix (3)
No ratings yet
Contact Center Vendor Comparison Matrix (3)
3 pages
Website Proposal
No ratings yet
Website Proposal
4 pages
Track Record Seafastenings PDF
No ratings yet
Track Record Seafastenings PDF
14 pages
SWE-COMP 20009.1-SPRING 22-ASSIGNMENT 2-QP Final
No ratings yet
SWE-COMP 20009.1-SPRING 22-ASSIGNMENT 2-QP Final
13 pages
Scrum
No ratings yet
Scrum
30 pages
Call by Value and Call by Reff
No ratings yet
Call by Value and Call by Reff
3 pages
416 10 Mongo
No ratings yet
416 10 Mongo
18 pages
Workflows For Dummies I
No ratings yet
Workflows For Dummies I
17 pages
302-002-574 - Service Commands Technical Notes
No ratings yet
302-002-574 - Service Commands Technical Notes
133 pages
Automated FPI System
No ratings yet
Automated FPI System
5 pages
Matrimony App Project
No ratings yet
Matrimony App Project
19 pages
Mavzu:: Javascript Da Brauzer Va Veb Hujjat Obyektlari Modeli Bilan Ishlash
No ratings yet
Mavzu:: Javascript Da Brauzer Va Veb Hujjat Obyektlari Modeli Bilan Ishlash
13 pages
Slideplayer Com Slide 10294885
No ratings yet
Slideplayer Com Slide 10294885
15 pages
Research Paper - Critical Comparative Feature Analysis of Structural Software (Sap2000 and Staad - Pro)
No ratings yet
Research Paper - Critical Comparative Feature Analysis of Structural Software (Sap2000 and Staad - Pro)
55 pages
Flutter For TV: Aleksandr Denisov Advanced Software Engineer - EPAM Flutter and Dart GDE
No ratings yet
Flutter For TV: Aleksandr Denisov Advanced Software Engineer - EPAM Flutter and Dart GDE
51 pages
Comparison Between Computer-Assisted Coding (CAC) and Solo Coding
No ratings yet
Comparison Between Computer-Assisted Coding (CAC) and Solo Coding
2 pages
Datasheet - Hitachi Universal Storage Platform VM
No ratings yet
Datasheet - Hitachi Universal Storage Platform VM
2 pages
Microsoft Azure: Infrastructure As A Service (Iaas)
No ratings yet
Microsoft Azure: Infrastructure As A Service (Iaas)
16 pages
Manually Creating An Oracle 11g Database
No ratings yet
Manually Creating An Oracle 11g Database
4 pages
Alom Commands
No ratings yet
Alom Commands
2 pages
CleanWipe - Step by Step Guide
No ratings yet
CleanWipe - Step by Step Guide
4 pages
OmegaT Documentation en
No ratings yet
OmegaT Documentation en
132 pages

Data Gathering

Uploaded by

Data Gathering

Uploaded by

Data Gathering

Opening a .CSV file from a URL

For .tsv file sep = ‘\t’

Index_col Paramiter (convert any column into index)

Use_cols paramiter (specific coloum)

Data_frame_name = pd.read_csv('file_name.tsv', use_cols = [---],[---],[---],[---])

Dtype paramiter (Convert any specific or group of column data type)

It convert text into date

pd.read_csv('file_name.csv', conertors = {‘coloum name’: fun_name})

Installing -- !pip install mysql.connector

Fetching Data From an API

Response = Requests.get(api link) ### we ned to formetting link

File convert to csv

find all h1 tag --- soup.find_all('h1')[0].text

#comment #for in soup.find_all('p'):

TO FIND OUT THE NUMBER OF REVIEWS

if response code is 403

You might also like