01 Gmane-Introduction - en

The document discusses spidering a gigabyte of email data from a public mailing list archive and then analyzing and visualizing that data. It describes downloading the raw data, cleaning and normalizing it into a SQLite database, and using various Python scripts to analyze and visualize the data in different ways like word clouds and timelines.

Uploaded by

Box Box

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

01 Gmane-Introduction - en

Uploaded by

Box Box

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

So, now we're going to do our last visualization and it's interesting that it's

kind of we're coming full circle, we're back to email and so instead of a few
thousand lines of email, we're going to do a gigabyte of email and you're going to
spider a gigabyte. Now, actually if you look at the Read Me on gmane.zip, it tells
you how you can get a head start by doing this first 675 megabytes by one statement
and then you can fill in the details. The idea is that we have an API out there
that will give us a mailing list. Given a URL that we just hit over and over again
changing this little number and then we're going to be pulling this raw data and
then we'll have analysis cleanup phase and then we're going to visualize this data
in a couple of different ways. Now, this is a lot of data, it's about a gigabyte of
data and it originally came from a place called gmane.org and we have had problems
because when too many students start using gmane.org to pull the data, we've
actually kind of hurt their servers. They don't have rate limits, they're nice
folks. If we hurt them, they're hurt, they're not defending themselves and so where
Twitter and Google, gmane.org is just some nice people that are doing this and so
don't be bug, don't be uncool. I've got this http://mbox.dr-chuck.net that has this
data and it's on superfast servers that are cached all around the world using this
thing called cloudflare. So, they're super awesome and you can beat the hack out of
dr-chuck.net and I guarantee you you're not going to hurt it. You can't take it
down. Good luck trying to take it down, okay, because it is a beast. So, make sure
that when you're testing, you better use dr-chuck.net, dont use gmane.org. Even
though it would work, please don't do that. I've got my own copy and okay enough
said. Okay. So, this is basically the data flow that's going to happen and that is
we go to this dr-chuck.net which has got all the data, it's got an API and we
basically had their sequence in number. So, there just message 1, message 2,
message 3 and so we can have message 1, message 2, message 3 and we know how much
we've retrieved. So, this program when it starts up it says how much is in the
database go down down down down down down oh, okay, number 4. So, then it calls the
API to get message number 4, brings it back and puts it in. Calls the API message
number 5, 6, 7, 8, 9, 100, 150, 200, 300, oh crash. Again, this is a slow but
restartable process, okay. So, then you start it back up and it's like oh we're 51.
So, we go 51, 52, 53, 54 and if you're really going to spider this all, I think
when I spidered at the first time, it took me like three days to get all of it and
so it's kind of fun, right, unless of course you're using a network connection
you're paying for. Do not do that because you're going to spend a lot of money on
your network connection. If you're on a unlimited network or you're in a
university, it's got a good connection, then have fun. Run it for hours, watch what
it does. It just grinds and grinds and grinds and grinds. Now, what happens is it
turns out that this data is a little funky and it's all talked about in the read me
but this is like almost 10 years of data from the psychi developer list and people
even change their email address and so there's this little bit of extra like patchy
code called G model that has some more data that it configures it and it reads all
this stuff and it cleans up the data. So, this ends up being really large and if
you recall from the database chapter, it's not well normalized. It's just raw, it's
set up to be it's an index it's very raw, it's only there for spidering and making
sure we can restart our spider. If you want to actually make sense of this data, we
cleaned it up by running a process that reads this completely, wipes this out and
then writes a whole brand new content. If you look at the size of this, this is
really large and this is really small. If you have the whole thing, it can take,
depending on how fast your computer is, it can take minutes to read this data
because it's so big and this is a good example of normalized data versus non
normalized data. So, it takes like- let's just say it takes two minutes to write
this because it is reading it slowly because it's not normalized. This is nicely
normalized. It's using index keys and foreign keys and primary keys and all that
stuff, all the stuff we taught you in the database. That's here. So, this is a
small and you look at the size of the file, it's roughly got the same information
but it's represented in a much more efficient way. So, then this produces
content.sqlite and then the rest of the things read content.sqlite because this is
the cleanup phase, that's the cleanup phase. Now, what you can do is you can run
this for awhile then blow that up then run this and that's fine because every time
this runs, it throws this away and rebuilds it and maybe look at some stuff and say
"Oh, I want to run some more" and then that's okay because now you can start this
back up and as soon as you're done with however far you went there, you stop that
and then you do this again. So, you do this and it reads the whole thing and
updates this whole thing and so then once this data has been processed the right
way, then you run gbasic.py and it dumps out some stuff but it's really just doing
database analysis and then if we want to visualize it with line, you run this
gline.py and again that loops through all the data and produces the data on
gline.js and then you can visualize this with a HTML file and the d3.js. If you
want to make a word cloud, you run this gword which loops through all that data,
produces some JavaScript that then is combined with some more HTML to produce a
nice word cloud. So, the ReadMe tells you all of this stuff and gets you through
all this stuff and tells you what to run and how to run it and roughly how long
it's going to take. So, you can work your way through all of these things. So, in
summary with these three examples, were really writing a little more sophisticated
applications. I've given you most of the source code for these applications, but
you ca see what a more sophisticated application looks like and based on these, you
can probably build your own data pulling and maybe even a data visualization
technique and adapt some of these stuff. So, thanks for listening and we'll see you
on the net.

How Does A Relational Database Work - Coding Geek PDF
No ratings yet
How Does A Relational Database Work - Coding Geek PDF
68 pages
3 Hcia Cloud Service Multiple
100% (1)
3 Hcia Cloud Service Multiple
14 pages
Internship Report
No ratings yet
Internship Report
27 pages
CM2040: Databases, Networks and The Web: Arjun Muralidharan 6th September 2020
No ratings yet
CM2040: Databases, Networks and The Web: Arjun Muralidharan 6th September 2020
21 pages
IT Key Metrics Data 2018 - Key Infrastructure Measures - IT Service Desk Analysis - Multiyear - Gartner - Dec 2017
No ratings yet
IT Key Metrics Data 2018 - Key Infrastructure Measures - IT Service Desk Analysis - Multiyear - Gartner - Dec 2017
28 pages
HTML For Novices By Novices
From Everand
HTML For Novices By Novices
Mike Abelar
No ratings yet
02 - 16 2 Geocoding Visualization - en
No ratings yet
02 - 16 2 Geocoding Visualization - en
2 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Pythonlearn 16 Data Viz
No ratings yet
Pythonlearn 16 Data Viz
19 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
The Ultimate Aws Cloud Practitioner Mastery: Mastering AWS Essentials, A Comprehensive Guide for Cloud Practitioners
From Everand
The Ultimate Aws Cloud Practitioner Mastery: Mastering AWS Essentials, A Comprehensive Guide for Cloud Practitioners
Furuta Kimiko
No ratings yet
Mongodb
No ratings yet
Mongodb
66 pages
01 - Page Rank Overview - en
No ratings yet
01 - Page Rank Overview - en
2 pages
Intro To Big Data and Databases
No ratings yet
Intro To Big Data and Databases
28 pages
Mongodb
No ratings yet
Mongodb
31 pages
03 - Worked Example Twfriends Py Chapter 15.en
No ratings yet
03 - Worked Example Twfriends Py Chapter 15.en
4 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manualpdf download
100% (2)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manualpdf download
48 pages
12 OpenDBengine
No ratings yet
12 OpenDBengine
25 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manualpdf download
100% (3)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manualpdf download
57 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual instant download
100% (3)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual instant download
46 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual - All Chapters Are Available In PDF Format For Download
100% (5)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual - All Chapters Are Available In PDF Format For Download
45 pages
Build APIs You Won t Hate 1st Edition Phil Sturgeon instant download
100% (1)
Build APIs You Won t Hate 1st Edition Phil Sturgeon instant download
34 pages
Download Study Resources for Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
100% (18)
Download Study Resources for Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
61 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual pdf download
100% (3)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual pdf download
52 pages
Get Distributed and Cloud Computing 1st Edition Hwang Solutions Manual free all chapters
100% (27)
Get Distributed and Cloud Computing 1st Edition Hwang Solutions Manual free all chapters
53 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual instant download
100% (2)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual instant download
59 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual 2024 scribd download full chapters
100% (4)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual 2024 scribd download full chapters
60 pages
Make Use of An Example To Explain The Significance of XML Over The Web Development
No ratings yet
Make Use of An Example To Explain The Significance of XML Over The Web Development
5 pages
Chapter 3
No ratings yet
Chapter 3
39 pages
Subtitle
No ratings yet
Subtitle
2 pages
All chapter download Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
100% (3)
All chapter download Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
52 pages
Get Distributed and Cloud Computing 1st Edition Hwang Solutions Manual Free All Chapters Available
100% (22)
Get Distributed and Cloud Computing 1st Edition Hwang Solutions Manual Free All Chapters Available
42 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual - Complete Set Of Chapters Available For One-Click Download
No ratings yet
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual - Complete Set Of Chapters Available For One-Click Download
66 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manualpdf download
100% (6)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manualpdf download
49 pages
Google Project Synopsis
No ratings yet
Google Project Synopsis
36 pages
An Introduction to Website Performance: How to Outrun the Zombie Hordes: Undead Institute, #15
From Everand
An Introduction to Website Performance: How to Outrun the Zombie Hordes: Undead Institute, #15
John Rhea
No ratings yet
Databases a Beginners Guide
No ratings yet
Databases a Beginners Guide
497 pages
UNIT-IV notes.docx
No ratings yet
UNIT-IV notes.docx
15 pages
WEBX IAT2 QB SOLN
No ratings yet
WEBX IAT2 QB SOLN
13 pages
Build Your Own Database From Scratch-2023-英文版
No ratings yet
Build Your Own Database From Scratch-2023-英文版
120 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual instant download
100% (2)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual instant download
55 pages
BMS Prep For The Tech PM Interview
No ratings yet
BMS Prep For The Tech PM Interview
32 pages
Major Project PROPOSAL-BACHELOR OF ENGINEERING
No ratings yet
Major Project PROPOSAL-BACHELOR OF ENGINEERING
37 pages
UfoBidule1 (Abdur-Rahmaan Janhangeer) (Z-Library)
No ratings yet
UfoBidule1 (Abdur-Rahmaan Janhangeer) (Z-Library)
55 pages
Build Your Own Database From Scratch 1nbsped 9798391723394
100% (1)
Build Your Own Database From Scratch 1nbsped 9798391723394
120 pages
Breaking Into Information Security
No ratings yet
Breaking Into Information Security
185 pages
Read This Story at Your Own Risk
From Everand
Read This Story at Your Own Risk
Steve Mierzejewski
4/5 (1)
Prompt
No ratings yet
Prompt
10 pages
Full Build Your Own Database From Scratch James Smith Ebook All Chapters
100% (3)
Full Build Your Own Database From Scratch James Smith Ebook All Chapters
62 pages
MongoDB Administrator Training
100% (1)
MongoDB Administrator Training
216 pages
build-your-own-database-from-scratch-1n
No ratings yet
build-your-own-database-from-scratch-1n
120 pages
Riddler White Paper
No ratings yet
Riddler White Paper
19 pages
Linkedin Posts 2024 Blue
No ratings yet
Linkedin Posts 2024 Blue
368 pages
Developers Guide To Multiplayer Games Wordware Game Developers Library Andrew Mulholland instant download
No ratings yet
Developers Guide To Multiplayer Games Wordware Game Developers Library Andrew Mulholland instant download
78 pages
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual - Quick Download In Full PDF Format With All Chapters
100% (3)
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual - Quick Download In Full PDF Format With All Chapters
27 pages
PostgreSQL 9 Administration Cookbook: LITE Edition
From Everand
PostgreSQL 9 Administration Cookbook: LITE Edition
Simon Riggs
3/5 (1)
The Computer User's Survival Handbook: Why Is My Computer Slow?
From Everand
The Computer User's Survival Handbook: Why Is My Computer Slow?
Laszlo Szenes
No ratings yet
Python: Beginner's Guide to Programming Code with Python
From Everand
Python: Beginner's Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Beginner's Guide to Programming Code with Python: Python Computer Programming, #1
From Everand
Python: Beginner's Guide to Programming Code with Python: Python Computer Programming, #1
Charlie Masterson
No ratings yet
Computer Programming: From Beginner to Badass—JavaScript, HTML, CSS, & SQL
From Everand
Computer Programming: From Beginner to Badass—JavaScript, HTML, CSS, & SQL
Zack Fleming
3/5 (1)
Review Paper On Event Management System
No ratings yet
Review Paper On Event Management System
2 pages
Pega Customer Decision Hub (CDH)
No ratings yet
Pega Customer Decision Hub (CDH)
14 pages
Kofax Totalagility: Architecture Guide
No ratings yet
Kofax Totalagility: Architecture Guide
42 pages
1.1 Preamble: Information Science & Engineering, Sjcit
No ratings yet
1.1 Preamble: Information Science & Engineering, Sjcit
44 pages
Ontology-Based Software Engineering-Software Engineering 2.0
No ratings yet
Ontology-Based Software Engineering-Software Engineering 2.0
11 pages
Extend EBS Using Applications Express: John Peters
100% (1)
Extend EBS Using Applications Express: John Peters
8 pages
Visual-Basic-.NET-2010-and-MySQL-Database-Connection-1
No ratings yet
Visual-Basic-.NET-2010-and-MySQL-Database-Connection-1
7 pages
Nosql PDF
No ratings yet
Nosql PDF
21 pages
4D Seismic Processing
100% (1)
4D Seismic Processing
20 pages
Ttmssoftwarereport 180224162503 PDF
No ratings yet
Ttmssoftwarereport 180224162503 PDF
28 pages
Datasheet For Cisco Remote Management Services RMS For Cloud Email Security Support
No ratings yet
Datasheet For Cisco Remote Management Services RMS For Cloud Email Security Support
16 pages
Fe-Safe Ansys Workbench
100% (1)
Fe-Safe Ansys Workbench
36 pages
Dynamics AX Interview Questions
100% (3)
Dynamics AX Interview Questions
22 pages
Turf Booking System Report (1) (1)
No ratings yet
Turf Booking System Report (1) (1)
4 pages
MS Access-Creating Table
No ratings yet
MS Access-Creating Table
52 pages
Title The 5 Essential Components of A Data Strategy: White Paper
No ratings yet
Title The 5 Essential Components of A Data Strategy: White Paper
16 pages
Drill-String Torque & Drag Model: DEA 44 Phase V
No ratings yet
Drill-String Torque & Drag Model: DEA 44 Phase V
59 pages
Gamnj
No ratings yet
Gamnj
24 pages
What Is A datab-WPS Office
No ratings yet
What Is A datab-WPS Office
19 pages
ICCA Statistics Report - 2015
No ratings yet
ICCA Statistics Report - 2015
28 pages
SDD
100% (1)
SDD
8 pages
Aakarsh - Aakarsh Goel
No ratings yet
Aakarsh - Aakarsh Goel
1 page
9-10 Trading Networks Administrators Guide PDF
0% (1)
9-10 Trading Networks Administrators Guide PDF
315 pages
DW Basics
No ratings yet
DW Basics
24 pages
Literature Review On College Facilities
100% (2)
Literature Review On College Facilities
7 pages
WSUS 3.0 SP2 - Operations Guide
No ratings yet
WSUS 3.0 SP2 - Operations Guide
151 pages
Envoy8X Getting Started Guide
No ratings yet
Envoy8X Getting Started Guide
16 pages
Organizational Management and Integration: Presented by - Divyasri G
No ratings yet
Organizational Management and Integration: Presented by - Divyasri G
135 pages

01 Gmane-Introduction - en

Uploaded by

01 Gmane-Introduction - en

Uploaded by

So, now we're going to do our last visualization and it's interesting that it's

You might also like