0% found this document useful (0 votes)
17 views

01 Gmane-Introduction - en

The document discusses spidering a gigabyte of email data from a public mailing list archive and then analyzing and visualizing that data. It describes downloading the raw data, cleaning and normalizing it into a SQLite database, and using various Python scripts to analyze and visualize the data in different ways like word clouds and timelines.

Uploaded by

Box Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

01 Gmane-Introduction - en

The document discusses spidering a gigabyte of email data from a public mailing list archive and then analyzing and visualizing that data. It describes downloading the raw data, cleaning and normalizing it into a SQLite database, and using various Python scripts to analyze and visualize the data in different ways like word clouds and timelines.

Uploaded by

Box Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

So, now we're going to do our last visualization and it's interesting that it's

kind of we're coming full circle, we're back to email and so instead of a few
thousand lines of email, we're going to do a gigabyte of email and you're going to
spider a gigabyte. Now, actually if you look at the Read Me on gmane.zip, it tells
you how you can get a head start by doing this first 675 megabytes by one statement
and then you can fill in the details. The idea is that we have an API out there
that will give us a mailing list. Given a URL that we just hit over and over again
changing this little number and then we're going to be pulling this raw data and
then we'll have analysis cleanup phase and then we're going to visualize this data
in a couple of different ways. Now, this is a lot of data, it's about a gigabyte of
data and it originally came from a place called gmane.org and we have had problems
because when too many students start using gmane.org to pull the data, we've
actually kind of hurt their servers. They don't have rate limits, they're nice
folks. If we hurt them, they're hurt, they're not defending themselves and so where
Twitter and Google, gmane.org is just some nice people that are doing this and so
don't be bug, don't be uncool. I've got this http://mbox.dr-chuck.net that has this
data and it's on superfast servers that are cached all around the world using this
thing called cloudflare. So, they're super awesome and you can beat the hack out of
dr-chuck.net and I guarantee you you're not going to hurt it. You can't take it
down. Good luck trying to take it down, okay, because it is a beast. So, make sure
that when you're testing, you better use dr-chuck.net, dont use gmane.org. Even
though it would work, please don't do that. I've got my own copy and okay enough
said. Okay. So, this is basically the data flow that's going to happen and that is
we go to this dr-chuck.net which has got all the data, it's got an API and we
basically had their sequence in number. So, there just message 1, message 2,
message 3 and so we can have message 1, message 2, message 3 and we know how much
we've retrieved. So, this program when it starts up it says how much is in the
database go down down down down down down oh, okay, number 4. So, then it calls the
API to get message number 4, brings it back and puts it in. Calls the API message
number 5, 6, 7, 8, 9, 100, 150, 200, 300, oh crash. Again, this is a slow but
restartable process, okay. So, then you start it back up and it's like oh we're 51.
So, we go 51, 52, 53, 54 and if you're really going to spider this all, I think
when I spidered at the first time, it took me like three days to get all of it and
so it's kind of fun, right, unless of course you're using a network connection
you're paying for. Do not do that because you're going to spend a lot of money on
your network connection. If you're on a unlimited network or you're in a
university, it's got a good connection, then have fun. Run it for hours, watch what
it does. It just grinds and grinds and grinds and grinds. Now, what happens is it
turns out that this data is a little funky and it's all talked about in the read me
but this is like almost 10 years of data from the psychi developer list and people
even change their email address and so there's this little bit of extra like patchy
code called G model that has some more data that it configures it and it reads all
this stuff and it cleans up the data. So, this ends up being really large and if
you recall from the database chapter, it's not well normalized. It's just raw, it's
set up to be it's an index it's very raw, it's only there for spidering and making
sure we can restart our spider. If you want to actually make sense of this data, we
cleaned it up by running a process that reads this completely, wipes this out and
then writes a whole brand new content. If you look at the size of this, this is
really large and this is really small. If you have the whole thing, it can take,
depending on how fast your computer is, it can take minutes to read this data
because it's so big and this is a good example of normalized data versus non
normalized data. So, it takes like- let's just say it takes two minutes to write
this because it is reading it slowly because it's not normalized. This is nicely
normalized. It's using index keys and foreign keys and primary keys and all that
stuff, all the stuff we taught you in the database. That's here. So, this is a
small and you look at the size of the file, it's roughly got the same information
but it's represented in a much more efficient way. So, then this produces
content.sqlite and then the rest of the things read content.sqlite because this is
the cleanup phase, that's the cleanup phase. Now, what you can do is you can run
this for awhile then blow that up then run this and that's fine because every time
this runs, it throws this away and rebuilds it and maybe look at some stuff and say
"Oh, I want to run some more" and then that's okay because now you can start this
back up and as soon as you're done with however far you went there, you stop that
and then you do this again. So, you do this and it reads the whole thing and
updates this whole thing and so then once this data has been processed the right
way, then you run gbasic.py and it dumps out some stuff but it's really just doing
database analysis and then if we want to visualize it with line, you run this
gline.py and again that loops through all the data and produces the data on
gline.js and then you can visualize this with a HTML file and the d3.js. If you
want to make a word cloud, you run this gword which loops through all that data,
produces some JavaScript that then is combined with some more HTML to produce a
nice word cloud. So, the ReadMe tells you all of this stuff and gets you through
all this stuff and tells you what to run and how to run it and roughly how long
it's going to take. So, you can work your way through all of these things. So, in
summary with these three examples, were really writing a little more sophisticated
applications. I've given you most of the source code for these applications, but
you ca see what a more sophisticated application looks like and based on these, you
can probably build your own data pulling and maybe even a data visualization
technique and adapt some of these stuff. So, thanks for listening and we'll see you
on the net.

You might also like