rickwash.com_papers_cscw08-appendix_parse_all_url

The document contains a Perl script named 'parse_all_url.pl' created by Rick Wash, which is designed to download and parse pages from del.icio.us and save them as HTML files. It includes functionality to create database tables for storing site and tag information, as well as methods for processing HTML files to extract relevant data. The script provides commands for creating a database and parsing saved HTML files to populate the database with the extracted information.

Uploaded by

zackm198613

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

rickwash.com_papers_cscw08-appendix_parse_all_url

Uploaded by

zackm198613

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

From: <Saved by Blink>

Snapshot-Content-Location:
https://rickwash.com/papers/cscw08-appendix/parse_all_url.txt
Subject:
Date: Thu, 13 Feb 2025 12:17:15 -0700
MIME-Version: 1.0
Content-Type: multipart/related;
type="text/html";
boundary="----MultipartBoundary--
d34iuTKpDSVocPkCb4t5DuMOZRbCF6XhEoVT21BOZO----"

------MultipartBoundary--d34iuTKpDSVocPkCb4t5DuMOZRbCF6XhEoVT21BOZO----
Content-Type: text/html
Content-ID: <[email protected]>
Content-Transfer-Encoding: binary
Content-Location: https://rickwash.com/papers/cscw08-appendix/parse_all_url.txt

<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-

8"><meta name="color-scheme" content="light dark"></head><body><pre style="word-
wrap: break-word; white-space: pre-wrap;">#!/usr/bin/perl
# Copyright (c) 2008 Rick Wash <[email protected]>
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
# ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
# ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
#
#
# parse_all_url.pl
# Rick Wash <[email protected]>
#
# Download a ton of pages from del.icio.us/URL (or from a person's delicious page)
# and save them as .html files.
# Then, you can run this file:
# - parse_all_url create <name>
# (creates the database tables <name>_site and <name>_tag)
# - parse_all_url <name>
# (parses all of the .html files in the current and all sub-directories)

use File::Find;

# Config info
$db_name = "delicious2007";
$db_user = "root";
$db_passwd = "********";

# Connect to the Database

use DBI;
$dbh = DBI->connect("dbi:mysql:$db_name", $db_user, $db_passwd);

sub create_tables
{
($sample) = @_;
$site_sql = "
CREATE TABLE `${sample}_site` (
ìd` int(11) NOT NULL auto_increment,
`deliciousID` varchar(200) NOT NULL default '',
`title` varchar(400) default NULL,
ùrl` varchar(500) default NULL,
ùser` varchar(200) default NULL,
`date` date default NULL,
`position` int(11) default NULL,
PRIMARY KEY (ìd`),
KEY `deliciousID` (`deliciousID`),
KEY `date` (`date`),
KEY `position` (`position`),
KEY ùser` (ùser`),
KEY ìd_date` (`deliciousID`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1";
$tag_sql = "
CREATE TABLE `${sample}_tag` (
ìd` int(11) NOT NULL auto_increment,
`site_id` int(11) default NULL,
`tag` varchar(200) default NULL,
`position` int(11) default NULL,
PRIMARY KEY (ìd`),
KEY `tag` (`tag`),
KEY `deliciousID` (`site_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1";

$dbh->do($site_sql);
$dbh->do($tag_sql);
}

sub fix_month {
$_ = shift;
s/Jan/01/g;
s/Feb/02/g;
s/Mar/03/g;
s/Apr/04/g;
s/May/05/g;
s/Jun/06/g;
s/Jul/07/g;
s/Aug/08/g;
s/Sep/09/g;
s/Oct/10/g;
s/Nov/11/g;
s/Dec/12/g;

return $_;
}

$first = shift; # first parameter from command line

if ($first eq "create") {
$sample = shift;
create_tables($sample);
exit(0);
} else {
$sample = $first;
}

$site_sql = "INSERT INTO ${sample}_site (deliciousID, title, url, user, date,

position) VALUES (?, ?, ?, ?, ?, ?)";
$site_sth = $dbh->prepare($site_sql);

$meta_sql = "INSERT INTO ${sample}_tag (site_id, tag, position) VALUES (?, ?, ?)";
$meta_sth = $dbh->prepare($meta_sql);

# First pass through the file. Find the Title, URL, and total number of posts
sub pass_one()
{
# Get the Site ID number from the title
if (/<title>del.icio.us\/url\/([^<]*)<\/title>/)
{
$site_id = $1;
}
if (/<h4 class="nomb"><a href="([^\"]*)" rel="nofollow">
([^<]*)<\/a><\/h4>/){
$title = $2;
$url = $1;
}
if (/this url has been saved by (\d*) people/) {
$total_count = $1
}
}

# Pass two through the file. Find the individual posts and store them in the
database
sub pass_two()
{
if (/<h5 class="datehead">(...) &lsquo;(\d\d)<\/h5>/) {
$mo = fix_month($1);
if ($2 < 90) {
$date = "20$2-${mo}-01 00:00:00";
} else {
$date = "19$2-${mo}-01 00:00:00";
}
}
if (/<li><p>by/) { # Found entry
@tags = ();
$order = 1;
if (/who" href="[^\\"]*">([^<]*)/) {
$user = $1;
}
while (/(to) <a href="[^"]*">([^<]*)<\/a>(.*)/) {
push @tags, [ $2, $order ];
$_ = $1 . $3; # Remove the matched section
$order += 1;
}
$site_sth->execute($site_id, $title,$url,$user,$date, $number--) || warn
"SQL Error Inserting into site";

$id = $dbh->last_insert_id(undef, undef, undef, undef);

foreach my $i (@tags) {
($tag, $order) = @{$i};
$meta_sth->execute($id, $tag, $order) || warn "SQL Error Inserting into
metadata";
}
}

sub parse_file {
my ($fname) = @_;

$number = 1;
# First pass....
open(INFILE, "<", $fname) || warn("Cannot open $fname: $!");
# Loop through the file, reading one line at a time
while(<INFILE>) {
pass_one()
}
close INFILE;

$number = $total_count;
# Second Pass
open(INFILE, "<", $fname) || warn("Cannot open $fname: $!");
# Loop through the file, reading one line at a time
while(<INFILE>) {
pass_two()
}
close INFILE;
}

sub check_file {
if (!(/^.*\.html$/)) { return; }
parse_file($_);
}

find(\&check_file, ".");
</pre></body></html>
------MultipartBoundary--d34iuTKpDSVocPkCb4t5DuMOZRbCF6XhEoVT21BOZO------

Learn JavaScript in 24 Hours
From Everand
Learn JavaScript in 24 Hours
Alex Nordeen
3.5/5 (5)
HOPPE Anti Heeling System
No ratings yet
HOPPE Anti Heeling System
151 pages
Operation Refilling and Maintenance Manual: Hatsuta Seisakusho Co. LTD
100% (1)
Operation Refilling and Maintenance Manual: Hatsuta Seisakusho Co. LTD
7 pages
Toyota Truck Highlander 2WD V6-3.0L (1MZ-FE) 2002: Timing Belt: Service and Repair
No ratings yet
Toyota Truck Highlander 2WD V6-3.0L (1MZ-FE) 2002: Timing Belt: Service and Repair
6 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
React Portfolio App Development: Increase your online presence and create your personal brand
From Everand
React Portfolio App Development: Increase your online presence and create your personal brand
Abdelfattah Ragab
No ratings yet
No Ph.D. Game Design With Three.js
From Everand
No Ph.D. Game Design With Three.js
Nikiforos Kontopoulos
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Basic DBA Query v.1: Oracle Database
From Everand
Basic DBA Query v.1: Oracle Database
Oraclesql-plsql
5/5 (1)
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
NgRx SignalStore: An effortless solution for state management
From Everand
NgRx SignalStore: An effortless solution for state management
Abdelfattah Ragab
No ratings yet
Index: Sr. No. Programs Remarks
No ratings yet
Index: Sr. No. Programs Remarks
25 pages
Danush
0% (1)
Danush
127 pages
WT Lab Manual
No ratings yet
WT Lab Manual
52 pages
Defined Moddynamicxmlsitemaphelper: or Die Class
No ratings yet
Defined Moddynamicxmlsitemaphelper: or Die Class
4 pages
Olympus THM
No ratings yet
Olympus THM
10 pages
Creating A Web Crawler in 3 Steps: Issac Goldstand Mirimar Networks
No ratings yet
Creating A Web Crawler in 3 Steps: Issac Goldstand Mirimar Networks
20 pages
150+ C Pattern Programs
From Everand
150+ C Pattern Programs
Hernando Abella
No ratings yet
Aws Backup Spreadsheet
No ratings yet
Aws Backup Spreadsheet
24 pages
WT LAB MANual
No ratings yet
WT LAB MANual
63 pages
We Blab Programs
No ratings yet
We Blab Programs
12 pages
Tutorial CRUD With CodeIgniter 2 Maxime Keltsma en
No ratings yet
Tutorial CRUD With CodeIgniter 2 Maxime Keltsma en
46 pages
Words Matter
From Everand
Words Matter
Robert Byrum
No ratings yet
YouTUBE Caching Using Squid As A Transparent Proxy PDF
No ratings yet
YouTUBE Caching Using Squid As A Transparent Proxy PDF
7 pages
How To Create A Simple Web Crawler in PHP
No ratings yet
How To Create A Simple Web Crawler in PHP
3 pages
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
Ex No: 3 Text Processing With Perl Date
No ratings yet
Ex No: 3 Text Processing With Perl Date
7 pages
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Web Technology
No ratings yet
Web Technology
54 pages
Web
100% (1)
Web
47 pages
Database SQL
No ratings yet
Database SQL
36 pages
wt lab programs 2021
No ratings yet
wt lab programs 2021
38 pages
MODx Ditto and Reflect Cheatsheet v1.2
100% (1)
MODx Ditto and Reflect Cheatsheet v1.2
3 pages
Oldenglen
From Everand
Oldenglen
Robin Mason
No ratings yet
50 Recipes for Programming Angular
From Everand
50 Recipes for Programming Angular
Jamie Munro
4/5 (1)
Report
No ratings yet
Report
249 pages
Building A Search Engine
No ratings yet
Building A Search Engine
11 pages
Introduction to PHP, Part 5, Second Edition
From Everand
Introduction to PHP, Part 5, Second Edition
Adam Majczak
No ratings yet
PHP Q.
No ratings yet
PHP Q.
8 pages
Wt Lab Manual
No ratings yet
Wt Lab Manual
18 pages
WT Lab Manual
No ratings yet
WT Lab Manual
60 pages
Luminescent Materials - G. Blasse, B. C. Grabmaier
100% (1)
Luminescent Materials - G. Blasse, B. C. Grabmaier
121 pages
Query String
No ratings yet
Query String
4 pages
Awwee
No ratings yet
Awwee
11 pages
Dearlogue
From Everand
Dearlogue
Abubakar Othman
No ratings yet
7-12 Web Lab Programs Vtu
No ratings yet
7-12 Web Lab Programs Vtu
9 pages
Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11
No ratings yet
Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11
14 pages
PHP Practical
No ratings yet
PHP Practical
40 pages
PHP Training Plan
0% (2)
PHP Training Plan
53 pages
PHP Final Exam Reviewer
No ratings yet
PHP Final Exam Reviewer
3 pages
Osfm
No ratings yet
Osfm
15 pages
Dorks With DonJuji
100% (1)
Dorks With DonJuji
4 pages
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
Awstats122013 Mjs
No ratings yet
Awstats122013 Mjs
89 pages
script
No ratings yet
script
7 pages
Aim: Write A PHP Script To Print Prime Numbers Between 1-50 Source Code
No ratings yet
Aim: Write A PHP Script To Print Prime Numbers Between 1-50 Source Code
45 pages
cs project documentation-merged
No ratings yet
cs project documentation-merged
19 pages
PHP (One Slide)
No ratings yet
PHP (One Slide)
1 page
Using Binary Search With SQL Injection
No ratings yet
Using Binary Search With SQL Injection
3 pages
Simplified PHP
From Everand
Simplified PHP
James Blanchette
No ratings yet
Web Technologies-I Question Papers
No ratings yet
Web Technologies-I Question Papers
28 pages
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
4th Year Assesement by RAJA
No ratings yet
4th Year Assesement by RAJA
10 pages
Thunderbolt Ac DC
No ratings yet
Thunderbolt Ac DC
44 pages
Statement Bank MBB
No ratings yet
Statement Bank MBB
13 pages
Procedia Cirp: Xiaoju Chen, H. Scott Matthews, Rebecca Hanes, Alberta Carpenter
No ratings yet
Procedia Cirp: Xiaoju Chen, H. Scott Matthews, Rebecca Hanes, Alberta Carpenter
4 pages
VSD, HPS Spoc Brochure
No ratings yet
VSD, HPS Spoc Brochure
2 pages
AIRs-LM - Math 10 QUARTER 4-Weeks 6-7 - Module 5
100% (5)
AIRs-LM - Math 10 QUARTER 4-Weeks 6-7 - Module 5
20 pages
PS50 80 Brochure US en
No ratings yet
PS50 80 Brochure US en
4 pages
Barios
No ratings yet
Barios
10 pages
Lec 3
No ratings yet
Lec 3
22 pages
Chromaster - Spare Parts Catalogue 2013
No ratings yet
Chromaster - Spare Parts Catalogue 2013
40 pages
SLX SX Elevator Maintenance Manual
No ratings yet
SLX SX Elevator Maintenance Manual
7 pages
Catalog 1140-15 Legacy Console Water Source Heat Pumps 3/4 To 1 Ton
No ratings yet
Catalog 1140-15 Legacy Console Water Source Heat Pumps 3/4 To 1 Ton
48 pages
SCADA
No ratings yet
SCADA
32 pages
Guidline For BPH RESEARCH PROJECT REPORT
No ratings yet
Guidline For BPH RESEARCH PROJECT REPORT
4 pages
KNN_Algorithm
No ratings yet
KNN_Algorithm
2 pages
Create Inforecord
100% (5)
Create Inforecord
12 pages
Low Voltage, Synchronous Step Down PWM Controller: Ideal For 2A To 10A, Small Footprint, DC-DC Power Converters
No ratings yet
Low Voltage, Synchronous Step Down PWM Controller: Ideal For 2A To 10A, Small Footprint, DC-DC Power Converters
10 pages
Bullying and Cyberbullying: History, Statistics, Law, Prevention, and Analysis Author: Richard Donegan Publisher: Strategic
No ratings yet
Bullying and Cyberbullying: History, Statistics, Law, Prevention, and Analysis Author: Richard Donegan Publisher: Strategic
9 pages
(Allocation Unit Size FAT32 Explained) What Allocation Unit Size Should I Use For FAT32 - EaseUS
No ratings yet
(Allocation Unit Size FAT32 Explained) What Allocation Unit Size Should I Use For FAT32 - EaseUS
13 pages
Jundy T. Serra Resume
No ratings yet
Jundy T. Serra Resume
2 pages
System of Linear Equations - Spring - 20-21
100% (1)
System of Linear Equations - Spring - 20-21
35 pages
Sro 11030950 081122
No ratings yet
Sro 11030950 081122
2 pages
3.0 - Occupational Noise v3.1 English
No ratings yet
3.0 - Occupational Noise v3.1 English
22 pages
Gis Implementation in Power Distribution Companies in India
100% (1)
Gis Implementation in Power Distribution Companies in India
29 pages
Digitally Signed by Irawati Dian Sari Date: 2022.02.25 08:52:51 +08'00'
No ratings yet
Digitally Signed by Irawati Dian Sari Date: 2022.02.25 08:52:51 +08'00'
4 pages
TD10008
No ratings yet
TD10008
5 pages
LAWBOT
No ratings yet
LAWBOT
13 pages

rickwash.com_papers_cscw08-appendix_parse_all_url

Uploaded by

rickwash.com_papers_cscw08-appendix_parse_all_url

Uploaded by

From: <Saved by Blink>

<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-

# Connect to the Database

$first = shift; # first parameter from command line

$site_sql = "INSERT INTO ${sample}_site (deliciousID, title, url, user, date,

$id = $dbh-&gt;last_insert_id(undef, undef, undef, undef);

You might also like

$id = $dbh->last_insert_id(undef, undef, undef, undef);