Linux Journal March 2017
Linux Journal March 2017
™
WATCH:
ISSUE
OVERVIEW
V
MARCH 2017 | ISSUE 275
http://www.linuxjournal.com
Since 1994: The Original Magazine of the Linux Community
BIG DATA,
Hadoop and R
AUTOMATION
TIPS
+
for SysAdmins
A LOOK AT MANIPULATING
IMAGES
Cutting the Cable Cord with ImageMagick
GEEK GUIDES
http://geekguide.linuxjournal.com
BotFactory: Containers
Automating 101
the End of Author:
Cloud Sprawl Sol Lederman
Author: Sponsor: Puppet
John S. Tonello
Sponsor:
BotFactory.io
COLUMNS
38 Reuven M. Lerner’s
At the Forge
Unsupervised Learning
48 Dave Taylor’s 28
Work the Shell
Image Manipulation
with ImageMagick
54 Kyle Rankin’s
Hack and /
Sysadmin 101: Automation
60 Shawn Powers’
The Open-Source 36
Classroom
The Post-TV Age?
IN EVERY ISSUE
8 Current_Issue.tar.gz
60
10 Letters
18 UPFRONT ON THE COVER
LINUX JOURNAL (ISSN 1075-3583) is published monthly by Belltown Media, Inc., PO Box 980985, Houston, TX 77098 USA.
Subscription rate is $29.50/year. Subscriptions start with the next issue.
Contributing Editors
)BRAHIM (ADDAD s 2OBERT ,OVE s :ACK "ROWN s $AVE 0HILLIPS s -ARCO &IORETTI s ,UDOVIC -ARCOTTE
0AUL "ARRY s 0AUL -C+ENNEY s $AVE 4AYLOR s $IRK %LMENDORF s *USTIN 2YAN s !DAM -ONSEN
Advertising
%
-!),: [email protected]
52,: www.linuxjournal.com/advertising
0(/.%
EXT
Subscriptions
%
-!),: [email protected]
52,: www.linuxjournal.com/subscribe
-!), 0/ "OX (OUSTON 48 53!
keep up
SUSE Enterprise Storage, the leading
open source storage solution, is highly
scalable and resilient, enabling high-end
with data
functionality at a fraction of
the cost.
explosion.
suse.com/storage
Data
L
IKE MOST FANCY TECH TERMS h#LOUD also the Gadget Guy
for LinuxJournal.com,
Computing” has lost its newness, and it’s now
and he has an
JUST A COMMODITY WE PURCHASE )TS OFTEN SO interesting collection
much easier to provision virtual machines than it of vintage Garfield
coffee mugs. Don’t
is to buy and host your own servers. Yes, there are
let his silly hairdo
concerns over privacy and security when your data fool you, he’s a
is in the cloud. When you host in your own data pretty ordinary guy
and can be reached
CENTER HOWEVER THERES STILL THE POSSIBILITY OF
via email at
a rogue cleaning crew getting to your servers. [email protected].
(We’ve all seen the movies; it just takes a mop Or, swing by the
#linuxjournal IRC
and a blue jumper to get you into the most secure
channel on
DATA CENTER 2EGARDLESS OF YOUR STANCE ON CLOUD Freenode.net.
COMPUTING ITS HERE TO STAY 4HIS MONTH WE TALK A
bit about how to live in this bold new world.
2EUVEN - ,ERNER STARTS THINGS OFF WITH MORE
V
PREVIOUS NEXT
V
V
Current_Issue.tar.gz UpFront
Stop
2EGARDING $OC 3EARLS h$EBUGGING $EMOCRACYv IN THE *ANUARY
issue: please stop printing childish personal insults in Linux Journal.
7HEN YOU REFER TO THE 0RESIDENT
%LECT AS h)NTERNET TROLL $ONALD 4RUMPv
YOU ARE BEING PERSONALLY INSULTING CHILDISH BY PLAYING FUNNY GAMES
WITH SOMEONES NAME hHEH HEH HES GOT RUMP IN HIS NAMEv AND
PROMOTING POLITICS OF HATE AND FEAR 4HIS HAS NO PLACE IN A TECHNICAL
journal and no relevance to Linux or computing in general.
9OU WOULD NOT REFER TO THE LOSING CANDIDATE AS A hSHREWv NOR WOULD YOU
PLAY CHILDISH GAMES WITH HER NAME AS A WAY OF INSULTING HER )N FACT YOU
DID REFER TO HER BY HER CORRECT NAME WITHOUT PERSONAL INSULT
)M SORRY BUT THE ELECTION PROCESS DOES NOT NEED hDEBUGGINGv BECAUSE
YOUR FAVORITE LOST 4HIS HAS HAPPENED EVERY FOUR YEARS SINCE THE RATIFICATION
OF THE 53 #ONSTITUTIONSOMEONE WINS SOMEONE LOSES )TS A DIRECT AND
INEVITABLE SIDE EFFECT OF HAVING ONE 0RESIDENT INSTEAD OF TWO %VERY TIME
NEARLY HALF THE VOTERS ARE DISAPPOINTED
every response to my January column, both here and in a variety of digital formats, including PDF,
.epub, .mobi and an online digital edition,
on our website, has been as negative as Mark’s, and as well as apps for iOS and Android devices.
Renewing your subscription, changing your
for the same reasons. email address for issue delivery, paying your
invoice, viewing your account details or other
subscription inquiries can be done instantly
online: http://www.linuxjournal.com/subs.
Opening with that remark also failed to support the Email us at [email protected] or reach
main purpose of that column, which was to call for us via postal mail at Linux Journal, PO Box
980985, Houston, TX 77098 USA. Please
help in rescuing journalism—and real journals such remember to include your complete name
and address when contacting us.
as this one—from drowning in a sea of “content”,
ACCESSING THE DIGITAL ARCHIVE:
way too much of which is crap routed by algorithms Your monthly download notifications
aimed by surveillance-gathered data into echo will have links to the various formats
and to the digital archive. To access the
chambers of the like-minded. This has the effect of digital archive at any time, log in at
http://www.linuxjournal.com/digital.
increasing enmity and blame toward those in echo
LETTERS TO THE EDITOR: We welcome your
chambers with opposing sympathies, which is worse letters and encourage you to submit them
at http://www.linuxjournal.com/contact or
than dangerous in democratic societies, because it mail them to Linux Journal, PO Box 980985,
tears apart the center spaces of basic agreement Houston, TX 77098 USA. Letters may be
edited for space and clarity.
those societies require. You can see how this looks WRITING FOR US: We always are looking
in 4HE 7ALL 3TREET *OURNAL’s Blue Feed, Red Feed for contributed articles, tutorials and
real-world stories for the magazine.
site (HTTPGRAPHICSWSJCOMBLUE
FEED
RED
FEED), An author’s guide, a list of topics and
due dates can be found online:
subtitled “See Liberal Facebook and Conservative http://www.linuxjournal.com/author.
&OR THE EDITING NEEDS HE DESCRIBED THERE IS NO EXCUSE FOR NOT USING THE
excellent Shotcut video editor (https://shotcut.org "ESIDES THE FACT THAT
IT HAS MORE THAN ENOUGH FEATURES ITS OPEN SOURCE AND AVAILABLE FOR
Linux, Windows and Mac OS.
) AM JUST WONDERING WHAT KIND OF EXCUSE HE WILL GIVE FOR NOT ADOPTING
3HOTCUT FOR HIS VIDEO EDITING NEEDS ) AM NOT PRESSURING HIM WITH
this, just playing and using this opportunity to let him know about
Shotcut.
I’m not connected with the project; I just use, advocate and provide
SOME HELP WITH MY SMALL KNOWLEDGE IN THE 3HOTCUT FORUM 3EE ALSO THE
VIDEOS ON 9OU4UBE FOR SOME HELP GETTING STARTED
—Luis Sismeiro
Shawn Powers replies: It’s easy to feel a bit defensive with questions
like, “what’s your excuse?”, but tone is often easy to misinterpret in
email, so I’m going to assume this was a friendly message. I don’t think
I need an excuse, because I don’t think I’ve done anything wrong. But
I’ll answer the question of “why”, because that’s a fair one.
I’ve never tried Shotcut, but now that you’ve brought it to my attention,
I’ll be giving it a try. Heck, perhaps I’ll write about it. The thing that’s
important to know though is that if it is a program that crashes or doesn’t
work well for me, I likely won’t use it.
So after all that, thank you for bringing the project to my attention. I’ll
definitely check it out!
Fake News
) READ -R 3EARLS TANTRUM IN THE *ANUARY ISSUE WITH GREAT
AMUSEMENT )T SEEMS HE IS NOT A FAN OF $ONALD 4RUMP BUT RATHER THAN
CALLING FOR THE DEATH OF THE %LECTORAL #OLLEGE THE OTHER FAD du jour), he
SAYS hWE NEED TO HACK THE NEWS BACK IN A LOGICAL DIRECTION AND AWAY
FROM THE FACT
FREE MISLEADING AND EMOTION STIRRING WAYS THAT NEWS IS
MADE TODAYv )N OTHER WORDS h$OCv IS CALLING FOR GLOBAL hFACT
CHECKINGv
ON THE INTERNETAKA A GLOBALIZED 7IKIPEDIA !ND WHO PRAY TELL WILL
BE TRUSTED WITH THAT CURATION PROCESS 7E DONT NEED TO LOOK FAR FOR AN
ANSWER BECAUSE OTHERS HAVE SUGGESTED SIMILAR THINGS BEFORE WITH REGARD
to broadcast news: https://en.wikipedia.org/wiki/Fairness_Doctrine.
3O THE CREATIVE CHAOS OF THE "AZAAR IS GOOD FOR ,INUX BUT BAD IN THE ARENA OF
POLITICS AND NEWS )NTERESTING HOW WHEN LIBERALS LOSE ELECTIONS THEIR IMMEDIATE
INSTINCTS ARE TO CHANGE THE #ONSTITUTIONAL PROCESS AND CALL FOR CLAMPING DOWN
ON RADIO 46 AND NOW THE INTERNET 4HE &IRST !MENDMENT APPLIES ONLY TO
WHOEVER HAS THE CORRECT VIEWS 7OULDNT IT BE BETTER IF WE HAD AN EDUCATED
ELECTORATE WHO COULD SMELL TRUTH FROM FICTION ON THEIR OWN "UT HOW OFTEN IS
CRITICAL THINKING TAUGHT TODAY ,ET US TAKE THIS AS A TEACHABLE MOMENT
-R 3EARLS GOES ON TO SHOW SOME PRETTY GRAPHS AND POINTS OUT THAT h4HIS
KIND OF STUDY DOES NOT SHOW A MANDATEv 4RUE THE ELECTION OF 4RUMP
ALONE DOES NOT SHOW A MANDATE BUT WHAT DOES THIS TABLE SHOW
2008(DEM/REP) 2016(DEM/REP)
https://ballotpedia.org/
State Senates 28/20 13/37 0ARTISAN?COMPOSITION?OF?STATE?SENATES
https://ballotpedia.org/
State House 32/16 18/31 0ARTISAN?COMPOSITION?OF?STATE?HOUSES
https://ballotpedia.org/
US House 233/198 194/241 5NITED?3TATES?#ONGRESS?ELECTIONS?
Doc Searls replies: I meant “hack” in the broad sense it has been used
here since Linux Journal began in 1994. If you want a specific definition
(or a set of them), consult the Jargon File: http://www.catb.org/jargon/html.
The Trump vs. Hillary contest was maximally interesting at the time I wrote
the column, but it was also beside the main point I tried to make: there
are dangerously dysfunctional ways our democracy now informs itself in
the networked world. “Fake news” is currently the most obvious example,
although I believe the real problems are deeper and more systematic than
that. However one looks at it, some fixing is required.
Dave Taylor replies: Interesting tweak to the script. I’ll have to send it
over to my NASA friends to QA!
"UT THE USE CASE THAT ) APPLIED TO JUSTIFY BUYING THE 3YNOLOGY WAS
h3YNOLOGY #LOUD 3TATIONv IT MIGHT HAVE COME OUT AFTER YOU WROTE
YOUR REVIEW )T IS GREAT FILES ON THE SERVER ARE NOT KEPT LOCKED UP IN A
SPECIAL CONTAINER LIKE OWN#LOUD SO YOU CAN SEAMLESSLY DROP FILES INTO
AN ORDINARY FOLDER ON THE 3YNOLOGY AND HAVE THEM REPLICATE OUT TO YOUR
Cloud Station clients.
4HIS MEANS ,!. USERS CAN SEE FILES DIRECTLY VIA 3AMBA !PPLE4ALK OR .&3
and not have to copy them to their own hard drives, but I (working
at home) also get access via synchronization. Working remotely, I
CAN EXPORT A LARGE FORMAT MAP TO A 0$& FILE AND IT WILL BE UPLOADED
AUTOMATICALLY VIA #LOUD 3TATION SYNC 4HEN MY TEAM CAN VIEW THE
MAP FROM THE 3YNOLOGY VIA ,!. FILE SHARE OR WEB SERVER &ILE 3TATION
I love ownCloud by the way, and it will run on a Synology (I tried it out), and
) USE IT ON MY OWN $EBIAN SERVER BUT #LOUD 3TATION FITS OUR USE CASES BETTER
—Brian Wilson
Shawn Powers replies: I’m fairly certain Cloud Station has been there
for a while, but I have to admit I haven’t tried it. (That will change!) My
concerns with running things on the Synology directly are all horsepower-
related. I love it for things like reverse proxying, web hosting and torrent
management. My Plex Media Server, however, I put on a separate box
because I fear the Synology wouldn’t be able to manage the transcoding.
I also share your frustration with the packages provided by Synology, but
thankfully, there are some community-maintained programs that can be
downloaded and installed.
The Geo stuff you’re doing sounds cool, by the way, and it sounds like
a perfect use case since the CPU demands aren’t too high. And thanks
again for the tip about Cloud Station; I’ll have to give it a try!
)N FACT THE DATA BLOCK SHOULD BE ENCLOSED IN SINGLE QUOTES NOT DOUBLE QUOTES
—Khoa Le
Shawn Powers replies: Double quotes and single quotes are often
the bane of my programming. Half the time I get errors like you
mention here, and the other half of the time I end up with output
that looks like, “Thank you $NAME, your contribution to $THING
was greatly appreciated!”
In this case, the script worked for me on Linux, and once it worked,
I didn’t try it elsewhere. I could have, as I use OS X as well, but
sadly, I didn’t. Thanks for pointing out the issue. Hopefully everyone
struggling will see this letter!
WRITE LJ A LETTER
We love hearing from our PHOTOS
Send your Linux-related photos to
readers. Please send us your
[email protected], and
comments and feedback via we'll publish the best ones here.
http://www.linuxjournal.com/contact.
RETURN TO CONTENTS
LINUX JOURNAL
on your Android device
www.linuxjournal.com/android
For more information about advertising opportunities within Linux Journal iPhone, iPad and
Android apps, contact John Grogan at +1-713-344-1956 x2 or [email protected].
PREVIOUS NEXT
V
V
diff -u
What’s New in
Kernel Development
Filesystem capabilities are supposed to be an improvement over simply
running something as the root user. The idea is that you identify the
specific special powers a program needs and then give it the ability to do
only those special powers. Unfortunately, capabilities have become very
complicated, with some individual capabilities being used to grant so
many special powers that they might as well just be the root user after all.
In particular, kernel developers who create new powers don’t always
know of which capability that power should be a part, so any given
capability can end up providing either too much or too little power to
the program.
Michael Kerrisk recently began an effort to document some basic
guidelines to help developers figure out which capability would best house
any particular new power. For example, “Don’t choose CAP_SYS_ADMIN
if you can possibly avoid it!” Apparently CAP_SYS_ADMIN has become
a huge dumping ground for powers of all sorts, falling prey to the
might-as-well-be-root syndrome.
Unfortunately, Casey Schaufler pointed out some POSIX history that
led to poor decisions being made early on, regarding how to organize
filesystem capabilities. For example:
PENGUIN
COMPUTING
Visit: www.flaggmgmt.com/linux
Show & Conference: Flagg Management Inc, Natalia Vassilieva Ed Turkel Don Clegg Pat McGinn
353 Lexington Avenue, New York 10016 (212) 286 0333 fax: (212) 286 0086 Sr Research Manager, HPC Strategist, VP Mktg Bus Dev BBA/IB CITP VP Prod Mktg
Hewlett Packard Labs Dell EMC Supermicro (Invited) CoolIT Systems
[email protected]
Android Candy: My
World, in a Lock Screen
Non-Linux FOSS:
File Spelunking
with WinDirStat
7ITH ,INUX ITS FAIRY EASY TO FIND THE LARGE FILES ON YOUR SYSTEM BY DOING
something like this:
4HATS WHERE SOMETHING LIKE 7IN$IR3TAT COMES INTO PLAY )TS A FILE
BROWSER THAT USES INCREDIBLE '5) ELEMENTS TO SHOW YOU THE FILES ON
YOUR SYSTEM WITH FILE SIZE SHOWN AS RECTANGLES "IG FILES ARE SHOWN AS
BIG RECTANGLES AND THEIR FILE TYPES ARE SPECIFIED BY COLOR )TS A GREAT
VISUAL WAY TO SORT YOUR FILESYSTEM AND GET RID OF OR AT LEAST FIND
EXTREMELY LARGE FILES
)F YOU USE 7INDOWS ON A REGULAR BASIS BUT SEEM TO HAVE A SHRINKING
HARD DRIVE ) URGE YOU TO DOWNLOAD 7IN$IR3TAT TO GET REAL
TIME
STATISTICS ON YOUR FILESYSTEM )TS OPEN SOURCE AND OF COURSE FREE TO
download at https://windirstat.net. —Shawn Powers
Archive
1994–2016
NOW
AVAILABLE!
SAVE $10.00
by using
discount code
2017ARCH
at checkout.
Coupon code
expires 3/28/2017
www.linuxjournal.com/archive
Jmol: Viewing
Molecules with Java
,ETS DIG BACK INTO SOME CHEMISTRY SOFTWARE TO SEE WHAT KIND OF WORK
YOU CAN DO ON YOUR ,INUX MACHINE 3PECIFICALLY LETS LOOK AT *MOL A
Java application that is available as both a desktop application and
A WEB
BASED APPLET HTTPJMOLSOURCEFORGENET).
9OU CAN USE *MOL TO HELP ANALYZE THE RESULTS YOU GET FROM OTHER
SOFTWARE PACKAGES THAT ACTUALLY CALCULATE THE CHEMICAL EFFECTS YOU
ARE RESEARCHING )T CAN READ IN DOZENS OF DIFFERENT FILE FORMATS AND
YOU CAN USE IT TO VISUALIZE EVERYTHING FROM SMALL MOLECULES TO HUGE
macromolecules, like proteins. You also can visualize crystals and
orbitals. You even can visualize animated events, such as chemical
reactions and molecular vibrations.
Most Linux distributions should have Jmol available within their
package management repositories. For example, you can install it on
Figure 1. When you first start Jmol, you get a blank workspace ready for your work.
)F YOU WANT TO USE THE LATEST AND GREATEST VERSION DOWNLOAD IT FROM THE
MAIN PROJECT WEBSITE 4HE DOWNLOAD COMES AS A SIMPLE ZIP FILE CONTAINING
everything you need to run Jmol. You also will need to install a Java
virtual machine in order to run Jmol.
)F YOU INSTALLED *MOL FROM THE PACKAGE MANAGER YOU PROBABLY WILL HAVE
A SCRIPT AVAILABLE THAT WILL MAKE RUNNING *MOL EASIER )F YOU INSTALL IT FROM
THE BINARY ZIP FILE YOU WILL NEED TO RUN IT MANUALLY BY CALLING *AVA AND
USING THE *!2 FILE AS A COMMAND
LINE OPTION
7HEN YOU FIRST START *MOL YOULL SEE A BLANK SCREEN READY FOR INPUT
!CROSS THE TOP IS A SERIES OF ICONS ALLOWING FOR EASY ACCESS TO THE KEY
FUNCTIONS AVAILABLE WITHIN *MOL )F YOU ALREADY HAVE DATA FILES TO ANALYZE
YOU CAN USE THEM /THERWISE YOU MAY NEED SOME SAMPLE FILES IN ORDER TO
PLAY WITH THE FUNCTIONALITY AVAILABLE
4HE BINARY DISTRIBUTION DOESNT INCLUDE ANY SAMPLE FILES IN ORDER TO SAVE
Figure 2. The basic display you get when you load a molecule is a ball and stick display.
Figure 3. When you load an animation, it starts with a static image of your molecule.
Figure 4. JSpecView is an extra tool available for looking at the spectra of molecules.
Figure 6. Jmol provides a full scripting language to help automate your analysis steps.
Figure 7. You can use Jmol to generate Gaussian input files based on your analysis.
'AUSSIAN INPUT FILE THAT YOU CAN THEN USE TO RUN FURTHER SIMULATIONS
OF YOUR MOLECULE
W ith these tools, you easily can share your research results with
others and build on the work you are doing. —Joey Bernard
RETURN TO CONTENTS
NEXT
PREVIOUS
Reuven M. Lerner’s
V
V
UpFront
At the Forge
Read a Book
™
EDITORS’
in the Blink CHOICE
★
of an Eye!
I love reading. Sadly, the 24 hours I get per day seems to be inadequate
for the tasks I need to accomplish. That might change as my teenagers
turn into college kids and then begin to start families of their own. For
now, however, between drama class and basketball practice, it seems
like it takes about 30 hours to accomplish a 24-hour day. Needless to say,
RETURN TO CONTENTS
Unsupervised
Learning REUVEN M.
In this article, Reuven moves from supervised LERNER
learning to unsupervised learning, where Reuven M. Lerner, a
you ask the computer to tell you something longtime Web developer,
offers training and
interesting about the data. consulting services in
Python, Git, PostgreSQL
and data science. He has
written two programming
ebooks (Practice Makes
NEXT Python and Practice Makes
PREVIOUS
Dave Taylor’s V
V
DATAIN OTHER WORDS TO hLEARNv WHAT THE MEANING OF THE DATA IS WHAT
RELATIONSHIPS IT CONTAINS WHICH FEATURES ARE OF IMPORTANCE AND WHICH
data records should be considered to be outliers or anomalies.
5NSUPERVISED LEARNING ALSO CAN BE USED FOR WHATS KNOWN AS
hDIMENSIONALITY REDUCTIONv IN WHICH THE MODEL FUNCTIONS AS A
PREPROCESSING STEP REDUCING THE NUMBER OF FEATURES IN ORDER TO SIMPLIFY
the inputs that you’ll hand to another model.
In other words, in supervised learning, you teach the computer about
your data and hope that it understands the relationships and categorization
WELL ENOUGH TO CATEGORIZE DATA IT HASNT SEEN BEFORE SUCCESSFULLY
In unsupervised learning, by contrast, you’re asking the computer to tell
you something interesting about the data.
4HIS MONTH ) TAKE AN INITIAL LOOK AT THE WORLD OF UNSUPERVISED LEARNING
#AN A COMPUTER CATEGORIZE DATA AS WELL AS A HUMAN (OW CAN YOU USE
0YTHONS SCIKIT
LEARN TO CREATE SUCH MODELS
Unsupervised Learning
4HERES A CHILDRENS CARD GAME CALLED Set THAT IS A USEFUL WAY TO THINK
ABOUT MACHINE LEARNING %ACH CARD IN THE GAME CONTAINS A PICTURE 4HE
PICTURE CONTAINS ONE TWO OR THREE SHAPES 4HERE ARE SEVERAL DIFFERENT
SHAPES AND EACH SHAPE HAS A COLOR AND A FILL PATTERN )N THE GAME
PLAYERS ARE SUPPOSED TO IDENTIFY THREE
CARD GROUPS OF CARDS USING ANY
ONE OF THOSE PROPERTIES 4HUS YOU COULD CREATE A GROUP BASED ON THE
COLOR GREEN IN WHICH ALL CARDS ARE GREEN IN COLOR BUT CONTAIN DIFFERENT
NUMBERS OF SHAPES SHAPES AND FILL PATTERNS 9OU COULD CREATE A GROUP
BASED ON THE NUMBER OF SHAPES IN WHICH EVERY CARD HAS TWO SHAPES BUT
THOSE SHAPES CAN BE OF ANY COLOR ANY SHAPE AND ANY FILL PATTERN
4HE IDEA BEHIND THE GAME IS THAT PLAYERS CAN CREATE A VARIETY OF DIFFERENT
GROUPS AND SHOULD TAKE ADVANTAGE OF THIS IN ORDER TO WIN THE GAME
) OFTEN THINK OF UNSUPERVISED LEARNING AS ASKING THE COMPUTER TO PLAY
A GAME OF Set. You give the computer a data set and ask it to divide that
LARGE BUNCH OF DATA INTO SEPARATE CATEGORIES 4HE MODEL MAY CHOOSE ANY
FEATURE OR SET OF FEATURES AND THAT MIGHT OR MIGHT NOT BE A FEATURE
THAT HUMANS WOULD CONSIDER TO BE IMPORTANT "UT IT WILL FIND THOSE
connections, or at least try to do so.
/NE OF THE MOST COMMON MACHINE
LEARNING MODELS FOR BEGINNERS
%pylab inline
import pandas as pd
from pandas import DataFrame, Series
from sklearn.datasets import load_iris
iris = load_iris()
df = DataFrame(iris.data, columns=iris.feature_names)
df['response'] = iris.target
Creating a Model
Once you’ve loaded the data, it’s time to create a model. You’re looking
TO DO WHATS KNOWN AS hCLUSTERINGv WHICH MEANS THAT THE COMPUTER WILL
divide the data set into categories or clusters.
3O NOW WHAT )N SUPERVISED LEARNING YOU WOULD CREATE A NEW MODEL
FROM A CLASSIFIER AND THEN TRAIN IT USING SCIKIT
LEARNS hFITv METHOD 9OU
then could give the trained model one or more data points and ask it to
categorize those based on the model.
)N UNSUPERVISED LEARNING ITS A BIT TRICKIERAFTER ALL YOURE ASKING THE
COMPUTER TO DO THE CATEGORIZATION )F YOU DONT HAVE ANY PRE
LABELED
CATEGORIES ITS GOING TO BE HARD TO KNOW WHETHER YOUR MODEL IS USEFUL
accurate or both.
"UT BEFORE GETTING INTO THE EVALUATION LETS BUILD A MODEL 3KLEARN
COMES WITH A NUMBER OF CLASSIFIERS THAT HANDLE CLUSTERING /NE POPULAR
CLASSIFIER IS KNOWN AS h+
MEANSv )N +
MEANS CLUSTERING THE IDEA IS THAT
the model puts each data point inside the cluster whose mean is the
CLOSEST 4HUS IF THERE ARE THREE CLUSTERS EACH CLUSTER WILL CONTAIN POINTS
THAT ARE CALCULATED TO BE CLOSEST 4HE hINERTIAv IS A MEASUREMENT OF HOW
COHERENT THE GROUPS ARETHAT IS HOW CLOSELY ASSOCIATED WITH ONE ANOTHER
THE ELEMENTS THAT HAVE BEEN GROUPED TOGETHER FIT
) SHOULD NOTE THAT BECAUSE +
MEANS USES DISTANCES TO CALCULATE HOW
TO COMPOSE A GROUP YOU LIKELY WILL WANT ALL OF YOUR FEATURES TO BE
ON THE SAME SCALE )N THE CASE OF THE FLOWERS ALL ARE WITHIN THE SAME
ORDER OF MAGNITUDE "UT YOU CAN IMAGINE THAT IF THREE MEASUREMENTS
ARE ON A SCALE OF n AND A FOURTH IS ON A SCALE OF n MILLION THE
calculations might not work out as well. For this reason, it can be a
GOOD IDEA TO USE A SCALERSEVERAL OF WHICH COME WITH SKLEARNTO PUT
ALL OF YOUR DATA ONTO THE SAME SCALE 3UCH SCALING IS OFTEN IMPORTANT
WHEN CREATING MODELS IT HELPS THE CALCULATIONS TO IDENTIFY TWO OR MORE
items as being close by.
3O USING 0YTHONS SCIKIT
LEARN YOU CAN SAY
4HE ABOVE CODE INDICATES THAT YOURE GOING TO USE THE +
MEANS
algorithm. You create a new model, indicating when you do so that you
want three groups.
.OW RIGHT AWAY YOU MIGHT BE ASKING YOURSELF HOW TO KNOW THAT THERE
WILL BE THREE CATEGORIESAND THE COP
OUT ANSWER IS THAT YOU GUESS 9OU
CAN TRY DIFFERENT VALUES FOR n_clusters and evaluate the model to see
how well it does. But in many cases, you’ll have to experiment a bit.
,ETS NOW RUN +
MEANS ON THE DATA 4HE 8 THAT IS INPUT MATRIX IS GOING
TO BE THE DATA FRAME MINUS THE hRESPONSEv COLUMN 9OU CAN CREATE THAT
AS FOLLOWS
X = df.drop('response', axis=1)
7ITH SUPERVISED LEARNING THE hFITv METHOD IS THE PROCESS IN WHICH YOU
TEACH THE MODEL TO MAKE ASSOCIATIONS BETWEEN THE INPUT MATRIX 8 AND THE
OUTPUT VECTOR Y )N UNSUPERVISED LEARNING YOURE ASKING THE MODEL ITSELF TO
MAKE SUCH DIVISIONS AND TO CREATE AN OUTPUT VECTOR 9OU DO THIS WITH hFITv
k.fit(X)
k.inertia_
!GAIN NOTICE THE TRAILING UNDERSCORE 4HE VALUE THAT ) GET IS
4HE INERTIA VALUE ISNT ON A SCALE THE GENERAL SENSE IS THAT THE LOWER THE
output = [ ]
for i in range(2,20):
model = KMeans(n_clusters=i)
model.fit(X)
output.append((i, model.inertia_))
kmeans = DataFrame(output, columns=['i', 'inertia'])
Series(k.labels_).value_counts()
I get:
2 62
1 50
0 38
) SHOULD NOTE THAT THIS NOW FALLS UNDER THE CATEGORY OF hSEMI
SUPERVISED
LEARNINGvTHAT IS TRYING TO SEE WHETHER AN UNSUPERVISED TECHNIQUE CAN
achieve the same results, or at least similar results, to a previously used
SUPERVISED TECHNIQUE
In such a case, you can evaluate your model using not just statistical
TESTS BUT ALSO ONE OF THE TECHNIQUES ) DESCRIBED IN MY PREVIOUS ARTICLES ON
SUPERVISED LEARNING NAMELY TRAIN
TEST
SPLIT 9OU USE UNSUPERVISED LEARNING
ON A PORTION OF THE INPUT DATA AND THEN PREDICT ON THE REMAINING PART
#OMPARING THE MODELS OUTPUTS WITH THE EXPECTED OUTPUTS FOR THAT SUBSET
can help you evaluate and tune your model.
A Different Algorithm
"UT IN THIS CASE LETS TRY USING A DIFFERENT MODEL TO ACHIEVE A DIFFERENT
RESULT SIMPLY TO SEE HOW EASILY SKLEARN ALLOWS YOU TO TRY DIFFERENT MODELS
One common choice in unsupervised learning is Gaussian Mixture, known
IN PREVIOUS VERSIONS OF SCIKIT
LEARN AS '-- ,ETS USE IT
Now, let’s have the model predict with the data used to train it, which
will return a NumPy array with the categories:
model.predict(X)
(OW DID THAT DO ,ETS POP THIS DATA INTO A 0ANDAS 3ERIES OBJECT AND
then count the values:
Series(model.predict(X)).value_counts()
2 55
1 50
0 45
metrics.homogeneity_score(labels_true, labels_pred)
0.89832636726027748
metrics.completeness_score(labels_true, labels_pred)
0.90106489086402064
(EY PRETTY GOOD .OT PERFECT THAT IS BUT NOT BAD AT ALL !ND IF
YOU COMPARE THIS AGAINST THE +
MEANS MODEL
labels_pred = k.labels_
metrics.homogeneity_score(labels_true, labels_pred)
0.75148540219883375
metrics.completeness_score(labels_true, labels_pred)
0.76498615144898152
Conclusion
In many ways, unsupervised learning is the true magic and potential in
THE MACHINE
LEARNING WORLD "Y USING COMPUTERS TO IDENTIFY PATTERNS
AND GROUPS IN YOUR DATA MORE QUICKLY AND ACCURATELY THAN YOU COULD DO
YOURSELF YOU CAN START TO IDENTIFY AND PREDICT ALL SORTS OF THINGS !S WITH
RESOURCES
I used Python (http://python.org) and the many parts of the SciPy stack (NumPy, SciPy, Pandas,
Matplotlib and scikit-learn) in this article. All are available from PyPI (http://PyPI.python.org) or
from https://www.scipy.org.
I recommend a number of resources for people interested in data science and machine learning.
I am a big fan of podcasts, and I particularly love “Partially Derivative”. Other good ones
are “Data Stories” and “Linear Digressions”. I listen to all three on a regular basis and
learn from them all.
If you’re looking to get into data science and machine learning, I recommend Kevin
Markham’s Data School (http://dataschool.org) and Jason Brownlie’s “Machine Learning
Mastery” (http://machinelearningmastery.com), where he sells a number of short and dense,
but high-quality ebooks on these subjects.
RETURN TO CONTENTS
Image
Manipulation DAVE TAYLOR
ImageMagick
on UNIX and Linux
systems for a really
long time. He’s the
author of Learning
Unix for Mac OS X
Dave switches gears this month and begins and Wicked Cool
delving into the more functional topic of Shell Scripts. You can
find him on Twitter
image manipulation. as @DaveTaylor,
or reach him through
his tech Q&A site: http://
www.AskDaveTaylor.com.
PREVIOUS NEXT
Reuven M. Lerner’s Kyle Rankin’s
V
V
$ convert -version
Version: ImageMagick 6.9.6-6 Q16 x86_64 2016-12-31
´http://www.imagemagick.org
Copyright: Copyright (C) 1999-2016 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules
Delegates (built-in): bzlib djvu fftw fontconfig freetype gslib
´jbig jng jp2 jpeg lcms ltdl lzma openexr png ps tiff
´webp x xml zlib
)F YOU DONT HAVE IT INSTALLED IT CAN BE QUITE A TASK TO GET IT ALL UP AND
RUNNING %VERYTHING LIVES AT http://www.imagemagick.org, which is where
you want to get started.
/N A ,INUX SYSTEM YOU CAN USE THE PACKAGE MANAGER OF CHOICE FOR YOUR
DISTRO 9OU CAN GRAB A COMPRESSED TAR IMAGE FROM THE SITE OR YOU CAN USE
rpm , like this:
WHICH YOU CANT DO UNTIL YOU INSTALL 8CODE FREE FROM !PPLE GET IT THROUGH
THE !PP 3TORE /NCE YOUVE INSTALLED 8CODE AND -AC0ORTS YOU CAN INSTALL
ImageMagick, and you’re good to go.
You know you’re good to go when the test command convert
-version RETURNS SOMETHING MEANINGFUL !S ALWAYS WHEN YOU
INSTALL NEW SOFTWARE YOULL WANT TO LOG OUT AND LOG IN AGAIN FOR
THE 0!4( CHANGES AND SHELL COMMAND
LINE HASH TO INCLUDE ALL THE
newest programs.
!MONG THE MOST COMMON FORMATS THAT YOULL ACTUALLY ENCOUNTER IN YOUR
DAY
TO
DAY COMPUTER USAGE ARE THE FOLLOWING
)MAGE-AGICK KNOWS OODLES OF OTHER FORMATS TOO INCLUDING ALL THE MAJOR
VIDEO FORMATS -+6 -0 !6) -/6 )T ALSO CAN CONVERT THINGS LIKE %03&
%NCAPSULATED 0OSTSCRIPT AND EVEN 0$& 0ORTABLE $OCUMENT &ORMAT
WHICH CAN BE USEFUL IN SPECIFIC INSTANCES
!RMED WITH THAT KNOWLEDGE CONVERSION BETWEEN IMAGE FILE FORMATS
IS REALLY RIDICULOUSLY SIMPLE ,ETS SAY YOU WANT TO CONVERT AN IMAGE FROM
3INCE THE )MAGE-AGICK UTILITIES ARE GLOB
AWARE THAT IS YOU CAN USE WILD CARDS
AND SPECIFY MULTIPLE FILENAMES YOU ALSO CAN CONVERT A GROUP OF ')& IMAGES
to JPG with the convert command or, more easily, with its cousin mogrify:
,ETS GIVE IT A WHIRL WITH A FOLDER THAT CONTAINS A HALF
DOZEN ')& IMAGES
using ls TO SHOW THE FOLDER CONTENTS BEFORE AND AFTER THE MOGRIFICATION
IS THAT A WORD
$ ls -s
total 272
8 add-to-google-reader.gif 24 blogger-1.gif
8 dave.gif 8 add-to-newsgator.gif
24 blogger-2.gif 176 manga.gif
16 aw-logo.gif 8 blogger-3.gif
$ mogrify -format jpg *gif
$ ls -s
total 752
8 add-to-google-reader.gif 24 blogger-1.gif
8 dave.gif 8 add-to-google-reader.jpg
112 blogger-1.jpg 8 dave.jpg
8 add-to-newsgator.gif 24 blogger-2.gif
176 manga.gif 8 add-to-newsgator.jpg
128 blogger-2.jpg 168 manga.jpg
16 aw-logo.gif 8 blogger-3.gif
24 aw-logo.jpg 24 blogger-3.jpg
3IMPLE ENOUGH 5SE convert FOR INDIVIDUAL IMAGES AND mogrify FOR
BULK CONVERSIONS )TD BE AN EASY SCRIPT TO DIFFERENTIATE BETWEEN THESE
two cases and invoke the correct command with the correct arguments
TOO )LL LEAVE THAT UP TO YOU
$ file manga*
manga.gif: GIF image data, version 89a, 358 x 313
manga.jpg: JPEG image data, JFIF standard 1.01,
´aspect ratio, density 1x1, segment length 16,
´baseline, precision 8, 358x313, frames 3
manga.png: PNG image data, 358 x 313, 8-bit/color RGB,
´non-interlaced
4HATS BETTER )TS CONSISTENTLY THE THIRD PARAMETER WHICH MEANS THAT
a simple script can strip out everything but the image dimensions:
$ for image in manga*;; do identify $image | cut -f1,3 -d\ ;; done
manga.gif 358x313
manga.jpg 358x313
manga.png 358x313
%ASY ENOUGH AND NOTICE THAT THE cut command is invoked both with
A SPACE AS THE DEFAULT FIELD DELIMITER AND SPECIFYING THAT YOU WANT FIELD
AND BUT NONE OF THE OTHERS
RETURN TO CONTENTS
Sysadmin 101:
Automation KYLE RANKIN
Approach automation in the right way, and you
might find you’ve automated yourself out of a job. Kyle Rankin is a Sr.
Systems Administrator
in the San Francisco
NEXT Bay Area and the author
PREVIOUS
Shawn Powers’ of a number of books,
Dave Taylor’s
V
V
MISTAKES AND IF ITS SOMETHING YOU DO EVERY DAY EVENTUALLY YOU EVEN MAY
stop paying attention to whether your task succeeded. Also, the way that
YOU MAY PERFORM A CERTAIN TASK MIGHT BE A LITTLE BIT DIFFERENT FROM HOW
A DIFFERENT ADMINISTRATOR ON THE TEAM DOES IT "Y AUTOMATING A TASK THE
TEAM CAN AGREE ON THE IDEAL WAY TO PERFORM IT AND KNOW THAT WHEN YOU
RUN YOUR AUTOMATION SCRIPT IT IS PERFORMED THE SAME WAY EVERY SINGLE
time with no skipped steps or commands run in the wrong order.
3) Automation allows everyone on the team to be productive.
With automation, you can take even a complex process and reduce it down
TO A COMMAND 4HAT COMMAND THEN BECOMES SOMETHING THAT ANYONE ON
THE TEAM CAN RUN WHEREAS THE COMPLEX PROCESS MAY HAVE REQUIRED MORE
SENIOR MEMBERS OF THE TEAM &OR INSTANCE IF YOU TAKE PRODUCTION SOFTWARE
DEPLOYMENT AS AN EXAMPLE OFTEN THERE CAN BE A COMPLEX ARRANGEMENT OF
TRIGGERING LOAD BALANCER AND MONITORING MAINTENANCE MODES SOFTWARE
VERSIONS TO CHECK MIRRORS TO SYNC UP AND SERVICES TO RESTART AND TEST %VEN
though these individual steps may be mundane, combined, they become pretty
COMPLICATED AND COULD OVERWHELM A JUNIOR MEMBER OF THE TEAMESPECIALLY
when production uptime hangs in the balance. By automating that process,
SENIOR ADMINISTRATORS CAN PUT ALL OF THEIR EXPERTISE INTO CREATING THE RIGHT
PROCESS THAT PERFORMS THE RIGHT CHECKS AND THEY CAN GO ON VACATION KNOWING
THAT ANYONE ELSE ON THE TEAM NOW CAN PERFORM THE TASK THE RIGHT WAY
4) Automation reduces documentation workload. /FTEN INSTEAD
OF AUTOMATING A TASK A SYSADMIN TEAM WILL SPEND TIME DOCUMENTING A
PROCESS 4HERE IS STILL AN IMPORTANT PLACE FOR DOCUMENTATION AND IN THE
NEXT SECTION ) DISCUSS WHEN THAT MAKES SENSE AND WHEN IT DOESNT 4HE
FACT IS THOUGH IF YOU TAKE TAKE AN ENTIRE PROCESS AND PUT IT INTO A SINGLE
AUTOMATED TASK YOU NO LONGER NEED A FULL WIKI PAGE OF DOCUMENTATION
THAT INEVITABLY WILL BECOME OUT OF DATE BECAUSE YOUVE REDUCED IT DOWN
TO hRUN THIS COMMANDv "ECAUSE THE PROCESS IS NOW AUTOMATED YOU ALSO
know the process is kept up to date; otherwise, the script wouldn’t work.
1) Routine tasks. )N GENERAL TASKS THAT YOU PERFORM FREQUENTLY AT LEAST
MONTHLY ARE GOOD CANDIDATES FOR AUTOMATION 4HE MORE FREQUENT THE
TASK IN THEORY THE MORE TIME
SAVINGS YOU WOULD GET FROM AUTOMATING IT
4ASKS THAT YOU PERFORM ONLY ONCE A YEAR MAY NOT BE WORTH THE EFFORT TO
BUILD AUTOMATION AROUND AND INSTEAD THOSE ARE THE KINDS OF TASKS THAT
BENEFIT FROM GOOD DOCUMENTATION
2) Repeatable tasks. )F YOU COULD DOCUMENT A PROCESS AS A SERIES OF
commands, and then copy and paste them one by one in a terminal and
the task would be complete, that’s a repeatable task that may be a good
CANDIDATE FOR AUTOMATION /N THE OTHER HAND ONE
OFF TASKS THAT HAVE
custom inputs or are something you may never have to do again aren’t
WORTH THE TIME AND EFFORT TO AUTOMATE
3) Complex tasks. 4HE MORE COMPLEX A TASK THE MORE OPPORTUNITIES
YOU HAVE FOR MISTAKES IF YOU DO IT MANUALLY )F A TASK HAS MULTIPLE STEPS IN
PARTICULAR STEPS THAT REQUIRE YOU TO TAKE THE OUTPUT FROM ONE STEP AND USE
IT AS INPUT FOR ANOTHER OR STEPS THAT USE COMMANDS WITH A COMPLEX STRING
OF ARGUMENTS ARE ALL GREAT CANDIDATES FOR AUTOMATION
4) Time-consuming tasks. 4HE LONGER THE TASKS TAKE TO COMPLETE
ESPECIALLY IF THERE ARE PERIODS OF RUNNING A COMMAND WAITING FOR IT TO
complete, and then doing something with that command’s output), the
BETTER A CANDIDATE IT IS FOR AUTOMATION /3 INSTALLATION AND CONFIGURATION
IS A GREAT EXAMPLE OF THIS AS WHEN YOU INSTALL AN /3 THERE ARE PERIODS
WHEN YOU ENTER INSTALLATION SETTINGS AND PERIODS WHEN YOU WAIT FOR THE
INSTALLATION TO COMPLETE !LL OF THAT WAITING IS WASTED TIME "Y AUTOMATING
LONG
RUNNING TASKS YOU CAN GO DO SOME OTHER WORK AND COME BACK TO THE
AUTOMATION OR BETTER HAVE IT ALERT YOU TO SEE IF IT IS COMPLETE
hIN THE WILDv BEFORE ) START AUTOMATING IT ) FIND ) USUALLY NEED TO PERFORM
A TASK A FEW TIMES TO UNDERSTAND WHERE AUTOMATION MAKES THE MOST
SENSE WHAT AREAS OF THE TASK MAY REQUIRE EXTRA ATTENTION AND WHAT SORTS
OF VARIABLES ) MIGHT ENCOUNTER FOR THE TASK /THERWISE IF ) JUST CHARGE
AHEAD AND WRITE A SCRIPT ) MAY FIND YOURSELF REWRITING IT FROM SCRATCH A
FEW WEEKS LATER BECAUSE ) DISCOVER THE PROCESS NEEDS TO BE ADAPTED TO A
NEW VARIATION OF THE TASK )F )M NOT QUITE SURE ABOUT PARTS OF A PROCESS )
MAY AUTOMATE ONLY THE PARTS ) AM SURE OF FIRST AND GET THOSE RIGHT ,ATER
ON WHEN THE REST OF THE PROCESS STARTS TO GEL IN MY MIND ) THEN GO BACK
and incorporate it into the automation I’ve already completed.
) ALSO AVOID AUTOMATING TASKS IF )M NOT SURE ) CAN DO SO SECURELY
&OR INSTANCE A NUMBER OF ORGANIZATIONS ARE BIG FANS OF USING #HAT/PS
AUTOMATING TASKS USING BOTS INSIDE A CHATROOM FOR AUTOMATION !LTHOUGH
) KNOW THAT MANY BOTS CAN AUTHENTICATE TASKS BEFORE THEY PERFORM
THEM ) STILL WORRY ABOUT THE POTENTIAL FOR ABUSE WITH A SERVICE THATS
USUALLY SHARED ACROSS THE WHOLE COMPANY NOT TO MENTION THE FACT THAT
production changes are being triggered by a host outside the production
environment. With my current threat model, I have to maintain strict
separation between development and production environments, so having
a bot accessible to anyone in the company, or having a Jenkins continuous
INTEGRATION SERVER IN THE DEVELOPMENT ENVIRONMENT PERFORMING MY
PRODUCTION TASKS JUST DOESNT WORK )N MANY CASES ) HAVE FULLY AUTOMATED
TASKS UP TO THE POINT THAT IT STILL REQUIRES AN ADMINISTRATOR WITH THE PROPER
access to go to the production environment (thereby proving that they are
AUTHORIZED TO BE THERE BEFORE THEY PUSH hTHE BUTTONv
RETURN TO CONTENTS
The Post-TV
Age? SHAWN
I have lots of streaming packages, but I just can’t POWERS
seem to cut the cord! Shawn Powers is the
Associate Editor for
Linux Journal. He’s
also the Gadget Guy
for LinuxJournal.com,
PREVIOUS and he has an
NEXT
Kyle Rankin’s
V
V
interesting collection
New Products of vintage Garfield
Hack and /
coffee mugs. Don’t
let his silly hairdo
fool you, he’s a
pretty ordinary guy
and can be reached
THE MOST BASIC CABLE PACKAGE FROM CHARTER via email at
( SPECTRUM? ) COSTS ME MORE THAN $70 PER [email protected].
MONTH, AND THATS WITHOUT ANY EQUIPMENT OTHER Or, swing by the
#linuxjournal IRC
than a single cable card. It’s very clear why people channel on
HAVE BEEN CUTTING THE CORD WITH CABLE 46 COMPANIES Freenode.net.
"UT WHAT OPTIONS EXIST $O THE ALTERNATIVES ACTUALLY
COST LESS !RE THE ALTERNATIVES AS GOOD )VE BEEN
TRYING TO FIGURE THAT OUT FOR A FEW MONTHS NOW AND
THE RESULTS )T DEPENDS
4HE IDEA OF CORD CUTTING ISNT NEW &OR YEARS PEOPLE
have been severing their ties with cable companies in
ORDER TO SAVE MONEY 4HE EVER
PERSISTENT QUESTION IS
THIS HOW DO THE OPTIONS COMPARE
OF MEDIA IN QUESTION 3ERVICES LIKE .ETFLIX !MAZON 0RIME AND (ULU ARE
GREAT BUT THEY DONT PROVIDE LIVE TELEVISION )N FACT DEPENDING ON THE
show and service, you might need to wait until the next day or even the
END OF A SEASON BEFORE YOUR DESIRED SHOWS ARE AVAILABLE 9OU USUALLY GET
THE ADVANTAGE OF NO COMMERCIALS BUT THE WAITING OFTEN IS UNBEARABLE IF
YOURE INTO TELEVISION SHOWS THAT END WITH CLIFFHANGERS
)T IS INTERESTING THOUGH NOW THAT .ETFLIX AND !MAZON HAVE BEEN SO
SUCCESSFUL WITH THEIR STREAMING SERVICES THEYRE BEGINNING TO GET THEIR
OWN EXCLUSIVE SHOWS 4HIS MEANS THAT NOT ONLY ARE THE SHOWS NOT DELAYED
BUT THEYRE ALSO ACTUALLY NOT AVAILABLE AT ALL VIA CABLE 46 !DMITTEDLY THAT
PHENOMENON IS FAIRLY NEW ONLY THE LAST FEW YEARS BUT IT MAKES THE CASE
FOR STREAMING FAR STRONGER 7HY PAY PER MONTH AND STILL NOT GET TO
watch Jessica Jones
!LSO MANY INDIVIDUAL STATIONS ARE STARTING TO OFFER THEIR OWN STREAMING
OPTIONS SO THE DAYS OF PAYING FOR CABLE SO YOU CAN SEE A PARTICULAR ("/
SHOW ARE OVER "ROADCAST NETWORKS ARE STARTING TO OFFER STREAMING OPTIONS
TOO SO IF YOURE JUST LOOKING FOR THE ABILITY TO WATCH PARTICULAR TELEVISION
SHOWS EVEN PAYING FOR MULTIPLE ONLINE ACCOUNTS IS CHEAPER THAN PAYING
FOR CABLEUSUALLY
Figure 1. Sling TV has been around a long time, but the lack of DVR and video glitches
make it less than stellar in my experience.
0"3 &/8 BUT FOR MOST OF THE COUNTRY YOU GET THOSE CHANNELS ONLY hON
DEMANDv WHICH MEANS RECORDINGS OF POPULAR SHOWS THE NEXT DAY
4HE TECHNOLOGY DETAILS OF 3LING 46 ARE A LITTLE CONFUSING )F YOU SUBSCRIBE
to the lowest tier, you can stream only one channel per account at a time.
4HAT MEANS IF YOU ARE WATCHING 46 IN YOUR LIVING ROOM YOU CANT WATCH
SOMETHING ELSE ON YOUR PHONE )F YOU SUBSCRIBE TO A HIGHER
PRICED TIER
you can have up to three streams at once. Also, although the streams
USUALLY ARE GOOD QUALITY MY ANECDOTAL EXPERIENCE SHOWS THAT THERE ARE A
FEW MORE ARTIFACTS AND GLITCHES WITH 3LING 46 THAN WITH THE OTHER OPTIONS
but nothing that makes it a showstopper. (I get glitches with my cable
TELEVISION TOO SO NOTHING IS PERFECT
4HERES A FREE TRIAL WITH 3LING 46 SO ITS WORTH CHECKING OUT *UST BE SURE
TO CANCEL IT BEFORE YOUR CREDIT CARD AUTO
RENEWS AT THE END OF THE TRIAL
UNLESS YOU DECIDE TO KEEP IT !LSO BECAUSE ITS BEEN AROUND FOR A LONG
TIME 3LING 46 HAS APPS ON MULTIPLE PLATFORMS 8BOX USERS CAN INSTALL 3LING
46 ALONG WITH !NDROID 46 AND 2OKU USERS ,IKE MOST STREAMING SERVICES
2OKU DOES A GREAT JOB OF STAYING VENDOR
NEUTRAL WHICH MEANS IT USUALLY
CAN PROVIDE SERVICES REGARDLESS OF WHO IS PROVIDING THEM
PlayStation Vue
0LAY3TATION 6UE IS A BIT MORE OF A SURPRISE SINCE 3ONY 0LAY3TATION
IS SYNONYMOUS WITH GAMING RATHER THAN TELEVISION )TS OFFERINGS ARE
IMPRESSIVE HOWEVER 4HE LINEUPS ARE SIMILAR TO 3LING 46 BUT THE
BREAKDOWNS ARE A LITTLE DIFFERENT 4HE LOWEST
PRICE SERVICE IS AROUND
PER MONTH WITH OTHER TIERS AVAILABLE THAT ADD MORE CHANNELS
3ONY GIVES YOU A PRICE BREAK IF YOURE NOT IN ONE OF THE CITIES THAT HAS
LOCAL CHANNELS AVAILABLE SO FOR ME IN RURAL -ICHIGAN ITS CHEAPER THAN
IF ) LIVED IN #HICAGO 4HAT MEANS ) DONT GET LOCAL CHANNELS THOUGH
WHICH IS FRUSTRATING
!LTHOUGH THE SLIGHTLY HIGHER PRICE SEEMS FRUSTRATING THE TECHNOLOGY
INCLUDED MIGHT MAKE UP FOR IT .OT ONLY CAN YOU STREAM TO FIVE DEVICES
SIMULTANEOUSLY BUT IT ALSO PROVIDES h#LOUD $62v WHICH AUTOMATICALLY
STORES RECORDED CONTENT FOR YOU !LL YOU NEED TO DO IS MARK A PROGRAM
AS A FAVORITE AND ALL EPISODES ARE SAVED FOR DAYS )TS NOT POSSIBLE TO
SCHEDULE A TIMED EVENT BUT THE $62 FEATURE IS EXTREMELY NICE AND IT
PROVIDES A FAR BETTER EXPERIENCE THAN THE LIVE
ONLY 3LING 46
Figure 2. PlayStation Vue is remarkable, until it’s not. The video quality is amazing, and
the DVR is superb. The geolocation frustrations along with PS4 console problems make it
difficult to love.
BIG ISSUE )VE HAD WITH 0LAY3TATION 6UE IS THAT ITS NOT POSSIBLE TO WATCH
STREAMS FROM THE SAME ACCOUNT ON TWO DIFFERENT 0LAY3TATION CONSOLES )
HAVE A CONSOLE IN MY OFFICE AND A 0LAY3TATION 0RO IN THE LIVING ROOM AND ITS
NOT POSSIBLE TO WATCH 6UE ON BOTH DEVICES 4HAT IS PARTICULARLY FRUSTRATING
BECAUSE WATCHING ON MULTIPLE 2OKU UNITS WORKS FINE BUT NOT ON THE ACTUAL
3ONY HARDWARE 4HERES ALSO SOME FRUSTRATION WITH GEOLOCATION 3ONY OFTEN
THINKS )M NOT HOME SO IT LIMITS WHAT ) CAN WATCH ) WOULD UNDERSTAND IF MY
IP address changed, but I have a static IP address and I’m always connecting
FROM HOME 3EE THE NOTICE IN &IGURE
DirecTV Now
$IREC46 .OW IS THE NEW KID ON THE BLOCK WHEN IT COMES TO CABLE 46
STREAMING 4HE PACKAGES ARE SIMILAR TO THE OTHER SERVICES ) MENTIONED
Figure 3.
DirecTV Now is
the new kid on
the block. The
$35/month is
a trial cost and
likely will increase
before this article
is published.
WITH SOME INITIAL LOW
PRICED OPTIONS AVAILABLE TO ENTICE USERS AWAY .OTE
WITH ALL THESE SERVICES BEING CONTRACT
FREE THE POTENTIAL FOR MOVING IN
ORDER TO SAVE A FEW BUCKS IS VERY LEGITIMATE $IREC46 .OW HAS SIMILAR
LIMITATIONS REGARDING LIVE BROADCAST STATIONS THAT IS AT THE TIME OF THIS
WRITING THERE ARENT ANY AVAILABLE BUT $IREC46 .OW HAS THE ADDITIONAL
LIMITATION THAT EVEN ON
DEMAND CONTENT FROM #"3 ISNT AVAILABLE 4HE
KERFUFFLE THAT $IREC46 AND #"3 HAVE BEEN HAVING EXTENDS TO THE STREAMING
service as well.
) HAVENT PERSONALLY USED THE $IREC46 .OW SERVICE BECAUSE NONE OF MY
DEVICES CURRENTLY ARE SUPPORTED !PPLE 46 IS ITS MAIN DEVICE AND YOU CAN
GET ONE FREE IF YOU PRE
PAY FOR THREE MONTHS OF SERVICE ) HAVE FRIENDS
WHOVE USED IT THOUGH AND THEY SAY THE QUALITY IS VERY GOOD ,IKE 3LING
46 HOWEVER IT DOESNT CURRENTLY HAVE ANY $62 CAPABILITY
3INCE $IREC46 .OW IS NEW ITS NOT FAIR TO CRITICIZE ITS LACK OF HARDWARE
SUPPORT YET 2OKU STREAMING IS SLATED FOR 1 AND ITS POSSIBLE
OTHER NON
COMPETITORS WILL GET APPS AS WELL !S IS USUALLY THE CASE
2OKU LIKELY WILL BE ONE OF THE PREMIERE WAYS TO WATCH STREAMING CABLE
46 SERVICE BECAUSE ITS COMPATIBILITY WILL ALLOW FOR SERVICE
HOPPING
without hardware reinvestment.
Figure 4. I want to love USTVnow, and perhaps now that there is a paid service, the reliability
will improve. I just hope it’s able to keep providing live broadcast channels in the US.
Rabbit Ears
Yes, obviously using an antenna is a great way to get local television. In
Figure 5. “Up to 0 channels” is a sad thing to see; I hope your location is better.
FACT YOU CAN HEAD OVER TO http://antennaweb.org and see what channels
ARE AVAILABLE IN YOUR AREA AND WHAT SORT OF ANTENNA YOULL NEED 4HE SITE
EVEN WILL TELL YOU WHAT DIRECTION TO POINT YOUR ANTENNA FOR THE BEST SIGNAL
)F YOURE JUST LOOKING FOR SOME OLD
FASHIONED TELEVISION AN ANTENNA IS
OFTEN A GOOD OPTION 0LUS APART FROM THE HARDWARE ITS TOTALLY FREE
4HE PROBLEM IS EVEN THOUGH ) LIVE IN A SMALL CITY ) GET EXACTLY ZERO
CHANNELS FROM MY LOCATION 4HAT IS DUE TO GEOGRAPHY BECAUSE ) LIVE ON
THE SIDE OF A HILL BUT NONETHELESS ) CANT GET ANY CHANNELS USING EVEN A
ROOFTOP ANTENNA %VEN IF YOU CAN HOWEVER ITS WORTH CONSIDERING WHETHER
THAT SORT OF SYSTEM IS ACCEPTABLE FOR YOU ) DONT WANT TO SWITCH MY INPUT
SOURCE ON THE TELEVISION EVERY TIME ) WANT TO WATCH 46 !ND 4I6O HAS
SPOILED ME ) WANT TO PAUSE LIVE 46 )TS POSSIBLE TO GET SOMETHING LIKE AN
($ (OMERUN DEVICE FROM 3ILICON $UST AND CONVERT YOUR ANTENNA SIGNAL
into a digital stream, but integrating that into your entertainment system
IS OFTEN CHALLENGING 0LUS ) HAD SO MUCH FRUSTRATION WITH MY ($ (OMERUN
SETUP IN OUR LAST HOUSE THAT ) OPTED TO JUST BUY A CABLE 46 SUBSCRIPTION
3O /4! OVER THE AIR CHANNELS ARE WORTH CHECKING OUT AND FOR SOME
PEOPLE THEY ARE MORE THAN ENOUGH &OR ME HOWEVER EVEN IF ) COULD GET A
good signal, I want more.
RETURN TO CONTENTS
V
V
The Open-Source
Demonstrator
Classroom
William
Gurstelle’s
ReMaking
History,
Volume 3
(Maker Media,
Inc.)
In William Gurstelle’s ReMaking History SERIES FROM -AKER -EDIA
Inc., readers get exponentially closer to the inventors who shaped
OUR MODERN WORLD COMPARED TO OTHER HISTORIES OF TECHNOLOGY 4HIS
IS BECAUSE 'URSTELLE DOESNT MERELY TELL THE STORIES OF REMARKABLE
INVENTORS FROM THE PAST HE GETS INTO THEIR FASCINATING MINDS BY
ILLUSTRATING HOW TO MAKE ONES OWN VERSION OF THE INVENTORS HANDIWORK
4HE NEW 6OLUME OF ReMaking History bearing the subtitle Makers
of the Modern World explores the early modern era and builds on
THE EARLIER TWO VOLUMES COVERING PRE
MODERN HISTORY TO THE )NDUSTRIAL
!GE )N THIS VOLUME SEVEN INVENTORS AND THEIR TECHNOLOGIESDESTINED
TO FILL BASEMENTS AND GARAGES EVERYWHEREINCLUDE !LESSANDRO 6OLTA
AND ELECTROPLATING (UMPHREY $AVY AND THE FIRST ELECTRIC LIGHT 'EORGE
Cayley and the aeronautical glider; the Lumiere Brothers and the movie
PROJECTOR 2UDOLF $IESEL AND THE AUTOMOBILE ENGINE (ANS 'OLDSCHMIDT
and the thermite reaction; August Möbius and the Möbius Strip; and
,OUIS 0OINSOT AND LOADS MOMENTS AND TORQUES
http://oreilly.com
William Rothwell
and Nick Garner’s
Cert ified Ethical
Hacker (CEH)
Complete Video
Course (Pearson
IT Certification)
Watch William Rothwell and Nick Garner’s new Certified Ethical Hacker
#%( #OMPLETE 6IDEO #OURSE AND LEARN EVERYTHING YOU NEED TO KNOW
TO ACE THE #%( EXAM IN LESS THAN HOURS $IVIDED INTO FIVE MODULES
AND CONTAINING A COMPLETE OVERVIEW OF THE TOPICS IN THE %#
#OUNCIL
"LUEPRINT 2OTHWELL AND 'ARNERS INTERMEDIATE
LEVEL VIDEO
TRAINING
course helps viewers master the essentials needed to pass the exam.
4HE COURSE COMMENCES WITH A GENERAL OVERVIEW OF SECURITY ESSENTIALS
FOLLOWED BY AN EXPLORATION OF SYSTEM NETWORK AND WEB SERVICES
SECURITY AND A DIVE IN TO WIRELESS AND INTERNET SECURITY 4O TEST ONES
CHOPS THE COURSE OFFERS QUIZZES EXERCISES AND TWO FULL PRACTICE EXAMS
"Y PROVIDING THE BREADTH OF COVERAGE NECESSARY TO LEARN THE FULL SECURITY
CONCEPTS BEHIND THE #%( EXAM THIS VIDEO COURSE HELPS PREPARE VIEWERS
FOR A CAREER AS A SECURITY PROFESSIONAL
HTTPINFORMITCOM
http://www.socallinuxexpo.org
Use Promo Code LJ15X for a 30%
discount on admission to SCALE
Briggs &
Stratton
8,000 Watt
Elite Series
Portable
Generator
with
StatStation
Wireless
Although Linux Journal READERS MIGHT NOT EQUATE -ILWAUKEE WITH TECH A NEW
"RIGGS 3TRATTON PRODUCT PORTENDS THE BRIGHT FUTURE OF SMARTENED hLEGACYv
DEVICES FROM THE INDUSTRIAL HEARTLAND 4HE 7ISCONSIN
BASED MAKER OF ENGINES
AND INDUSTRIAL PRODUCTS RECENTLY ANNOUNCED A SMARTERAND hTECHIERvWAY
TO PRODUCE ON
DEMAND POWER IN THE FORM OF THE NEW "RIGGS 3TRATTON
7ATT %LITE 3ERIES 0ORTABLE 'ENERATOR WITH 3TAT3TATION 7IRELESS FEATURING
"LUETOOTH TECHNOLOGY 4HE ACCOMPANYING 3TAT3TATION APP FOR !NDROID AND
I/3 WITH SUPPORT FROM "LUETOOTH CONNECTIVITY PROVIDES VALUABLE REMOTE
VISIBILITY INTO KEY METRICS SUCH AS FUEL LEVEL AND REMAINING RUNTIME RUNTIME
METER PERCENT OF AVAILABLE 7ATT CONSUMPTION MAINTENANCE REMINDERS
DEALER LOCATOR REFERENCE GUIDES AND HOW
TO VIDEOS
http://briggsandstratton.com
Please send information about
releases of Linux-related products
to [email protected]
or New Products c/o Linux Journal,
PO Box 980985, Houston, TX 77098.
Submissions are edited for length
and content.
RETURN TO CONTENTS
BIG DATA
DEMONSTRATOR
USING HADOOP TO BUILD
A LINUX CLUSTER FOR LOG
DATA ANALYSIS USING R
THIS ARTICLE WALKS THROUGH THE STEPS TO CREATE A HADOOP LINUX
CLUSTER IN THE CLOUD AND OUTLINES HOW TO ANALYZE DEVICE LOG
DATA VIA AN EXAMPLE IN THE R PROGRAMING LANGUAGE.
NEXT
PREVIOUS Feature: Integrating
V
V
T
HIS ARTICLE DESCRIBES WHY DEVICE LOG DATA ANALYSIS IS USEFUL AND BRIEFLY
INTRODUCES THE INVOLVED TECHNOLOGIES AND HOW THEY FIT TOGETHER
,INUX IS THE BASIS FOR THE h)NFRASTRUCTURE AS A 3ERVICEv STANDARD
that makes the proposed solution portable between cloud providers.
Furthermore, we describe the steps you need to go through to create a
(ADOOP CLUSTER BASED ON ,INUX IN AN !MAZON CLOUD 4HE STEPS INVOLVE
bash/install scripts placed in a GitHub repository that allows the automatic
INSTALLATION OF ALL THE NECESSARY COMPONENTS AND CONFIGURATION
"IG $ATA TECHNOLOGY AND THE )NTERNET OF 4HINGS )O4 ARE A STRONG
COMBINATION 4HE )O4 IS A GREAT SOURCE OF INFORMATION AND "IG $ATA
TECHNOLOGY ALLOWS FOR ANALYSIS OF VAST AMOUNTS OF DATA 0OSSIBLE
applications are prediction, anomaly detection and device improvement/
DEVELOPMENT 4HE LATTER IS THE CASE WE HAVE BEEN WORKING ON IN ORDER
to investigate why devices break. We need Big Data technology, because
CLASSICAL SINGLE
SERVER APPROACHES WERE UNABLE TO PROCESS THE LARGE
AMOUNTS OF DATA FAST ENOUGH FOR AN
EFFICIENT ANALYSIS WORK CYCLE
In this article, we use an example
OF DATA ANALYSIS OF DEVICE LOG
data to illustrate how to use
A DEMONSTRATOR SETUP TO FIND
unknown correlations between
parameters in log data. However,
in this article, we will not go into
the device details, but use an
abstracted device model approach.
DEMONSTRATOR OVERVIEW
4HE DEMONSTRATOR SETUP CONSISTS
OF SELECTED TECHNOLOGIES &IGURE
developed scripts and installation
INSTRUCTIONS 4HIS ALLOWS FOR THE
REPRODUCTION OF THE SETUP
4HE SO
CALLED CLOUD IS A DYNAMIC
MARKET FOR COMPUTER RESOURCES
Prices are decreasing over time, Figure 1. Selected Technologies Layer Model
BUT IT IS FAR FROM FREE AND A CREDIT CARD IS REQUIRED TO GET A CLOUD ACCOUNT
4HE CLOUD IS IMPORTANT IN "IG $ATA ANALYSIS BECAUSE WE REQUIRE LARGE
COMPUTER RESOURCES ONLY FROM CASE TO CASE AND A STANDING IN
HOUSE
COMPUTER CLUSTER IS IN MANY SITUATIONS TOO EXPENSIVE ESPECIALLY FOR
smaller organizations to begin learning Big Data analysis.
When we want to analyze data, we select a cloud provider and
CREATE A CLUSTER DO THE ANALYSIS AND THEN AFTERWARD WE DESTROY THE
CLUSTER 4HIS WAY WE PAY ONLY FOR THE COMPUTER RESOURCES USED DURING
the data analysis.
7E HAVE SELECTED )NFRASTRUCTURE AS A 3ERVICE )AA3 AS THE CLOUD
TECHNOLOGY BECAUSE IT ALLOWS FOR PORTABILITY OF THE DEMONSTRATOR
BETWEEN DIFFERENT CLOUD PROVIDERS
!LL THE CLOUD PROVIDERS THAT WE KNOW OF OFFER VIRTUAL 5BUNTU ,INUX
MACHINES 5BUNTU IS A WELL KNOWN ,INUX DISTRIBUTION WHICH IS WHY WE
DEVELOPED THE DEMONSTRATOR INSTALLATION SCRIPT FOR 5BUNTU
We chose to work on Amazon Web Services (AWS), since it’s a well
ESTABLISHED STABLE BUSINESS WITH WELL DEFINED AND DOCUMENTED INTERFACES
AND IT OFFERS A FREE TIER THAT IS VERY CONVENIENT FOR DEVELOPMENT WORK
(ADOOP IS A KIND OF OVERLAY OPERATING SYSTEM FOR A CLOUD CLUSTER OF
Linux computers. It handles all the resources in the cluster and allows
programs to be executed in a distributed manner. Hadoop is written
IN *AVA AND IS RATHER MEMORY
CONSUMING WHEN COMPARED TO SMALLER
JOBS AND TESTING (ADOOP HAS BEEN THE DE FACTO STANDARD FOR "IG $ATA
PROCESSING FOR THE PAST TEN YEARS OR SO (ADOOP CONSISTS OF A NUMBER OF
COMPONENTSSEE &IGURE
At the bottom is the Hadoop Distributed File System (HDFS) that allows
FOR HIGH DATA THROUGHPUT )T HANDLES THE )/ BOTTLENECK PROBLEM WHEN
ANALYZING VAST AMOUNTS OF DATA 4HE DATA IS SPREAD OUT OVER THE CLUSTER
and the idea is that data is processed where it is stored.
9!2. 9ET
!NOTHER
2ESOURCE
.EGOTIATOR IS THE CENTRAL COMPONENT THAT
allocates resources to Hadoop jobs. Keep in mind that this is no trivial
TASK BECAUSE A (ADOOP CLUSTER MAY CONTAIN NODES 9!2. HAS
BUILT
IN LOGIC TO HANDLE NODE AND JOB FAILURE IN A GRACEFUL WAY .ODES MAY
disappear and reappear on the cluster network, but jobs must be taken
over by other nodes in the meantime.
-AP
2EDUCE IS THE COMPONENT THAT HANDLES THE PARALLELIZATION OF
Figure 2.
The Main
Components
of Hadoop
analysis tasks. Hard disk and network issues are abstracted away
FROM THE DEVELOPER IN ORDER TO ALLOW THE DEVELOPER TO CONCENTRATE ON
DEVELOPING THE ANALYSIS PROGRAM 4HE (ADOOP SYSTEM HANDLES THESE
ISSUES AUTOMATICALLY "E AWARE THAT THE -AP
2EDUCE FRAMEWORK ENFORCES
parallel programming by constraining the programing model, and that can
BE DIFFICULT TO GET USED TO 4HERE IS A -AP FUNCTION THAT IS RESPONSIBLE FOR
IMPORTING DATA AND CONVERTING TO THE INTERNAL DATA FORMAT KEY
VALUE
FOR EXAMPLE THE KEY IS THE DEVICE )$ AND THE VALUE IS A LIST OF TEMPERATURE
DATA POINTS 4HERE IS ALSO A REDUCE FUNCTION ONE PER SLAVE NODE THAT
PROCESSES DATA WITH A CERTAIN KEY 4HIS MEANS IN OUR CASE THAT ALL DATA
FOR ONE DEVICE WILL BE PROCESSED BY THE SAME SLAVE NODE
4HE (ADOOP SYSTEM WILL FEED THE MAP FUNCTION WITH DATA RECORDS )T MAY
BE LINE BY LINE OR FILE BY FILE 4HE SYSTEM WILL DISTRIBUTE THE LOAD BY DIVIDING
THE FILES AMONG THE SLAVE NODES FOR hMAPv PROCESSING 4HERE IS NO DIRECT
FILESYSTEM ACCESS 4HE ONLY WAY TO OUTPUT RESULTS IS TO EMIT A KEY
VALUE
PAIR OR A LIST OF KEY
VALUE PAIRS 4HERE IS NO SHARED MEMORY BETWEEN MAP
INSTANCES EACH MAP
NODE IS DOING THE WORK ON ITS OWNHENCE ALLOWING
FOR DECOUPLED PARALLEL PROCESSING OF DATA 4HE DEVELOPER HAS NO CONTROL
OVER WHICH NODE WILL PROCESS WHAT DATA 4HE KEY
VALUES EMITTED BY MAP
instances are sorted by the system according to the key. Other than this,
YOU CANNOT ASSUME ANY ORDERING OF KEY
VALUE
PAIRS
)N ORDER TO SPEED UP EXECUTION THE MAP FUNCTION MAY BE USED TO FILTER
OUT RECORDS THAT ARE NOT RELEVANT FOR THE ANALYSIS 4HE 2EDUCER FUNCTION
WILL RECEIVE A LIST OF KEY
VALUE PAIRS WITH A CERTAIN KEY FOR PROCESSING
7HEN DONE THE FUNCTION EMITS KEY
VALUE PAIRS 4HE (ADOOP SYSTEM
COMBINES ALL THE KEY
VALUE PAIRS FROM THE REDUCERS INTO THE OUTPUT
/N TOP OF THE (ADOOP ,INUX CLUSTER WE HAVE CHOSEN 2 AS THE DATA
ANALYSIS SOFTWARE 2 IS A GENERIC MATH TOOL THAT PROVIDES A FAST INTERACTIVE
PROCESS WHICH IS FUNDAMENTAL FOR DATA ANALYSIS 2 IS A HIGH
LEVEL
PROGRAMING LANGUAGE WITH MANY EXTENSION PACKAGES 4HIS STEMS FROM THE
FACT THAT 2 IS OPEN SOURCE AND HAS A LARGE COMMUNITY !MONG ITS PACKAGES
is data mining. R has a command line that allows an interactive process
AND FITS WELL WITH THE 5.)8 ENVIRONMENT SCRIPTING
(OWEVER 2 IS CLASSIC SINGLE COMPUTER SOFTWARE THEREFORE THE 2 PACKAGE
Figure 3.
Map-Reduce
Functions in R
you should determine what and how you want to analyze. In other
WORDS YOU DEFINE YOUR HYPOTHESIS AND WRITE AN ANALYSIS PROGRAM IN 2
4HEN CREATE YOUR OWN 2 (ADOOP CLUSTER 7ELL DESCRIBE THE DETAILS IN THE
next section. When your R Hadoop Linux cluster is ready, you load your
DATA FROM THE EXTERNAL SERVER INTO THE ($&3 AND RUN THE ANALYSIS PROGRAM
using the R command prompt.
When the results are ready, you should review them and copy the
RESULT DATA TO A STORAGE SERVER )F THE RESULTS ARE NOT SATISFACTORY YOU
should change the analysis program and run it again on the cluster.
Finally, when done, you should destroy the cluster, since keeping disk
AND #05 ALLOCATION WILL COST TOO MUCH IF YOU ARE NOT USING IT (OWEVER
IT MAY BE SENSIBLE TO KEEP THE MASTER TEMPLATE IMAGE IF YOU WANT TO
DO MORE ANALYSIS IN THE FUTURE
Figure 5. On the left is a screen dump of the Launch button, and on the right is the Ubuntu
Server used.
4HE SCRIPT WILL INSTALL 33( (ADOOP 2 AND 2MR ON THE 5BUNTU SERVER
(ADOOP IS INSTALLED IN THE hUBUNTUv USERS HOME DIRECTORY AND THE
(ADOOP DATA FILES ($&3 WILL BE PLACED IN TMP )N ADDITION IT WILL COPY
THE (ADOOP CONFIGURATION DATA BASIC CONFIGURATION FROM THE %%DIGI
'IT(UB REPO TO THE (ADOOP INSTALLATION ON THE SERVER 9OU CAN FIND FURTHER
CODE COMMENTS IN THE SCRIPT ! VARIATION POINT IS FOR INSTANCE THAT YOU
CAN COMMENT OUT THE 2 PART OF THE SCRIPT BEFORE RUNNING IT AND INSTALL
OTHER ANALYSIS SOFTWARE SUCH AS 0YTHON
Next, here are the steps to create the template.
3HUT DOWN THE SERVER WITH
#!/bin/bash
bash /home/ubuntu/auto-slave.sh
7AIT SOME TEN MINUTES FOR ALL THE SLAVES TO COME ONLINE )F YOU
want, on the master, in another console window, check the online slaves
on the master during the process:
"REAK AUTO
CONFIGSH PRESS #TRL
# TO BREAK TO STOP ACCEPTING MORE SLAVES
!PPEND THE CLUSTERCONFIGFILE TO ETCHOSTS BECAUSE (ADOOP REQUIRES
$.3 NAMES FOR SLAVES
5PDATE THE (ADOOP SLAVES LIST WITH THE $.3 NAMES FROM THE
CLUSTER
CONFIGFILE BUT REMOVE THE INTERNET ADDRESS INFORMATION
In this section, you have now, on AWS, created a Linux Hadoop cluster
BASED ON A SINGLE MASTER INSTALLATION 4HIS INSTALLATION HAS ALL THE NECESSARY
SOFTWARE TOOLS AND CONFIGURATION SO THAT YOU CAN DIVE IN TO USING AND
TESTING THE CLUSTERSEE THE NEXT SECTION
(OWEVER WE HAVE MULTIPLIED ONE OF THE SIGNALS IN ONE OF THE LOG FILES
with a sinus curve. Without knowing which log, we have designed the
2 PROGRAM TO USE A CORRELATION FUNCTION TO FIND IT #ORRELATION COR IS
A MATHEMATICAL FUNCTION THAT GIVEN TWO SIGNALS WILL OUTPUT A NUMBER
BETWEEN AND n :ERO MEANS THAT THE TWO SIGNALS ARE UNRELATED OR
NOT DETECTABLE BY THE ALGORITHM ! VALUE ABOVE OR BELOW n IS
CONSIDERED A SIGNIFICANT CORRELATION
!N 2 PROGRAM USING THE COR FUNCTION WAS ABLE TO FIND THE DATA LOG
THAT WAS MULTIPLIED WITH A SINUS CURVE )T IS DIFFICULT TO TELL WHICH ONE
Figure 8. Illustrations of a couple device log files. On the left is the sinus signal. On the
right of this are two log file signals: V (middle) and VV (right).
cor(v,y)
[1] 0.5779192
cor(vv,y)
[1] 0.7557141
%NTER THE (ADOOP DIRECTORY FORMAT THE ($&3 AND START THE WHOLE
SYSTEM (ADOOP DMONS ON THE DIFFERENT MACHINES 4HE FOLLOWING WILL
initialize all nodes:
4HEN YOU HAVE TO GENERATE THE TEST DATA 9OU SHOULD DOWNLOAD AND RUN
THE SCRIPT FOUND IN THE %%DIGI REPO ON 'IT(UB
.EXT PUT THE TEST DATA INTO ($&3 AND FOR THAT YOU MAKE A DIRECTORY
STRUCTURE AS REQUIRED BY THE ANALYSIS PROGRAM
Now the cluster should be running with data loaded. You need to
DOWNLOAD THE %%DIGI ANALYSIS PROGRAM AND RUN IT AFTER YOU FIRST HAVE SET
UP THE 2MR ENVIRONMENT IN AN 2 SESSION 4HE IMPORTANT PARTS OF THE 2
CODE ARE SHOWN IN ,ISTING
'ET THE %%DIGI DATA ANALYSIS PROGRAM
> R
> source("etc/hadoop/hset.R")
> source("EEdigitest.R")
> q()
Listing 1. The “mapper” takes the input files and selects key/value before emitting. “Reducer”
analyses data with the “cor” function (correlation) and emits the result as value. Before printing,
the R analysis program selects the data with high (more than 0.75) significance.
> sbin/stop-all.sh
CONCLUSION
!FTER READING THIS ARTICLE YOU SHOULD BE ABLE TO SET UP A (ADOOP
,INUX CLUSTER IN A SHORT AMOUNT OF TIME 7E HAVE ALSO PROVIDED A
WAY FOR YOU TO TEST THE CLUSTER USING 2 4HE OPEN
SOURCE TOOL CHAIN
FITS NICELY TOGETHER AND YOULL BE ABLE TO LEARN ABOUT "IG $ATA
analysis at no cost.
W ith the current cloud price structures, our recommendation is to
USE A CLOUD CLUSTER WHEN YOU HAVE A SMALL BUDGET FOR COMPUTING
AND YOUR NEED FOR IT IS TRANSIENT )F AT SOME POINT YOU ARE USING
MANY MACHINES CONSTANTLY FOR AN ENTIRE MONTH YOU SHOULD CONSIDER
building a local computer cluster. While doing the project, we learned
that buying 50 strong computers as hardware would have cost a
SIMILAR AMOUNT AS RENTING STRONG COMPUTERS FOR ONE MONTH
7HEN USING A FREE TIER REMEMBER THAT THE CLOUD WILL COST YOU
MONEY OVER TIME )T IS THE CLOUD PROVIDERS BUSINESS TO CHARGE FOR
using their computing resources. Cloud providers have extensive
PRICE INFORMATION AND THEY WILL CHARGE YOU FOR EVERY USE OF A VIRTUAL
SERVERFOR EXAMPLE STORAGE OF HARD DISK IMAGES AND MACHINE
TEMPLATES DATA OUT OF THE DATA CENTER REGION %VEN WHEN YOU
USE THE FREE TIER READ YOUR BILL CAREFULLY FOR UNEXPECTED COSTS DUE
TO A TINY MISTAKE ON YOUR SIDEFOR EXAMPLE USING MORE STORAGE
THAN INCLUDED IN THE FREE TIER 4HE BEST YOU CAN DO IS TO MAKE A
SMALL BUDGET FOR CLOUD USAGE IN ORDER TO PREPARE YOU AND YOUR
ORGANIZATIONS MINDSET 4HEN YOU SHOULD RUN TRIALS BASED ON YOUR
TYPICAL ANALYSIS TASKS TO DETERMINE WHAT TYPE OF MACHINES AND WHICH
provider is most suited.
$O NOT USE THE INTUITIVELY IMPRESSIVE
SOUNDING 3OFTWARE AS A
Service (SaaS), but stick to generic virtual machines, as we used in
this article. Otherwise, you most likely will be investing in learning
A PROPRIETARY 3AA3 INTERFACE AND WILL LOSE THE ABILITY TO SWITCH CLOUD
providers without high switching costs. Managing Linux machines
SERVERS FOR SHORT PERIODS SHOULD NOT BE A PROBLEM IN TERMS OF
OPERATIONAL COSTS )N OUR EXPERIENCE IN
DEPTH KNOWLEDGE ABOUT ,INUX
AND COMPUTER HARDWARE PERFORMANCE IS HIGHLY RELEVANT WHEN USING
CLOUD COMPUTER RESOURCES !FTER ALL YOU ARE GIVEN COMPLETE CONTROL OF
A REAL #05 AND 2!- IF ONLY FOR SHORT PERIOD
ACKNOWLEDGEMENTS
4HIS WORK WAS SUPPORTED BY 4HE 3OUTHERN $ENMARK 'ROWTH &ORUM
AND THE REGIONAL %5 PROJECT h)O4STYRINGv WHICH IS A PROJECT ABOUT
ENERGY EFFICIENCY IN EMBEDDED CONTROL SYSTEMS USING "IG $ATA AND )O4
technologies. Q
Rune Torbensen is Postdoc at University of Southern Denmark, SDU Mechatronics, Sønderborg, Denmark
and has an IoT PhD with a focus on wireless embedded communication from Aalborg University. He
has used embedded Linux in most of his experiments during the past ten years. Recently, he became
interested in Big Data technology due to its strong relation with IoT.
Søren Top, associate Professor PhD, is a lecturer at University of Southern Denmark, SDU Mechatronics,
Sønderborg, Denmark. He has taught operating systems and embedded systems for decades, and he uses
Linux for both topics on a daily basis.
RESOURCES
Hadoop: http://hadoop.apache.org
Rmr2: https://github.com/RevolutionAnalytics/RHadoop/wiki
RETURN TO CONTENTS
SPTechCon offers classes and tutorials for IT professionals, • Tips and tricks for working with SharePoint
business decision makers, information workers, developers and 2013 and 2010, and Office 365
software and information architects. Each presenter at SPTechCon
• Practical information you can put to use
is a true SharePoint expert, with many drawn from Microsoft’s
on the job right away!
tech teams or holding Microsoft MVP status.
• The most knowledgeable instructors working
Whether you’re looking to upgrade to a more current version, in SharePoint today
making a move to the cloud, or simply need answers to those
daunting problems you’ve been unable to overcome, SPTechCon
is the place for you! Come join us! www.sptechcon.com
A BZ Media Event
Integrating
Web
Applications
with Apache
Learn how to write your own custom Apache configurations
to make your applications work the way you want.
ANDY CARLSON
PREVIOUS
NEXT
Feature:
V
V
W
hen you deploy a web application, how do end users access
IT /FTEN WEB APPLICATIONS ARE SET BEHIND A GATEWAY DEVICE
THROUGH WHICH END USERS CAN ACCESS IT /NE OF THE POPULAR
products to act as an application gateway on Linux is the Apache Web
3ERVER !LTHOUGH IT CAN FUNCTION AS A NORMAL WEB SERVER IT ALSO HAS THE
ability to connect through it to other web servers.
In this article, I discuss what it takes to integrate a web application
INTO !PACHE 4HIS INCLUDES INTEGRATING THE (440 PROTOCOL FUNCTIONALITY
CUSTOMIZING CONTENT TO RENDER PROPERLY AND REUSING PIECES OF CONFIGURATION
/NCE YOU UNDERSTAND THOSE BASIC BITS OF FUNCTIONALITY YOULL HAVE THE TOOLS
YOU NEED TO MAXIMIZE YOUR WEB APPLICATIONS USABILITY 3O LETS GET STARTED
Input:
Name: Frank Sinatra
Genre: Jazz
Name: 2Pac
Genre: Rap
Name: Reel Big Fish
Genre: Ska
Regex pattern: "^Name: "
Output:
Name: Frank Sinatra
Name: 2Pac
Name: Reel Big Fish
4HIS EXAMPLE SEARCHES THE INPUT TEXT FOR TEXT THAT MATCHES THE
pattern "^Name: " 4HIS PATTERN SAYS h,OOK FOR THE TEXT .AME AT
THE BEGINNING OF EACH LINEv 3INCE THERE ARE TWO LINES THAT BEGIN WITH
THAT TEXT ONLY THOSE TWO LINES ARE RETURNED 7HILE h>v REPRESENTS THE
BEGINNING OF A LINE hv REPRESENTS THE END OF A LINE 3O IF YOU WERE TO
APPLY THE PATTERN hAv TWO LINES WOULD BE RETURNED &RANK 3INATRA AND
3KA ,ETS EXPAND ON THAT EXAMPLE AND USE THE INPUT FROM %XAMPLE
with a new pattern.
Example 2:
As you can see, I’ve taken the original regex pattern and added [0-9]
TO THE END 4HIS WILL SEARCH FOR A SINGLE CHARACTER THAT CAN BE ANY NUMBER
FROM TO WHICH IS WHY h0ACv WAS THE ONLY LINE RETURNED 9OU ALSO CAN
SPECIFY A RANGE WITH ALPHABETIC CHARACTERS [a-z] or [A-Z] ).
Along with pattern selection, you also can do substitution with regex.
4HERE ARE TWO FORMATS FOR REGEX SUBSTITUTIONS S\PATTERN\REPLACE\MODIFIER
OR SPATTERNREPLACEMODIFIER )N !PACHE ) FIND IT EASIER TO USE THE PIPE
STYLE SUBSTITUTION %XAMPLE USES THE SAME INPUT WITH A NEW PATTERN
Example 3:
4HIS PATTERN HAS A LOT TO DISSECT /NE OF THE GREAT FEATURES OF REGEX
IS THE ABILITY TO MATCH ANY CHARACTER 4HE DOT OPERATOR WILL MATCH ANY
ONE CHARACTER 4HE ASTERISK OPERATOR WILL MATCH OR MORE OF WHATEVER
character or operator preceded it. Putting these two operators together
MATCHES OR MORE OF ANY CHARACTER %NCLOSING THIS IN PARENTHESES ALLOWS
THE MATCHED TEXT TO BE REPRESENTED IN THE REPLACE PORTION OF THE PATTERN
with a variable. In this case, \1 REPRESENTS THE FIRST BLOCK OF TEXT WITHIN
parentheses and \2 REPRESENTS THE SECOND 4HE ONLY CHARACTERS THAT
ARE EXPLICITLY BEING MATCHED ARE h&RANKv !S SUCH THE LINES CONTAINING
h&RANKv WILL BE REPLACED WITH EVERYTHING UP TO h&RANKv REPRESENTED
by \1 h$WEZILv AND EVERYTHING FOLLOWING h&RANKv REPRESENTED BY \2 ).
!S YOU CAN SEE THE ENTIRETY OF THE TEXT INPUT WAS SENT TO THE OUTPUT
ALTHOUGH MODIFIED BY THE PATTERN
Protocol Integration
7HEN IT IS DECIDED THAT AN APPLICATION WOULD BENEFIT FROM !PACHE
integration, there is a high likelihood that it will reside on a separate
SERVER FROM !PACHE 4O INTEGRATE APPLICATIONS BEING ACCESSED VIA (440
FULLY ANY OR ALL OF THESE MODULES MAY BE USED mod_rewrite , mod_proxy ,
mod_ssl and mod_headers %ACH OF THESE MODULES ALLOWS YOU TO
customize the way communication between the end user and web servers
OCCURS FROM MODIFYING (440 HEADER DATA TO MANAGING PROXY CONNECTIONS
to other servers.
First, let’s look at mod_rewrite 4HERE ARE A NUMBER OF DIRECTIVES
within the mod_rewrite MODULE BUT ) COVER ONLY A HANDFUL HERE
RewriteEngine , RewriteCond and RewriteRule 4HE RewriteEngine
DIRECTIVE SIMPLY ENABLES 52, REWRITING AND IS INVOKED AS FOLLOWS
RewriteEngine on
)N THIS EXAMPLE WHEN THE 52, OF GOOGLE IS ACCESSED THE SERVER WILL RESPOND
WITH AN (440 THAT WILL REDIRECT THE USER TO HTTPWWWGOOGLECOM
4HIS EXAMPLE WILL WORK ONLY IF THE REQUEST 52, IS EXACTLY EQUAL TO hGOOGLEv
)F THE NEED IS TO REDIRECT ON ANY 52, STARTING WITH hGOOGLEv YOU WOULD
DEFINE A CONDITIONAL REDIRECT USING RewriteCond AS FOLLOWS
4HE RewriteCond directive has two parts: a string value to check and a
SUBSTRING TO SEARCH FOR )N THIS EXAMPLE YOU ARE LOOKING IN THE REQUEST_URI
(440 SESSION VARIABLE FOR ANYTHING BEGINNING WITH hGOOGLEv )F THAT CONDITION
is met, the RewriteRule ON THE FOLLOWING LINE IS EXECUTED "ECAUSE YOU ARE
DETERMINING THE VALUE OF THE TARGET 52, IN THE RewriteCond THE VALUE OF THE
TARGET 52, IN THE RewriteRule IS DEFINED AS "^.*$".
4HE EXAMPLES GIVEN HERE ARE ALL USER
FACING EVENTS LIKE A REDIRECT
4HE RewriteRule DIRECTIVE ALSO CAN BE USED TO PROXY REQUESTS TO A SERVER
4HIS IS DONE BEHIND THE SCENES UNLIKE AN (440 REDIRECT SO THE REQUEST
IS FORWARDED WITHOUT THE USERS KNOWLEDGE ! PROXIED REQUEST MAY BE
CONFIGURED LIKE THE EXAMPLE BELOW
4HE FIRST LINE OF THE HEADER CONTAINS THE METHOD GET in this case)
AND THE 52, BEING REQUESTED 7HEN THE SERVER RECEIVES THE REQUEST
FROM THE CLIENT IT STRIPS OFF hHOMEv AS SPECIFIED IN THE ProxyPass
DIRECTIVE AND FORWARDS THE REQUEST TO THE BACK
END SERVER )F YOU
WANT TO PROXY RESPONSE PACKETS AS WELL AS REQUEST PACKETS THE
FOLLOWING ProxyPassReverse statement can be paired with the
previous ProxyPass statement:
4HIS EXAMPLE SUGGESTS THAT WITHIN THE HOME FOLDER THERE ARE MANY
SUB
FOLDERS LETS SAY USER NAMES AND WITHIN EACH OF THOSE EXISTS A FOLDER
NAMED hDOCSv 4HE 53%2.!-%DOCS 52, EXISTS ON DOCSERVERTEST
IN THE ROOT OF THE WEB SERVER AS DENOTED BY THE IN THE SERVER 52,
4HE ProxyPassReverse WILL FUNCTION IN THE SAME MANNER AS IT DID IN THE
previous example.
Securing websites with SSL in Apache is accomplished with mod_ssl .
!LTHOUGH ) WONT DISCUSS CONFIGURING 33, FROM THE GROUND UP A FEW
directives relate to proxied SSL connections: SSLProxyCheckPeerExpire ,
SSLProxyCheckPeerName and SSLProxyCheckPeerCN . It is a common
PRACTICE TO USE SELF
SIGNED CERTIFICATES ON BACK
END SERVERS PROVIDED A
VALID CERT IS IN PLACE ON THE USER
FACING SERVER AND THESE DIRECTIVES ADDRESS
COMMON ISSUES THAT CAN ARISE WHEN USING SELF
SIGNED CERTS !NY OF THESE
DIRECTIVES CAN HAVE ONE OF TWO ARGUMENTS PROVIDED hONv OR hOFFv )F SET TO
hOFFv SSLProxyCheckPeerExpire will skip checking the expiration date
ON THE 33, CERT USED ON A BACK
END SERVER 4O AVOID CHECKING A CERTIFICATES
common name or alternate names against the server name used to access
a back end, set SSLProxyCheckPeerName TO hOFFv )N OLDER VERSIONS OF
Apache, you might be able to use SSLProxyCheckPeerCN SET TO hOFFv
INSTEAD OF SSLProxyCheckPeerName .
!LONG WITH REWRITING 52,S IT MAY BE NECESSARY TO REWRITE (440
REQUEST OR RESPONSE HEADER FIELDS )N !PACHE THIS IS DONE WITH
mod_headers 4HERE ARE ONLY TWO DIRECTIVES WITHIN THIS MODULE
Header and RequestHeader 4HESE DIRECTIVES ARE USED TO MODIFY
RESPONSE AND REQUEST HEADER FIELDS RESPECTIVELY -ANY ACTIONS CAN
BE USED WITH EITHER OF THESE DIRECTIVES BUT HERE LETS LOOK AT THE set
and edit ACTIONSFOR EXAMPLE
4HIS EXAMPLE WILL ADD AND REPLACE ANY EXISTING HEADER IN AN (440
response named ReceiveTime AND GIVE IT THE VALUE OF THE 5.)8 TIMESTAMP
WHEN THE REQUEST WAS RECEIVED BY THE SERVER REPRESENTED BY "%t" ).
)F YOU NEED TO REPLACE THE VALUE OF A HEADER THAT COMES FROM A
BACK
END SERVER YOU WOULD USE THE edit action. Consider the
FOLLOWING EXAMPLE
Content Integration
/NCE A REMOTE APPLICATION IS INTEGRATED WITH AN !PACHE SERVER FROM A
PROTOCOL STANDPOINT IT MAY BE NECESSARY TO INTEGRATE CONTENT 4HIS WILL
GENERALLY MANIFEST ITSELF AS 52,S CODED INTO (4-, OR *AVA3CRIPT THAT
ARE SPECIFIC TO A BACK
END SERVER AND NOT TO A USER
FACING SERVER 4HE
BASIC NECESSITY IS TO BE ABLE TO SEARCH AND REPLACE BITS OF (4-, OR
*AVA3CRIPT CONTENT SO THAT IT CAN RENDER AND PERFORM CORRECTLY WHEN
ACCESSED THROUGH AN !PACHE PROXY 4HE MODULE THAT ACCOMPLISHES
this is mod_substitute AND SPECIFICALLY THE Substitute directive.
Substitute ALLOWS A SIMPLE REGEX SUBSTITUTE TO BE PERFORMED ON THE
PAYLOAD DATA OF AN (440 RESPONSE
3OMETHING TO CONSIDER BEFORE ATTEMPTING TO REPLACE TEXT IS TO
ACCOUNT FOR WHETHER THE BACK
END WEB SERVER COMPRESSES DATA BEFORE
SENDING IT OVER THE NETWORK )F IT DOES YOUR Substitute statements
MIGHT NOT WORK AS IT WILL BE SEARCHING FOR !3#)) TEXT WITHIN BINARY
COMPRESSED DATA 4O ACCOUNT FOR THIS YOU CAN INSTRUCT !PACHE TO
DECOMPRESS THE DATA MANIPULATE THE RESPONSE AND THEN RE
COMPRESS
IT 4HIS IS DONE USING THE SetOutputFilter DIRECTIVE WHICH IS PART OF
!PACHE CORE FUNCTIONALITY (ERES HOW IT WORKS
SetOutputFilter INFLATE;;SUBSTITUTE;;DEFLATE
Substitute "s|(href="http)(://)back-end01.test:8080|$1s$2public.test|in"
&OR THIS EXAMPLE LETS ASSUME THAT THE USER
FACING SITE PUBLICTEST
RUNS (4403 AND THE BACK
END SERVER BACK
ENDTEST RUNS (440
ON PORT 4HIS WOULD BE A SOLUTION IF THE BACK
END WEB SERVER
RETURNED HYPERLINKS THAT WERE SPECIFIC TO ITSELF AS OPPOSED TO THE
USER
FACING SITE )N THE SEARCH PORTION OF THE REGEX SUBSTITUTE THIS
SPLITS OUT TWO GROUPS OF TEXT IN PARENTHESES (href=\"http) and
(://) 4HESE ARE BLOCKS OF TEXT THAT YOU WANT PRESERVED IN
THE REPLACE SECTION OF THE REGEX )N THE REPLACE YOU ARE INSERTING
AN hSv AFTER HTTP AND REPLACING THE HOSTNAMEPORT WITH THE
USER
FACING SITE NAME !FTER PROCESSING THE RESULTING STRING WILL BE
href="https://public.test 4HIS WILL UPDATE HYPERLINKS THAT USE
hHREFv ATTRIBUTES A AND LINK &OR IMG AND SCRIPT TAGS YOU
COULD USE THIS SAME 3UBSTITUTE STATEMENT AND REPLACE hHREFv
WITH hSRCv !NOTHER CONSIDERATION WOULD BE TO ACCOUNT FOR DOUBLE
OR SINGLE QUOTES DELIMITING ATTRIBUTE VALUES href=’ vs. href=" ).
!NOTHER APPLICATION OF Substitute IS TO EXTEND THE FUNCTIONALITY
OF A PAGE WITHOUT MANIPULATING THE ORIGINAL SOURCE CODE #ONSIDER
THE FOLLOWING EXAMPLE
be a great starting point and can help you get started with accessing
your applications through Apache. Q
Andy Carlson has worked in IT for the past 13 years doing networking and server administration. He is
thankful to have chosen a career that he loves, grows in and learns from. He and his amazing wife have
three daughters and a son, and they currently reside in Cincinnati, Ohio. He enjoys playing the guitar and
spending time with family and friends.
RESOURCES
The following are some articles I’ve found useful along with some example Apache
configs I’ve written.
RETURN TO CONTENTS
)F YOU FEEL A BIT OVERWHELMED DONT WORRY 4HIS GUIDE LAYS OUT THE VARIOUS
DATABASE OPTIONS AND ANALYTIC SOLUTIONS AVAILABLE TO MEET YOUR APPS UNIQUE NEEDS
9OULL SEE HOW DATA CAN MOVE ACROSS DATABASES AND DEVELOPMENT LANGUAGES SO YOU CAN WORK IN YOUR FAVORITE
ENVIRONMENT WITHOUT THE FRICTION AND PRODUCTIVITY LOSS OF THE PAST
Sponsor: IBM
> https://geekguide.linuxjournal.com/content/field-guide-world-modern-data-stores
$EVELOPERS AND )4 ALIKE ARE FINDING IT DIFFICULT AND SOMETIMES EVEN IMPOSSIBLE TO QUICKLY INCORPORATE ALL OF THIS DATA INTO
THE RELATIONAL MODEL WHILE DYNAMICALLY SCALING TO MAINTAIN THE PERFORMANCE LEVELS USERS DEMAND 4HIS IS CAUSING MANY TO
LOOK AT .O31, DATABASES FOR THE FLEXIBILITY THEY OFFER AND IS A BIG REASON WHY THE GLOBAL .O31, MARKET IS FORECASTED TO
NEARLY DOUBLE AND REACH 53$ BILLION IN
Sponsor: IBM
> https://geekguide.linuxjournal.com/content/why-nosql-your-database-options-new-non-relational-world
4HERE ARE GOOD REASONS WHY ITgS HARD TO FIGURE OUT WHAT CONSUMES RESOURCES LIKE #05 )/ AND MEMORY IN
A COMPLEX PIECE OF SOFTWARE SUCH AS A DATABASE 4HE FIRST PROBLEM IS THAT MOST DATABASE SERVER SOFTWARE
DOESNgT OFFER ANY WAY TO MEASURE OR INSPECT THAT TYPE OF PERFORMANCE DATA 4HE DATABASE SERVER ISNgT
OBSERVABLE 4HIS PROBLEM ARISES IN TURN FROM THE COMPLEXITY OF THE DATABASE SERVER SOFTWARE AND THE WAY
IT DOES ITS WORK WHICH ACTUALLY PRECLUDES MEASURING RESOURCE CONSUMPTION ACCURATELY
3PONSOR 6IVID#ORTEX
> https://geekguide.linuxjournal.com/content/estimating-cpu-query-weighted-linear-regression
4HIS BUYERgS GUIDE IS DESIGNED TO HELP YOU UNDERSTAND WHAT DATABASE MANAGEMENT REALLY
REQUIRES SO YOUR INVESTMENTS IN A SOLUTION PROVIDE THE GREATEST POSSIBLE ULTIMATE VALUE
3PONSOR 6IVID#ORTEX
> https://geekguide.linuxjournal.com/content/database-performance-monitoring-buyer%E2%80%99s-guide
3PONSOR 6IVID#ORTEX
> https://geekguide.linuxjournal.com/content/essential-guide-queueing-theory
3PONSOR 6IVID#ORTEX
> https://geekguide.linuxjournal.com/content/sampling-stream-events-probabilistic-sketch
The Problem
with “Content” DOC SEARLS
Real journalism is getting programmatically
corrupted and harder to find. Fortunately, Doc Searls is Senior
there’s a fix. Editor of Linux Journal.
He is also a fellow with
the Berkman Center for
Internet and Society
at Harvard University
PREVIOUS and the Center for
V
Information Technology
Feature: Integrating Web Applications with Apache
and Society at
UC Santa Barbara.
B
ack in the early ’00s, John Perry Barlow
(https://en.wikipedia.org/wiki/John_Perry_Barlow)
SAID h) DIDNT START HEARING ABOUT CONTENT UNTIL
THE CONTAINER BUSINESS FELT THREATENEDv Linux Journal was
ONE OF THOSE CONTAINERSSO WAS EVERY OTHER MAGAZINE
NEWSPAPER AND BROADCAST STATION 4ODAY THOSE CONTAINERS
ARE BOBBING AROUND IN AN OCEAN OF hCONTENTv ON THE
INTERNET 7ORSE THE STUFF INSIDE THE CONTAINERS WHICH WE
USED TO CALL hEDITORIALv IS NOW A BREED OF hCONTENTv TOO
)N THE OLD DAYS EDITORIAL LIVED ON ONE SIDE OF A
h#HINESE WALLv BETWEEN ITSELF AND THE PUBLISHING
SIDE OF A NEWSPAPER OR MAGAZINE 4HE SAME WENT
FOR THE PROGRAMMING AND ADVERTISING SIDES OF A
COMMERCIAL BROADCAST STATION OR NETWORK 4HE WALL
WAS TRANSPARENT MEANING IT WAS POSSIBLE FOR A WRITER
A PHOTOGRAPHER A NEWSCASTER OR A PERFORMING ARTIST TO
SEE WHAT FUNDED THE OPERATION BUT THE ETHICAL THING WAS TO IGNORE WHAT
HAPPENED ON THE OTHER SIDE OF THAT WALL 7HICH WAS EASY TO DO BECAUSE
EVERYTHING ON THE OTHER SIDE OF THAT WALL WAS SOMEBODY ELSES JOB
4ODAY THAT WALL HAS BEEN DESTROYED BY THE IMPERATIVES OF hCONTENT
PRODUCTIONv WHICH IS THE NEW JOB OF JOURNALISTS AND EVERYBODY ELSE
DEVOTED TO hGENERATING CONTENTv IN MAXIMUM VOLUMES ALL THE BETTER TO
ATTRACT hPROGRAMMATICv ADVERTISING
9OU CAN SEE THE WRECKAGE OF ONE SUCH WALL IN A *ANUARY The New
York Times STORY TITLED h)N .EW *ERSEY /NLY A &EW -EDIA 7ATCHDOGS !RE
,EFTv HTTPSWWWNYTIMESCOMNYREGIONIN
NEW
JERSEY
ONLY
A
FEW
MEDIA
WATCHDOGS
ARE
LEFTHTML?R), by David Chen. In it he writes,
h4HE Star-Ledger, which almost halved its newsroom eight years ago, has
MUTATED INTO A DIGITAL MEDIA COMPANY REQUIRING MOST REPORTERS TO REACH AN
EVER
INCREASING QUOTA OF PAGE VIEWS AS PART OF THEIR COMPENSATIONv
!S ) EXPLAINED IN MY *ANUARY ARTICLE h7HAT 7E #AN $O WITH
Ad Blocking’s Leverage” (HTTPWWWLINUXJOURNALCOMCONTENTWHAT
WE
CAN
DO
AD
BLOCKINGS
LEVERAGE), the advertising we’re talking about
here isn’t the old Madison Avenue kind that lived on the other side
OF JOURNALISMS #HINESE WALL )TS A NEW ALL
DIGITAL KIND CALLED adtech.
While adtech is called advertising and looks like advertising, it is
ACTUALLY A BREED OF DIRECT MARKETING https://en.wikipedia.org/wiki/
Direct_marketing A COUSIN OF SPAM DESCENDED FROM JUNK MAIL
,IKE JUNK MAIL ADTECH IS DATA
DRIVEN WANTS TO GET PERSONAL FINDS
SUCCESS IN TINY
PERCENTAGE RESPONSES AND EXCUSES MASSIVE NEGATIVE
EXTERNALITIES 4HOSE INCLUDE WANTON AND UNWELCOME SURVEILLANCE
ANNOYING THE CRAP OUT OF PEOPLE AND FILLING THE WORLD WITH CRAP
INCLUDING FAKE NEWS AND FRAUDULENT ADVERTISING
(ERES ONE WAY TO TELL THE DIFFERENCE BETWEEN REAL ADVERTISING AND
adtech, using the Star-Ledger as an example:
Q !DTECH WANTS TO PUSH ADS AT READERS ANYWHERE IT CAN FIND THEM BASED
on gathered intelligence, algorithms and whatever else shows up in live
AUCTION MARKETS FOR EYEBALLS
RETURN TO CONTENTS
ADVERTISER INDEX
Thank you as always for supporting our advertisers by buying their products!
ADVERTISER URL PAGE #
ATTENTION ADVERTISERS
$RUPAL#ON "ALTIMORE HTTPSEVENTSDRUPALORGBALTIMORE The Linux Journal brand’s following has grown
$RUPALIZEME HTTPDRUPALIZEME to a monthly readership nearly one million strong.
(0# 7ALLSTREET HTTPWWWFLAGGMGMTCOMLINUX Encompassing the magazine, Web site, newsletters
,IBRE 0LANET HTTPLIBREPLANETORGCONFERENCE and much more, Linux Journal offers the ideal con-
,INUX&EST .ORTHWEST HTTPLINUXFESTNORTHWESTORG
tent environment to help you reach your marketing
0EER (OSTING HTTPGOPEERCOMLINUX
objectives. For more information, please visit
3#!,% X HTTPWWWSOCALLINUXEXPOORG
http://www.linuxjournal.com/advertising.
3ILICON -ECHANICS HTTPWWWSILICONMECHANICSCOM
304ECH#ON HTTPWWWSPTECHCONCOM
353% HTTPSUSECOMSTORAGE
Go to http://drupalize.me and
get Drupalized today!
break down
your innovation barriers
power your business to its full potential
When you’re presented with new opportunities, you want to focus on turning
them into successes, not whether your IT solution can support them.
Peer 1 Hosting powers your business with our wholly owned FastFiber NetworkTM,
solutions that are secure, scalable, and customized for your business.