Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. This slide deck aims at familiarizing the user with Sqoop and how to effectively use it in real deployments.
This document provides an overview of starting SPSS and entering data. It discusses starting SPSS, defining variables, entering data for sample students, saving data files, and loading data files. Key steps include defining variables in the Variable View window by naming them and adding value labels, entering data horizontally in the Data View window, and saving files with a .sav extension so they can be opened later in SPSS.
CTF for ビギナーズのバイナリ講習で使用した資料です。
講習に使用したファイルは、以下のリンク先にあります。
https://onedrive.live.com/redir?resid=5EC2715BAF0C5F2B!10056&authkey=!ANE0wqC_trouhy0&ithint=folder%2czip
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
** Hadoop Training: https://www.edureka.co/hadoop **
This Edureka PPT on Sqoop Tutorial will explain you the fundamentals of Apache Sqoop. It will also give you a brief idea on Sqoop Architecture. In the end, it will showcase a demo of data transfer between Mysql and Hadoop
Below topics are covered in this video:
1. Problems with RDBMS
2. Need for Apache Sqoop
3. Introduction to Sqoop
4. Apache Sqoop Architecture
5. Sqoop Commands
6. Demo to transfer data between Mysql and Hadoop
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This document provides an overview of starting SPSS and entering data. It discusses starting SPSS, defining variables, entering data for sample students, saving data files, and loading data files. Key steps include defining variables in the Variable View window by naming them and adding value labels, entering data horizontally in the Data View window, and saving files with a .sav extension so they can be opened later in SPSS.
CTF for ビギナーズのバイナリ講習で使用した資料です。
講習に使用したファイルは、以下のリンク先にあります。
https://onedrive.live.com/redir?resid=5EC2715BAF0C5F2B!10056&authkey=!ANE0wqC_trouhy0&ithint=folder%2czip
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
** Hadoop Training: https://www.edureka.co/hadoop **
This Edureka PPT on Sqoop Tutorial will explain you the fundamentals of Apache Sqoop. It will also give you a brief idea on Sqoop Architecture. In the end, it will showcase a demo of data transfer between Mysql and Hadoop
Below topics are covered in this video:
1. Problems with RDBMS
2. Need for Apache Sqoop
3. Introduction to Sqoop
4. Apache Sqoop Architecture
5. Sqoop Commands
6. Demo to transfer data between Mysql and Hadoop
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Apache Sqoop allows transferring data between structured data stores like relational databases and Hadoop. It uses MapReduce to import/export data in parallel. Sqoop can import data from databases into Hive and export data from HDFS to databases. The document provides examples of using Sqoop to import data from MySQL to Hive and export data from HDFS to MySQL. It also demonstrates creating and executing Sqoop jobs. References for more Sqoop tutorials and documentation are included.
Apache Pig is a platform for analyzing large data sets using a high-level language called Pig Latin. Pig Latin scripts are compiled into MapReduce programs that process data in parallel across a cluster. Pig simplifies data analysis tasks that would otherwise require writing complex MapReduce programs by hand. Example Pig Latin scripts demonstrate how to load, filter, group, and store data.
Cloudslam09:Building a Cloud Computing Analysis System for Intrusion DetectionWei-Yu Chen
In order to resolve huge amount of anomaly
information generated by Intrusion Detection System (IDS), this paper presents and evaluates a log analysis system for IDS based on Cloud Computing technique,
named IDS Cloud Analysis System (ICAS). To achieve this, there are two basic components have to be designed. First is the regular parser, which normalizes
the raw log files. The other is the Analysis Procedure, which contains Data Mapper and Data Reducer. The Data Mapper is designed to anatomize alert messages and the Data Reducer is used to aggregates and merges. As a result, this paper will show that the
performance of ICAS is suitable for analyzing and reducing large alerts.
2. Computing with big datasets is a fundamentally different challenge than doing “big compute” over a small dataset
3. 平行分散式運算 Grid computing MPI, PVM, Condor… 著重於 : 分散工作量 目前的問題在於:如何分散資料量 Reading 100 GB off a single filer would leave nodes starved – just store data locally
4. 分散大量資料: Slow and Tricky 交換資料需同步處理 Deadlock becomes a problem 有限的頻寬 Failovers can cause cascading failure
5. 數字會說話 Data processed by Google every month: 400 PB … in 2007 Max data in memory: 32 GB Max data per computer: 12 TB Average job size: 180 GB 光一個 device 的讀取時間 = 45 minutes
11. Hadoop 提供 Automatic parallelization & distribution Fault-tolerance Status and monitoring tools A clean abstraction and API for programmers
12. Hadoop Applications (1) Adknowledge - Ad network behavioral targeting, clickstream analytics Alibaba processing sorts of business data dumped out of database and joining them together. These data will then be fed into iSearch, our vertical search engine. Baidu - the leading Chinese language search engine Hadoop used to analyze the log of search and do some mining work on web page database
13. Hadoop Applications (3) Facebook 處理 internal log and dimension data sources as a source for reporting/analytics and machine learning. Freestylers - Image retrieval engine use Hadoop 影像處理 Hosting Habitat 取得所有 clients 的軟體資訊 分析並告知 clients 未安裝或未更新的軟體
14. Hadoop Applications (4) IBM Blue Cloud Computing Clusters Journey Dynamics 用 Hadoop MapReduce 分析 billions of lines of GPS data 並產生交通路線資訊 . Krugle 用 Hadoop and Nutch 建構 原始碼搜尋引擎
15. Hadoop Applications (5) SEDNS - Security Enhanced DNS Group 收集全世界的 DNS 以探索網路分散式內容 . Technical analysis and Stock Research 分析股票資訊 University of Maryland 用 Hadoop 執行 machine translation, language modeling, bioinformatics, email analysis, and image processing 相關研究 University of Nebraska Lincoln, Research Computing Facility 用 Hadoop 跑約 200TB 的 CMS 經驗分析 緊湊渺子線圈 ( CMS , C ompact M uon S olenoid )為 瑞士 歐洲核子研究組織 CERN 的 大型強子對撞器 計劃的兩大通用型 粒子偵測器 中的一個。
16. Hadoop Applications (6) Yahoo! Used to support research for Ad Systems and Web Search 使用 Hadoop 平台來發現發送垃圾郵件的殭屍網絡 趨勢科技 過濾像是釣魚網站或惡意連結的網頁內容
17. MapReduce Conclusions 適用於 large-scale applications 以及 large-scale computations 程式設計者只需要解決”真實的”問題,架構面留給 MapReduce MapReduce 可應用於多種領域 : Text tokenization, Indexing and Search, Data mining, machine learning…
20. Divide and Conquer 範例四: 眼前有五階樓梯,每次可踏上一階或踏上兩階,那麼爬完五階共有幾種踏法? Ex : (1,1,1,1,1) or (1,2,1,1) 範例一:十分逼近法 範例二:方格法求面積 範例三:鋪滿 L 形磁磚
21. Programming Model Users implement interface of two functions: map (in_key, in_value) (out_key, intermediate_value) list reduce (out_key, intermediate_value list) out_value list
22. Map One-to-one Mapper Explode Mapper Filter Mapper (“Foo”, “other”) (“FOO”, “OTHER”) (“key2”, “data”) (“KEY2”, “DATA”) (“A”, “cats”) (“A”, “c”), (“A”, “a”), (“A”, “t”), (“A”, “s”) (“foo”, 7) (“foo”, 7) (“test”, 10) (nothing) let map(k, v) = Emit(k.toUpper(), v.toUpper()) let map(k, v) = foreach char c in v: emit(k, c) let map(k, v) = if (isPrime(v)) then emit(k, v)
23. Reduce Example: Sum Reducer let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum) (“A”, [42, 100, 312]) (“A”, 454) (“B”, [12, 6, -2]) (“B”, 16)
60. <Key, Value> Pair Row Data Map Reduce Reduce Input Output key values key1 val key2 val key1 val … … Map Input Output Select Key key1 val val … . val
61. Class MR { static public Class Mapper … { } static public Class Reducer … { } main(){ Configuration conf = new Configuration(); Job job = new Job( conf , “job name"); job . setJarByClass ( thisMainClass.class ); job .setMapperClass(Mapper.class); job .setReduceClass(Reducer.class); FileInputFormat.addInputPaths (job, new Path(args[0])); FileOutputFormat.setOutputPath (job, new Path(args[1])); job . waitForCompletion (true); }} Program Prototype (v 0.20) Map 程式碼 Reduce 程式碼 其他的設定參數程式碼 Map 區 Reduce 區 設定區
62. Class Mapper (v 0.20) class MyMap extends Mapper < , , , > { // 全域變數區 public void map ( key, value , Context context ) throws IOException,InterruptedException { // 區域變數與程式邏輯區 context.write ( NewKey, NewValue ); } } 1 2 3 4 5 6 7 8 9 INPUT KEY Class OUTPUT VALUE Class OUTPUT KEY Class INPUT VALUE Class INPUT VALUE Class INPUT KEY Class import org.apache.hadoop.mapreduce.Mapper;
63. Class Reducer (v 0.20) class MyRed extends Reducer < , , , > { // 全域變數區 public void reduce ( key, Iterable< > values , Context context ) throws IOException, InterruptedException { // 區域變數與程式邏輯區 context. write( NewKey, NewValue ); } } 1 2 3 4 5 6 7 8 9 INPUT KEY Class OUTPUT VALUE Class OUTPUT KEY Class INPUT VALUE Class INPUT KEY Class INPUT VALUE Class import org.apache.hadoop.mapreduce.Reducer;
80. 範例四 (1) public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: hadoop jar WordCount.jar <input> <output>"); System.exit(2); } Job job = new Job(conf, "Word Count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); CheckAndDelete.checkAndDelete(args[1], conf); System.exit(job.waitForCompletion(true) ? 0 : 1); }
81. 範例四 (2) class TokenizerMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, Context context ) throws IOException , InterruptedException { String line = ((Text) value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }}} 1 2 3 4 5 6 7 8 9 < no , 1 > < news , 1 > < is , 1 > < a, 1 > < good , 1 > < news, 1 > <word,one> Input key Input value ………………… . ………………… No news is a good news. ………………… /user/hadooper/input/a.txt itr line no news news is good a itr itr itr itr itr itr
82. 範例四 (3) class IntSumReducer extends Reducer< Text , IntWritable , Text, IntWritable > { IntWritable result = new IntWritable(); public void reduce( Text key , Iterable < IntWritable > values , Context context ) throws IOException , InterruptedException { int sum = 0; for ( IntWritable val : values ) sum += val .get(); result.set(sum); context .write ( key, result ); }} 1 2 3 4 5 6 7 8 < news , 2 > <key,SunValue> <word,one> news 1 1 < a, 1 > < good, 1 > < is, 1 > < news, 1 1 > < no, 1 > for ( int i ; i < values.length ; i ++ ){ sum += values[i].get() }
87. 範例六 (2) public class WordIndex { public static class wordindexM extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { FileSplit fileSplit = (FileSplit) context.getInputSplit(); Text map_key = new Text(); Text map_value = new Text(); String line = value.toString(); StringTokenizer st = new StringTokenizer(line.toLowerCase()); while (st.hasMoreTokens()) { String word = st.nextToken(); map_key.set(word); map_value.set(fileSplit.getPath().getName() + ":" + line); context.write(map_key, map_value); } } } static public class wordindexR extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String v = ""; StringBuilder ret = new StringBuilder("\n"); for (Text val : values) { v += val.toString().trim(); if (v.length() > 0) ret.append(v + "\n"); } output.collect((Text) key, new Text(ret.toString())); } }
88. 範例六 (2) public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length < 2) { System.out.println("hadoop jar WordIndex.jar <inDir> <outDir>"); return; } Job job = new Job(conf, "word index"); job.setJobName("word inverted index"); job.setJarByClass(WordIndex.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(wordindexM.class); job.setReducerClass(wordindexR.class); job.setCombinerClass(wordindexR.class); FileInputFormat.setInputPaths(job, args[0]); CheckAndDelete.checkAndDelete(args[1], conf); FileOutputFormat.setOutputPath(job, new Path(args[1])); long start = System.nanoTime(); job.waitForCompletion(true); long time = System.nanoTime() - start; System.err.println(time * (1E-9) + " secs."); }}
95. News! Google 獲得 MapReduce 專利 Named” system and method for efficient large-scale data processing” Patent #7,650,331 Claimed since 2004 Not programming language, but the merge methods What does it mean for Hadoop? 2010/01/20 ~