Big Data point-of-sale devices capture all retail transactions; online retailers compile
customers clicks and provide purchase recommendations; financial traders
Ming-Hwa Wang, Ph.D. turned systems that cope with fast moving data to their advantages; chatter
COEN 242 Introduction to Big Data from social networks, web server logs, traffic flow sensors, satellite imagery,
Department of Computer Engineering broadcast audio streams, banking transactions, MP3 music, GPS trails, financial
Santa Clara University data
• data need to be stored and then can be analyzed/mined; when you can, keep
Introduction everything
• big data • streaming data (fast-moving data, or complex event processing): input data may
• Moore’s law: CPU speed doubles every 18 months. Applications too fast to store in their entirety (hoping not thrown away anything useful), and
automatically got faster without having to change their code. analysis must occur as the data streams in, e.g., mobile application, online
• heat dissipation and signal inference stopped making individual processors gaming, etc.
faster, and switched towards adding more parallel CPU cores. Applications • big data: the information that can’t be processed or analyzed using traditional
needed to be modified to add parallelism. processes or tools
• data grows faster than Moore’s law, and data is addictive • since early ‘80s, processor speed from 10 MHz to 3.6GHz, RAM from
• expensive to collect data explosion: sensors, cameras, public dataset, etc. $1,000/MB to $25/GB, disk transfer speed 75M/second
• the next generation of successful businesses will be built around data; Hal Varian • traditional processing either using single powerful CPU or
said, “The ability to take data – to be able to understand it, to process it, to parallel/distributed processing by transferring data to CPU in order to
extract value from it, to visualize it, to communicate it – that’s going to be a huge process, and thus have synchronization, bandwidth, temporal dependency
important skill in the next generation.” and partial failure issues, not good for big data
• social, mobile and cloud lead us to the data-enabled world • data locality: it’s the program that needs to move, not the data
• digital exoskeleton or paperless office transfers of pre-existing paper • traditional high-performance file servers (NetApp, EMC) have advantages of
process into the computer by digitization fast random access and support many concurrent clients, but with high cost
• digital nervous system with a large number of inflows and outflows of data per terabyte of storage
and a high level of networking • traditional relational databases with static schemas are designed for
• the value of big data to an organization falls into two categories: analytical use consistency and support complex transactions and denormalized (to speed
and enabling new products up querying) by time consuming pre-materialization (to predict the queries)
• data science: a data-driven application acquires its value from the data itself, and into an online analytic processing (OLAP) cube, but not scale well with the
creates more data as a result, or a data product size
• CDDB maps the length of the track to the title • graph database Neo4J: social network relations are graph by nature
• Google’s PageRank, spelling checking, integrated voice search, track the • most data analysis is comparative, new non-relational databases or NoSQL
progress of epidemic, etc. (origin from Google’s BigTable or Amazon’s Dynamo) store huge
• Facebook and LinkedIn use patterns of friendship relationship to suggest unstructured data and provide eventual consistency but not absolute
other people you may know consistency, provide enough structure to organize data, but do not require
• Amazon saves your searches, correlates with other users and give the exact schema of the data before storing it
appropriate recommendations; customers generate a trail of data exhaust • data analysis
that can be mined and put to use • huge amount of data in different formats and may have errors
• Five Vs (volume, velocity or rate of data grows, variety or different data types, • data conditioning and cleaning (80% of the total effort): get data into a state
veracity or uncertainty of available data, and value for business): the web has where it’s usable; but data is frequently missing or incongruous, and some
people spending more time online, and leaving a trail of data wherever they go; are difficult to be parsed, e.g., natural language processing
mobile applications leave even richer data trail (geolocation, video, audio);
• agile practice: faster product cycles, close interaction between developers tasks to manipulate data that is stored across a cluster of servers for massive
and customers, and testing; MapReduce enabling agile data analysis, and parallelism
have soft real-time response • easy to use: to run a MapReduce job, users just need to write map and
• machine learning libraries: PyBrain, Elefant, Weka, Mahout, Google Prediction reduce functions and the Hadoop handles all the rest (e.g., decompose the
API, OpenCV (for computer vision), Mechanical Turk (to develop training sets) submitted job into map/reduce tasks, schedule tasks, monitor tasks to
• statistics is the grammar of data science ensure successful completion or restart tasks that fails), a tradeoff for
• open source R language and CRAN flexibility
• narrative-driven data analysis or data visualization: making data tell its story: • a map is a function whose input is a single aggregate and whose output is a
• use visualization to see how bad row data is, and use visualization as the first bunch of key-value), or (k, v) pairs, wasteful of network and disk I/O, but
step for data analysis minimize memory usage in the mappers
• packages: GnuPlot, R, IBM’s ManyEyes, Casey Reas’ and Ben Fry’s Processing • sort and shuffle phase: the master controller sorts the key-value pairs by
(with animations) key, divides among all the reduce tasks by hashing (or by alphanumeric
• data jiujitsu: use smaller auxiliary problems to solve a large, difficult problem partitioning), and shuffle all pairs with the same key to the same reducer, or
that appear intractable, e.g., CDDB optionally via Shuffle/sort, which direct the tasks from mapper to a specific
• the Storage, MapReduce and Query (SMAQ) stack for big data (analogy to LAMP reducer by grouping/sorting, or via Combiner, which perform local
stack – Linux, Apache, MySQL and PHP- for web 2.0) aggregation the tasks from mapper before sending to reducer
• the drivetrain approach • a reduce function takes multiple map outputs with the same key and
• define a clear objective combines their values, i.e., reduces (k, [v1, v2, …, vn]) to (k, v), If the reducer
• predictive modeling or optimal decision function is associative and commutative, the values can be combined in any
• simulation order with the same result, when there is no more reduce work, all final key-
value pairs are emitted as output, the outputs from all reducer tasks are
• optimization
merged into a single file
• data scraping
• efficiency:
• the uber problem: challenges of data scarping: scale, metadata, data format,
• partitioning/shuffling: increase parallelism and reduce the data transfer
source domain complexity
by partitioning the output of the mappers into groups (buckets or
• tools: ScraperWiki, Junar, Mashery, Apigee, 2scale
regions), my multiple reducers operate on the partitions in parallel with
• legal implications of data scraping
final results merged together
• copyright: the data or facts is not copyrightable, but specific
• combiner: much of the data transfer between mapper and reducer are
arrangement or selection of the data will be protected
repetitive (multiple key-value pairs with the same key), a combiner (in
• terms of service are based in contract law, but their enforceability is a
essence a reducer function) combines all the data for the same key into
gray area, e.g., YouTube forbids a user from posting copyrighted video
a single value
• trespass to chattels: high-volume web scraping could cause significant
• a combinable reducer: its output must match its input
monetary damages (e.g., denial of service) to the sites being scraped
• non-combinable reducer needs to be processed into pipelined
map-reduce steps
Hadoop
• maximum parallelism: use one reducer task to execute each reducer,
• origin from Yahoo (created by Dong Cutting), the top-level Apache open source
execute each reducer task at a different compute node
project Hadoop ([Link] written in Java: a computation
• reduce skew: different reducers take different time, a significant
environment built on top of a distributed clustered file system for very large-
difference in the amount of time each takes exhibiting skew, we can
scale data operations, and based on Google File System (GFS) and the
reduce the impact by using fewer reducer tasks than there are reducers,
MapReduce programming paradigm.
and using more reduce tasks than there are compute nodes
• MapReduce follows the functional programming paradigm for parallelism
• limit reducer tasks to prevent intermediate files explosion
and locality of data access, work is broken down into mapper and reducer
• Hadoop is designed to scan through large complex/structured data set to • [Link]/software/data/infosphere/biginsights/[Link]
produce its results through a highly scalable, distributed batch processing • MapReduce programming languages
system. Hadoop is not about speed-of-thought response times, real-time • Java
warehousing, or blazing transactional speed; it is built around a function-to-data • scripting languages through Hadoop Streaming interface for simple/short
model and is about discovery and making the once near-impossible possible; for applications
reliability (self-healing), failures are expected, mean time to failure (MTTF) rate • Pig (and PigLatin, a dataflow scripting language)
associated with inexpensive hardware is well understood, and with built-in fault • Hive and Hive Query Language (HQL): run from Hive shell, Java Database
tolerance and fault compensation; it is open source, so no vendor lock-in Connectivity (JDBC), or Open Database Connectivity (ODBC) via Hive Thrift
problem. Client; enables Hadoop to operate as a data warehouse; Hive with HBase is
• Hadoop components (or Hadoop core): 4-5 times slower than HDFS and limited store a petabyte in HBase
• Hadoop distributed file system (HDFS): data redundancy • Jaql: a functional, record-oriented, declarative query language for JavaScript
• Hadoop MapReduce model: programming redundancy Object Notation (JSON)
• Hadoop common: a set of libraries • data locality: MapReduce assign workloads to servers where the data to be
• Hadoop YARN (Yet-Another-Resource-Negotiator) processed is stored without using storage area network (SAN) or network
• related projects (Apache project, Apache Incubator projects, and Apache attached storage (NAS)
Software Foundation hosted on GitHub) • data redundancy for high availability, reliability, scalability and data locality:
• Apache Avro, Protobuffer, Thrift: for data serialization, split-able HDFS replicates each block onto three servers by default, a NameNode (and
compression optionally a BackupNode to prevent single point of failure, SPOF, because any
• Cassandra, HBase, Mango, and Hypertable: for column-oriented huge data loss in metadata will result in a permanent loss of data in the cluster) keeps
databases storing key-value pairs for fast retrieval, act as a source and a sink track of all the data files in HDFS
for MapReduce • IBM General Parallel File System (GPFS) based on SAN
• Chukwa: a monitoring system for large distributed system • IBM GPFS extended to a share nothing cluster (GPFS-SNC)
• Mahout: a machine learning library for analytical computation, collaborative • Google File System (GFS)
filtering, user recommendations, clustering, regression, classification, • Cloudstore: an open-source DFS from Kosmix
pattern mining • transparent to the application: Hadoop HDFS will contact the NameNode, find
• Hive: ad hoc – i.e., no compilation requirement - SQL-like queries for data the servers that hold the data for the application task, and then send your
aggregation and summarization for Hypertable application to run locally on those node
• Pig: a high-level Hadoop parallel data-flow programming language for • MapReduce model: the mappers run parallel converting input into tuples of key
Cassandra, and its User Defined Functions (UDFs) and value pairs, and then the reducer combines the results from mappers to get
• Cloudera Hue (Hadoop User Experience): web console the solutions
• Zookeeper (naming, coordination and workflow services for distributed • the master nodes run three daemons: NameNode (optionally a second
applications) and Oozie (managing job workflow, scheduling, and NameNode to prevent single point of failure), a secondary NameNode does
dependencies) housekeeping for the NameNode, and JobTracker; each slave or worker
• Ambari (a web-based tool for provisioning, configuration, monitoring, node runs two daemons: DataNodes and TaskTrackers
administration, upgrade services, and deployment) and Whirr (running • typically, a cluster with more nodes will perform better than on with fewer,
services on cloud platforms) faster nodes; in general, more disks is better; Hadoop nodes are typical disk-
• Sqoop (SQL to Hadoop) and Flume: for data ingestion and network-bond
• Cascading (provide a wrapper for Java applications) and Cascalog (as a query • an application (run on client machine) submits a job to a node (running a
language) daemon JobTracker) in Hadoop cluster, the JobTracker communicates with
• free download the NameNode, breaks the job into map and reduce tasks, and TaskTracker
• [Link] schedules the tasks on the node in the cluster where the data resides.
• a set of continually running daemons TaskTracker agents monitor the status • Facebook Scribe
of each task and report to JobTracker • NoSQL databases contain MapReduce functionality:
• HDFS is not a POSIX-compliance file system, can’t interact with Linux or Unix file • CouchDB: semi-structured document-based storage with JavaScript
system, and you should use the /bin/hdfs dfs <args> file system shell command • MongoDB: better performance with JavaScript
interface or FsShell (invoked by Hadoop command fs) • Riak: high availability with JavaScript or Erlang
• HDFS shell commands: put, get, cat, chmod, chown, copyFromLocal, • integration with SQL databases
copyToLocal, cp, expunge, ls, mkdir, mv, rm • Sqoop with Java JDBC database API, import data from relational databases
• Java applications need import the [Link] package into Hadoop
• no benefit using RAID because HDFS provide build-in redundancy, and RAID • Massively parallel processing (MPP) databases: Greenplum based on
striping is slow than JBOD used by HDFS (but RAID is preferred to be used on PostreSQL DBMS, Aster Data’s nCluster
master node) • data warehouse integrated with MapReduce (e.g., Zynga’s Vertica, IBM’s
• no benefit using virtualization because multiple virtual nodes per node hurts Netezza) or use a hybrid solution: using on-demand cloud resources to
performance supplement in-house deployments
• no benefit using blade servers • search with MapReduce: Solr, ElasticSearch
• use RedHat Enterprise Linux (RHEL) on master nodes and CentOS on slaves; • initiatives
prefer server distribution (e.g., Ubuntu) to desktop distribution (e.g., Fedora) • need for scale/interaction: interactive 5s-1m, non-interactive 1m+, batch
• configuring the system: 1h+
• don’t use Linux logical volume manager (LVM) to make all your disks appear • next generation execution YARN: improve MapReduce performance, low
as a single volume latency, streaming services
• check BIOS settings (e.g., disable IDE emulation when using SATA drives) • HDFS 2.0: with federation (more than one NameNodes) and checkpoint
• test disk I/O speed with hdparm –t /dev/sda1 • Ambari: for Hadoop 2.0, more scale, root cause analysis
• mount disks with the noatime option to not update access time • Herd: streaming and ETL integration with HCatalog/Hive
• reduce the swappiness of the system by set [Link] to 0 or 5 in • Continuum: for HDFS snapshots and mirroring for disaster recovery
/etc/[Link] • Stinger: enhance Hive for BI use cases
• increase nofile ulimit in /etc/security/[Link] to at least 32K • Knox: Hadoop gateway for security
• disable IPv6, and disable SELinux
• install and configure the ntp daemon to ensure the time on all nodes are Common MapReduce Algorithms
synchronized (especially important for HBase) • basic ideas
• JVM recommendations • Mapper role is filter, transform and parse (FTP) or extract, transform and
• always use official Oracle JDK ([Link] 1.6.0u24 or 1.6.0u26 (don’t load (ETL)
use newest version until it is fully tested) • MapReduce jobs tend to be relatively short in terms of lines of code, and
• Hadoop versions many of MapReduce jobs use very similar code; reuse of existing
• [Link] Mappers/Reducers, with minor modifications
• Cludera’s Distribution for Hadoop (CDH) free and Enterprise editions • it is typical to combine multiple small MapReduce jobs together in a single
[Link] workflow using Oozie
• integration with streaming data sources • the following common MapReduce algorithms are the basis for more
• Cloudera Flume: a distributed data collection service for flowing data into a complex MapReduce jobs
Hadoop cluster • the famous “Word Count” MapReduce program (used in spell checkers, language
• source and predefined source adapters detection and translation systems, etc.) from Cloudera:
• decorator: IBM Information Server package [Link];
• sink: Collector Tier Event sink, Agent Tier sink, Basic sink import [Link];
import [Link].*; • word count
import [Link]; • Mapper emits (word, 1) for each word
import [Link].*;
import [Link].*; • sorting large data sets
import [Link].*; • keys are passed to the Reducer in sorted order, so it is well suited for sorting
import [Link].*; large data sets; for input contains lines with single value
public class WordCount { • Mapper is merely the identity function for the value
public static class Map extends MapReduceBase implements (k, v) ® (v, _)
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); • Reducer is the identity function
private Text word = new Text(); (k, _) ® (k, “”)
public void map(LongWritable key, Text value, • for multiple Reducers, then each Reducer is sorted independently; to
OutputCollector<Text, IntWritable> output, Reporter reporter) generate merged sort order, add a Partitioner
throws IOException { if key < “c” then reducer = 1 else reducer = 2
String line = [Link](); • as a speed test for Hadoop framework’s I/O
StringTokenizer tokenizer = new StringTokenizer(line);
while ([Link]()) { • searching large data sets
[Link]([Link]()); • Mapper compares the line against the pattern, if match, output (line, _) or
[Link](word, one); (filename+line, _), otherwise, output nothing
} • Reducer is the identity function
} • secondary sort
}
public static class Reduce extends MapReduceBase implements • the key is sorted, but we also want to sort values within each key
Reducer<Text, IntWritable, Text, IntWritable> { • a naïve solution loop through all values in Reducer and keep track the largest
public void reduce(Text key, Iterator<IntWritable> values, value and emit the largest value at end
OutputCollector<Text, IntWritable> output, Reporter reporter) • a better solution arrange the values for a given key to the Reducer in
throws IOException { descending sorted order, and Reducer just emits the first value
int sum = 0;
while ([Link]()) {
• custom comparator classes can compare composite keys (the actual key and
sum += [Link]().get(); the value), and specified in the driver code by
} [Link]([Link]);
[Link](key, new IntWritable(sum)); • to pass all values for the same natural key in one call to Reducer by using a
} grouping comparator class using
} [Link]([Link]);
public static void main(String[] args) throws Exception { • extend WritableComparator by overriding compare( ) and using
JobConf conf = new JobConf([Link]); [Link]([Link]);
[Link]("wordcount");
• extend WritableComparator by overriding compare( ) and using
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]); • inverted index
[Link]([Link]); • assume the input is a set of file containing lines of text; key is the byte offset
[Link]([Link]); of the line, and value is the line itself
[Link]([Link]); • Mapper emits (word, filename) or (word, filename+line) for each word in
[Link]([Link]); the line
[Link](conf, new Path(args[0]));
[Link](conf, new Path(args[1])); • Reducer is the identity function and emits (word, filename_list)
[Link](conf); • calculating word co-occurrence
}
}
• measure the frequency with which two words appears close to each other 3. compute TF-IDF: Mapper input ((term, doc_id), (tf, n), output ((term,
in a corpus of documents doc_id), TF x IDF), Reducer: the identity function
• this is at the heart of many data-mining techniques, and provides results for • working at scale: any time when you need to buffer data, there is a potential
“people who did this, also do that”, e.g., shopping recommendations, credit scalability bottleneck
risk analysis, identifying ‘people of interest’ • since we need to buffer (doc_id, tf) in the second MapReduce job,
• Mapper especially for very-high-frequency words, the possible solutions can
map(doc_id a, doc d) { ignore those words, write out intermediate data to a file, or use another
foreach w in d do MapReduce pass
foreach u near w do
• incremental map-reduce
emit(pair(w, u), 1)
} • new data keeps coming in, thus we need to rerun the computation to keep
• Reducer the output up to date
reduce(pair p, Iterator counts) { • structure a map-reduce computation to allow incremental updates
s = 0 • only if the input data changes does the mapper need to be rerun
foreach c in counts do • if we are partitioning the data for reduction, then any partition that’s
s += c unchanged does not need to be re-reduced
emit(p, s)
} • if changes are additive, then just run the reduce with the existing result
• two-stage map-reduce using pipes-and-filters approach: the output of one stage and the new additions
serves as input to the next • if there are destructive changes (updates, deletes), break the reduce
• reuse: the intermediate output may be useful for different outputs operation into steps and only recalculating those steps whose inputs
have changed using a dependency network
• materialized view: the saved intermediate records, often represent the
heaviest amount of data access
MapReduce Algorithms for Relational Algebra Operations
• term frequency – inverse document frequency (TF-IDF)
• traditional relational database uses relational algebra:
• for how important is this term in a document
• the set of attributes of a relation is schema, a relation is a table with column
• real-world systems typical perform “stemming” on terms: removal of
headers (attributes), and rows of the relation (tuples) R(A1, A2, …, An); a
plurals, tense, possessives, etc.
relation can be stored as a file in a distributed file system, and the elements
• TF-IDF considers both the frequency of a word (term) in a given document
of this file are the tuples of the relation
(TF) and the number of documents which contains the word (IDF) = log(N/n),
• selection:
where N is total number of documents, and n is the number of documents
• sc(R): apply a condition C to each tuple in the relation and produce as output
that contain a term; TF-IDF = TF x IDF
only those tuples that satisfy C
• Stop-words: some words appear too frequently in all documents to be
relevant • map function: for each tuple t in R, test if it satisfies C; if so, produce the key-
value pair (t, t)
• data mining application example: music recommendation system
• reduce function: identity
• term weighting function: assign a score (weight) to each term (word) in a
• projection:
document
• ps(R): for some subset S of the attributes of the relation, produce from each
• several small jobs add up to full algorithm: 3 MapReduce jobs
tuple only the components for the attributes in S
1. compute term frequencies: Mapper: input (doc_id, contents) output
((term, doc_id), 1), Reducer: output((term, doc_id), tf), and we can add • map function: for each tuple t in R, construct a tuple t’ by eliminating from
a combiner, which will use the same code as the Reducer t those components who attributes are not is S, then output the key-value
2. compute number of documents each words occurs in: Mapper: input pair (t’, t’)
((term, doc_id), tf) output (term, (doc_id, tf, 1)), Reducer: output • reduce function removes duplicates (may add combiners for efficiency):
((term, doc_id), (tf, n)) turn (t’, [t’, t’, …, t’]) to (t’, t’)
• set union, intersection, and difference for two relations R and S with same • if there are several grouping attributes, then the key is the list of the
schema: values of a tuple for all these attributes; if there is more than one
• union map function: turn each tuple t into a key-value pair (t, t) aggregation, then the reduce function applies each of them to the list
• union reduce function removes duplicates: turn (t, [t, t]) to (t, t) of values associated with a given key and produces a tuple consisting of
• intersection map function: turn each tuple t in R into a key-value pair (t, t) the key, including components for all grouping attributes if there is
• intersection reduce function removes duplicates: if (t, [t, t]), produce (t, t), more than one, followed by the results of each of the aggregations
otherwise, produce nothing • examples:
• difference map function: for a tuple t in R, produce key-value pair (t, R), and • use paths of length two in the web: a link L1 from U1 to U2 or L1(U1, U2), and
for a tuple t in S, produce key-value pair (t, S) a link L2(U2, U3), use natural join of links: pU1,U3(L1 L2)
• difference reduce function: if (t, [R]), produce (t, t), otherwise, produce • compute a count of number of friends of each user using grouping and
nothing aggregation in social network: guser,count(friends)(friends):
• bag union, intersection, and difference for two relations R and S:
• bag union: the bag of tuples in which tuple t appears the sum of the numbers Development
of times it appears in R and S • run develop jobs on a virtual machine in pseudo-distributed mode (all five
• bag intersection: the bag of tuples in which tuple t appears the minimum of Hadoop daemons are running on the same single-machine cluster in its own
the numbers of times it appears in R and S JVM), test before deploying to the real cluster
• bag difference: the bag of tuples in which the number of times a tuple t • compile and submit a MapReduce job
appears is equal to the number of times it appears in R minus the number $ javac –classpath `hadoop classpath` *.java
$ jar cvf [Link] *.class
of times it appears in S; a tuple that appears more times in S than in R does
$ hadoop jar [Link] WordCount <input_directory_file>
not appear in the difference <output_directory_file>
• natural join: $ hadoop fs –ls <output_directory_file>
• R S given two relations R and S, if the tuples agree to all the attributes $ hadoop fs –cat <output_directory_file>/part-<r>-0000 | less
that are common to the two schemas, then produce a tuple that has $ hadoop fs –rm –r <output_directory_file>
components for each of the attributes in either schema and agrees with the • stop a job through JobTracker
two tuples on each attribute $ mapred job –list
$ mapred job –kill <job_id>
• map function: for each tuple (a, b) of R, produce the key-value pair (b, (R,
a)), for each tuple (b, c) of S, produce the key-value pair (b, (S, c))
Communication Cost Model
• reduce function: if (b, [(R, a), (S, c)]), produce (_, (a, b, c)), otherwise,
• assume the communication cost as the way to measure the efficiency of an
produce nothing
algorithm instead of exceptions of execution time dominates (because tasks
• equijoin: the tuple-aggreement condition involves equality of attributes
tend to be very simple) or moving data from disk to memory (or vice versa)
from the two relations that do not necessary have the same name
• example: if relation R(A, B) has size r and S(B, C) has size s, then the join R(A,
• grouping and aggregation:
B) S(B, C) has communication cost O(r + s)
• gx(R): where X is a list of elements that are either a grouping attribute or an
• communication from map tasks to reduce tasks is likely to be across the
expression q(A), where q is one of the 5 aggregation operations (SUM,
interconnect of the cluster
COUNT, AVG, MIN, and MAX) and A is an attribute not among the grouping
• the computation of multiway joins as single MapReduce jobs is more efficient
attributes
than the straightforward cascade of 2-way joins
• for relation R(A, B, C), the map and reduce function of gA,q(B)(R) are:
• star join: a chain store keeps a huge fact table with each entry representing a
• map function: for each tuple (a, b, c), produce the key-value pair (a, b)
single sale F(A1, A2, …, An), and keeps a relative small dimension table giving
• reduce function: apply the aggregate operator to the list [b1, b2, …, bn]
information about the participant (e.g., a customer) D(F1, B11, B12, …, B1n), then
with key a and get the result x, then produce the pair (a, x)
analytic queries typical join the fact table with several of the dimension table (a
“star” join) and then aggregate the result
• count only input size (normally is huge) and not output size (usually is • ARRAY <primitive-type>: prices = [1000, 950, 920]
summarized or aggregated) • MAP <primitive-type, data-type>: bonus = [‘alice’: 200, ‘bob’: 150]
• two parameters • STRUCT <col-name: data-type>: address<zip:STRING, street:STRING,
• small reducers size q: force many reducers for high degree of parallelism, city: STRING>
and running reducers in memory • the Hive Metastore: Hive’s Metastore is a database containing table definitions
• replicate rate r: the average communication form map tasks to reduce tasks and other metadata
per input • by default, stored locally on the client machine in a Derby database
• if multiple people will use Hive, create a shared Metastore in database
Hive server
• MapReduce code is typically written in Java, other languages can use Hadoop • Hive data physical layout
Streaming interface; only a few developers write good MapReduce code, but • Hive tables are store in subdirectories in HDFS
many other people need to analyze data and a higher-level abstraction on top of • actual data is store in flat files either SequenceFiles (control character-
MapReduce can provide the ability to query the data without needing to know delimited text), or in arbitrary format with the use of a custom
MapReduce intimately serializer/deserializer (SerDes)
• Hive ([Link] was originally developed at Facebook, and is now • the Hive Shell
a top-level Apache Software Foundation project; it provides a very SQL-like (or $ hive
declarative) language, and it generates MapReduce jobs that run on the Hadoop hive> SHOW TABLE;
cluster to create a Hive table:
• provides an SQL dialect for querying data stored in HDFS, Amazon S3, and hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;
databases (e.g., HBase, Cassandra)
hive> DESCIRBE shakepeare;
• makes it easier for developers to port SQL-based data warehouse to create a Hive external table in another directory movie:
applications to Hadoop hive> CREATE EXTERNAL TABLE movie (id INT, name STRING, year INT)
• Hive Query Language (HiveQL or HQL) ROW FORMAT DELIMITED FIELD TERMINATED BY ‘\t’ LOCATION
• Hive translate queries to MapReduce jobs ‘/user/training/movie’;
• Hive is not a full database and does not conform to the ANSI SQL standard, to load data
it does not provide record-level update, insert, delete, or transactions, and hive> LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;
you can do is generating new tables from queries or output query results to data can also be loaded using Hadoop’s filesystem commands:
$ hadoop fs –put shakespeare_freq /user/hive/warehouse/shakespeare
files
Sqoop’s -–hive-import option can import data from database into Hive tables by
• Hive queries have high latency (due to Hadoop is a batch-oriented system),
generating the Hive CREATE TABLE statement based on the table definition in
cannot provide features required for OLTP (OnLine Transaction Processing)
RDBMS
but closer to being an OLAP (OnLine Analytic Processing) tool
hive> SELECT * FROM shakespeare LIMIT 10;
• Use NoSQL for OLTP with large-scale data (e.g., HBase, Cassandra, hive> SELECT * FROM shakespear WHERE freq > 100 ORDER BY freq ASC
DynamoDB, etc. on Hadoop, Amazon’s Elastic MapReduce EMR, Elastic LIMIT 10;
Computer Cloud EC2, etc.) hive> SELECT [Link], [Link], [Link] FROM Shakespeare s JOIN kjv k
• the Hive data model ON ([Link] = [Link]) WHERE [Link] >= 5;
• layers table definition on top of data in HDFS directories/files: tables (typed or store the output results by creating a new table then write into it:
hive> INSERT OVERWRITE TABLE new_table SELECT [Link], [Link],
columns, JSON-like data, e.g, arrays, struct, map, etc.), partitions and
[Link] FROM Shakespeare s JOIN kjv k ON ([Link] = [Link]) WHERE
buckets (hashed for sampling, join optimization) [Link] >= 5;
• primitive types: INT, TINYINT, SMALLINT, BIGINT, FLOAT, DOUBLE, BOOLEN, Hive support manipulation of data via user-defined functions (UDFs) written in
STRING, and CDH4 added BINARY, TIMESTAMP Java or in any language via the TRANSFORM operator
• type constructors
hive> INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (user_id, • logical optimizer: e.g., projection pushdown
movie_id, rating, unixtime) USING ‘python weekday_mapper.py’ AS • MapReduce compiler: convert logical plan into MapReduce jobs
(user_id, movie_id, rating, weekday) FROM u_data;
hive> QUIT; • MapReduce optimizer: e.g., combiner performs early partial aggregation
• the batch Hive • Hadoop job manager: topological sort the jobs and submitted to Hadoop,
$ hive –f file_to_execute monitor the execution status
• Hive limitations • Pig Latin features:
• not all standard SQL is supported, e.g., no correlated subqueries urls= LOAD ‘dataset’ AS (url, category, pagerank);
groups = GROUP urls BY category;
• no support for UPDATE and DELETE bigGroups = FILTER groups BY COUNT(urls)>1000000;
• no support for INSERTing single rows result = FOREACH bigGroups GENERATE group, top10(urls);
• word count in Hive STORE result INTO ‘myOutput’;
CREATE TABLE docs (line STRING); • a step-by-step dataflow language where computation steps are chained
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; together through the use of variables
CREATE TABLE word_counts AS
• the use of high-level transformations (e.g., GROUP, FILTER)
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w • the ability to specify schemas as port of issuing a program
GROUP BY word • the use of user-defined functions (e.g., top10 above)
ORDER BY word; • data type
• Impala • scalar type: int, long, double, chararray (string, sorted by lexicographical
• an open-source project created by Cloudera to facilitates interactive real- ordering), bytearray (for unknown data type and lazy conversion of types,
time queries of data in HDFS sorted by byte radix ordering)
• Impala does not use MapReduce, but uses a language very similar to HiveQL • complex types: map (and its dereference operator #), tuple, and bag
and produces results 5-40x faster (collection of tuples)
• storage function: to delimit data values and tuples when loading/storing
Pig data from/to a file (default: ASCII, binStorage for bytearray, user-defined
• Pig ([Link] was originally developed at Yahoo, and is now a top- storage function), and use { } to encode nested complex types
level Apache Software Foundation project; Pig translates a high-level PigLatin • declaring type explicitly as part of the AS clause during the LOAD
script run on client machine (without shared metadata) into MapReduce jobs • referring type implicitly from operators (e.g., #) or known return type from
and coordinates their execution functions
• MapReduce does not directly support complex N-step dataflows, lacks • modes of user interaction
explicit support for combined processing of multiple data set (e.g., • interactive mode through an interactive shell Grunt
joining/grouping datasets) and frequently-needed data manipulation $ pig [–help] [–h] [-version] [–i] [–execute] [–e]
primitives (e.g., filtering, aggregation, top-k threshholding) grunt> emps = LOAD ‘people’ AS (id, name, salary);
• Pig provides a very simple high-level dataflow or data description language grunt> rich = FILTER emps BY salary > 100000;
(in the spirit of SQL) to encode explicit dataflow graphs with built-in grunt> srtd = ORDER rich BY salary DESC;
grunt> STORE srtd INTO ‘rich_people’;
relational-style operations, and gives sufficient control back to the
to display on screen for debugging (not use in cluster)
programmer grunt> DUMP srtd;
• Pig features: joining datasets, grouping data, referring to elements by position grunt> DESCRIBE bag_name;
instead of names, loading non-delimited data using a custom SerDe, creation of grunt> data1 = LOAD ‘data1’ AS (col1, col2, col3, col4);
user-defined functions in Java, etc. grunt> data2 = LOAD ‘data2’ AS (colA, colB, colC);
• Pig compilation and execution stages grunt> jnd = JOIN data1 BY col3, data2 by cola;
grunt> STORE jnd INTO ‘outfile’;
• parser: syntactical analysis, type checking, schema inference; output
grunt> grpd = GROUP bag1 BY elementX;
canonical logical plan/operator arranged in DAG grunt> justnames = FOREACH emps GENERATE name;
grunt> summedUP = FOREACH grpd GENERATE group, COUNT(bag1) AS $ sqoop list-tables --connect jdbc:mysql://localhost/movielens --
elementCount; username training --password training
grunt> QUIT; $ sqoop import --connect jdbc:mysql://localhost/movielens --table
• batch mode movie --fields-terminated-by ‘\t’ --username training --password
$ pig [Link] training
• embedded mode with a Java library, which allows Pig Latin commands to be $ hadoop fs –ls movie
$ hadoop fs –tail movie/part-m-00000
submitted via method invocations from a Java program
• Flume is a distributed, reliable, available open source (initially developed by
• word count in Pig
raw = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
Cloudera) service (with goals of reliability, scalability, manageability, and
raw_grp = GROUP raw BY token; extensibility) for efficiently moving large amount of data as it is produced with
rst = FOREACH raw_grp GENERATE group, COUNT(raw) as count; optionally process incoming data (transformation, suppressions, metadata
DUMP rst; enrichment), and parallel writes to multiple HDFS file formats (text,
• choosing between Pig and Hive SequenceFile, JSON, Avro, others) by agents
• those with an SQL background choose Hive, those without choose Pig • Flume’s reliability
• Pig deals better with less-structured data, and Hive is used to query that • each Flume agent has a source, a sink, and a channel or queue between the
structured data, so some organizations choose to use both source and the sink (can be in-memory only or durable file channel – will not
lost data even if power is lost)
Integrate Hadoop into the Enterprise Workflow using Sqoop & Flume • data transfer between agents and channels is transactional, and a failed
• Hadoop can complement enterprise workflow to handle huge query size and data transfer will make the agent rolls back and retries
make ad-hoc full or partial dataset queries possible, but cannot serve interactive • can configure multiple agents with the same task for redundancy
queries (Impala?) and less powerful updates (no transactions, no modification of • Flume’s horizontal scalability
existing records) • Increase system performance linearly by adding more agents
• HDFS can crunch sequential data faster, and offload traditional file server for • Flume’s manageability
interactive data access • The ability to control data flows, monitor nodes, modify the settings, and
• don't read from a RDBMS into Mapper directly (as it leads to a distributed denial control outputs of a large system
of service attack DDoS), instead, import the data (by default, a comma-delimited • configuration can be loaded from a properties file (pushed out to each node
text file) into HDFS beforehand using Sqoop by scp, Puppet, Chef, etc.) on the fly
• Sqoop is an open source tool originally written at Cloudera and now a top-level • Flume’s extensibility
Apache Software Foundation project; • The ability to add new functionality to a system by write their own sources
• Sqoop can input just one table, all tables in a database, or just a portion of a table and sinks
(by WHERE clause) using a JDBC interface (for any JDBC-compatible database), • access HDFS from legacy systems with FuseDFS and HttpFS
or use a free custom connector (using a system’s native protocols to access data • FuseDFS is a user space file system, and allows to mount HDFS as a regular
for faster performance), e.g., Netezza, Teradata, Quest Oracle Database filesystem
• Sqoop syntax and examples: • HttpFS provides an HTTP/HTTPS REST interface to read/write from/to HDFS
$ sqoop [import|import-all-tables|list-tables] [--connect|--
either from within a program or via command-line tools (curl or wget)
username|--password]
Curl [Link]
$ sqoop export [options] # send HDFS data to a RDBMS
$ sqoop help [command]
$ sqoop import –-username fred –-password derf –-connect MapReduce Variations and Extensions
jdbc:mysql://[Link]/personnel –-table employee –- • two-step workflow using acyclic flow graph and minimize communication
where “id > 1000” • recursive workflow for iterated application of fixed point of matrix-vector
$ sqoop list-databases --connect jdbc:mysql://localhost --username multiplication
training --password training
• Google Pregel and Apache open-source equivalent Giraph: views its data as a • if |IB| > |OA|, a single incoming edge is created to each of B’s inputs,
graph and use checkpoint assigned in round-robin from A’s outputs
• bipartite composition operator >>: A >> B forms the complete bipartite
Dryad graph between OA and IB
• merge operator ||: C = A || B = <VA Å* VB, EA È EB, IA È* IB, OA È* OB>,
where * means duplicates removed
• run-time graph refinements - partial aggregation operation: having grouped the
inputs into k sets, the optimizer replicates the downstream vertex k times to
allow all of the sets to be processed in parallel
• resource allocation, distribution, and scheduling: topological sorting
• fault-tolerant policy: all vertex programs are deterministic
V V V • report chain: vertex, daemon, job manager (JM)
• heartbeat
• schedule duplicate executions if vertices run too slowly
Low-Latency Cloud
• memcached: provide a general-purpose key-value store entirely in DRAM and its
widely used to offload back-end database systems
• as of 2009, approximately 25% of all online data for Facebook was kept in
main memory on memcached servers, providing a hit rate of 96.5%
• the Microsoft Dryad system (discontinued active development in 2011) is a
• memcached makes no durability guarantees
relative lower-level programming model for execution of data parallel
• RAMCloud
application, allows the developer fine control over the communication graph and
• Jim Gray’s 5-minute rule: pages referenced every five minutes should be
the subroutines (in C++) that live at its vertices, and gets better performance
memory resident
• the Dryad system organization: job manager (JM), name server (NS), daemon
• Jim Gray’s 5-byte rule: spend five bytes of main memory to save one
(D), and running vertices (V)
instruction per second
• channels can be implemented by files (the default), TCP pipes (need to run at the
• FlashCloud replaces disks with flash memory
same time), or shared memory, applications can supply their own serialization
• cheaper and consume less energy than RAMCloud, but write latency is
and deserialization routines, and the items transported via the channels can be
much higher
either newline-terminated text strings or tuples of base types
• for high query rate and small data set sizes, DRAM is cheapest, for low
• graph description language: operations with “acyclic” properties (i.e., scheduling
query rates and large data sets, disk is cheapest, and in the middle
deadlock is impossible), and implemented in C++
ground, flash is cheapest
• basic graph object: G = <VG, EG, IG, OG>
• the cost of flash is limited by cost/query/sec, the cost of DRAM is limited
• replicate operator ^: C = G^k = <VG1 Å … Å VGk, EG1 È … È EGk, IG1 È …
by cost/query/bit; cost/query/bit is improving much more quickly than
È IGk, OG1 È … È OGk>, where Å denotes sequence concatenation
cost/query/sec
• concatenation composition operator °: if VA and VB are disjoint at run time
• RAMCloud replaces disks with DRAM, stores data in the main memory and
and a new directed edge Enew connects OA to IB, then C = A ° B = <VA Å VB,
uses thousands of servers to create a large-scale storage system, and disks
EA È EB È Enew, IA, OB> are only used as backup
• pointwise composition operator >=: • 100x-1,000x lower latency than disk-based system and 100x-1,000 greater
• if |OA| > |IB|, a single outgoing edge is created from each of A’s throughput, manipulate data 100x-1,000x more intensively
outputs, assigned in round-robin to B’s inputs
• flat storage model: all objects can be access quickly and not affected by data • classification: stochastic gradient descent (SGD), support vector machine
placement, can support random access (MapReduce for sequential access) (SVM), Naïve Bayes, complementary naïve Bayes, random forests
• durability and availability by replication and buffered logging • to run Mahout’s item-based recommender:
• high cost and high energy use per bit, not cost effective for large-scale media $ mahout recommenditembased -–input movierating –-output recs –-
storage yet userFile users –-similarityClassname SIMILARITY_LOGLIKELIHOOD
$ hadoop fs –cat recs/part-r-00000
Berkeley Data Analytics Stack (BDAS)
Semantic Web and Resource Description Format (RDF)
• support batch, streaming, and interactive computations in a unified analytics
• RDF uses subject-property-object (SPO) triples to model the entity-relationship
stack instead of 3 separated stacks, easy to develop sophisticated algorithms
graph (a pair of entities connected by a named relationship or an entity
(e.g., graph, machine learning)
connected to the values of a named attribute, the typed nodes correspond to
• Hadoop stack
entities and the edges to relationships)
• data processing layer: Hive, Pig, MapReduce, HBase, Storm, …
• 3 requirements:
• resource management layer: Hadoop Yarn
• a high diversity of property names for schema-less data, RDF supports not
• storage layer: HDFS, S3, …
only SQL select-project-join queries, but also allow wildcard for property
• BDAS stack
names
• data processing layer: Spark Streaming, Shark SQL, BlinkDB, GraphX, MLlib, • RDF not only does Boolean-match evaluation, but also ranks of search
MLBase, … results (for too-many-result situations)
• resource management layer: Spark, Apache Mesos
• RDF support text search with keywords and phrases
• storage layer: Tachyon, HDFS, S3, … • SPARQL
• Apache Mesos enables multiple frameworks to share same cluster resources, • each triple pattern has one or two of the SPO components replace by
allow BDAS & Hadoop fit together seamlessly
variable names (with prefixed ‘?’), the dots (‘.’) between the triple patterns
• Mesosphere: startup to commercialize Mesos denote logical conjunctions, the triple patterns can augmented by keyword
• Apache Spark: a distributed engine with fault-tolerant, efficient in-memory conditions (to match the specified keywords enclosed in the ‘{ }’ pairs
storage (100x faster than Hadoop MapReduce), and powerful programming (semantic grouping to eliminate inappropriate matches), and using the same
models and APIs (Scala, Python, Java with 2-5x less code) variable in different triple patterns denotes join conditions
• Tachyon: in-memory fault-tolerant storage system, allow multiple frameworks • some triples are more important than others, and use statistical weights
to share in-memory data (the witnesses of a given triple) to construct a non-uniform distribution
• For example:
Machine Learning and Mahout Select ?x, ?m Where { ?x has type Musician {classical, composer}
• machine learning is having programs work out what to do instead of telling . ?x composedSoundtrack ?m . ?m hasType Movie {gunfight, battle}
programs exactly what to do }
• the three Cs for machine learning: collaborative filtering (recommendations),
clustering, and classification; the first two are unsupervised learning, and the last
one is supervised learning
• Mahout is a machine learning data-agnostic library written in Java and some pre-
built script to analyze data, algorithms include in Mahout include:
• Recommendation: Pearson correlation, log likelihood, spearman
correlation, Tanimoto coefficient, singular value decomposition (SVD),
linear interpolation, cluster-based recommenders
• clustering: k-means clustering, Canopy clustering, fuzzy k-means, Latent
Dirichlet analysis (LDA)