Introduction to Data Analytics
[Link]
Rich's Data Analytics Training 10/29/2022
High Level Goals for the course
2
Understand foundations of data analytics so that you can interpret
and communicate results and make informed decisions
Study and learn to apply common statistical methods and
machine learning algorithms to solve business problems
Learn to work with popular tools to analyze and visualize data; more
importantly encourage consistency across departments on
analytics/tools used
Working with cloud for data storage and for deployment of
applications
Learn methods for mastering and applying emerging concepts and
technologies for continuous data-driven improvements to
your business processes
Transform complex analytics into routine processes
Rich's Data Analytics Training 10/29/2022
Motivation
3
Tremendous advances have taken place in statistical
methods and tools, machine learning and data mining
approaches, and internet based dissemination tools for
analysis and visualization.
Many tools are open source and freely available for
anybody to use.
Is there an easy entry-point into learning these
technologies?
Can we make these tools easily accessible to the decision
makers similar to how “office” productivity software is
used?
Rich's Data Analytics Training 10/29/2022
Newer kinds of Data
4
New kinds of data from different sources (see p.23 of Data Science
book) : tweets, geo location, emails, blogs
Two major types: structured and unstructured data
Structured data: data collected and stored according to well
defined schema; Realtime stock quotes
Unstructured data: messages from social media, news, talks,
books, letters, manuscripts, court documents..
“Regardless of their differences, they work in tandem in any
effective big data operation. Companies wishing to make the most
of their data should use tools that utilize the benefits of both.” 5
We will discuss methods for analyzing both structured and
unstructured data
Rich's Data Analytics Training 10/29/2022
Top Ten Largest Databases
7000
6000
5000
Terabytes
4000
Top ten largest databases (2007)
3000
2000
1000
0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate
Ref: [Link]
Rich's Data Analytics Training 5 10/29/2022
Top Ten Largest Databases in 2007 vs
Facebook ‘s cluster in 2010
21 PetaByte
In 2010
7000
6000
5000
4000
Terabytes
3000
Top ten largest databases (2007)
2000
1000
0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate Facebook
Ref: [Link]
Rich's Data Analytics Training 6 10/29/2022
Data Strategy
7
In this era of big data, what is your data strategy?
Strategy as in simple “Planning for the data challenge”
It is not only about big data: all sizes and forms of data
Data collections from customers used to be an elaborate
task: surveys, and other such instruments
Nowadays data is available in abundance: thanks to the
technological advances as well as the social networks
Data is also generated by many of your own business
processes and applications
Data strategy means many different things: we will discuss
this next
Rich's Data Analytics Training 10/29/2022
Components of a data Strategy1
8
Data integration
Meta data
Data modeling
Organizational roles and responsibilities
Performance and metrics
Security and privacy
Structured data management
Unstructured data management
Business intelligence
Data analysis and visualization
Tapping into social data
This course will provide training in emerging technologies, tools, environments
and APIs available for developing and implementing one or more of these
components.
Rich's Data Analytics Training 10/29/2022
Data Strategy for newer kinds of data
9
How will you collect data? Aggregate data? What are
your sources? (Eg. Social media)
How will you store them? And Where?
How will you use the data? Analyze them? Analytics?
Data mining? Pattern recognition?
How will you present or report the data to the
stakeholders and decision makers? visualization?
Archive the data for provenance and accountability.
Rich's Data Analytics Training 10/29/2022
Tools for Analytics
10
Elaborate tools with nifty visualizations; expensive
licensing fees: Ex: Tableau, Tom Sawyer
Software that you can buy for data analytics: Brilig,
small, affordable but short-lived
Open sources tools: Gephi, sporadic support
Open source, freeware with excellent community
involvement: R system
Some desirable characteristics of the tools: simple,
quick to apply, intuitive, useful, flat learning curve
A demo to prove this point: data actions /decisions
Rich's Data Analytics Training 10/29/2022
Demo: Exam1 Grade: Traditional reporting 1
Q1 Q2 Q3 Q4 Q5 Total
16.7 13.9 9.6 18.5 13.7 72.4
20.0 16.0 9.0 19.0 17.0 76.0
20.0 20.0 15.0 25.0 20.0 90.0
Q1 Q2 Q3 Q4 Q5 Total
16.0 14.2 9.6 19.4 14.0 73.2
80.1% 71.1% 64.0% 77.4% 70.2% 73.2%
Q1 Q2 Q3 Q4 Q5 Total
17.3 13.6 9.7 17.6 13.3 71.5
86.7% 67.8% 64.6% 70.3% 66.7% 71.5%
Question 1..5, total, mean, median, mode; mean ver1, mean ver2
Rich's Data Analytics Training 11 10/29/2022
Traditional approach 2: points vs #students
12
Distribution of exam1 points
Rich's Data Analytics Training 10/29/2022
Individual questions analyzed..
13
Rich's Data Analytics Training 10/29/2022
Interpretation and action/decisions
14
Rich's Data Analytics Training 10/29/2022
R-code
15
data2<-[Link]([Link]())
exam1<-data2$midterm
hist(exam1, col=rainbow(8))
boxplot(data2, col=rainbow(6))
boxplot(data2,col=c("orange","green","blue","grey","yellow", "sienna"))
fn<-boxplot(data2,col=c("orange","green","blue","grey","yellow", "pink"))$stats
text(5.55, fn[1,6], paste("Minimum =", fn[1,6]), adj=0, cex=.7)
text(5.55, fn[2,6], paste("LQuartile =", fn[2,6]), adj=0, cex=.7)
text(5.0, fn[3,6], paste("Median =", fn[3,6]), adj=0, cex=.7)
text(5.55, fn[4,6], paste("UQuartile =", fn[4,6]), adj=0, cex=.7)
text(5.55, fn[5,6], paste("Maximum =", fn[5,6]), adj=0, cex=.7)
grid(nx=NA, ny=NULL)
Rich's Data Analytics Training 10/29/2022
Demo Details
16
Grade data stored in excel file and common input format
Converted this file to csv
Start a R Studio project
Read in the csv data (using a file chooser option) into
data2
boxplot(data2)
That is it.
You can now add legends, colors, and labels to make it
presentable.
Export the plot as a image or pdf to report the results
Rich's Data Analytics Training 10/29/2022
Format of the course
17
Focus on a single topic per session
Begin with general introduction to the topic
Related concepts explained
Sample problems and solutions, algorithms, methods
and hands on exercises
Implement solutions using tools
Don’t hesitate to provide feedback, ask questions
What this course is NOT: We will NOT teach
Statistics or Machine Learning insides, but we will
learn how to apply and use them for data analytics
Rich's Data Analytics Training 10/29/2022
Session Format
Slide Presentation Visualization
Portfolio
Session: lecture,
demos, hands-on
Lab Handout
exercises
Projects:
R-Project
Code/Program Data
Rich's Data Analytics Training 18 10/29/2022