Subido por Milton Roman

BIG DATA 1

Anuncio
BIG DATA
Qué es Big Data?
Qué es Big Data
Que generó la era Big Data?
Characteristics
of Big Data
Characteristics of Big Data
Volume
Velocity
Variety
Characteristics of Big Data
Volume
Velocity
Variety
Veracity
Valence
Characteristics of Big Data
Volume
Velocity
Variety
Veracity
Valence
Value
Characteristics
of Big Data:
Volume
• Describe what volume of big data
means and why you should care about it
• Explain why data volume is not just
about storage
Volume = Size
Volume = Size
Volume = Size
Every minute…
204 Million emails
Every minute…
204 Million emails
200,000 photos
1.8 Million likes
Every minute…
204 Million emails
200,000 photos
1.8 Million likes
1.3 Million video views
72 hours of video uploads
100 MBs ~= couple of
volumes of Encyclopedias
A DVD ~= 5 GBs
1 TB ~= 300 hours of
good quality video
LHC ~= 15 PBs a year
Exponential data growth!
Relevance of Volume for Us?
More data = Better safety
Challenges: Storage and more…
Storage
Data acquisition
Retrieval
Distribution
Processing
Cost
Performance
Volume
Processing Big Data
Volume = Size
Challenges
Storage
Access
Processing
Characteristics of
Big Data:
Velocity
• Describe the velocity of Big Data
• Add the term “real-time streaming”
to your big data vocabulary
• Identify why data velocity is more
relevant in today’s world than ever
before
Velocity == Speed
Speed of creating data
Speed of storing data
Speed of analyzing data
Big Data
Real-time action
Big Data
Real-time action
Late decisions
Missing opportunities
How to decide what to pack ?
How to decide what to pack ?
Use weather information
of last year at this time?
How to decide what to pack ?
Use weather information
of last month ?
OR
Use weather status of
this week or today ?
Action
Real-time Processing
Batch Processing
Collect
Data
Clean
Data
Feed in
Chunks
Wait
Act
Real-Time Processing
Instantly
capture
streaming
data
Feed real
time to
machines
Process
Real
Time
Act
Batch Processing
Real-Time Processing
Incomplete
Fast
Rate needed for data- driven
actions
Rate of generation and
processing of data
Speed of Data Generation
Slow
Speed of Data Processing
1=$
Slow
2 = $$
3=$
Fast
4 = $$$
Which path to choose?
Fast
Streaming data
=
“what’s going on
right now”
Streaming data
=
gets generated ata
varied rates
Real-time processing
Agile and adaptable businessdecisions
Scalability - Variety
• Describe different aspects of data
variety (aka heterogeneity) related to
Big Data
• Identify the challenges and
opportunities resulting from data
variety
Variety == Complexity
• Data were confined only to
tables
Today, Data are more
heterogeneous
Axes of Data Variety
Structural
Variety –
formats and
models
Semantic
Variety – how to
interpret and
operate on data
Media Variety –
medium in
which data get
delivered
Availability
Variations –
real-time?
Intermittent?
Variety within a Type
• Think of an email collection
• Table-like part
Variety within a Type
• Think of an email collection
• Sender, receiver, date…
• Unstructured Text
Well-structured
Variety within a Type
• Think of an email collection
• Sender, receiver, date…
• Body of the email Text
• Media
Well-structured
Variety
within
a
Type
• Think of an email collection
•
•
•
•
Sender, receiver, date…
Body of the email Text
Attachments Multi-media
Who-sends-to-whom
Well-structured
Variety within a Type
• Think of an email collection
Well-structured
Text
Multi-media
Network
a past email
Semantics
Variety within a Type
• Think of an email collection
Well-structured
Text
Multi-media
Network
a past email Semantics
• Real-time? Availability
Scalability Issues
• Impact of data variety
• Harder to ingest
• Difficult to create common
storage
• Difficult compare and match data
across variety
• Difficult to integrate
• Management and policy
challenges
Characteristics
of Big Data:
Veracity
After this video you will be able to..
• Describe what the veracity of Big Data
stands for and why you need to care about it
• Summarize what went wrong with the
Google Flu Predictor and Amazon’s Banana
Slicer Reviews
• Explain two methods to overcome the Big
Data quality challenges
Veracity == Quality
Validity
Volatility
Veracity == Quality
Accuracy of data
Reliability of the data source
Context within analysis
Uncertainty
When sentiment analysis doesn’t work?
Google Flu Trends
Uncertainty
Google Flu Trends
Uncertainty
Provenance
Veracity == Quality
Accuracy of data
Reliability of the data source
Context within analysis
Uncertainty
Provenance
Characteristics
of Big Data:
Valence
• Describe what valence means and
how it relates to other Vs of Big
Data
• Recognize when valence might
become a challenging issue for a
big data problem
Valence == Connectedness
Valence – a Concept from
Chemistry
Valence – Measure of Connectivity
• Data Connectivity
• Two data items are connected when
they are related to each other
• Valence
• Fraction of data items that are
connected out of total number of
possible connections
Why worry about Valence?
Valence increases over time
Makes the data connections denser
Organizational
Behavior
Valence: Challenges
• More complex data exploration
algorithm s
• Modeling and prediction of valence
changes
• Group event detection
• Emergent behavior analysis
The Sixth V:
Value
Volume
Size
Volume
Size
Volume
Value
Eglence Inc. Big Data Case:
Catch The Pink Flamingo
Current Mission:
Find Star Backs on Land
★
★
★
★
Millions of Players!
Group Name:
The Super Freaks
Group N ame:
Scary B easts
Potentially
inaccurate
user info
Game rewards
Daily User Activity
200K Twitter mentions daily
#CatchThePinkFlamingo
Strong user
communities
Big Data
Solutions
Architect
Data Source
Machine
• User activity logs
People
• Twitter conversations
Organization
• User demographic in
• Game stats
Dimension
Volume
• Big daily workload and
associated data on players
and game stats
Variety
• Multiple types of data
Velocity
• Real-time analysis of usage
activity
Veracity
• Demographic info not
accurate
Valence
• Connections between players
REFERENCIAS
Descargar