BIG DATA Qué es Big Data? Qué es Big Data Que generó la era Big Data? Characteristics of Big Data Characteristics of Big Data Volume Velocity Variety Characteristics of Big Data Volume Velocity Variety Veracity Valence Characteristics of Big Data Volume Velocity Variety Veracity Valence Value Characteristics of Big Data: Volume • Describe what volume of big data means and why you should care about it • Explain why data volume is not just about storage Volume = Size Volume = Size Volume = Size Every minute… 204 Million emails Every minute… 204 Million emails 200,000 photos 1.8 Million likes Every minute… 204 Million emails 200,000 photos 1.8 Million likes 1.3 Million video views 72 hours of video uploads 100 MBs ~= couple of volumes of Encyclopedias A DVD ~= 5 GBs 1 TB ~= 300 hours of good quality video LHC ~= 15 PBs a year Exponential data growth! Relevance of Volume for Us? More data = Better safety Challenges: Storage and more… Storage Data acquisition Retrieval Distribution Processing Cost Performance Volume Processing Big Data Volume = Size Challenges Storage Access Processing Characteristics of Big Data: Velocity • Describe the velocity of Big Data • Add the term “real-time streaming” to your big data vocabulary • Identify why data velocity is more relevant in today’s world than ever before Velocity == Speed Speed of creating data Speed of storing data Speed of analyzing data Big Data Real-time action Big Data Real-time action Late decisions Missing opportunities How to decide what to pack ? How to decide what to pack ? Use weather information of last year at this time? How to decide what to pack ? Use weather information of last month ? OR Use weather status of this week or today ? Action Real-time Processing Batch Processing Collect Data Clean Data Feed in Chunks Wait Act Real-Time Processing Instantly capture streaming data Feed real time to machines Process Real Time Act Batch Processing Real-Time Processing Incomplete Fast Rate needed for data- driven actions Rate of generation and processing of data Speed of Data Generation Slow Speed of Data Processing 1=$ Slow 2 = $$ 3=$ Fast 4 = $$$ Which path to choose? Fast Streaming data = “what’s going on right now” Streaming data = gets generated ata varied rates Real-time processing Agile and adaptable businessdecisions Scalability - Variety • Describe different aspects of data variety (aka heterogeneity) related to Big Data • Identify the challenges and opportunities resulting from data variety Variety == Complexity • Data were confined only to tables Today, Data are more heterogeneous Axes of Data Variety Structural Variety – formats and models Semantic Variety – how to interpret and operate on data Media Variety – medium in which data get delivered Availability Variations – real-time? Intermittent? Variety within a Type • Think of an email collection • Table-like part Variety within a Type • Think of an email collection • Sender, receiver, date… • Unstructured Text Well-structured Variety within a Type • Think of an email collection • Sender, receiver, date… • Body of the email Text • Media Well-structured Variety within a Type • Think of an email collection • • • • Sender, receiver, date… Body of the email Text Attachments Multi-media Who-sends-to-whom Well-structured Variety within a Type • Think of an email collection Well-structured Text Multi-media Network a past email Semantics Variety within a Type • Think of an email collection Well-structured Text Multi-media Network a past email Semantics • Real-time? Availability Scalability Issues • Impact of data variety • Harder to ingest • Difficult to create common storage • Difficult compare and match data across variety • Difficult to integrate • Management and policy challenges Characteristics of Big Data: Veracity After this video you will be able to.. • Describe what the veracity of Big Data stands for and why you need to care about it • Summarize what went wrong with the Google Flu Predictor and Amazon’s Banana Slicer Reviews • Explain two methods to overcome the Big Data quality challenges Veracity == Quality Validity Volatility Veracity == Quality Accuracy of data Reliability of the data source Context within analysis Uncertainty When sentiment analysis doesn’t work? Google Flu Trends Uncertainty Google Flu Trends Uncertainty Provenance Veracity == Quality Accuracy of data Reliability of the data source Context within analysis Uncertainty Provenance Characteristics of Big Data: Valence • Describe what valence means and how it relates to other Vs of Big Data • Recognize when valence might become a challenging issue for a big data problem Valence == Connectedness Valence – a Concept from Chemistry Valence – Measure of Connectivity • Data Connectivity • Two data items are connected when they are related to each other • Valence • Fraction of data items that are connected out of total number of possible connections Why worry about Valence? Valence increases over time Makes the data connections denser Organizational Behavior Valence: Challenges • More complex data exploration algorithm s • Modeling and prediction of valence changes • Group event detection • Emergent behavior analysis The Sixth V: Value Volume Size Volume Size Volume Value Eglence Inc. Big Data Case: Catch The Pink Flamingo Current Mission: Find Star Backs on Land ★ ★ ★ ★ Millions of Players! Group Name: The Super Freaks Group N ame: Scary B easts Potentially inaccurate user info Game rewards Daily User Activity 200K Twitter mentions daily #CatchThePinkFlamingo Strong user communities Big Data Solutions Architect Data Source Machine • User activity logs People • Twitter conversations Organization • User demographic in • Game stats Dimension Volume • Big daily workload and associated data on players and game stats Variety • Multiple types of data Velocity • Real-time analysis of usage activity Veracity • Demographic info not accurate Valence • Connections between players REFERENCIAS