Census Data Analysis & Data Mining Da t a M i n i n g Mar ía del Rosar io Br uer a I BM Scholar s Pr ogr am Census Data Analysis & Data Mining Pr e g u n t a s y r e s p u e s t a s Pr egunt as: · ¢&XiOHVHOYDORUGHORVFOLHQWHV" · ¢&XiOHV VRQ ORV FOLHQWHV TXH WLHQHQ PD\RU SUREDELOLGDGGHGHVHUWDU" · ¢&XiOHV VRQ ORV SURGXFWRV TXH VH YHQGHQ HQ IRUPDFRQMXQWD"« Respuest as: · (VWiQHQORVGDWRVGHOXVXDULR · 6HQHFHVLWDQKHUUDPLHQWDVHVSHFLDOHVSDUD HQFRQWUDUODV Census Data Analysis & Data Mining Business Int elligenc e S´(V XQ SDUDJXDV EDMR HO TXH VH LQFOX\H XQ FRQMXQWR GH FRQFHSWRV \ PHWRGRORJtDVFX\DPLVLyQFRQVLVWHHQ PHMRUDU HO SURFHVR GH WRPD GH GHFLVLRQHVHQORVQHJRFLRVEDViQGRVH HQ KHFKRV \ VLVWHPDV TXH WUDEDMDQ FRQKHFKRVµ +RZDUG'UHVQHU *DUWQHU*URXS Census Data Analysis & Data Mining B .I .: r e c u r s o s y h e r r a m i e n t a s S)XHQWHV GH GDWRV ZDUHKRXVHV GDWDPDUWVHWF S+HUUDPLHQWDV GH DGPLQLVWUDFLyQ GH GDWRV S+HUUDPLHQWDV GH H[WUDFFLyQ \ FRQVXOWD S+HUUDPLHQWDVGHPRGHOL]DFLyQ'DWD 0LQLQJ Census Data Analysis & Data Mining ¿Qu é e s Da t a M i n i n g ? (1 9 9 7 ) ·'DWD 0LQLQJ es el pr oceso de explor ación y análisis - de maner a aut omát ica o semiaut omát ica - de los dat os par a obt ener pat r ones signif icat ivos y r eglas de negocio. · 0LFKDHO%HUU\*RUGRQ/LQRII 'DWD0LQLQJIRUPDUNHWLQJVDOHV DQGFXVWRPHUVXSSRUW :LOH\86$ Census Data Analysis & Data Mining Re f l e x i o n e s (2 0 0 0 ) S QRV JXVWD OD QRFLyQ GH TXH ORV SDWURQHV GHEHQVHUVLJQLILFDWLYRV« S 6L KD\DOJR TXHUHFKD]DPRV HV OD IUDVH ´SRU PHGLRV DXWRPiWLFRV R VHPLDXWRPiWLFRVµ QR SRUTXH QR VHD FLHUWR VLQ DXWRPDWL]DFLyQ HV LPSRVLEOH PLQDU JUDQGHV FDQWLGDGHV GH GDWRV VLQR SRUTXH HQWHQGHPRV TXH VH KD SXHVWR GHPDVLDGR pQIDVLV HQ OD DXWRPDWL]DFLyQ \ QR VXILFLHQWH HQ ODV HWDSDV GH H[SORUDFLyQ \ DQiOLVLV S 'DWD0LQLQJHVXQSURFHVR !"$# %'&(#)* %+,- /.102435 565 Census Data Analysis & Data Mining Qu é N O e s Da t a M i n i n g S1R HV XQ SURGXFWR TXH VH FRPSUD HQODWDGR VLQR XQD GLVFLSOLQD TXH GHEHVHUGRPLQDGD S1RHVXQDVROXFLyQLQVWDQWiQHDDORV SUREOHPDVGHQHJRFLR S1R HV XQ ILQ HQ Vt PLVPR VLQR XQ SURFHVR TXH D\XGD D HQFRQWUDU VROXFLRQHVDSUREOHPDVGHQHJRFLR Census Data Analysis & Data Mining Pi l a r e s d e l p r o c e s o d e Da t a M i n i n g S S S 'DWRV $OJRULWPRV\WpFQLFDV 3UiFWLFDVGHPRGHOL]DFLyQ Census Data Analysis & Data Mining Di s c i p l i n a s q u e s e i n t e g r a n S,QWHOLJHQFLD$UWLILFLDO S(VWDGtVWLFD S7HFQRORJtDV GH VRSRUWH GH GHFLVLRQHV2/73 S7HFQRORJtDVGHKDUGZDUH\VRIWZDUH Census Data Analysis & Data Mining Pe r s p e c t i v a h i s t ó r i c a Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data Census Data Analysis & Data Mining Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data Census Data Analysis & Data Mining Et a p a s e n e l p r o c e s o d e Da t a M i n i n g ·,GHQWLILFDUHOSUREOHPDGHQHJRFLR ·7UDQVIRUPDU ORV GDWRV HQ LQIRUPDFLyQ ·$FWXDUDSDUWLUGHORVUHVXOWDGRV ·0HGLUORVUHVXOWDGRVGHODVDFFLRQHV Census Data Analysis & Data Mining The Mining Pr ocess Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data Census Data Analysis & Data Mining El A n a l i s t a d e Da t o s S(V HO YtQFXOR HQWUH ODV iUHDV GH WHFQRORJtD LQIRUPiWLFD \ ODV iUHDV GH QHJRFLRV S7UDGXFH ORV UHTXHULPLHQWRV GH LQIRUPDFLyQHQSUHJXQWDVDSURSLDGDVSDUD VX DQiOLVLV FRQ ODV KHUUDPLHQWDV GH PLQHUtD S5HDOLPHQWDHO'DWD:DUHKRXVHGHOD FRPSDxtDFRQQXHYRVFULWHULRVGHGDWD FOHDQLQJ\GDWDYDOLGDWLRQ Census Data Analysis & Data Mining El A n a l i s t a d e Da t o s 7HFQRORJtD LQIRUPiWLFD 8VXDULRV GHQHJRFLR Census Data Analysis & Data Mining El A n a l i s t a 7896: ;<99>= ? @ A 9 B6@96C D: 9E?F< G H d e Da t o s ];?^MA <[6=< D: 9E?F< G H D@I @YX;@ A H96: 9 D@I @*LZ? [ < A A : ;= D@I @ D: 9E?F< G H D@I @J@ G < K?89< \ ;96: =KI 9 D@I @POQA <@ ;6: ;= D@I @MBRG @ ;9ST?UGTV>@IW: ?U; LM: ;6: ;=ND@I @ Census Data Analysis & Data Mining Habilidades requeridas S'DWDPDQLSXODWLRQ64/ S&RQRFLPLHQWRGHODVWpFQLFDVGH PLQHUtD\DQiOLVLVH[SORUDWRULR S+DELOLGDG GH FRPXQLFDFLyQ LQWHUSUHWDFLyQGHORVSUREOHPDVGH QHJRFLR S&UHDWLYLGDG Census Data Analysis & Data Mining Da t a M i n i n g T e a m Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data Census Data Analysis & Data Mining Co s t o s d e p r o y e c t o Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data Census Data Analysis & Data Mining Or i g e n d e l o s d a t o s S%DVHVGH'DWRV5HODFLRQDOHV S'DWD:DUHKRXVHV S'DWD0DUWVDQG2/$3 S2WURVIRUPDWRV([FHODUFKLYRV $6&,,HQFXHVWDVGDWRVFHQVDOHV HWF Census Data Analysis & Data Mining Tipos de fuent es de dat os S7UDQVDFFLRQDOHVHMODVRSHUDFLRQHV UHDOL]DGDVFRQWDUMHWDGHFUpGLWR S5HODFLRQDOHVHMODHVWUXFWXUDGH ORVSURGXFWRVTXHRIUHFHHO%DQFR S'HPRJUiILFRVHMFDUDFWHUtVWLFDV GHOJUXSRIDPLOLDU Census Data Analysis & Data Mining La form a de los dat os p a r a Da t a M i n i n g S6HRUJDQL]DQHQIRUPDGHXQDWDEOD SODQD FRPSXHVWD SRU ILODV \ FROXPQDV S/DV )LODV XQLGDG GH DQiOLVLV3RU HMHPSORXQDFXHQWDXQWLFNHW S/DV FROXPQDV ORV DWULEXWRV GH FDGD XQLGDG GH DQiOLVLV3RU HMHPSOR IUHFXHQFLDGHXVRGHODWDUMHWDGH FUpGLWR Census Data Analysis & Data Mining Ca r a c t e r ís t i c a s d e l a s t a b l a s d e d a t o s p a r a Da t a M i n i n g S7RGRVORVGDWRVGHEHQHVWDUHQXQDVROD WDEODR´YLVWDµGHOD%DVHGH'DWRV S&DGD ILOD GHEH FRUUHVSRQGHU D XQD LQVWDQFLDUHOHYDQWHDOQHJRFLR S/DV &ROXPQDV VLQ YDULDELOLGDG GHEHQ VHU LJQRUDGDV S/DV &ROXPQDV FRQ YDORUHV ~QLFRV SDUD FDGD FDVR GHEHQ VHU LJQRUDGDV 1UR GH FXHQWD Census Data Analysis & Data Mining La c alidad de los dat os ·(O p[LWR GH ODV DFWLYLGDGHV GH Dat a Mining VH UHODFLRQD GLUHFWDPHQWH FRQ OD CALI DADGHORVGDWRV ·6H GHEH LGHQWLILFDU ORV GDWRV IDOWDQWHV “missings” RIXHUDGHUDQJR“out lier s” Census Data Analysis & Data Mining La c alidad de los dat os · 0XFKDVYHFHVUHVXOWDQHFHVDULRSUHSURFHVDUORV GDWRVDQWHVGHGHULYDUORVDOPRGHORGHDQiOLVLV (OSUHSURFHVDPLHQWRSXHGHLQFOXLU WUDQVIRUPDFLRQHVUHGXFFLRQHVRFRPELQDFLRQHV GHORVGDWRV · /DVHPiQWLFDGHORVGDWRVGHEHD\XGDUSDUDOD VHOHFFLyQGHXQDFRQYHQLHQWHr epr esent ación \ ODVERQGDGHVGHODUHSUHVHQWDFLyQHOHJLGD JUDYLWDQGLUHFWDPHQWHVREUHODFDOLGDGGHO PRGHOR\GHORVUHVXOWDGRVSRVWHULRUHV Census Data Analysis & Data Mining Pr o b l e m a s c o n l o s d a t o s · 'HPDVLDGRVGDWRV _ GDWRVFRUUXSWRVRFRQUXLGR _ GDWRVUHGXQGDQWHVUHTXLHUHQIDFWRUL]DFLyQ _ GDWRVLUUHOHYDQWHV _ H[FHVLYDFDQWLGDGGHGDWRVPXHVWUHR · 3RFRVGDWRV _ DWULEXWRVSHUGLGRVPLVVLQJV _ YDORUHVSHUGLGRV _ SRFDFDQWLGDGGHGDWRV · 'DWRVIUDFWXUDGRV _ GDWRVLQFRPSDWLEOHV _ P~OWLSOHVIXHQWHVGHGDWRV Census Data Analysis & Data Mining Pr e p a r a c i ó n d e l o s d a t o s 7UDQVIRUPHG'DWD $VVLPLODWHG ,QIRUPDWLRQ ([WUDFWHG ,QIRUPDWLRQ `ba cTa deaf$ghijklg m g b` n ga o$cTcTa gp 6HOHFW 7UDQVIRUP 0LQH $VVLPLODWH Census Data Analysis & Data Mining Da t a Wa r e h o u s e S'DWD :DUHKRXVH LV D VXEMHFWRULHQWHG LQWHJUDWHG WLPHYDULDQW QRQ YRODWLOH FROOHFWLRQ RI GDWD LQ VXSSRUW RI PDQDJHPHQWGHFLVLRQV %LOO,QPRQ S$FRS\RIWUDQVDFWLRQGDWDVSHFLILFDOO\ VWUXFWXUHGIRUTXHU\DQGDQDO\VLV 5DOSK.LPEDOO Census Data Analysis & Data Mining Da t a M a r t s S7pFQLFDPHQWHHVXQVXEFRQMXQWRGHO ': RULHQWDGR D XQD ILQDOLGDG HVSHFtILFD GH QHJRFLR PDUNHWLQJ ILQDQ]DVSURGXFFLyQHWF S(O WpUPLQR VH XWLOL]D WDPELpQ SDUD LGHQWLILFDU VROXFLRQHV DOWHUQDWLYDV D XQ ': FRUSRUDWLYR PiV UHGXFLGDV \ GH PHQRU FRVWR \ WLHPSR GH LPSODQWDFLyQ Census Data Analysis & Data Mining Arquit ec t ura del Da t a w a r e h o u s e qrts uWvtwWwTx ylz{$wT| }Wvlz ~ wWvlv$s uTtt vR} Metadata ReportQ uery, EIS OLAP DW Datos operacionales y externos Data Mining Census Data Analysis & Data Mining Herram ient as de e x p l o t a c i ó n d e l DW S+HUUDPLHQWDVGHYLVXDOL]DFLyQ S5HSRUWLQJ S2/$3 S'DWD0LQLQJ Census Data Analysis & Data Mining OL A P S2Q/LQH$QDO\WLFDO3URFHVVLQJ S3HUPLWHQ OD HODERUDFLyQ GH YLVWDV PXOWLGLPHQVLRQDOHV GHO ': SDUD RSWLPL]DUSHUIRUPDQFH S(VWiQ VRSRUWDGDV SRU PRWRUHV GH DGPLQLVWUDFLyQ GHO ': TXH DGPLWHQ ODFRQVWUXFFLyQGHHVWRV´FXERVµ Census Data Analysis & Data Mining OL A P S+HUUDPLHQWDV~WLOHV\SRGHURVDV SDUDDFFHGHUD%DVHVGH'DWRV\ 'DWD:DUHKRXVHV\REWHQHU ´UHSRUWHVµGHLQIRUPDFLyQ S/D WHFQRORJtD 2/$3 FRPPSOHPHQWD ODV DFWLYLGDGHV GH 'DWD 0LQLQJ \ VXSHUDODVSRVLELOLGDGHVGHO64/ Census Data Analysis & Data Mining Da t a M i n i n g y OL A P S/DV KHUUDPLHQWDV GH UHSRUWLQJ 2/$3 \ FRQVXOWD UHVSRQGHQ HIHFWLYDPHQWH SDUD OD FRQVWUXFFLyQ GH PRGHORV GHVFULSWLYRV \ UHWURVSHFWLYRV SDUD FRQILUPDU R UHFKD]DU KLSyWHVLV SUHYLDV GHO XVXDULR Census Data Analysis & Data Mining Da t a M i n i n g y OL A P S/DV KHUUDPLHQWDV GH 'DWD 0LQLQJ SHUPLWHQ HQFRQWUDU SDWURQHV QR HYLGHQWHV HQ ORV JUDQGHV YRO~PHQHV GH LQIRUPDFLyQ GHO ': \ SURSRQHU PRGHORVSUHGLFWLYRV Census Data Analysis & Data Mining Qu é e s l a Es t a d ís t i c a S(V OD GLVFLSOLQD TXH H[WUDH LQIRUPDFLyQ JHQHUDO D SDUWLU GH GDWRVHVSHFtILFRV S(VHOHVWXGLRGHODHVWDELOLGDGHQOD YDULDFLyQ S(V HO DUWH GH H[DPLQDU VXPDUL]DU \ H[WUDHU FRQFOXVLRQHV D SDUWLU GH ORVGDWRV Census Data Analysis & Data Mining Da t a M i n i n g y Es t a d ís t i c a S/RV PpWRGRV HVWDGtVWLFRV VRQ HO FRUD]yQ GH PXFKDV GH ODV WpFQLFDV GHPLQHUtDGHGDWRV S2ULJLQDOPHQWH PXFKDV GH HVWDV WpFQLFDV IXHURQ GLVHxDGDV FRQ SURSyVLWRVFRQILUPDWRULRV S/D HVWDGtVWLFD H[SORUDWRULD DSDUHFH HQ ORV FRQ ORV DSRUWHV GH -7XFNH\ Census Data Analysis & Data Mining Da t a M i n i n g y Es t a d ís t i c a S(Q OD 0LQHUtD GH 'DWRV QR VH KDFHQ VXSXHVWRV D SULRUL VREUH OD QDWXUDOH]D GH ODV YDULDEOHV \ GH ODV UHODFLRQHV HQWUH HOODV QRUPDOLGDG OLQHDOLGDG HWF S/RVDOJRULWPRVHVWDGtVWLFRVVHDGDSWDQ SDUD 0LQHUtD GH 'DWRV DO SURFHVDPLHQWR GH JUDQGHV YRO~PHQHV GHGDWRV Census Data Analysis & Data Mining Da t a M i n i n g e I A S/D ,QWHOLJHQFLD$UWLILFLDO VHLQWHJUD D OD 0LQHUtD GH 'DWRV D SDUWLU GH ODVUHGHVQHXURQDOHVDUWLILFLDOHV S6H XWLOL]DQ SDUD FRQVWUXLU PRGHORV SUHGLFWLYRV QR OLQHDOHV TXH DSUHQGHQ DWUDYpVGHHQWUHQDPLHQWR\TXHVH DVLPLODQDORVPRGHORVGHUHGHVGH QHXURQDVELROyJLFDV Census Data Analysis & Data Mining Re d e s n e u r o n a l e s S/DVUHGHVQHXURQDOHVVRQDGHFXDGDV SDUDSUREOHPDVGHWLSRSUHGLFWLYR S8QSUREOHPDDSURSLDGRSDUDXQDUHG QHXURQDOWLHQHWUHVFDUDFWHUtVWLFDV Se compr enden clar ament e los I NPUTS Se compr ende clar ament e el OUTPUT Exist en ej emplos (exper iencia) suf icient es par a ent r enar a la r ed Census Data Analysis & Data Mining Los m odelos neuronales S/D UHG QHXURQDO QR SURGXFH UHJODV H[SOtFLWDVTXHGHVFULEDQHOPRGHOR S8QPRGHORQHXURQDOHVWDQEXHQRFRPROR HV HO VHW GH GDWRV XVDGR SDUD HQWUHQDU ODUHG S(O PRGHOR HV HVWiWLFR \ GHEH VHU H[SOtFLWDPHQWH DFWXDOL]DGR DJUHJDQGR HMHPSORV UHFLHQWHV \ UHHQWUHQDQGR OD UHGSDUDDVHJXUDUVXYLJHQFLD\XWLOLGDG Census Data Analysis & Data Mining Los m odelos neuronales S&RQ PRGHORV QHXURQDOHV VH SXHGH DWDFDU XQD JUDQ YDULHGDG GH SUREOHPDV \ SURGXFLU EXHQRV UHVXOWDGRV D~Q HQ GRPLQLRV FRPSOHMRV FRQ YDULDEOHV FRQWLQXDV\FDWHJyULFDV S6RQ DSURSLDGRV SDUD WDUHDV GH FODVLILFDFLyQ \ SUHGLFFLyQ FXDQGR ORV UHVXOWDGRV GHO PRGHOR VRQ PiV LPSRUWDQWHV TXH FRPSUHQGHU FyPR IXQFLRQDHOPRGHOR Census Data Analysis & Data Mining Cu s t o m e r Re l a t i o n s h i p Managem ent S(V HO SURFHVR TXH DGPLQLVWUD OD UHODFLyQ HQWUH OD FRPSDxtD \ VXV FOLHQWHV S3DUD TXH UHVXOWH H[LWRVR UHVXOWD QHFHVDULR LGHQWLILFDU ORV SDWURQHV GHFRQVXPR\FRPSRUWDPLHQWRGHORV FOLHQWHV Census Data Analysis & Data Mining Da t a M i n i n g - CRM S'DWD 0LQLQJ VH XWLOL]D SDUD VLVWHPDWL]DU ORV SURFHVRV GH E~VTXHGD GH ORV SUHGLFWRUHV GH FRPSRUWDPLHQWR GH ORV FOLHQWHV HQ ODVHWDSDVGHGLVHxRGHFDPSDxDV S7DPELpQ VH DSOLFD SDUD OD PHGLFLyQ GHORVUHVXOWDGRVGHODFDPSDxD\OD UHDOLPHQWDFLyQGHO&50 Census Data Analysis & Data Mining Pr o b l e m a s t íp i c o s d e Da t a M i n i n g S&ODVLILFDFLyQ S(VWLPDFLyQ S3UHGLFFLyQ S$JUXSDPLHQWRDSDUWLUGH UHJODVGH DVRFLDFLyQ S&OXVWHULQJ S'HVFULSFLyQ\YLVXDOL]DFLyQHWF Census Data Analysis & Data Mining Pr o b l e m a d e Cl u s t e r i n g $JUXSDUDORVFOLHQWHVVHJ~QVXVLQGLFDGRUHV 55HFHQF\ ))UHFXHQFLD 0 0RQWR HWF HQVHJPHQWRVGHFRPSRUWDPLHQWRKRPRJpQHR 5HVXOWDGR &OLHQWHV +HDY\ 0HGLXP /LJKW HWF (OGHODIDFWXUDFLyQVHFRQFHQWUDHQHO FOXVWHU+HDY\GHORVFOLHQWHV /RV FOLHQWHV +HDY\ VRQ FDVDGRV FRQ KLMRV WUDEDMDGRUHV DXWyQRPRV FRQ XQ LQJUHVR VXSHULRUD Census Data Analysis & Data Mining Pr o b l e m a d e Cl a s i f i c a c i ó n &ODVLILFDU XQ QXHYR FOLHQWH GH DFXHUGR D VX SHUILO VRFLRGHPRJUiILFR FRPR SRWHQFLDO FOLHQWH+HDY\0HGLXP/LJKW Census Data Analysis & Data Mining Pr o b l e m a d e Es t i m a c i ó n (VWLPDU HO FRQVXPR GH XQ GHWHUPLQDGR UXEUR GH DUWtFXORV GH XQ JUXSR FOLHQWHV HQ HO SUy[LPR WULPHVWUH (VWLPDU HO /79 /LIH 7LPH 9DOXH SRWHQFLDOGHXQQXHYRFOLHQWH Census Data Analysis & Data Mining Pr o b l e m a d e Pr e d i c c i ó n 3UHGHFLUHODEDQGRQRGHXQFOLHQWH FKXUQLQJDWULWWLRQ 3DUDXQDFRPSDxtDGHWHOHIRQtD FHOXODU 3DUDXQD$)-3 3DUDXQDWDUMHWDGHFUpGLWR Census Data Analysis & Data Mining Pr o b l e m a d e A s o c i a c i ó n (QFRQWUDUODVUHJODVTXHGHWHUPLQDQ HOFURVVWUDIILFHQWUHSURGXFWRV SDUDORVFOLHQWHVGHXQ%DQFR3RU HMHPSOR ´&XDQGRXQFOLHQWHVHDFWLYDHQ&DMD GH$KRUURVHOVLJXLHQWHSURGXFWR HQGRQGHVHDFWLYDHV3UpVWDPRV SHUVRQDOHV(VWHSDWUyQRFXUUHHQ HOGHORVFDVRVµ Census Data Analysis & Data Mining Pr o b l e m a d e v i s u a l i za c i ó n 5HSUHVHQWDU PHGLDQWH XQ VRIWZDUH GH JHRORFDOL]DFLyQ *,6 OD GLVWULEXFLyQ GH ORV FOLHQWHV HQ OD ]RQDGH LQIOXHQFLD GH ODVVXFXUVDOHV GHXQFRPHUFLR Census Data Analysis & Data Mining Pr o b l e m a s u s u a l e s S&DUDFWHUL]DFLyQ GH SHUILOHV GH FOLHQWHVSDUDGHILQLUDFFLRQHVGH8S VHOOLQJ\&URVVVHOOLQJ S7UDFNLQJ GH FDPSDxDV \ SUHGLFFLyQ GHUHVSXHVWDQRUHVSXHVWD S&DQDVWDGHFRQVXPRGHWDUMHWDVGH FUpGLWR\SUHYHQFLyQGHIUDXGHV S0RGHORVGHSUHGLFFLyQGHDEDQGRQR Census Data Analysis & Data Mining Pr o b l e m a s u s u a l e s S3URJUDPDVGHPLOODMH\ILGHOL]DFLyQ S&RQVROLGDFLyQGH%DVHVGH'DWRV SURSLDVFRQIXHQWHVH[WHUQDV S:HEPLQLQJ\DQiOLVLVGHWUiILFR\ XVRGHUHFXUVRVGHHEXVLQHVV S'HILQLFLyQ GH PDUFRV PXHVWUDOHV SDUD LQYHVWLJDFLRQHV GH PHUFDGR \ HQFXHVWDVGHFXVWRPHUVDWLVIDFWLRQ Census Data Analysis & Data Mining La elec c ión del m odelo p a r a Da t a M i n i n g ·3ULQFLSDOHVREMHWLYRVGHOSURFHVRGH'DWD 0LQLQJ pr edicción descr ipción ·(O PpWRGR D XWLOL]DU GHSHQGH GH ORV REMHWLYRVSHUVHJXLGRVSRUHODQiOLVLVSHUR WDPELpQ GH OD FDOLGDG \ FDQWLGDG GH ORV GDWRVGLVSRQLEOHV Census Data Analysis & Data Mining Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data Census Data Analysis & Data Mining Có m o s e l e c c i o n a r u n a p o t e n c i a l a p l i c a c i ó n d e DM &RQVLGHUDFLRQHVSUiFWLFDV ·3RWHQFLDOLPSDFWRVLJQLILFDWLYR5HODFLyQ FRVWREHQHILFLR ·1RKD\RWUDDOWHUQDWLYD ·([LVWHVRSRUWHLQVWLWXFLRQDO ·1RH[LVWHQLPSHGLPHQWRVOHJDOHVGHXVR GHODLQIRUPDFLyQ Census Data Analysis & Data Mining Có m o s e l e c c i o n a r u n a p o t e n c i a l a p l i c a c i ó n d e DM Consider aciones t écnicas: ·'LVSRQLELOLGDGVXILFLHQWHGHGDWRV ·5HOHYDQFLDGHDWULEXWRV ·%DMRVQLYHOHVGHUXLGRHQORVGDWRV ·3UHFLVDUHOQLYHOGHFRQILDQ]DSDUDORV UHVXOWDGRV ·&RQRFLPLHQWRDQWHULRUH[LVWHQWH Census Data Analysis & Data Mining La evaluac ión de los m odelos ·&XiQDMXVWDGRHVHOPRGHOR" ·(VFRUUHFWDVXGHVFULSFLyQGHORV GDWRVREVHUYDGRV" ·&XDQWDFRQILDQ]DVHSXHGHWHQHUHQ VXVSUHGLFFLRQHV" ·&XiQFRPSUHQVLEOHHVHOPRGHOR" Census Data Analysis & Data Mining Las m edidas ·/D FRQFRUGDQFLD GH XQ PRGHOR SUHGLFWLYR FRQODUHDOLGDGVHPLGHFRQUHODFLyQDOD WDVDGHHUURUHVGHFLUHOSRUFHQWDMHGH FDVRV FODVLILFDGRV R FX\D SUHGLFFLyQ IXH LQFRUUHFWD ·3DUD HOOR VH GLVSRQH GH GDWRV GH YDOLGDFLyQ \ WHVWLQJ VREUH ORV TXH GHEH DSOLFDUVH SHULyGLFDPHQWH HO PRGHOR D PRGRGHFRQWURO Census Data Analysis & Data Mining Las m edidas ·(Q HO FDVR GH ORV PRGHORV GHVFULSWLYRV XQDEXHQDUHJODHVODTXHSURSRUFLRQDOD LQIRUPDFLyQ PiV FRPSUHQVLEOH FRQ OD PHQRU ´ORQJLWXGµ GH H[SUHVLyQ GH OD UHJOD ·(Q GHILQLWLYD OD PHGLGD PiV LPSRUWDQWH GH HIHFWLYLGDG HV HO UHWRUQR GH OD LQYHUVLyQ Census Data Analysis & Data Mining Un proyec t o ex it oso S8Q~QLFRSURMHFWOHDGHU S8QHTXLSR PXOWLGLVFLSOLQDULR LQWHJUDGR SRU SHUVRQDVGHODViUHDVGH,7\GHQHJRFLR S/DV XQLGDGHV GH QHJRFLR HVWiQ LQYROXFUDGDVGHVGHHOFRPLHQ]R S(O iUHD GH ,7 HVWi LQYROXFUDGD GHVGH HO FRPLHQ]R S8Q SHTXHxR SUR\HFWR SLORWR TXH PXHVWUH ODVYHQWDMDVGH'DWD0LQLQJ Census Data Analysis & Data Mining L a s n u e v a s t e c n o l o g ía s Census Data Analysis & Data Mining We b M i n i n g S(V HO GHVFXEULPLHQWR GH SDWURQHV VLJQLILFDWLYRVDSDUWLUGHODQiOLVLVGH OD HVWUXFWXUD FRQWHQLGRV \ XVR GH OD:HE Census Data Analysis & Data Mining We b M i n i n g T a x o n o m y :HE0LQLQJ :HEFRQWHQW :HE6WUXFWXUH :HEXVDJH Census Data Analysis & Data Mining Re s u l t a d o s We b m i n i n g S(O GH ORV YLVLWDQWHV TXH DFFHGHQ D ZZZLEPFRPUHGERRNV DFFHGHQD ZZZLEPFRPVRIWZDUHGDWDLPLQHU IRUGDWD S(QWU\DQG([LWSRLQWV Census Data Analysis & Data Mining Re s u l t a d o s We b m i n i n g S/LQN DQDO\VLV \ SDWURQHV VHFXHQFLDOHVGHHQODFHVGHSiJLQDV S6HJPHQWDFLyQGHFOLHQWHVGHH FRPPHUFH S&DQDVWDGHSURGXFWRV SHWFHWFHWF Census Data Analysis & Data Mining Tex t Mining S6RQQXHYDVKHUUDPLHQWDVGHVWLQDGDV D H[WUDHU LQIRUPDFLyQ GH GRFXPHQWRV ´QR HVWUXFWXUDGRVµ RUJDQL]DUORV VHJPHQWDUORV LQGH[DUORV Census Data Analysis & Data Mining Pr o b l e m a s d e T e x t M i n i n g S'LUHFFLRQDPLHQWR DXWRPiWLFR GH HPDLOVVHJ~QVXFRQWHQLGR S&ODVLILFDFLyQ DXWRPiWLFD GH GRFXPHQWRVGHXQDLQWUDQHW S%~VTXHGD GH LQIRUPDFLyQ HQ GRFXPHQWRV GH GLVWLQWRV LGLRPDV VLPXOWiQHDPHQWH Census Data Analysis & Data Mining Pr o b l e m a s d e T e x t M i n i n g S$QiOLVLV GH FRQWHQLGRV GH SiJLQDV :HE S2UJDQL]DFLyQ GH VHUYLFLRV GH E~VTXHGDHQOD:HE S([WUDFFLyQGHFRQFHSWRVGHVtQWHVLV HQ GRFXPHQWRV UHIHULGRV DO PLVPR DVXQWR Census Data Analysis & Data Mining Co n c l u s i o n e s Census Data Analysis & Data Mining Pa r a q u é M i n e r ía d e Da t o s S/D 0LQHUtD GH 'DWRV HV XQD KHUUDPLHQWDHILFD]SDUDGDUUHVSXHVWD SUHJXQWDVFRPSOHMDVGH,QWHOLJHQFLDGH 1HJRFLRV S/DV KHUUDPLHQWDV GLVSRQLEOHV SHUPLWHQ DXWRPDWL]DU SDUWH GH OD WDUHD GH HQFRQWUDU ORV SDWURQHV GH FRPSRUWDPLHQWRRFXOWRVHQORVGDWRV S3HUR« Census Data Analysis & Data Mining Qu é n o p u e d e a u t o m a t i za r s e (t o d a v ía ) S/D HOHFFLyQ GH ORV SUREOHPDV GH QHJRFLR FDQGLGDWRVSDUDWDUHDVGH'DWD0LQLQJ S/D LGHQWLILFDFLyQ \ UHFROHFFLyQ GH ORV GDWRV TXH FRQWLHQHQ OD LQIRUPDFLyQ EXVFDGD S(O PDVDMHR \ WUDWDPLHQWR GH ORV GDWRV TXHSRVLELOLWDODE~VTXHGDGHSDWURQHV S(OGLVHxR\FiOFXORGHYDULDEOHVGHULYDGDV Census Data Analysis & Data Mining Qu é n o p u e d e a u t o m a t i za r s e (t o d a v ía ) S(OSODQGHDFFLRQHVTXHDSR\iQGRVHHQORV UHVXOWDGRVGHOPRGHORSURGX]FDHO52, S/D PHGLFLyQ GHO p[LWR GH ODV DFFLRQHV UHDOL]DGDV D SDUWLU GH ORV UHVXOWDGRV SURSRUFLRQDGRVSRU'DWD0LQLQJ Census Data Analysis & Data Mining Co n c l u s i o n e s S&RQYLHUWD D 'DWD 0LQLQJ HQ XQD SDUWHGHVXSUR\HFWRGHQHJRFLR S,QFOX\D D 'DWD 0LQLQJ HQ OD ´FXOWXUDµGHVXRUJDQL]DFLyQ Census Data Analysis & Data Mining Ej e m p l o s c o n DB 2 I n t e l l i g e n t M i n e r f o r Da t a Census Data Analysis & Data Mining T é c n i c a s u t i l i za d a s S&OXVWHULQJVHJPHQWDFLyQ S&DQDVWDGHSURGXFWRV S$UEROGHGHFLVLyQ S5HGQHXURQDOFRPRPRGHOR SUHGLFWLYR Census Data Analysis & Data Mining ¿Qu é e s “ c l u s t e r i n g ” ? S(V OD SDUWLFLyQ GHO FRQMXQWR GH LQGLYLGXRV HQ VXEFRQMXQWRV OR PiV KRPRJpQHRVSRVLEOHV S(OREMHWLYRHVPD[LPL]DUODVLPLOLWXG GH ORV LQGLYLGXRV GHO FOXVWHU \ PD[LPL]DU ODV GLIHUHQFLDV HQWUH FOXVWHUV Census Data Analysis & Data Mining Aplic ac iones de la t éc nic a S6HJPHQWDFLyQGHODEDVHGHGDWRV S'HWHFFLyQGHIUDXGHV S'HWHFFLyQGHGHIHFWRV Census Data Analysis & Data Mining Ob j e t i v o s S'HWHUPLQDUHOQ~PHURySWLPRGH FOXVWHUV S$VLJQDUDFDGDLQGLYLGXRDXQ~QLFR FOXVWHU S(YDOXDUHOLPSDFWRGHODVYDULDEOHV HQODIRUPDFLyQGHOFOXVWHU S&RPSUHQGHU HO ´SHUILOµ GH FDGD FOXVWHU Census Data Analysis & Data Mining Medidas de sim ilaridad S9DULDEOHV FDWHJyULFDV HVFDODV QRPLQDOHV \ RUGLQDOHV VRQ VLPLODUHVVLVRQLJXDOHV S9DULDEOHV QXPpULFDV HVFDODV PpWULFDV HO DOJRULWPR GHWHUPLQD VXGLIHUHQFLDH[SUHVDGDHQXQLGDGHV GHGHVYLDFLRQHVVWDQGDUG Census Data Analysis & Data Mining Ej e m p l o s i m i l a r i d a d 1RPEUH Juan Maria No evaluado 6H[R M F Diferente (VW&LYLO C C Igual /XJDU Cap.Fed GBA Diferente 6LPLODULGDG 0.33 0.33 Census Data Analysis & Data Mining Cr i t e r i o Co n d o r c e t S(VXQDPHGLGDGHVLPLODULGDGTXHYDUtD HQWUH\ S9DOH ORV LQGLYLGXRV HVWiQ XELFDGRV DOHDWRULDPHQWHHQORVFOXVWHUV S9DOH 7RGRV ORV LQGLYLGXRV GH ORV FOXVWHUVVRQLGpQWLFRV\QRKD\LQGLYLGXRV FRQ HVDV FDUDFWHUtVWLFDV IXHUD GH FDGD FOXVWHU S&RQGRUFHWPtQLPRXVXDO Census Data Analysis & Data Mining El p r o b l e m a 6H WUDWD GH VHJPHQWDU OD %DVH GH 'DWRVGHORVFOLHQWHVGHXQDWDUMHWD GH FUpGLWR D SDUWLU GH VXV LQGLFDGRUHV GH FRQVXPR SDUD LGHQWLILFDU DO VHJPHQWR GH PD\RU YDORU Census Data Analysis & Data Mining Los dat os disponibles S $ SDUWLU GH OD %DVH GH 'DWRV GH WUDQVDFFLRQHV GHO ~OWLPR DxR GH ORV FOLHQWHV VH REWLHQHQ FRPR YDULDEOHV )UHFXHQFLDGHXVRGHODWDUMHWD : calculada como media de días ent r e t r ansacciones. 6DOGRSURPHGLRPHQVXDOGHWUDQVDFFLRQHVHQ 0RQWRSURPHGLRSRUWUDQVDFFLyQ &DQWLGDGGHVHUYLFLRVSRUGpELWRDXWRPiWLFR 'DWRVVRFLRGHPRJUiILFRVVH[RHGDG HVWDGRFLYLORFXSDFLyQKLMRV Census Data Analysis & Data Mining La preparac ión de dat os S'HILQLUODXQLGDGGHDQiOLVLV ¢FXHQWDRWDUMHWD" S'HILQLUTXpHVXQDWUDQVDFFLyQ HM¢FyPRVHFRQVLGHUDQORVDMXVWHV PRQWRVQHJDWLYRV" S'HILQLUYDULDEOHVGHULYDGDVHQOD IUHFXHQFLD¢FyPRLQWHUYLHQHQORV GpELWRVDXWRPiWLFRV" Census Data Analysis & Data Mining La preparac ión de dat os S'HVFULELUODVYDULDEOHVDLQFOXLUHQ HOPRGHORSDUD Calcular medidas de posición y disper sión I dent if icar dist r ibuciones asimét r icas I dent if icar missings I dent if icar valor es incor r ect os o f uer a de r ango I dent if icar out lier s Census Data Analysis & Data Mining E s ta d is tic a s 'HVFULSWLYRVJHQHUDOHV C lu s te r 0 1 0 0 ,0 0 % d e p o b la c ió n S o lte D roiv o rc ia d o /Viu d o Ca s a d o s c io s N o tra b aCjau e n ta P ro p ia R e la c io n d e p e n d e n c ia e dad Ma s c u lin o F e m e n in o e s ta d o _ c ivil Si No o cup s e xo h ijo s a vg tc kt fre c u pe sos Census Data Analysis & Data Mining Cr i t e r i o s d e s e g m e n t a c i ó n S6H WRPDQ FRPR YDULDEOHV ´DFWLYDVµ ODV TXH FRUUHVSRQGHQ DO FRPSRUWDPLHQWRGHFRQVXPR S6H WRPDQ FRPR YDULDEOHV VXSOHPHQWDULDV ORV DWULEXWRV VRFLRGHPRJUiILFRV Census Credit Ca rd 1 Data Analysis & Data Mining Mas culino Femenino s cio s [s e xo ] Divo So lt ero rciado /Viud o NoCuent t rab aja a Pro pia Si Cas ado Relacio n d ep end encia [es t ad o_ civil] [o cup ] No [hijo s ] fre cu pe s os a vg tckt [e d a d ] 55 2 Divo So lt ero rciado /Viud o NoCuent t rab aja a Pro pia Mas culino Femenino Cas ado Relacio n d epend encia s cio s [es t ad o_ civil] [o cup ] [s e xo ] Si pe s os No [hijo s ] fre cu [e d a d ] a vg tckt 0 27 Divo So lt ero rciado /Viud o NoCuent t rab aja a Pro pia Si Cas ado Relacio n d ep end encia s cio s 18 fre cu pe s os [es t ad o_ civil] [o cup ] Census Cre dit Ca rd Clus te r 2 [hijo s ] Ma s c ulino Fe me nino Uso frecuente fre cu a vg tckt [e d a d ] No tra b aCue ja nta P ro p ia R e la c io n d e p e nd e nc ia Trabajo Cta Propia [e s ta d o_ c ivil] [oc up ] Si Saldo >>> Varones [s e xo ] [s e xo ] S o lteDivo ro rc ia d o /Viud o Ca s a d o Casados s cio s Mas culino Femenino 27,21% de pobla ción Data Analysis & Data Mining Tienen 4 o más débitos automáticos No No Con hijos [hijo s ] pe sos Edad 40-45 [e d a d ] Ticket >>> a vg tckt Census Data Analysis & Data Mining Pa r e t o 120 100 80 Cluster 0 Cluster 1 Cluster 2 60 40 20 0 % Cuentas % Suma Saldo Census Data Analysis & Data Mining Arboles de dec isión S6RQ WpFQLFDV TXH VH XWLOL]DQ FRQ ILQDOLGDGSUHGLFWLYD\GHFODVLILFDFLyQ S6H REWLHQH FRPR UHVXOWDGR ´UHJODVµ TXHH[SOLFDQHOFRPSRUWDPLHQWRGHXQD YDULDEOH 7$5*(7 FRQ UHODFLyQ D RWUDV35(',&725$6 S(Q HVWH HMHPSOR VH XWLOL]DQ SDUD ´H[SOLFDUµORVFOXVWHUV Census Data Analysis & Data Mining Algorit m os S&+$,'&KL6TXDUHG$XWRPDWLF 'HWHFWLRQ S&57 &ODVVLILFDWLRQ DQG 5HJUHVVLRQ7UHH S&4XHVW\RWURV S,QWHOOLJHQW 0LQHU XWLOL]D XQD YDULDQWHGH&57 Census Data Analysis & Data Mining Arbol de c om port am ient o Si tiene 4 o más débitos automáticos y un saldo > $ 727 entonces su probabilidad de pertenecer al cluster 2 es del 99% Census Data Analysis & Data Mining Arbol soc iodem ográfic o Census Data Analysis & Data Mining Mark et Bask et Analysis S(OSUREOHPD6HWUDWDGHHQFRQWUDU ODV UHJODV GH DVRFLDFLyQ TXH RUJDQL]DQ ORV SHGLGRV GH ´WRSSLQJVµ H[WUD GH XQD SL]]HUtD D SDUWLU GHO DQiOLVLV GH XQ FRQMXQWR GH WLFNHWVGHYHQWD Census Data Analysis & Data Mining L a t a b l a d e Da t a M i n i n g S ,GWLFNHW S &yGLJRGHSURGXFWR +RQJRV 3HSSHURQL 4XHVR &HUYH]D *DVHRVD 2WUDEHELGD Census Data Analysis & Data Mining Pr o p ó s i t o d e M B A S*HQHUDUUHJODVGHOWLSR I F (SI ) condición ENTONCES (THEN) r esult ado S(MHPSOR 6Lpr oduct o A y pr oduct o C ENTONCES pr oduct o B Census Data Analysis & Data Mining Tipos de reglas S8WLOHV DSOLFDEOHV UHJODV TXH FRQWLHQHQ EXHQD FDOLGDG GH LQIRUPDFLyQ TXH SXHGHQ WUDGXFLUVH HQDFFLRQHVGHQHJRFLR S7ULYLDOHVUHJODV\DFRQRFLGDVHQHO QHJRFLRSRUVXIUHFXHQWHRFXUUHQFLD S,QH[SOLFDEOHV FXULRVLGDGHV DUELWUDULDVVLQDSOLFDFLyQSUiFWLFD Census Data Analysis & Data Mining Pr o b l e m a s d e l M B A S/DH[LVWHQFLDGHPXFKRVLWHPVHQHO VHW GH DQiOLVLV FRPSOLFD H[SRQHQFLDOPHQWH HO WLHPSR GH FiOFXOR S5HVXOWD QHFHVDULR GHILQLU FULWHULRV SDUDVHOHFFLRQDUODVPHMRUHVUHJODV S(V LPSRUWDQWH OD FRQVWUXFFLyQ GH XQDWD[RQRPtDGHSURGXFWRV Census Data Analysis & Data Mining ¿Cu á n b u e n a e s u n a r e g l a ? S0HGLGDVTXHFDOLILFDQDXQDUHJOD Sopor t e Conf ianza Lif t (I mpr ovement ) Census Data Analysis & Data Mining So p o r t e S(VODFDQWLGDGGHWUDQVDFFLRQHV HQGRQGHVHHQFXHQWUDODUHJOD Ej : “Si A ent onces B” est á pr esent e en 4000 de 10000 t r ansacciones. Sopor t e (A/ B) : 40% Census Data Analysis & Data Mining Co n f i a n za S&DQWLGDGGHWUDQVDFFLRQHVTXH FRQWLHQHQODUHJODUHIHULGDDOD FDQWLGDGGHWUDQVDFFLRQHVTXH FRQWLHQHQODFOiXVXODFRQGLFLRQDO Ej : Par a el caso ant er ior , si A est á pr esent e en 6000 t r ansacciones (60%) Conf ianza (A/ B) = 40% / 60% = 66% Census Data Analysis & Data Mining M e j o r a (I m p r o v e m e n t ) S&DSDFLGDGSUHGLFWLYDGHODUHJOD Mej or a = p(A/ B) / p(A) * p(B) Ej : p(A/ B) = 40% ; p(A) = 60%; p(B) = 30% I mpr ov (A/ B) = 40% / (60% * 30%) = 2.22 Mayor a 1 : la r egla t iene valor pr edict ivo Census Data Analysis & Data Mining Ej e m p l o d e c á l c u l o Census Data Analysis & Data Mining Da t o s b á s i c o s +RQJRV Si Si Si Si No No No No TOTAL 3HSSHURQL Si Si No No Si Si No No 4XHVR Si No Si No Si No Si No &DQWLGDG 100 400 300 100 200 150 200 550 2000 Census Data Analysis & Data Mining Re g l a s U6 R Hongos Pepperoni Queso Hongos --> Pepperoni Hongos --> Queso Queso --> Pepperoni Hongos + Pepperoni --> Queso Hongos + Queso --> Pepperoni Queso + Pepperoni --> Hongos 900 850 800 500 400 300 100 100 100 U W6 0.45 0.43 0.40 0.25 0.20 0.15 0.05 0.05 0.05 0.56 0.47 0.38 0.20 0.25 0.33 1.31 1.18 0.88 0.80 0.59 0.74 Pueden descartarse por bajo soporte Reglas significativas Census Data Analysis & Data Mining Ot r o e j e m p l o d e M B A S/D DVRFLDFLyQ VH SODQWHD HQWUH ORV WRSSLQJV GH ODV SL]]DV \ ODV EHELGDV S/RV JUiILFRV GH UHJODV SHUPLWHQ YLVXDOPHQWH LGHQWLILFDU UHJODV FRQ EXHQVRSRUWHFRQILDQ]D\OLIW Census Data Analysis & Data Mining Census Data Analysis & Data Mining Re g l a s Soporte (%)Confianza(%) 3.1746 80.0000 + 16.6667 81.8200 + 13.0688 78.4100 + 16.6667 63.0000 . 29.8413 72.8700 + 29.8413 62.6700 + 13.0688 61.7500 . 9.0476 57.0000 + 3.0159 57.0000 . 6.9312 56.9600 . 9.0476 56.4400 . Tipo 1.7800 1.7200 1.6500 1.5400 1.5300 1.5300 1.5100 1.4000 1.3900 1.3500 1.3300 Elevación Cuerpo de regla [Hongos]+[Otra bebida] [Cerveza]+[Pepperoni] [Cerveza]+[Queso] [Hongos]+[Pepperoni] [Cerveza] [Hongos] [Hongos]+[Queso] [Pepperoni]+[Queso] [Hongos]+[Pepperoni]+[Queso] [Hongos]+[Gaseosa] [Gaseosa]+[Pepperoni] Cabecera de regla ==> [Pepperoni] ==> [Hongos] ==> [Hongos] ==> [Cerveza] ==> [Hongos] ==> [Cerveza] ==> [Cerveza] ==> [Gaseosa] ==> [Cerveza] ==> [Queso] ==> [Queso] Census Data Analysis & Data Mining We b M i n i n g S(O SUREOHPD VH WUDWD GH DQDOL]DU ODV WUDQVDFFLRQHV \ HO SHUILO GH ORV XVXDULRV GH XQ :HE VLWH GH XQ FRPHUFLRGHYHQWDSRULQWHUQHW Census Data Analysis & Data Mining Modelos aplic ados S$VRFLDFLyQGHSiJLQDVYLVLWDGDV FDQDVWDGHSURGXFWRV S3HUILOGHXVXDULRVFOXVWHULQJ GHPRJUiILFR S3RWHQFLDOHVFRPSUDGRUHViUEROGH GHFLVLyQ Census Data Analysis & Data Mining Asoc iac ión de páginas R ¡¢£¥¤ ¦U§¨ª©«£¬­6® ¯°¦ ³ ³ ¨t´ µR¶·¸¶ ¶¨¶¹ ·6·ºµ¨ ³ ´6· · ³ ³ ¨t´ µR¶·>Ä ´¨¶µ6·6·ºµ¨ ³ ´6· · ³ ·Q¨ ³ ³ ´6Å>´Æ¨$·µ6·6· ³ ¨t´ µ6· · ƨtÄ Ä6·¶ÌÅ ÅQ¨$¹ Å ·6· ³ ¨t´6Í · · ƨtÄ Ä6·¶Î´6¹Q¨$Å´6·6· ³ ¨t´6Í · · ±6£ ²U«­ »¼§ ½b¨T¾(¢¿R«RÀÂÁ ÁUà »l®¾­U¢Q¨T¾(¢¿R«RÀ »l®¾­U¢Q¨T¾(¢¿R«RÀÂÁ ÁUà »¼§ ½b¨T¾(¢¿R«RÀ »WÇ6¿U£¦RÈUɧ ­UȾR¯¦Z¨T¾(¢¿R«RÀÁ ÁUà »Ê¿ Ëȯ ®¨W¾¢¿R«RÀ »l®¾­U¢Q¨T¾(¢¿R«RÀÂÁ ÁUÃb»WÇ6¿U£¦RÈ6ɧ ­UȾR¯ ¦Z¨T¾¢¿R«RÀ »WÇ6¿U£¦RÈUɧ ­UȾR¯¦Z¨T¾(¢¿R«RÀÁ ÁUà »l®¾­U¢Q¨T¾(¢¿R«RÀ Census Data Analysis & Data Mining Low COMMUNI CATI ON High r evenue Most ar e male High AGE Low FUN High r at e in REGI ON 6 = Fr ankf ur t Clust er ing r esult : Business clust er Census 10% of all user s Data Analysis & Data Mining Most ar e f emale High COMMUNI CATI ON Low r evenue High r at e in REGI ON 5 = Cologne High FUN Low AGE Clust er ing r esult : Fun clust er Census Data Analysis & Data Mining ,) t he int er est in I NFORMATI ON is ver y low (near ly 0) $1' in COMMUNI CATI ON high (wit h at least an access r at e of 5) 7+(1 visit or will pr obably not buy (95.5%). Classif icat ion r esult Census Data Analysis & Data Mining Se c u e n c i a d e c l i c k s Ï6ÐÑ Ò ÓÔÓÒÑ ÕÑÖ ×ÐØ in 17.2% (of all t r ansact ions) t he user goes t o GOURMET.ht ml ; he t hen sends t wo emails out . Ï6ÐÑ Ò ÓÔÓÒÑ ÕÑÖ ×ÐØ in 56.9% (of all t r ansact ions) t he user goes f ir st SPORTS.ht ml ; he t hen uses t he chat as a communicat ion medium; f inally, he f ocus his at t ent ion t o Fashion. Ï6ÐÑ Ò ÓÔÓÒÑ ÕÑÖ ×ÐØ I n 25.9% (of all t r ansact ions) t he user goes f ir st t o womens-f ashion.ht ml ; he t hen sends a post car d, and goes t o womens-f ashion.ht ml back again. Census Data Analysis & Data Mining De t e c c i ó n t e m p r a n a d e m ora S(OSUREOHPD6HWUDWDGHLGHQWLILFDU DQWLFLSDGDPHQWH ORV FOLHQWHV FRQ PD\RUSRVLELOLGDGGHHQWUDUHQPRUD SDUD DQWLFLSDU ODV DFFLRQHV SUHYHQWLYDVGHFREUDQ]D\UHFXSHUR Census Data Analysis & Data Mining Las soluc iones posibles S5HJODV SDUD LGHQWLILFDU D ORV VHJPHQWRV GH FOLHQWHV FRQ PD\RU SURSHQVLyQDPRUD S6FRULQJGHULHVJRGHPRURVLGDG Census Data Analysis & Data Mining Modelos aplic ables S3DUD ODV UHJODV iUERO GH FODVLILFDFLyQ S3DUDHOVFRULQJPRGHORQHXURQDO Census Data Analysis & Data Mining A r b o l m u e s t r a 5 0 /5 0 Census Data Analysis & Data Mining Morosos Mo ra 6 0 d ia s R e g ió n 9 0 -9 8 N Y 9 ,3 9 % d e p o b la c ió n 1 Y N VIP C U S TO ME R LA TE F E E S P A ID 3 0 D AY S O VE R C R E D IT LIM IT C R E D IT S C O R E C U S TO ME R AG E C R E D IT LIMIT IN C O M E M E M B E R (M O N T H S ) # P U R CH AS E S / W E E K C AS H LIM IT MO R A 6 0 Y N Census Data Analysis & Data Mining No Morosos Mo ra 6 0 d ia s Re g ió n 0 -2 0 6 ,8 1% d e p o b la c ió n Y N Y LATE F E E S P AID 3 0 DAYS VIP CU S TO ME R O VE R C R E D IT LIMIT CU S TO ME R AG E CR E D IT S CO R E ME MB E R (MO N T H S ) INC O ME C AS H LIMIT C R E DIT LIMIT # P UR CH AS ES / W EE K MO R A 6 0 Y N Census Data Analysis & Data Mining Verific ac ión 1.2 1.0 .8 .6 Scoring predicho .4 .2 0.0 -.2 N= 2947 258 NO SI Mora real El scoring que predice la red está netamente diferenciado para morosos y pagadores Census Data Analysis & Data Mining Re f e r e n c i a s · 'DWD0LQLQJ7HFKQLTXHVIRU0DUNHWLQJ6DOHV DQG&XVWRPHU6XSSRUW0LFKDHO%HUU\*RUGRQ /LQRII:LOH\86$ · 'DWD0LQLQJZLWK1HXUDO1HWZRUNV-RVHSK %LJXV0F*UDZ+LOO86$ · 'DWD0LQLQJDKDQGVRQDSSURDFKIRUEXVLQHVV SURIHVVLRQDOV5REHUW*URWK3UHQWLFH+DOO 86$ · 0DVWHULQJ'DWD0LQLQJ0LFKDHO%HUU\*RUGRQ /LQRII:LOH\86$ Census Data Analysis & Data Mining Re f e r e n c i a s S 'DWDSUHSDUDWLRQIRU'DWD0LQLQJ'RULDQ3\OH 0RUJDQ.DXIPDQQ3XEOLVKHUV,QF6DQ)UDQFLVFR 86$ S $QiOLVLV0XOWLYDULDQWH+DLU$QGHUVRQ7DWKDP %ODFN3UHQWLFH+DOO0DGULG S %XLOGLQJ'DWD0LQLQJDSSOLFDWLRQVIRU&50$ %HUVRQ66PLWK.7KHDUOLQJ0F*UDZ+LOO Census Data Analysis & Data Mining Re f e r e n c i a s · ,%0 Ù Ù Ù ZZZLEPFRPVRIWZDUHGDWDLPLQHUIRUGDWD ZZZGPJRUJ ZZZLEPFRPUHGERRNV · 7KH'DWD0LQHZZZWKHGDWDPLQHFRP · .''0LQHZZZNGQXJJHWVFRP · FKE#FHQVXVFRPDU