Example: tourism industry

Data Science Tutorial

12017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited1 Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 152132017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is UnlimitedData Science TutorialEliezer Kanal Technical Manager, CERTD aniel DeCapria data Scientist, ETC2 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited2 About usEliezer KanalTechnical Manager, CERTR ecent projects: ML-based Malware Classifier Network traffic analysis Cybersecurity questionnaireoptimizationDaniel DeCapriaData Scientist, ETCR ecent projects: Cyber risk situationaldashboard Big Learning benchmarks3 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited3 Today s presentation a tale of two rolesThe call center managerIntroduction todata Science capabilitiesThe master carpenterOverview of thedata Science toolkit4 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribut

Today’s presentation –a tale of two roles The call center manager Introduction to data science capabilities The master carpenter Overview of the data science toolkit. 4 ... • Works best with numeric data (usually) • Works for predicting specific numeric outcome. 38 Data Science Tutorial August 10, 2017

Tags:

  Data, Presentation

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Data Science Tutorial

1 12017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited1 Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 152132017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is UnlimitedData Science TutorialEliezer Kanal Technical Manager, CERTD aniel DeCapria data Scientist, ETC2 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited2 About usEliezer KanalTechnical Manager, CERTR ecent projects: ML-based Malware Classifier Network traffic analysis Cybersecurity questionnaireoptimizationDaniel DeCapriaData Scientist, ETCR ecent projects: Cyber risk situationaldashboard Big Learning benchmarks3 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited3 Today s presentation a tale of two rolesThe call center managerIntroduction todata Science capabilitiesThe master carpenterOverview of thedata Science toolkit4 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited4 Call center managerFirst day on !

2 Goal:Reduce costsTask:Keep calls short! data :Average call minutes (5:08)..very long!Number of employees:300 Average calls per day:~28,0005 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited5 Call center manager Gather dataGet the data ! Where is it? What will you use to analyze it? How accurate it is? How complete is it? Is it too big to easily read?6 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited6 data cleaning = 90% of the work2 weeks (10 days) = 9 cleaning, 1 analyzing7 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited7 More structureLess structureCleaning the data Structuring the DataGoal: Organize data in a table, = descriptor (age, weight, height)Row = individual, complete recordsHow can you get data out of these documents?

3 8 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited8 Cleaning the DataEven when you think your data should be clean, it might not Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited9 Cleaning the data Call Center ExampleNameMgrDirCall LengthPhone LineProblem solved?CommentBeth JonesDan ThomasAnne Kim1 ThomasAnne Kim1 , BethDan ThomasAnne KeaneMark RyanTim KeaneMark RyanTim WoodTim WoodTim WoodTim Nominal Unstructured 10 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited10 Call Center manager Exploring data11 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited11 Exploratory data Analysis (EDA) Mean Median Standard deviation Histograms!

4 12 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited12 Distributions The majority of data will follow SOME distributionoWeight of all Americans:Gaussianophone call length:Exponential Determining distribution is a common data Science task Multidimensional outliers: Insider Threat exampleImage Copyright 2001-2016 The Apache Software Foundation. See Copyright slide for more Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited13 EDA Smart visualizations14 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited1415 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited1516 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release.

5 Distribution is Unlimited1617 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited17 Brief interruptionSkeptics in the audience18 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited18 Brief interruptionData Science helps you use data to get results. This is Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited19 Call center manager call duration histogramAverage (5:08)20 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited20 Call Center manager Insights!

6 Strategy update: Goodbye reduce call time Hello reduce callbacks How to measure? callbacks isn t currently captured21 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited21 Feature EngineeringNeed more useful data ?Create it yourself!22 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited22 Feature Engineering Feature Engineering: coming up with new, useful ( ,informative) dataomean, sums, medians, , xy , sqrt (xy), etc. Our case:o# of callbacksoCall during peak time?oOverall agent performance? (combination of factors)23 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited23 The role of Listening in data ScienceData Science finds hidden patterns in dataExperts know what data & patterns are importantTalk to subject matter experts24 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited24 Call Center manager Predictive analyticsCan we predict staffing.

7 One day ahead? ..one week ahead? ..one month ahead?Can we determine what types of calls to ..for a product we haven t had before? ..for a market we ve never seen before?25 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited25 Example Predictive Analytics QuestionsPredicting Current UnknownsOnline: Which ads are malicious?Security: Is the bank transaction fraudulent?IC: Which names map to the same person (entity resolution)?Predicting Future EventsRetail: What will be the new trend of merchandise that a company should stock?Security: Where will a hacker next attack our network?IC: Who will become the next insider threat?Determining Future ActionsSales: How can a company increase sales revenues?Health: What actions can be taken to prevent the spread of flu?

8 IC: How will a vulnerability patch affect our knowledge/preparedness for future attacks?26 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited26 Call Center manager Predictive analyticsMany techniques available, explored in next section27 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited27 Call Center manager ReviewPredictionAction!InterpretationEDA , VisualizationClean dataGet data28 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited28 Because we know our data , we can ..more intelligent questions ..action-oriented questions.

9 Questions that can be answered29 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited29 This slide intentionally left blank30 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited30 The master carpenter The right tool for the job 31 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited31 Feature Engineering Part 2 The fuel of data Science is dataData preparation is criticalData quality algorithm choiceThat will come With the wrong wood, I can make nothing 32 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited32 Types of Machine Learning AlgorithmsClassification Na ve Bayes Logistic Regression Decision Trees K-Nearest Neighbors Support Vector MachinesRegression Linear Regression Support Vector MachinesClustering K-Means Clustering33 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release.

10 Distribution is Unlimited33 Types of Machine Learning AlgorithmsApplications: Everywhere Banking Weather Sports scores Economics Environmental Science Cybersecurity34 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited34 Linear Regression PredictionProblem:If I have examples of X and Y, when I learn a new X, can I predict Y?35 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited35 Linear Regression PredictionSolution: Find the line that is closest to every pointSaid differently: Find the line that the SUM of all errors is smallest36 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited36 Linear Regression PredictionThree dimensions,same conceptHUNDREDS of dimensions,same concept37 data Science TutorialAugust 10, 2017 2017 Carnegie Mellon University2017 SEI data Science in Cybersecurity SymposiumApproved for Public Release.


Related search queries