Example: air traffic controller

Mining of Massive Datasets - The Stanford University InfoLab

MiningofMassiveDatasetsJure LeskovecStanford RajaramanMilliway LabsJeffrey D. UllmanStanford 2010, 2011, 2012, 2013, 2014 Anand Rajaraman, Jure Leskovec,and Jeffrey D. UllmaniiPrefaceThis book evolved from material developed over several years by Anand Raja-raman and Jeff Ullman for a one-quarter course at Stanford . The courseCS345A, titled Web Mining , was designed as an advanced graduate course,although it has become accessible and interesting to advanced Jure Leskovec joined the Stanford faculty, we reorganizedthe materialconsiderably. He introduced a new course CS224W on network analysis andadded material to CS345A, which was renumbered CS246.

Preface This book evolved from material developed over several years by Anand Raja-raman and Jeff Ullman for a one-quarter course at Stanford.

Tags:

  Book, Mining, Massive, Dataset, Stanford, Mining of massive datasets

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Mining of Massive Datasets - The Stanford University InfoLab

1 MiningofMassiveDatasetsJure LeskovecStanford RajaramanMilliway LabsJeffrey D. UllmanStanford 2010, 2011, 2012, 2013, 2014 Anand Rajaraman, Jure Leskovec,and Jeffrey D. UllmaniiPrefaceThis book evolved from material developed over several years by Anand Raja-raman and Jeff Ullman for a one-quarter course at Stanford . The courseCS345A, titled Web Mining , was designed as an advanced graduate course,although it has become accessible and interesting to advanced Jure Leskovec joined the Stanford faculty, we reorganizedthe materialconsiderably. He introduced a new course CS224W on network analysis andadded material to CS345A, which was renumbered CS246.

2 The three authorsalso introduced a large-scale data- Mining project course, book nowcontains material taught in all three the book Is AboutAt the highest level of description, this book is about data Mining . However,it focuses on data Mining of very large amounts of data, that is, data so largeit does not fit in main memory. Because of the emphasis on size, many of ourexamples are about the Web or data derived from the Web. Further, the booktakes an algorithmic point of view: data Mining is about applying algorithmsto data, rather than using data to train a machine-learning engine of somesort. The principal topics covered are:1.

3 Distributed file systems and map-reduce as a tool for creating parallelalgorithms that succeed on very large amounts of Similarity search, including the key techniques of minhashing and locality-sensitive Data-stream processing and specialized algorithms for dealing with datathat arrives so fast it must be processed immediately or The technology of search engines, including Google s PageRank, link-spamdetection, and the hubs-and-authorities Frequent-itemset Mining , including association rules, market-baskets, theA-Priori Algorithm and its Algorithms for clustering very large, high-dimensional Two key problems for Web applications: managing advertising and rec-ommendation Algorithms for analyzing and Mining the structure of very large graphs,especially social-network Techniques for obtaining the important properties of a large dataset bydimensionality reduction, including singular-value decomposition and la-tent semantic Machine-learning algorithms that can be applied to very large data, suchas perceptrons, support-vector machines, and gradient appreciate fully the material in this book , we recommend the followingprerequisites:1.

4 An introduction to database systems, covering SQL and relatedprogram-ming A sophomore-level course in data structures, algorithms, A sophomore-level course in software systems, software engineering, andprogramming book contains extensive exercises, with some for almost everysection. Weindicate harder exercises or parts of exercises with an exclamationpoint. Thehardest exercises have a double exclamation on the WebGo slides, homework assignments, project require-ments, and exams from courses related to this Automated HomeworkThere are automated exercises based on this book , using the Gradiance root-question technology, available Students mayenter a public class by creating an account at that site and enteringthe classwith code1 EDD8A1D.

5 Instructors may use the site by making an account therePREFACE vand then emailingsupport at gradiance dot comwith their login name, thename of their school, and a request to use the MMDS art is by Scott would like to thank Foto Afrati, Arun Marathe, and Rok Sosic for criticalreadings of a draft of this were also reported by Rajiv Abraham, Ruslan Aduk, ApoorvAgar-wal, Aris Anagnostopoulos, Yokila Arora, Stefanie Anna Baby, Atilla SonerBalkir, Arnaud Belletoile, Robin Bennett, Susan Biancani, Richard Boyd, AmitabhChaudhary, Leland Chen, Hua Feng, Marcus Gemeinder, Anastasios Gounar-is, Clark Grubb, Shrey Gupta, Waleed Hameid, Saman Haratizadeh, JulienHoachuck, Przemyslaw Horban, Hsiu-Hsuan Huang, Jeff Hwang, Rafi Kamal,Lachlan Kang, Ed Knorr, Haewoon Kwak, Ellis Lau, Greg Lee, David ,Ethan Lozano, Yunan Luo, Michael Mahoney, Sergio Matos, JustinMeyer,Bryant Moscon, Brad Penoff, John Phillips, Philips Kokoh Prasetyo, Qi Ge,Harizo Rajaona, Timon Ruban, Rich Seiter, Hitesh Shetty, Angad Singh, SandeepSripada, Dennis Sidharta, Krzysztof Stencel, Mark Storus, Roshan Sumbaly,Zack Taylor, Tim Triche Jr.

6 , Wang Bin, Weng Zhen-Bin, Robert West, StevenEuijong Whang, Oscar Wu, Xie Ke, Christopher Yeh, Nicolas Zhao, andZhou Jingbo, The remaining errors are ours, of D. Alto, CAMarch, 2014viPREFACEC ontents1 Data What is Data Mining ? .. Statistical Modeling .. Machine Learning .. Computational Approaches to Modeling .. Summarization .. Feature Extraction .. Statistical Limits on Data Mining .. Total Information Awareness .. Bonferroni s Principle .. An Example of Bonferroni s Principle .. Exercises for Section .. Things Useful to Know .. Importance of Words in Documents.

7 Hash Functions .. Indexes .. Secondary Storage .. The Base of Natural Logarithms .. Power Laws .. Exercises for Section .. Outline of the book .. Summary of Chapter 1 .. References for Chapter 1 .. 182 MapReduce and the New Software Distributed File Systems .. Physical Organization of Compute Nodes .. Large-Scale File-System Organization .. MapReduce .. The Map Tasks .. Grouping by Key .. The Reduce Tasks .. Combiners .. Details of MapReduce Execution .. Coping With Node Failures .. Exercises for Section .. Algorithms Using MapReduce.

8 Matrix-Vector Multiplication by MapReduce .. If the Vector v Cannot Fit in Main Memory .. Relational-Algebra Operations .. Computing Selections by MapReduce .. Computing Projections by MapReduce .. Union, Intersection, and Difference by MapReduce .. Computing Natural Join by MapReduce .. Grouping and Aggregation by MapReduce .. Matrix Multiplication .. Matrix Multiplication with One MapReduce Step .. Exercises for Section .. Extensions to MapReduce .. Workflow Systems .. Recursive Extensions to MapReduce .. Pregel .. Exercises for Section .. The Communication Cost Model.

9 Communication-Cost for Task Networks .. Wall-Clock Time .. Multiway Joins .. Exercises for Section .. Complexity Theory for MapReduce .. Reducer Size and Replication Rate .. An Example: Similarity Joins .. A Graph Model for MapReduce Problems .. Mapping Schemas .. When Not All Inputs Are Present .. Lower Bounds on Replication Rate .. Case Study: Matrix Multiplication .. Exercises for Section .. Summary of Chapter 2 .. References for Chapter 2 .. 693 Finding Similar Applications of Near-Neighbor Search .. Jaccard Similarity of Sets .. Similarity of Documents.

10 Collaborative Filtering as a Similar-Sets Problem .. Exercises for Section .. Shingling of Documents .. Choosing the Shingle Size .. Hashing Shingles .. Shingles Built from Words .. Exercises for Section .. Similarity-Preserving Summaries of Sets .. Matrix Representation of Sets .. Minhashing .. Minhashing and Jaccard Similarity .. Minhash Signatures .. Computing Minhash Signatures .. Exercises for Section .. Locality-Sensitive Hashing for Documents .. LSH for Minhash Signatures .. Analysis of the Banding Technique .. Combining the Techniques .. Exercises for Section.