Mining of Massive Datasets - Stanford

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Unknown date Unknown author

Mining of Massive Datasets


Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Je Ullman
Big-data is transforming the world. Here you will learn data mining and
machine learning techniques to process large datasets and extract
valuable knowledge from them.

The book
The book is based on Stanford Computer Science course CS246: Mining
Massive Datasets (and CS345A: Data Mining).

The book, like the course, is designed at the undergraduate computer


science level with no formal prerequisites. To support deeper
explorations, most of the chapters are supplemented with further
reading references.

The Mining of Massive Datasets book has been published by Cambridge


University Press. You can get a 20% discount by applying the code
MMDS20 at checkout.

By agreement with the publisher, you can download the book for free
from this page. Cambridge University Press does, however, retain
copyright on the work, and we expect that you will obtain their
permission and acknowledge our authorship if you republish parts or all
of it.

We welcome your feedback on the manuscript.

/
The 3rd edition of the book
The following is the third edition of the book. It contains new material on
Spark, Tensor ow, minhashing, community- nding, simrank, graph
algorithms, and decision trees. There is a new chapter 13, covering deep
learning.

We also o er a set of lecture slides that we use for teaching Stanford


CS246: Mining Massive Datasets course. Note that the slides do not
necessarily cover all the material convered in the corresponding
chapters.

Chapter Title Book Slides Videos


Preface and Table
PDF
of Contents
Chapter
Data Mining PDF   PDF PPT
1
Map-Reduce and
Chapter
the New Software PDF   PDF PPT 1 2 3 4 5 6 7 8
2
Stack
Chapter Finding Similar
PDF   PDF PPT 1 2 3 4 5 6 7 8 9 10 11 12 13
3 Items
Part
Chapter Mining Data 1: PDF PPT
PDF 1 2 3 4 5
4 Streams Part PDF PPT
2:
Part
Chapter 1: PDF PPT
Link Analysis PDF 1 2 3 4 5 6 7 8 9 10 11 12 13 14
5 Part PDF PPT
2:
Chapter Frequent
PDF   PDF PPT 1 2 3 4
6 Itemsets
Chapter
Clustering PDF   PDF PPT 1 2 3 4 5
7
Chapter Advertising on
PDF   PDF PPT 1 2 3 4
8 the Web
Part
Chapter Recommendation 1: PDF PPT
PDF 1 2 3 4 5
9 Systems Part PDF PPT
2:
Part
Chapter Mining Social- 1: PDF PPT
PDF 1 2 3 4 5 6 7 8 9 10 11 12
10 Network Graphs Part PDF PPT
2:
/
Chapter Dimensionality PDF   PDF PPT 1 2 3 4 5 6 7 8 9 10 11 12
11 Reduction
Part
Large-Scale
Chapter 1: PDF PPT
Machine PDF 1 2 3 4 5 6 7 8 9 10 11 12
12 Part PDF PPT
Learning
2:
Chapter Neural Nets and
PDF
13 Deep Learning
Index PDF
Errata HTML

Download the latest version of the book as a single big PDF le (603
pages, 3.6 MB).

The Errata for the third edition of the book: HTML.

Download slides (PPT) in French: Chapter 4, Chapter 5, Chapter 8,


Chapter 9, Chapter 10. Courtesy of Richard Khoury.

Note to the users of provided slides: We would be delighted if you found


this our material useful in giving your own lectures. Feel free to use these
slides verbatim, or to modify them to t your own needs. PowerPoint
originals are available. If you make use of a signi cant portion of these
slides in your own lecture, please include this message, or a link to our
web site: http://www.mmds.org/.

Comments and corrections are most welcome. Please let us know if you
are using these materials in your course and we will list and link to your
course.

Stanford big data courses

CS246

CS246: Mining Massive Datasets is graduate level course that discusses


data mining and machine learning algorithms for analyzing very large
amounts of data. The emphasis is on Map Reduce as a tool for creating
parallel algorithms that can process very large amounts of data.
/
CS341

CS341 Project in Mining Massive Data Sets is an advanced project based


course. Students work on data mining and machine learning algorithms
for analyzing very large amounts of data. Both interesting big datasets as
well as computational infrastructure (large MapReduce cluster) are
provided by course sta . Generally, students rst take CS246 followed by
CS341.

CS341 is generously supported by Amazon by giving us access to their


EC2 platform.

CS224W

CS224W: Social and Information Networks is graduate level course that


covers recent research on the structure and analysis of such large social
and information networks and on models and algorithms that abstract
their basic properties. Class explores how to practically analyze large
scale network data and how to reason about it through models for
network structure and evolution.

You can take Stanford courses!

If you are not a Stanford student, you can still take CS246 as well as
CS224W or earn a Stanford Mining Massive Datasets graduate certi cate
by completing a sequence of four Stanford Computer Science courses. A
graduate certi cate is a great way to keep the skills and knowledge in
your eld current. More information is available at the Stanford Center
for Professional Development (SCPD).

Supporting materials
If you are an instructor interested in using the Gradiance Automated
Homework System with this book, start by creating an account for
/
yourself here. Then, email your chosen login and the request to become
an instructor for the MMDS book to support@gradiance.com. You will
then be able to create a class using these materials. Manuals explaining
the use of the system are available here.

Students who want to use the Gradiance Automated Homework System


for self-study can register here. Then, use the class token 1EDD8A1D to
join the "omnibus class" for the MMDS book. See The Student Guide for
more information.

Previous versions of the book

The 2nd edition of the book (v2.1)


The following is the second edition of the book. There are three new
chapters, on mining large graphs, dimensionality reduction, and
machine learning. There is also a revised Chapter 2 that treats map-
reduce programming in a manner closer to how it is used in practice.

Together with each chapter there is aslo a set of lecture slides that we use
for teaching Stanford CS246: Mining Massive Datasets course. Note that
the slides do not necessarily cover all the material convered in the
corresponding chapters.

Chapter Title Book Slides Videos


Preface and Table
PDF
of Contents
Chapter
Data Mining PDF   PDF PPT
1
Map-Reduce and
Chapter
the New Software PDF   PDF PPT 1 2 3 4 5 6 7 8
2
Stack
Chapter Finding Similar
PDF   PDF PPT 1 2 3 4 5 6 7 8 9 10 11 12 13
3 Items
Chapter Mining Data PDF Part PDF PPT 1 2 3 4 5
4 Streams 1: PDF PPT

/
Part
2:
Part
Chapter 1: PDF PPT
Link Analysis PDF 1 2 3 4 5 6 7 8 9 10 11 12 13 14
5 Part PDF PPT
2:
Chapter Frequent
PDF   PDF PPT 1 2 3 4
6 Itemsets
Chapter
Clustering PDF   PDF PPT 1 2 3 4 5
7
Chapter Advertising on
PDF   PDF PPT 1 2 3 4
8 the Web
Part
Chapter Recommendation 1: PDF PPT
PDF 1 2 3 4 5
9 Systems Part PDF PPT
2:
Part
Chapter Mining Social- 1: PDF PPT
PDF 1 2 3 4 5 6 7 8 9 10 11 12
10 Network Graphs Part PDF PPT
2:
Chapter Dimensionality
PDF   PDF PPT 1 2 3 4 5 6 7 8 9 10 11 12
11 Reduction
Part
Large-Scale
Chapter 1: PDF PPT
Machine PDF 1 2 3 4 5 6 7 8 9 10 11 12
12 Part PDF PPT
Learning
2:
Index PDF
Errata HTML

Download the latest version of the book as a single big PDF le (511
pages, 3 MB).

Download the full version of the book with a hyper-linked table of


contents that make it easy to jump around: PDF le (513 pages, 3.69 MB).

The Errata for the second edition of the book: HTML.

Download slides (PPT) in French: Chapter 4, Chapter 5, Chapter 8,


Chapter 9, Chapter 10. Courtesy of Richard Khoury.

Note to the users of provided slides: We would be delighted if you found


this our material useful in giving your own lectures. Feel free to use these
slides verbatim, or to modify them to t your own needs. PowerPoint
originals are available. If you make use of a signi cant portion of these
slides in your own lecture, please include this message, or a link to our
web site: http://www.mmds.org/.
/
Version 1.0
The following materials are equivalent to the published book, with errata
corrected to July 4, 2012.

Chapter Title Book


Preface and Table of Contents PDF
Chapter 1 Data Mining PDF
Chapter 2 Large-Scale File Systems and Map-Reduce PDF
Chapter 3 Finding Similar Items PDF
Chapter 4 Mining Data Streams PDF
Chapter 5 Link Analysis PDF
Chapter 6 Frequent Itemsets PDF
Chapter 7 Clustering PDF
Chapter 8 Advertising on the Web PDF
Chapter 9 Recommendation Systems PDF
Index PDF
Errata HTML

Download the book as published here (340 pages, 2 MB).

Viewed using Just Read

You might also like