Collective Intelligence in Action
By Satnam Alag
4/5
()
About this ebook
In the Web 2.0 era, leveraging the collective power of user contributions, interactions, and feedback is the key to market dominance. A new category of powerful programming techniques lets you discover the patterns, inter-relationships, and individual profiles-the collective intelligence--locked in the data people leave behind as they surf websites, post blogs, and interact with other users.
Collective Intelligence in Action is a hands-on guidebook for implementing collective intelligence concepts using Java. It is the first Java-based book to emphasize the underlying algorithms and technical implementation of vital data gathering and mining techniques like analyzing trends, discovering relationships, and making predictions. It provides a pragmatic approach to personalization by combining content-based analysis with collaborative approaches.
This book is for Java developers implementing Collective Intelligence in real, high-use applications. Following a running example in which you harvest and use information from blogs, you learn to develop software that you can embed in your own applications. The code examples are immediately reusable and give the Java developer a working collective intelligence toolkit.
Along the way, you work with, a number of APIs and open-source toolkits including text analysis and search using Lucene, web-crawling using Nutch, and applying machine learning algorithms using WEKA and the Java Data Mining (JDM) standard.
Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.
Satnam Alag
Satnam Alag, PhD, is currently the Vice President of Engineering at NextBio, a vertical search engine and a Web 2.0 collaboration application for the life sciences community. He is a seasoned software professional with over fifteen years of experience in machine learning and over a decade of experience in commercial software development and management. Dr. Alag worked as a consultant with Johnson & Johnsons's BabyCenter where he helped develop their personalization engine. Prior to that, he was the Chief Software Architect at Rearden Commerce and began his career at GE R&D. He is a Sun Certified Enterprise Architect (SCEA) for the Java Platform. Dr. Alag earned his PhD in engineering from UC Berkeley and his dissertation was in the area of probabilistic reasoning and machine learning. He has published numerous peer-reviewed articles.
Related to Collective Intelligence in Action
Related ebooks
Feature Engineering Bookcamp Rating: 0 out of 5 stars0 ratingsReal-World Functional Programming: With examples in F# and C# Rating: 0 out of 5 stars0 ratingsMachine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratingsTroubleshooting Java: Read, debug, and optimize JVM applications Rating: 0 out of 5 stars0 ratingsRe-Engineering Legacy Software Rating: 0 out of 5 stars0 ratingsAPI Design Patterns Rating: 5 out of 5 stars5/5Go in Practice Rating: 5 out of 5 stars5/5RxJava for Android Developers Rating: 0 out of 5 stars0 ratingsAI as a Service: Serverless machine learning with AWS Rating: 1 out of 5 stars1/5Backbone.js Patterns and Best Practices Rating: 0 out of 5 stars0 ratingsFunctional Programming in JavaScript: How to improve your JavaScript programs using functional techniques Rating: 0 out of 5 stars0 ratingsStreaming Data: Understanding the real-time pipeline Rating: 0 out of 5 stars0 ratingsGANs in Action: Deep learning with Generative Adversarial Networks Rating: 0 out of 5 stars0 ratingsNode Web Development, Second Edition Rating: 0 out of 5 stars0 ratingsAlgorithms of the Intelligent Web Rating: 0 out of 5 stars0 ratingsiOS in Practice Rating: 0 out of 5 stars0 ratingsFull Stack Python Security: Cryptography, TLS, and attack resistance Rating: 0 out of 5 stars0 ratingsDeep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsParallel and High Performance Computing Rating: 0 out of 5 stars0 ratingsClassic Computer Science Problems in Java Rating: 0 out of 5 stars0 ratingsLinked Data: Structured data on the Web Rating: 4 out of 5 stars4/5The Joy of Clojure Rating: 4 out of 5 stars4/5Beginning Graphics Programming with Processing 4 Rating: 0 out of 5 stars0 ratingsNeo4j in Action Rating: 0 out of 5 stars0 ratingsReactive Design Patterns Rating: 0 out of 5 stars0 ratingsStreet Coder: The rules to break and how to break them Rating: 0 out of 5 stars0 ratingsEvent Processing in Action Rating: 0 out of 5 stars0 ratingsKnative in Action Rating: 0 out of 5 stars0 ratingsReal-World Cryptography Rating: 4 out of 5 stars4/5Web Performance in Action: Building Fast Web Pages Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Python for Beginners: A Crash Course to Learn Python Programming in 1 Week Rating: 0 out of 5 stars0 ratingsCreating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5The Alignment Problem: How Can Machines Learn Human Values? Rating: 4 out of 5 stars4/5Scary Smart: The Future of Artificial Intelligence and How You Can Save Our World Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Algorithms to Live By: The Computer Science of Human Decisions Rating: 4 out of 5 stars4/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Deep Utopia: Life and Meaning in a Solved World Rating: 0 out of 5 stars0 ratingsThe Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5The Algorithm: How AI Can Hijack Your Career and Steal Your Future Rating: 0 out of 5 stars0 ratingsGrokking Machine Learning Rating: 0 out of 5 stars0 ratingsDeep Learning with PyTorch Rating: 5 out of 5 stars5/5Advances in Financial Machine Learning Rating: 5 out of 5 stars5/5ChatGPT Rating: 1 out of 5 stars1/5Prompt Engineering ; The Future Of Language Generation Rating: 3 out of 5 stars3/5TensorFlow in 1 Day: Make your own Neural Network Rating: 4 out of 5 stars4/5The Creativity Code: How AI is learning to write, paint and think Rating: 4 out of 5 stars4/5Grokking Deep Reinforcement Learning Rating: 5 out of 5 stars5/5ChatGPT For Dummies Rating: 4 out of 5 stars4/5Predictive Analytics and Machine Learning for Managers Rating: 0 out of 5 stars0 ratingsMidjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/5ChatGPT Rating: 3 out of 5 stars3/5Hands-On System Design: Learn System Design, Scaling Applications, Software Development Design Patterns with Real Use-Cases Rating: 0 out of 5 stars0 ratingsGrokking Artificial Intelligence Algorithms Rating: 0 out of 5 stars0 ratingsThe Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications Rating: 0 out of 5 stars0 ratingsDeep Learning with Python Rating: 5 out of 5 stars5/5
Reviews for Collective Intelligence in Action
7 ratings1 review
- Rating: 5 out of 5 stars5/5I consider this book a must for those who also read or think about reading Segaran's 'Programming Collective Intelligence'. This book's language of choice is Java and it presents important aspects of industrial strength open source tools such Nutch, Lucene and Hadoop. The book is very practical oriented and do not expect to find much theoretical about data mining or machine algorithms (the author is very brief on explanations, gives good examples and provide enough literature to dive into relevant detailed texts, articles, books, etc.). However if you want to jump into analysing your application's unstructured data along with user analysis, build intelligence and focused crawling and searching functionalities, and finally build a recommendation system then this book is full of technical treasury. I really liked the final chapters on building recommendation systems and also I was glad to learn about JDM, the Java Data Mining API. Another strong point of the book is its concerns about scalability and discussions about class and database design for implementing the systems.
Book preview
Collective Intelligence in Action - Satnam Alag
Copyright
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact
Special Sales Department
Manning Publications Co.
Sound View Court 3B fax: (609) 877-8256
Greenwich, CT 06830 email:
orders@manning.com
©2009 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15% recycled and processed without the use of elemental chlorine.
Manning Publications Co.
Sound View Court 3B
Greenwich, CT 06830
Development Editor: Jeff Bleiel
Copyeditor: Benjamin Berg
Typesetter: Gordan Salinovic
Cover designer: Leslie Haimes
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 13 12 11 10 09 08
Dedication
To my dear sons, Ayush and Shray, and my beautiful, loving, and intelligent wife, Alpana
Brief Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Preface
Acknowledgments
About this book
1. Gathering data for intelligence
Chapter 1. Understanding collective intelligence
Chapter 2. Learning from user interactions
Chapter 3. Extracting intelligence from tags
Chapter 4. Extracting intelligence from content
Chapter 5. Searching the blogosphere
Chapter 6. Intelligent web crawling
2. Deriving intelligence
Chapter 7. Data mining: process, toolkits, and standards
Chapter 8. Building a text analysis toolkit
Chapter 9. Discovering patterns with clustering
Chapter 10. Making predictions
3. Applying intelligence in your application
Chapter 11. Intelligent search
Chapter 12. Building a recommendation engine
Index
List of Figures
List of Tables
List of Listings
Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Preface
Acknowledgments
About this book
1. Gathering data for intelligence
Chapter 1. Understanding collective intelligence
1.1. What is collective intelligence?
1.2. CI in web applications
1.2.1. Collective intelligence from the ground up: a sample application
1.2.2. Benefits of collective intelligence
1.2.3. CI is the core component of Web 2.0
1.2.4. Harnessing CI to transform from content-centric to user-centric applications
1.3. Classifying intelligence
1.3.1. Explicit intelligence
1.3.2. Implicit intelligence
1.3.3. Derived intelligence
1.4. Summary
1.5. Resources
Chapter 2. Learning from user interactions
2.1. Architecture for applying intelligence
2.1.1. Synchronous and asynchronous services
2.1.2. Real-time learning in an event-driven system
2.1.3. Polling services for non–event-driven systems
2.1.4. Advantages and disadvantages of event-based and non–event-based architectures
2.2. Basics of algorithms for applying CI
2.2.1. Users and items
2.2.2. Representing user information
2.2.3. Content-based analysis and collaborative filtering
2.2.4. Representing intelligence from unstructured text
2.2.5. Computing similarities
2.2.6. Types of datasets
2.3. Forms of user interaction
2.3.1. Rating and voting
2.3.2. Emailing or forwarding a link
2.3.3. Bookmarking and saving
2.3.4. Purchasing items
2.3.5. Click-stream
2.3.6. Reviews
2.4. Converting user interaction into collective intelligence
2.4.1. Intelligence from ratings via an example
2.4.2. Intelligence from bookmarking, saving, purchasing Items, forwarding, click-stream, and reviews
2.5. Summary
2.6. Resources
Chapter 3. Extracting intelligence from tags
3.1. Introduction to tagging
3.1.1. Tag-related metadata for users and items
3.1.2. Professionally generated tags
3.1.3. User-generated tags
3.1.4. Machine-generated tags
3.1.5. Tips on tagging
3.1.6. Why do users tag?
3.2. How to leverage tags
3.2.1. Building dynamic navigation
3.2.2. Innovative uses of tag clouds
3.2.3. Targeted search
3.2.4. Folksonomies and building a dictionary
3.3. Extracting intelligence from user tagging: an example
3.3.1. Items related to other items
3.3.2. Items of interest for a user
3.3.3. Relevant users for an item
3.4. Scalable persistence architecture for tagging
3.4.1. Reviewing other approaches
3.4.2. Recommended persistence architecture
3.5. Building tag clouds
3.5.1. Persistence design for tag clouds
3.5.2. Algorithm for building a tag cloud
3.5.3. Implementing a tag cloud
3.5.4. Visualizing a tag cloud
3.6. Finding similar tags
3.7. Summary
3.8. Resources
Chapter 4. Extracting intelligence from content
4.1. Content types and integration
4.1.1. Classifying content
4.1.2. Architecture for integrating content
4.2. The main CI-related content types
4.2.1. Blogs
4.2.2. Wikis
4.2.3. Groups and message boards
4.3. Extracting intelligence step by step
4.3.1. Setting up the example
4.3.2. Naïve analysis
4.3.3. Removing common words
4.3.4. Stemming
4.3.5. Detecting phrases
4.4. Simple and composite content types
4.5. Summary
4.6. Resources
Chapter 5. Searching the blogosphere
5.1. Introducing the blogosphere
5.1.1. Leveraging the blogosphere
5.1.2. RSS: the publishing format
5.1.3. Blog-tracking companies
5.2. Building a framework to search the blogosphere
5.2.1. The searcher
5.2.2. The search parameters
5.2.3. The query results
5.2.4. Handling the XML response
5.2.5. Exception handling
5.3. Implementing the base classes
5.3.1. Implementing the search parameters
5.3.2. Implementing the result objects
5.3.3. Implementing the searcher
5.3.4. Parsing XML response
5.3.5. Extending the framework
5.4. Integrating Technorati
5.4.1. Technorati search API overview
5.4.2. Implementing classes for integrating Technorati
5.5. Integrating Bloglines
5.5.1. Bloglines search API overview
5.5.2. Implementing classes for integrating Bloglines
5.6. Integrating providers using RSS
5.6.1. Generalizing the query parameters
5.6.2. Generalizing the blog searcher
5.6.3. Building the RSS 2.0 XML parser
5.7. Summary
5.8. Resources
Chapter 6. Intelligent web crawling
6.1. Introducing web crawling
6.1.1. Why crawl the Web?
6.1.2. The crawling process
6.1.3. Intelligent crawling and focused crawling
6.1.4. Deep crawling
6.1.5. Available crawlers
6.2. Building an intelligent crawler step by step
6.2.1. Implementing the core algorithm
6.2.2. Being polite: following the robots.txt file
6.2.3. Retrieving the content
6.2.4. Extracting URLs
6.2.5. Making the crawler intelligent
6.2.6. Running the crawler
6.2.7. Extending the crawler
6.3. Scalable crawling with Nutch
6.3.1. Setting up Nutch
6.3.2. Running the Nutch crawler
6.3.3. Searching with Nutch
6.3.4. Apache Hadoop, MapReduce, and Dryad
6.4. Summary
6.5. Resources
2. Deriving intelligence
Chapter 7. Data mining: process, toolkits, and standards
7.1. Core concepts of data mining
7.1.1. Attributes
7.1.2. Supervised and unsupervised learning
7.1.3. Key learning algorithms
7.1.4. The mining process
7.2. Using an open source data mining framework: WEKA
7.2.1. Using the WEKA application: a step-by-step tutorial
7.2.2. Understanding the WEKA APIs
7.2.3. Using the WEKA APIs via an example
7.3. Standard data mining API: Java Data Mining (JDM)
7.3.1. JDM architecture
7.3.2. Key JDM objects
7.3.3. Representing the dataset
7.3.4. Learning models
7.3.5. Algorithm settings
7.3.6. JDM tasks
7.3.7. JDM connection
7.3.8. Sample code for accessing DME
7.3.9. JDM models and PMML
7.4. Summary
7.5. Resources
Chapter 8. Building a text analysis toolkit
8.1. Building the text analyzers
8.1.1. Leveraging Lucene
8.1.2. Writing a stemmer analyzer
8.1.3. Writing a TokenFilter to inject synonyms and detect phrases
8.1.4. Writing an analyzer to inject synonyms and detect phrases
8.1.5. Putting our analyzers to work
8.2. Building the text analysis infrastructure
8.2.1. Building the tag infrastructure
8.2.2. Building the term vector infrastructure
8.2.3. Building the Text Analyzer class
8.2.4. Applying the text analysis infrastructure
8.3. Use cases for applying the framework
8.4. Summary
8.5. Resources
Chapter 9. Discovering patterns with clustering
9.1. Clustering blog entries
9.1.1. Defining the text clustering infrastructure
9.1.2. Retrieving blog entries from Technorati
9.1.3. Implementing the k-means algorithms for text processing
9.1.4. Implementing hierarchical clustering algorithms for text processing
9.1.5. Expectation maximization and other examples of clustering high-dimension sparse data
9.2. Leveraging WEKA for clustering
9.2.1. Creating the learning dataset
9.2.2. Creating the clusterer
9.2.3. Evaluating the clustering results
9.3. Clustering using the JDM APIs
9.3.1. Key JDM clustering-related classes
9.3.2. Clustering settings using the JDM APIs
9.3.3. Creating the clustering task using the JDM APIs
9.3.4. Executing the clustering task using the JDM APIs
9.3.5. Retrieving the clustering model using the JDM APIs
9.4. Summary
9.5. Resources
Chapter 10. Making predictions
10.1. Classification fundamentals
10.1.1. Learning decision trees by example
10.1.2. Naïve Bayes’ classifier
10.1.3. Belief networks
10.2. Classifying blog entries using WEKA APIs
10.2.1. Building the dataset for classifying blog entries
10.2.2. Building the classifier class
10.3. Regression fundamentals
10.3.1. Linear regression
10.3.2. Multi-layer perceptron (MLP)
10.3.3. Radial basis functions (RBF)
10.4. Regression using WEKA
10.5. Classification and regression using JDM
10.5.1. Key JDM supervised learning–related classes
10.5.2. Supervised learning settings using the JDM APIs
10.5.3. Creating the classification task using the JDM APIs
10.5.4. Executing the classification task using the JDM APIs
10.5.5. Retrieving the classification model using the JDM APIs
10.5.6. Retrieving the classification model using the JDM APIs
10.6. Summary
10.7. Resources
3. Applying intelligence in your application
Chapter 11. Intelligent search
11.1. Search fundamentals
11.1.1. Search architecture
11.1.2. Core Lucene classes
11.1.3. Basic indexing and searching via example
11.2. Indexing with Lucene
11.2.1. Understanding the index format
11.2.2. Modifying the index
11.2.3. Incremental indexing
11.2.4. Accessing the term frequency vector
11.2.5. Optimizing indexing performance
11.3. Searching with Lucene
11.3.1. Understanding Lucene scoring
11.3.2. Querying Lucene
11.3.3. Sorting search results
11.3.4. Querying on multiple fields
11.3.5. Filtering
11.3.6. Searching multiple indexes
11.3.7. Using a HitCollector
11.3.8. Optimizing search performance
11.4. Useful tools and frameworks
11.4.1. Luke
11.4.2. Solr
11.4.3. Compass
11.4.4. Hibernate search
11.5. Approaches to intelligent search
11.5.1. Augmenting search with classifiers and predictors
11.5.2. Clustering search results
11.5.3. Personalizing results for the user
11.5.4. Community-based search
11.5.5. Linguistic-based search
11.5.6. Data search
11.6. Summary
11.7. Resources
Chapter 12. Building a recommendation engine
12.1. Recommendation engine fundamentals
12.1.1. Introducing the recommendation engine
12.1.2. Item-based and user-based analysis
12.1.3. Computing similarity using content-based and collaborative techniques
12.1.4. Comparison of content-based and collaborative techniques
12.2. Content-based analysis
12.2.1. Finding similar items using a search engine (Lucene)
12.2.2. Building a content-based recommendation engine
12.2.3. Related items for document clusters
12.2.4. Personalizing content for a user
12.3. Collaborative filtering
12.3.1. k-nearest neighbor
12.3.2. Packages for implementing collaborative filtering
12.3.3. Dimensionality reduction with latent semantic indexing
12.3.4. Implementing dimensionality reduction
12.3.5. Probabilistic model–based approach
12.4. Real-world solutions
12.4.1. Amazon item-to-item recommendation
12.4.2. Google News personalization
12.4.3. Netflix and the BellKor Solution for the Netflix Prize
12.5. Summary
12.6. Resources
Index
List of Figures
List of Tables
List of Listings
Foreword
When I founded ReadWriteWeb[¹] back in April 2003, a tech news and analysis blog that is now one of the world’s top 10 blogs,[²] my goal was to explore the current era of the web. The year 2003 was a time when the effects of the dot-com meltdown were still being felt, yet there was something new stirring on the web, too. I christened my new blog Read/Write Web (the slash and space have since been dropped) because this new era of the web seemed to embody the notion that Tim Berners-Lee had when he invented the web—that it ought to be editable by anyone and that everyone contributes in some way to the web’s data.
¹http://www.readwriteweb.com/
² According to Technorati http://www.technorati.com/pop/blogs/
As Satnam Alag writes in this book, collective intelligence as a research field actually predates the web. But it was after the dot-com era had ended that we began to see evidence of collective intelligence applied to the web. In 2003 we regularly saw it in sites like Amazon, with its user reviews and recommendations, eBay with its user-driven auctions, Wikipedia with its editable encyclopedia, and Google with its mysterious PageRank algorithm for ranking the popularity of web pages.
Sometime in 2004, O’Reilly & Associates coined the term Web 2.0, which eventually gained mainstream acceptance as the term for this era of the web (just as dot-com described the previous one). A central part of the new definition was the notion of harnessing collective intelligence, in which user contributions could be valuable in aggregate if mined and utilized in some way in your web site or application.
For all the popularity of Web 2.0, it remains difficult to implement many of its principles. This is where this book comes in, because it applies mathematical formulas and examples to the notion of collective intelligence (from now on simply known as CI). After explaining how to gather data and extract intelligence on the web, in part 2 of the book Satnam instructs you on specific CI techniques such as data mining, text analysis, clustering, and predictive technology.
And, pssst, do you want to know how to build a recommendation engine? This is an area of web technology that we at ReadWriteWeb have been covering with great interest in 2008. Recommendation engines, as Satnam notes, aim to show items of interest to a user. But in our reviews of the current wave of recommendation engines, we have seen that it’s hard—very hard—to get recommendations right. Satnam shows how the leading practitioners, such as Amazon, Google News, and Netflix, build their recommendation engines. He also explains the different approaches you can take, with examples that developers can use and deploy in their own applications.
The Read/Write Web, or Web 2.0, or the Social Web, whatever you want to call it, relies on and builds value from user participation. If you’re a web developer, you’ll want to know how to use CI techniques to ensure that your web application can extract valuable data from its usage—and most importantly deliver that value right back to the users, where it belongs. This book goes a long way towards explaining how to do this.
RICHARD MACMANUS
FOUNDER/EDITOR, READWRITEWEB
Preface
"What is the virality coefficient for your application?"
This is an increasingly common question being asked of young companies as they try to raise money from venture capitalists. New products are being designed that inherently take advantage of virality within the product. Companies such as YouTube, Facebook, Ning, LinkedIn, Skype, and more have grown from zero to millions of users by leveraging the power of virality. With little or no marketing, these types of companies rely on the wisdom of crowds to spread exponentially from one user to two users, then four, then eight, and so on. A simple link in an email, which worked for Hotmail to grow its user base, may no longer be adequate for your application. Facebook and LinkedIn enable users to build their networks by sending an invitation to others to connect as friends or connections; other applications such as Skype and Jaxtr provide free services as long as you’re connecting to someone who’s already a member, thus encouraging users to register.
It wasn’t long ago when things were different. I still remember a few years back when I would ignore requests from others to connect on sites such as LinkedIn. Over a period of time, after repeatedly getting requests to connect from friends and acquaintances, I finally reached a tipping point and joined the network. The critical mass of users on the application, in addition to word-of-mouth recommendations, was good enough for me to see enough value to joining the network. Others had collectively convinced me to change my ways and join the application—this is one aspect of how collective intelligence is born and can manifest itself in your application.
Over the last few years, there’s been a quiet revolution in the way users interact. Time magazine even declared you,
as in the collective set of users on the web, as the person of the year for 2006. Users are no longer shy about expressing themselves. This expression may be as simple as forwarding an interesting article to a friend, rating an item, or generating new content—commonly known as user-generated content (UGC). To harness this user revolution, a new breed of applications, commonly known as user-centric applications, are being developed. Putting the user at the center of the application, leveraging social networks, and UGC are the new paradigms, and a high degree of personalization is now becoming the norm.
It’s been almost two years since I first contacted Manning with the idea of writing a book on collective intelligence. Ever since my graduate school days, I’ve been fascinated by how you can discover interesting information by analyzing data. Over the years, I was able to ground a lot of theory in the practical world, especially in the context of large-scale web applications. One thing I knew was that there wasn’t a practical book that could guide a developer through the various aspects of applying intelligence in an application. I could see a typical developer’s eyes roll when delving into the inner workings of an algorithm or applying some of the collective intelligence features. There’s immense value that an application can create by leveraging user-interaction data. As more and more companies joined the Web 2.0 parade, I wanted to write a book that would guide developers to understanding and implementing collective intelligence–related features in their applications.
It took longer to write this book than I had hoped. Most of the book was written while I was working full-time in demanding jobs. But the experience obtained by implementing these concepts in the real world provided good insight into what would be useful to others.
Remember, applications that make use of every user interaction to improve the value of the application for the user and other potential future users, and harness the power of virality, will dominate their markets. This book provides a set of tools that you’ll need to leverage the information provided by the users on your site. Whatever forms of information may be available to you, this book will guide you in harnessing the potential of your information to personalize the site for your users. Focus on the user, and you shall succeed. For collective intelligence begins with a crowd of one.
Acknowledgments
In the late seventeenth century, Sir Isaac Newton said, If I have seen further, it is by standing on the shoulders of giants.
Similarly, if I’ve been able to finish this book, it’s with the help of a great number of people.
First, this book wouldn’t have been possible without Associate Publisher Michael Stephens. Mike’s passion and belief in the topic kept the book going. He’s an excellent mentor and guides you through good times and bad. Just like Mike, my brain now converts all text into lists of lists. It was a real privilege to work with my development editor, Jeff Bleiel. Jeff spent countless hours providing feedback, digging deeper into why things were written in a certain way, and improving the flow of the text. Thanks to Marjan Bace, Manning’s publisher, for helping fine-tune the table of contents, and for his guiding principle of keeping the book focused on new content. Special thanks to Karen Tegtmeyer for setting up and coordinating the peer reviews. And to the production team of Benjamin Berg, Katie Tennant, and Gordan Salinovic for turning my manuscript into the book that you are now holding. They spent countless hours checking and rechecking the manuscript. If you’re thinking of writing a book, you won’t find a better team than the one at Manning!
I’d like to thank all of the reviewers of my manuscript, many of whom spent large amounts of their free time on this task, for sending their excellent comments, suggestions, and criticisms. Some of the reviewers wished to remain anonymous...but here are a few I would like to acknowledge by name: Jérôme Bernard, Ryan Cox, Dave Crane, Roozbeh Daneshvar, Steve Gutz, Clint Howarth, Frank Jania, Gordon Jones, Murali Krishnan, Darren Neimke, Sumit Pal, Muhammad Saleem, Robi Sen, Sopan Shewale, Srikanth Sundararajana, and John Tyler.
Special thanks to Shiva Paranandi, for his help in reviewing the text and the code, and for his technical proofread; Brendan Murray, for his technical proofread of the first half of the book; Sean Handel, for his detailed review of and suggestions on the first four chapters; Gautam Aggarwal, for his insightful comments; Krishna Mayuram, for his review of the third chapter; Mark Hornick, specification lead of JDM, for his suggestions on JDM-related chapters; Mayur Datar of Google, for reviewing the text for the Google News Personalization section in chapter 12; Mark Hall, Lead for Pentaho’s data mining solutions (WEKA), for his comments on WEKA-related content; Shi Hui Liu, Murtaza Sonaseth, Kevin Xiao, Hector Villarreal, and the rest of the NextBio team, for their suggestions; Shahram Seyedin-Noor of NextBio, for his comments on the early chapters, encouragement, and his passionate philosophy on virality; and Ken DeLong and Mike McEvoy of BabyCenter, for their review and suggestions to improve the manuscript.
Special thanks to the awesome team at NextBio, especially the management team: Saeid Akhtari, Shahram Seyedin-Noor, Ilya Kupershmidt, and Mostafa Ronaghi, who introduced me to the field of data search and life sciences. We have a fantastic opportunity in intelligent search and user-centric applications; let’s make it happen!
This book wouldn’t have been possible without the support of a number of people whom I have worked for, including Patrick Grady, the charismatic CEO of Rearden Commerce; Michael McEvoy, CEO of QuickTrac Software; K.J., CEO 123signup.com, whom I thank for his mentorship; and Gordon Jones, SVP at TechWorks.
And finally, thanks to Richard MacManus, founder and editor of ReadWriteWeb, for taking the time to read the manuscript and write the foreword to the book.
This book took longer to finish than I had hoped, while I was working full-time. Consequently, it amounted to working all the time, even when we were on vacation. This book wouldn’t have been possible without the active support of my wife, Alpana, and sons, and also the active encouragement and support provided by our extended families. On Alpana’s side, dad diligently proofread and cheered raw early drafts; mom tried to free up my time; Rohini and Amit Verma provided constant encouragement. On my side, my mom helped in every way she could and kept me going, while my two adoring sisters, Nina and Amrita, made me feel as if I were the best writer in the world. Special thanks to Rajeev, Ankit, and Anish Suri for their encouragement.
Needless to say, this book was a nonstarter without the inspiration and support provided by Alpana, Ayush, and Shray. Dad, how many chapters did you finish last night?
kept me going, as I didn’t want to see the disappointment in my sons’ eyes. Thank you, Alpana, for supporting me through this venture—it wouldn’t have been possible without your sacrifices. I look forward to some quality time with the family, soon.
About this book
Collective Intelligence in Action is a practical book for applying collective intelligence to real-world web applications. I cover a broad spectrum of topics, from simple illustrative examples that explain the concepts and the math behind them, to the ideal architecture for developing a feature, to the database schema, to code implementation and use of open source toolkits. Regardless of your background and nature of development, I’m sure you’ll find the examples and code samples useful. You should be able to directly use the code developed in this book. This is a practical book and I present a holistic view on what’s required to apply these techniques in the real world. Consequently, the book discusses the architectures for implementing intelligence—you’ll find lots of diagrams, especially UML diagrams, and a number of screenshots from well-known sites, in addition to code listings and even database schema designs.
There are a plethora of examples. Typically, concepts and the underlying math for algorithms are explained via examples with detailed step-by-step analysis. Accompanying the examples is Java code that demonstrates the concepts by implementing them, or by using open source frameworks.
A lot of work has been done by the open source community in Java in the areas of text processing and search (Lucene), data mining (WEKA), web crawling (Nutch), and data mining standards (JDM). This book leverages these frameworks, presenting examples and developing code that you can directly use in your Java application.
The first few chapters don’t assume knowledge of Java. You should be able to follow the concepts and the underlying math using the illustrative examples. For the later chapters, a basic understanding of Java will be helpful. The book uses a number of diagrams and screenshots to illustrate the concepts. The Resources section of each chapter contains links to other useful content.
Roadmap
Chapter 1 provides a basic introduction to the field of collective intelligence (CI). CI is an active area of research, and I’ve kept the focus on applying CI to web applications. Section 1.2.1 is a personal favorite of mine; it provides a roadmap through a hypothetical example of how you can apply CI to your application. This is a must-read, since it helps to translate CI into features in your application and puts the flow of the book in perspective. Chapter 1 should also provide you with a good overview of the three forms of intelligence: direct, indirect, and derived.
The book is divided into three parts. Part 1 deals with collecting data, both within and outside the application, to be translated into intelligence later. Chapters 2 through 4 deal with gathering information from within one’s application, while chapters 5 and 6 focus on gathering information from outside of one’s application.
Chapter 2 provides an overview of the architecture required to embed CI in your application, along with a quick overview of some of the basic concepts that are needed to apply CI. Please take some time to go through section 2.2 in detail, as a firm understanding of the concepts presented in this section will be useful throughout the book. This chapter also shows how intelligence can be derived by analyzing the actions of the user. It’s worthwhile to go through the example in section 2.4 in detail, as understanding the concepts presented there will also be useful throughout the book.
Chapter 3 continues with the theme of collecting data, this time from the user action of tagging. It provides an overview of the three forms of tags and how tagging can be leveraged. In section 3.3, we work through an example to show how tagging data can be converted into intelligence. This chapter also provides an overview of the ideal persistence architecture required to leverage tagging, and illustrates how to develop tag clouds.
Chapter 4 is focused on the different kinds of content that may be available in your application and how they can be used to derive intelligence. The chapter begins with providing an overview of the different architectures to embed content in your application. I also briefly discuss content that’s typically associated with CI: blogs, wikis, and message boards. Next, we work through a step-by-step example of how intelligence can be extracted from unstructured text. This is a must-read section for those who want to understand text analytics.
The next two chapters are focused on collecting data from outside of one’s application—first by searching the blogosphere and then by crawling the web.
Chapter 5 deals with building a framework to harvest information from the blogosphere. It begins with developing a generalized framework to retrieve blog entries. Next, it extends the framework to query blog-tracking providers such as Technorati, Blogdigger, Bloglines, and MSN.
Chapter 6 is focused on retrieving information from the web using web crawling. It introduces intelligent web crawling or focused crawling, along with a short discussion on dealing with hidden content. In this chapter, we first develop a simple web crawler. This exercise is useful to understand all the pieces that need to come together to build a web crawler and to understand the issues related to crawling the complete web. Next, for scalable crawling, we look at Nutch, an open source scalable web crawler.
Part 2 of the book is focused on deriving intelligence from the information collected. It consists of four chapters—an introduction to the data mining process, standards, and toolkits, and chapters on developing a text-analysis toolkit, finding patterns through clustering, and making predictions.
Chapter 7 provides an introduction to the process of data mining—the process and the various kinds of algorithms. It introduces WEKA, the open source data mining toolkit that’s being extensively used, along with Java Data Mining (JDM) standard.
Chapter 8 develops a text analysis toolkit; this toolkit is used in the remainder of the book to convert unstructured text into a format that’s usable for the mining algorithms. Here we leverage Lucene for text processing. In this section, we develop a custom analyzer to inject synonyms and detect phrases.
In chapter 9, we develop clustering algorithms. In this chapter, we develop the implementation for the k-means and hierarchical clustering algorithms. We also look at how we can leverage WEKA and JDM for clustering. Building on the blog harvesting framework developed in chapter 5, we also illustrate how we can cluster blog entries.
In chapter 10, we deal with algorithms related to making predictions. We first begin with classification algorithms, such as decision trees, Naïve Bayes’ classifier, and belief networks. This chapter covers three algorithms for making predictions: linear regression, multi-layer perceptron, and radial basis function. It builds on the example of harvesting blog entries to illustrate how WEKA and JDM APIs can be leveraged for both classification and regression.
Part 3 consists of two chapters, which deal with applying intelligence within one’s application.
Chapter 11 deals with intelligent search. It shows how you can leverage Lucene, along with other useful toolkits and frameworks that leverage Lucene. It also covers six different approaches being taken in the area of intelligent search.
The last chapter, chapter 12, illustrates how to build a recommendation engine using both content-based and collaborative-based approaches. It also covers real-world case studies on how recommendation engines have been build at Amazon, Google News, and Netflix.
Code conventions and downloads
All source code in listings or in text is in a fixed-width font like this to separate it from ordinary text. Method and function names, object properties, XML elements, and attributes in text are presented using this same font. Code annotations accompany many of the listings, highlighting important concepts. In some cases, numbered bullets link to explanations that follow the listing.
Source code for all of the working examples in this book is available for download from www.manning.com/CollectiveIntelligenceinAction. Basic setup documentation is provided with the download.
Author Online
The purchase of Collective Intelligence in Action includes free access to a private web forum run by Manning Publications, where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your web browser to www.manning.com/CollectiveIntelligenceinAction. This page provides information about how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct on the forum.
Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It isn’t a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest his interest stray! The Author Online forum and the archives of previous discussions will be accessible from the publisher’s web site as long as the book is in print.
About the author
SATNAM ALAG, PH.D, is currently the vice president of engineering at NextBio (www.nextbio.com), a vertical search engine and a Web 2.0 user-centric application for the life sciences community. He’s a seasoned software professional with more than 15 years of experience in machine learning and over a decade of experience in commercial software development and management. Dr. Alag worked as a consultant with Johnson & Johnson’s BabyCenter, where he helped develop their personalization engine. Prior to that, he was the chief software architect at Rearden Commerce and began his career at GE R&D. He’s a Sun Certified Enterprise Architect (SCEA) for the Java Platform. Dr. Alag earned his Ph.D in engineering from UC Berkeley, and his dissertation was on the area of probabilistic reasoning and machine learning. He’s published a number of peer-reviewed articles.
About the title
By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering. According to research in cognitive science, the things people remember are things they discover during self-motivated exploration.
Although no one at Manning is a cognitive scientist, we’re convinced that for learning to become permanent it must pass through stages of exploration, play, and, interestingly, retelling of what is being learned. People understand and remember new things, which is to say they master them, only after actively exploring them. Humans learn in action. An essential part of an In Action book is that it’s example-driven. It encourages the reader to try things out, to play with new code, and explore new ideas.
There is another, more mundane, reason for the title of this book: our readers are busy. They use books to do a job or solve a problem. They need books that allow them to jump in and jump out easily and learn just what they want just when they want it. They need books that aid them in action. The books in this series are designed for such readers.
About the cover illustration
The figure on the cover of Collective Intelligence in Action is captioned Le Champenois,
a resident of the Champagne region in northeast France, best known for its sparkling white wine. The illustration is taken from a 19th century edition of Sylvain Maréchal’s four-volume compendium of regional dress customs published in France. Each illustration is finely drawn and colored by hand. The rich variety of Maréchal’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their station in life was just by their dress.
Dress codes have changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns or regions. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.
At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Maréchal’s pictures.
Part 1. Gathering data for intelligence
Chapter 1 begins the book with a brief overview of what collective intelligence is and how it manifests itself in your application. Then we move on to focus on how we can gather data from which we can derive intelligence. For this, we look at information both inside the application (chapters 2 through 4) and outside the application (chapters 5 and 6).
Chapter 2 deals with learning from the interactions of users. To get the ball rolling, we look at the architecture for embedding intelligence, and present some of the basic concepts related to collective intelligence (CI). We also cover how we can gather data from various forms of user interaction. We continue with this theme in chapter 3, which deals with tagging. This chapter contains all the information you need to build tagging-related features in your application. In chapter 4, we look at the various forms of content that are typically available in a web application and how to derive collective intelligence from it.
Next, we change our focus to collecting data from outside our application. We first deal with searching the blogosphere in chapter 5. This is followed by chapter 6, which deals with intelligently crawling the web in search of relevant content.