39c PDF
39c PDF
39c PDF
cn
International Journal of Software and Informatics, ISSN 1673-7288 http://www.ijsi.org
c
°2011 by ISCAS. All rights reserved. Tel: +86-10-62661040
Abstract Computer science is undergoing a fundamental change and is reshaping our un-
derstanding of the world. An important aspect of this change is the theory and applications
dealing with the gathering and analyzing of large real-world data sets. In this paper, we
introduce four research projects in which processing and interpreting large data sets is a cen-
tral focus. Innovative ways of analyzing such data sets allow us to extract useful information
that we would never have obtained from small or synthetic data sets, thus providing us with
new insights into the real world.
Key words: modern computer science; social networks; large data sets; high-dimensional
data
Hopcroft JE, Soundarajan S, Wang L. The future of computer science. Int J Software
Informatics, Vol.5, No.4 (2011): 549–565. http://www.ijsi.org/1673-7288/5/i110.htm
1 Introduction
This research was partially supported by the U.S. Air Force Office of Scientific Research under Grant
FA9550-09-1-0675.
Corresponding author: John E. Hopcroft, Email: jeh@cs.cornell.edu
Received 2010-12-05; Revised 2011-05-05; Accepted 2011-05-12.
550 International Journal of Software and Informatics, Volume 5, Issue 4 (2011)
greater than can be processed by traditional means. Thus, future computer science
research and applications will be less concerned with how to make computers work
and more focused on the processing and analysis of such large amounts of data.
Consider the following example of internet search. At the beginning of the inter-
net era, users were required to know the IP address of the site to which they wished
to connect. No form of search was available. As websites proliferated, online search
services became necessary in order to make the internet navigable. The first internet
search tool was developed in 1993, and dozens more were created over the next several
years. Ask Jeeves, founded in 1996, relied partially on human editors who manually
selected the best websites to return for various search queries. Given the huge num-
ber of websites available today, such a strategy is clearly no longer feasible. Google,
founded in 1998, is a leader among today’s search engines. It relies on a search algo-
rithm that uses the structure of the internet to determine the most popular and thus,
perhaps, the most reputable websites.
However, while Google’s search engine was a major advance in search technology,
there will be more significant advances in the future. Consider a user who asks the
question, “When was Einstein born?” Instead of returning hundreds of webpages to
such a search, one might expect the answer “Einstein was born at Ulm, in Wurttem-
berg Germany, on March 14, 1879”, along with pointers to the source from which the
answer was extracted. Other similar searches might be:
– Construct an annotated bibliography on graph theory
– Shafi Goldwasser, Silvio Micali and Charles Rackoff, “The knowledge complexity
of interactive proof systems”
– Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan and Mario Szegedy,
“Proof verification and the hardness of approximation problems”
Figure 1. The distribution of k = 13 clusters of NIPS papers[1] . The histograms of each cluster are
stacked on top of each other to show the influence of cluster popularity over time.
of large or high-dimensional data are completely different from that of small or low-
dimensional data. Heuristics and methods that were effective merely a decade ago
may already be outdated.
In this paper, we begin in Section 2 by giving an overview of several ongoing
projects dealing with the analysis of large data sets which represent the work currently
being done in a wide variety of research areas. In Section 3, we discuss some examples
of the theoretical foundation required to rigorously conduct research in these areas,
such as large graphs, high-dimensional data, sparse vectors, and so on. We conclude
in Section 4 with comments on the problems considered.
Figure 2. A sample friendship network. Vertices typically have a significant number of cut edges
the exact core from both weighted and unweighted graphs has been proved to be
NP-complete. Alternative heuristic algorithms have been developed, all of which are
capable of finding an approximate core, and their performance can be justified by the
experimental results based on various large-scale social graphs[5] . In this way, one
can obtain communities that are not only more densely connected than expected by
chance alone, but also well connected to the rest of the network.
Much of the early work on finding communities in social networks focused on par-
titioning the corresponding graph into disjoint subcomponents[3,4,6−11] . Algorithms
often required dense graphs and conductance was widely taken as the measure of the
quality of a community[2,3,12,13] . Given a graph G = (V, E), the conductance of a
subset of vertices S ⊆ V is defined as:
P
i∈S,j6∈S Aij
ϕ(S) = n P P o ,
min i∈S,j∈V A ij , i6∈S,j∈V Aij
where {Aij |i, j ∈ V } are the entries of the adjacency matrix for the graph. A subset
was considered to be community-like when it had low conductance value. However, as
discussed earlier, an individual may belong to multiple communities at the same time
and will likely have more connections to individuals outside of his/her community
than inside. One approach to finding such well-defined overlapping communities is
that of Mishra et al.[14] where the concept of an (α, β)-community was introduced and
several algorithms were given for finding an (α, β)-community in a dense graph under
certain conditions. Given a graph G = (V, E) with self-loops added to all vertices, a
subset C ⊆ V is called an (α, β)-community when each vertex in C is connected to
at least β vertices of C (self-loop counted) and each vertex outside of C is connected
to at most α vertices of C (α < β).
554 International Journal of Software and Informatics, Volume 5, Issue 4 (2011)
Among the interesting questions being explored are why (α, β)-communities cor-
respond to well-defined clusters and why there is no sequence of intermediate (α, β)-
communities connecting one cluster to another. Other intriguing questions include
whether different types of social networks incorporate fundamentally different social
structures; what it is about the structure of real-world social networks that leads to
the structure of cores, as in the Twitter graph, and why some synthetic networks do
not display this structure.
By taking the intersection of a group of massively overlapping (α, β)-communities
obtained from repeated experiments, one can eliminate random factors and extract
the underlying structure. In social graphs, for large community size k, the (α, β)-
communities are well clustered into a small number of disjoint cores, and there are
no isolated (α, β)-communities scattered between these densely-clustered cores. The
number of cores decreases as k increases and becomes relatively small for large k. The
cores obtained for a smaller k either disappear or merge into the cores obtained for
a larger k. Moreover, the cores correspond to dense regions of the graph and there
are no bridges of intermediate (α, β)-communities connecting one core to another.
In contrast, the cores found in several random graph models usually have significant
overlap among them, and the number of cores does not necessarily decrease as k
increases. Extensive experiments demonstrate that the core structure displayed in
various large-scale social graphs is, indeed, due to the underlying social structure
of the networks, rather than due to high-degree vertices or to a particular degree
distribution[5,15] .
This work opens up several new questions about the structure of large-scale social
networks, and it demonstrates the successful use of the (α, β)-community algorithm
on real-world networks for identifying their underlying social structure. Further, this
work inspires an effective way of finding overlapping communities and discovering the
underlying core structure from random perturbations. In social graphs, one would
not expect a community to have an exact boundary; thus, the vertices inside an
(α, β)-community but outside the corresponding core are actually located in the rough
boundary regions. Other open questions include how the core structure will evolve,
whether the cores correspond to the stable backbones of the network, and whether the
vertices that belong to multiple communities at the same time constitute the unstable
boundary regions of the network.
2.2 Tracking flow of ideas in scientific literature
Related to this research, content-based inferring and learning has been exten-
sively studied recently. Various methods to improve question-answer services in so-
cial networks have been proposed[16−19] . In addition, tracking popular events in social
communities can be achieved using a statistical model[20] .
556 International Journal of Software and Informatics, Volume 5, Issue 4 (2011)
general problem of link prediction was studied by Clauset et al. in Ref. [24].
2.4 Tracking bird migration in north America
Hidden Markov models (HMMs) assume a generative process for sequential data
whereby a sequence of states (i.e. a sample path) is drawn from a Markov chain in a
hidden experiment. Each state generates an output symbol from a given alphabet, and
these output symbols constitute the sequential data (i.e. observations). The classic
single path problem, solved by the Viterbi algorithm, is to find the most probable
sample path given certain observations for a given Markov model[25] .
Two generalizations of the single path problem for performing collective inference
on Markov models are introduced in Ref. [25], motivated by an effort to model bird
migration patterns using a large database of static observations. The eBird database
maintained by the Cornell Lab of Ornithology contains millions of bird observations
from throughout North America reported by the general public using the eBird web
application. Recorded observations include location, date, species and number of
birds observed. The eBird data set is very rich, and the human eye can easily discern
migration patterns from animations showing the observations as they unfold over
time on a map of North America. However, the eBird data entries are static and
movement is not explicitly recorded, only the distributions at different points in time.
Conclusions about migration patterns are made by the human observer, and the goal
is to build a mathematical framework to infer dynamic migration models from the
static eBird data. Quantitative migration models are of great scientific and practical
importance. For example, this problem comes from an interdisciplinary project at
Cornell University to model the possible spread of avian influenza in North America
through wild bird migration.
The migratory behavior of a species of birds can be modeled by a single gener-
ative process that independently governs how individual birds fly between locations.
This gives rise to the following inference problem: a hidden experiment draws many
independent sample paths simultaneously from a Markov chain, and the observations
reveal collective information about the set of sample paths at each time step, from
which the observer attempts to reconstruct the paths.
Figure 4 displays the pattern of ruby-throated hummingbird migration inferred
by this model for the four weeks starting on the dates indicated. The top row shows the
distributions and migrating paths inferred by the model: grid cells colored in lighter
shades represent more birds; arrows indicate flight paths between the week shown
and the following week, with line width proportional to bird flow. The bottom row
shows the raw data for comparison: white dots indicate negative observations; black
squares indicate positive observations, with size proportional to bird count; locations
with both positive and negative observations appear a charcoal color. This leads to
a somewhat surprising prediction that when migrating north, some hummingbirds
will fly across the Gulf of Mexico while others follow the coastline, but when flying
south, they generally stay above land. This prediction has been confirmed by work
performed by ornithologists. For example, in the summary paragraph on migration
from the Archilochus colubris species account[26] , Robinson et al. write “Many fly
across Gulf of Mexico, but many also follow coastal route. Routes may differ for
north- and southbound birds.” The inferred distributions and paths are consistent
558 International Journal of Software and Informatics, Volume 5, Issue 4 (2011)
3 Theoretical Foundation
graphs is often unknown, one way to study these networks is to consider generative
graph models instead, where a graph is constructed by adding vertices and edges in
each time interval. Although such graphs typically differ from real-world networks in
many important ways, researchers can use the similarities between the two types of
networks to gain insight into real-world data sets.
A simple but commonly used model for creating random graphs is the Erdös-
Renyi model, in which an edge exists between each pair of vertices with equal prob-
ability, independent of the other edges. A more realistic model is known as the
“preferential attachment” model, in which the probability that an edge is adjacent to
a particular vertex is proportional to the number of edges already adjacent to that
vertex. In other words, a high degree vertex is likely to gain more edges than a low
degree vertex. The preferential attachment model gives rise to the power-law degree
distribution observed in many real-world graphs.
Another interesting feature of real-world networks is the existence of the “giant
component”. The following table describes the number of components of each size in
a protein database containing 2,730 proteins and 3,602 interactions between proteins.
As the component size increases, the number of components of that size decreases.
Thus, there are many fewer components of size four or more in this protein graph than
components of size one, two, or three. However, those smaller components contain
only 899 proteins, while the other 1,851 proteins are all contained within one giant
component of size 1,851, as shown in the following table.
Consider the Erdös-Renyi random graph model in which each edge is added
independently with equal probability. Suppose that we start with 1,000 vertices and
zero edges. Then, there are clearly 1,000 components of size one. If we add one edge,
we will have 998 components of size one and one component of size two. However, a
giant component begins to emerge as more edges are added, as shown in the following
table. The graph contains many components of small size and a giant component of
size 101. This occurs because a component is more likely to attract additional vertices
as its size increases.
Since random graph models mimic some vital features of real-world networks,
it is often helpful to study the processes that generate these features in random
graph models. An understanding of these processes can provide valuable insights for
analyzing real-world data sets.
560 International Journal of Software and Informatics, Volume 5, Issue 4 (2011)
By the Law of Large Numbers, the value of D2 (x1 , x2 ), which is the sum of a large
number of random variables, is tightly concentrated about its expected value. Thus,
high-dimensional data is inherently unstable for conventional clustering algorithms.
Another interesting phenomenon in high dimensions is the volume of a unit radius
sphere. As the dimension d increases, the volume of a unit radius sphere goes to zero.
This is easily seen by the following integration:
Z Z √ 2
1 Z √ 2 2
1−x1 2 1−x1 −x2 −···−xd−1
V (d) = √ ··· √ dxd · · · dx1
x1 =−1 x2 =− 1−x21 xd =− 1−x21 −x22 −···−x2d−1
Z Z 1
A(d)
= rd−1 drdΩ = .
Sd r=0 d
R
where A(d) = Sd
dΩ is the surface area of a unit radius sphere. Consider the Gaussian
integration:
Z ∞ Z ∞ Z ∞ µZ ∞ ¶d
e−(x1 +···+xd ) dxd · · · dx1 =
2 2 2
I(d) = ··· e−x1 dx1 = π d/2 .
x1 =−∞ x2 =−∞ xd =−∞ −∞
Therefore,
2π d/2 3d
V (d) = and lim V (d) ≈ lim =0.
dΓ (d/2) d→∞ d→∞ d!
√
non-zero volume. This occurs when the radius of the sphere reaches d, where d is
the dimension of the space. As we continue to increase the volume of the sphere we
are integrating over, the probability mass will soon stop increasing since the density
function decreases exponentially fast. Therefore, even though the probability density
is maximum at the √ origin, all of the probability mass is concentrated in a narrow
annulus of radius d.
Given two standard Gaussian processes whose centers are extremely close, each
generating random data points, we should be able to tell which Gaussian process
generated which point. In high dimensions, even though the centers are close to each
other, the two annuli will not overlap by much. An algorithm for determining which
Gaussian process generated which point is to calculate the distance between each pair
of
√ points. Two points generated by the same Gaussian process would be a distance
2d apart, and the√distance between two points generated by different Gaussian
processes would be δ 2 + 2d, where δ is the distance between the two centers. To
see this, first consider two random points generated by the same Gaussian process.
Generate the first point and then rotate the coordinate axes such that the point is
at the North Pole. Generate the second point and this point will lie on, or very
near, the equator, since the surface area of a high-dimensional hyper-sphere is close
to the equator. Thus, the two points and the origin form √ a right triangle. Having
approximated the annulus by a hyper-sphere of radius √ d, the distance between the
two points generated by the same Gaussian process is 2d.
To calculate the distance between two random points generated by different Gaus-
sian processes, again generate the first point and rotate the coordinate axes such that
the point is at the North Pole. Then, generate the second point and it will lie on, or
very near, the equator√of the second hyper-sphere. Thus, the distance between the
two points is given by δ 2 + 2d as illustrated in Fig. 5.
written as:
minimize ||x||0 subject to Ax = b
where the 0-norm is the number of non-zero elements of the sparse vector x. However,
the 0-norm is non-convex and optimization problems of this nature are often NP-
complete. Thus, the question remains of when the solution to the convex optimization
problem:
minimize ||x||1 subject to Ax = b
correctly recovers the vector x. Note that the above minimization problem can be
solved by linear programming.
4 Conclusion
References
[1] Shaparenko B, Caruana R, Gehrke J, Joachims T. Identifying temporal patterns and key players
in document collections. Proc. of IEEE ICDM Workshop on Temporal Data Mining: Algo-
rithms, Theory and Applications (TDM-05). 2005. 165–174.
[2] Gaertler M. Clustering. Network Analysis: Methodological Foundations, 2005, 3418: 178–215.
[3] Leskovec J, Lang K, Dasgupta A, Mahoney M. Statistical properties of community structure in
large social and information networks. Proc. of 18th International World Wide Web Conference
564 International Journal of Software and Informatics, Volume 5, Issue 4 (2011)
(WWW). 2008.
[4] Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys. Rev.
E, 2004, 69(026113).
[5] Wang L, Hopcroft JE. Community structure in large complex networks. Proc. of 7th Annual
Conference on Theory and Applications of Models of Computation (TAMC). 2010.
[6] Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys.
Rev. E, 2004, 70(06111).
[7] Girvan M, Newman MEJ. Community structure in social and biological networks. Proc. of
Natl. Acad. Sci. USA. 2002, 99(12): 7821–7826.
[8] Newman MEJ. Detecting community structure in networks. The European Physical J. B, 2004,
38: 321–330.
[9] Newman MEJ. Fast algorithm for detecting community structure in networks. Phys. Rev. E,
2004, 69(066133).
[10] Newman MEJ. Finding community structure in networks using the eigenvectors of matrices.
Phys. Rev. E, 2006, 74(036104).
[11] Newman MEJ. Modularity and community structure in networks. Proc. of Natl. Acad. Sci.
USA. 2006, 103(23): 8577–8582.
[12] Lang K, Rao S. A flow-based method for improving the expansion or conductance of graph cuts.
Proc. of 10th International Conference Integer Programming and Combinatorial Optimization
(IPCO). 2004.
[13] Schaeffer SE. Graph clustering. Computer Science Review, 2007, 1(1): 27–64.
[14] Mishra N, Schreiber R, Stanton I, Tarjan RE. Finding strongly-knit clusters in social networks.
Internet Mathematics, 2009, 5(1–2): 155–174.
[15] He J, Hopcroft JE, Liang H, Supasorn S, Wang L. Detecting the structure of social networks
using (α, β)-communities. Proc. of 8th Workshop on Algorithms and Models for the Web Graph
(WAW). 2011.
[16] Liu Y, Bian J, Agichtein E. Predicting information seeker satisfaction in community question
answering. SIGIR, 2008.
[17] Wang K, Ming Z, Chua T. A syntactic tree matching approach to finding similar questions in
community-based qa services. SIGIR, 2009.
[18] Wang XJ, Tu X, Feng D, Zhang L. Ranking community answers by modeling question-answer
relationships via analogical reasoning. SIGIR, 2009.
[19] Yang T, Jin R, Chi Y, Zhu S. Combining link and content for community detection: a discrim-
inative approach. KDD, 2009.
[20] Lin CX, Zhao B, Mei Q, Han J. Pet: a statistical model for popular events tracking in social
communities. KDD, 2010.
[21] Soundarajan S, Hopcroft JE. Recovering social networks from contagion information. Proc. of
7th Annual Conference on Theory and Applications of Models of Computation (TAMC). 2010.
[22] Gomez-Rodriguez M, Leskovec J, Krause A. Inferring networks of diffusion and influence. Proc.
of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
2010.
[23] Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N. Cost-effective outbreak
detection in networks. Proc. of 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD), 2007.
[24] Clauset A, Moore C, Newman MEJ. Hierarchical structure and the prediction of missing links
in networks. Nature, May 2008.
[25] Sheldon DR, Saleh ElmohamedM A, Kozen D. Collective inference on markov models for mod-
eling bird migration. Neural Information Processing Systems (NIPS), 2007.
[26] Robinson TR, Sargent RR, Sargent MB. Ruby-throated hummingbird (archilochus colubris).
In: Poole A, Gill F, eds. The Birds of North America, number 204. The Academy of Natural
Sciences, Philadelphia, and The American Ornithologists’ Union, Washington, D.C., 1996.
[27] Gehrke J, Ginsparg P, Kleinberg J. Overview of the 2003 KDD cup. SIGKDD Explorations,
2003, 5(2): 149–151.
[28] Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS. Fast algorithms for projected clustering.
John E. Hopcroft, et al.: The future of computer science 565