Academia.eduAcademia.edu

Evolving the Ecosystem of Personal Behavioral Data

2017, Human–Computer Interaction

Evolving the Ecosystem of Personal Behavioral Data Jason Stampfer Wiese September 2015 CMU-HCII-15-108 Human-Computer Interaction Institute School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 USA Committee: Jason Hong (Co-Chair) John Zimmerman (Co-Chair) Anind Dey James Landay (Stanford University) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy © 2015 Jason Wiese. All rights reserved. Funding for this research comes from the Yahoo InMind project, The Stu Card Fellowship, A Google Faculty Research Award, NSF Grant No. DGE-0903659 and ONR N66001-12-C-4196. Any opinions, findings and conclusions or recommendations are those of the authors and do not necessarily reflect those of any of the sponsors. Abstract Personal data is everywhere. The widespread adoption of the Internet, fueled by the proliferation of smartphones and data plans, has resulted in an amazing amount of digital information about each individual. Social interactions (e.g. email, SMS, phone, Skype, Facebook), planning and coordination (e.g. calendars, TripIt, Basecamp, online to do lists), entertainment (e.g. YouTube, iTunes, Netflix, Spotify), and commerce (e.g. online banking, credit card purchases, Amazon, Zappos, eBay) all generate personal data. This data holds promise for a breadth of new service opportunities to improve people’s lives through deep personalization, through tools to manage aspects of their personal wellbeing, and through services that support identity construction. However, there is a broad gap between this vision of leveraging personal data to benefit individuals and the state of personal data today. This thesis proposes unified personal data as a new framing for engaging with personal data. Through this framing, it synthesizes previous research on personal data and describes a generalized framework for developing applications that depend on personal data, exposing current challenges and issues. Next, it defines a set of design goals to improve the state of personal data systems today. Finally, it contributes Phenom, a software service designed to address the challenges of developing applications that rely on personal data. ii Acknowledgements I am thankful for the advice, support, insights, and feedback from my co-advisors Jason Hong and John Zimmerman. Throughout my time as their student, their patience with me and commitment to me has been steadfast. I am also grateful for the insights of my committee members Anind Dey and James Landay, and of my internship mentors Jacob Biehl, Elizabeth Churchill, A.J. Brush, and Scott Saponas. The faculty, staff, and students of the Human-Computer Interaction Institute have provided a warm and supportive community of colleagues, collaborators, and dear friends. In particular, I would like to thank Ian Li, Sauvik Das, Derek Lomas, Jenn Marlow, Jenny Olsen, Iris Howley, Scott Davidoff, Eiji Hayashi, Will Odom, and Queenie Kravitz. I am especially grateful to my Friendship House roommates Chris Harrison, Stephen Oney, and Amy Ogan. These years at Carnegie Mellon have been immeasurably better because of the three of you. Thank you to my family for always believing in me: my parents Evette and Charlie, and my sisters Sarah and Samantha. Most of all, I want to thank Eliane, my partner in life, who has contributed blood, sweat, and tears to this dissertation. iii Table of Contents Abstract .......................................................................................................................... ii Acknowledgements ......................................................................................................... iii Table of Contents ........................................................................................................... iv List of Figures ................................................................................................................. vi List of Tables .................................................................................................................. viii 1 Introduction: The Challenge of Personal Data .......................................................... 9 1.1 Research Contributions ...............................................................................................14 1.2 Dissertation Overview ..................................................................................................14 2 The Landscape of Personal Data ............................................................................... 16 2.1 Personal Information Management .............................................................................18 2.2 User Modeling ..............................................................................................................20 2.3 Recommender Systems ................................................................................................21 2.4 Lifelogging ....................................................................................................................22 2.5 Context-Aware Computing ..........................................................................................24 2.6 Personal Informatics and Quantified Self ...................................................................25 2.7 Computational Social Science and Data Mining To Understand Human Behavior .26 2.8 Identity Interfaces: Virtual Possessions and Self-Reflection ........................................27 2.9 Sharing Context ...........................................................................................................29 2.10 Privacy.........................................................................................................................30 2.11 Research Examples .....................................................................................................31 2.12 Discussion ..................................................................................................................38 3 A Case Study: Inferring Sharing Preferences Using Communication Data................. 40 3.1 Connecting Features of Social Relationships to Sharing Preferences .........................42 3.2 Using Communication Data To Infer Tie-Strength ....................................................53 3.3 Case Study Discussion .................................................................................................68 4 A Conceptual Framework for Personal Data ............................................................. 70 4.1 The ecosystem of personal data today..........................................................................71 4.2 The Personal Data Continuum ...................................................................................74 4.3 The steps for working with personal data ....................................................................78 4.4 The Conceptual Framework ........................................................................................81 4.5 Design Goals ................................................................................................................82 5 Phenom: A Service for Unified Personal Data .......................................................... 86 5.1 System Architecture .....................................................................................................86 5.2 Evaluation: Example Applications and Queries ..........................................................98 5.3 Discussion ....................................................................................................................105 5.4 Related Work ...............................................................................................................110 6 Conclusion ............................................................................................................... 113 iv 7 References ................................................................................................................ 115 v List of Figures Figure 1: Personal data today is separated across the applications and services where each type of data originated (left). To unlock the full potential of personal data, it should instead be structured to prioritize the coherence of the heterogeneous data around each individual who is the subject of that data (right). ..................11 Figure 2: The conceptual framework of the steps involved in developing software that depends on personal data: data is collected from some number of data sources, the collected data is transformed into the appropriate level of abstraction or meaning for the target application, and the data is incorporated in that target application. .............................................................13 Figure 3. A map highlighting research domains across HCI that generate, make use of, or investigate personal data, and the interconnections between them. ..................18 Figure 4: Stage-Based Model of Personal Informatics Systems (Li et al., 2010) .....................25 Figure 5: An example of a scenario presented during one of the sessions. .............................33 Figure 6: The goal of the research in this chapter is to use communication logs (call and SMS logs) to infer sharing preferences, using tie strength as an intermediate representation. Theoretical literature supports the connection between communication logs and tie strength and also the connection between tie strength and sharing preferences. Additionally, past work has demonstrated that communication behavior corresponds to tie strength. Therefore, I focused first on the connection between tie strength and sharing preferences before attempting to replicate the prior finding connecting communication behavior to tie strength. .....................................................................................41 Figure 7: The instructions for the grouping activity. ..............................................................44 Figure 8: Hierarchical clustering using average linkage distance. Horizontal position of the branches is directly proportional to the calculated distance between each cluster. Scenarios are shorthand for the same ones in Table 3. .........................49 Figure 9. Total number of friends within each tie strength level across all participants, separated by the number of contacts who only appeared in the contact list, only in the Facebook friends list, appeared in both, or neither. The data indicates that there is a notable number of strong ties that appear only in the phonebook and not in facebook, but there are few strong ties who appear only in facebook and not in the phonebook. ......................................................54 Figure 10. Number of friends in the mobile contact list who exchanged zero (No Comm Logs) vs. at least one (Some Comm) SMS or call with our participants (determined from call log data). There are a number of strong ties with zero communication logs in the dataset. Any classifier that is based on this communication behavior will misclassify those strong ties as weak ties. This vi issue is even more pronounced for medium tie-strength: nearly half of those contacts have no communication in the collected dataset. ................................55 Figure 11. A grid of six plots showing communication frequency and total talk time. The top 3 graphs plot each contact’s aggregate call duration (y-axis) against number of calls (x-axis). The bottom 3 graphs plot each contact’s number of SMS messages (y-axis) against number of calls (x-axis). For both top and bottom, the columns separate the contacts by tie strength group. The graphs include data for contacts with at least one call or SMS. All numbers are represented as the percentage of a participant’s total communication frequency/duration. ...........................................................................................56 Figure 12: Personal data today is separated across the applications and services where each type of data originated (left). To unlock the full potential of personal data, it should instead be structured to prioritize the coherence of the heterogeneous data around each individual who is the subject of that data (right). .................................................................................................................71 Figure 13: The personal data continuum ranges from very low-level data (far left side) like sensor data that describes the user’s behavior and surroundings to very high level data (far right side) that describes information about individuals that they might not even know about themselves. Information in the lower levels can often be directly sensed, but data higher on the continuum has to be provided manually or inferred from a combination of lower level data. ...........75 Figure 14: The personal data pipeline breaks down the steps of working with personal data. At a high level, using personal data means collecting the data, inferring some meaning from that data, and then applying the data to the target application. However, these steps are deceivingly simple. In reality each of these steps is complex with many components and a host of implicit challenges. ..........................................................................................................78 Figure 15: A system diagram for Phenom illustrating its different components. The Epistenet Data Store serves as a semantic knowledge base of personal data. Data providers bring personal data in from external data sources. Bots operate on the data contained within the datastore to generate inferences and abstractions. A unified querying API provides application developers with a single query interface to access the richly interconnected personal data from the datastore. .....................................................................................87 Figure 16: An example of an ontology in Epistenet. Direction edges in this graph refer to “subsumptive” relationships. So, a PhoneCall is a type of Communication. Attributes of a parent ontology class are also contained in the descendents of that ontology class. .............................................................................................89 vii List of Tables Table 1 Data collected for each friend. Data in the top half of the table (“observable features”) is data that was potentially observable by a UbiComp system or social networking site. Data on the bottom half of the table would either be inferred from the observable features or manually inputted by the user ...........44 Table 2: Linear regression models predicting sharing and closeness (last column only), controlling for each participant. Each column is a different model and data in the table are non-standardized β coefficients, except for R2 in the last row, which can be compared across models to demonstrate the variance explained. For example, the “close” model (fourth column) includes one effect, friend closeness, and this model accounts for 63% of the variance in sharing preferences. Gray cells indicate effects that were not included for that particular model. The data indicate both that closeness is the best predictor of sharing, and that observable features can predict closeness. Significance: *p<0.05; **p<0.01; ***p<0.001 ...................................................44 Table 3: Summary of data for each sharing scenario, sorted by overall mean sharing. The first column reports the correlation with closeness, and all correlation coefficients are significant to p<.001. The Tukey-Kramer test compares the overall means for sharing in each scenario: scenarios that have the same letter are not significantly different from each other. .........................................48 Table 4 The results of 9 classifiers constructed using SMO. The prediction classes are tiestrength categories. For 2-verystrong, the medium strong and weak tie strength classes are combined and for 2-mediumstrong the medium strong and very strong tie strength classes are combined..............................................59 viii 1 Introduction In the last decade, our society has undergone a fundamental shift in day-to-day life starting with the widespread adoption of the Internet and rapidly accelerated by the proliferation of smartphones and data plans. For a large and growing portion of the first world population, an incredible number of people’s daily tasks are now mediated by Internet-connected computing technology: social interactions (e.g. email, SMS, phone, Skype, Facebook), planning and coordination (e.g. calendars, TripIt, Basecamp, online to do lists), entertainment (e.g. YouTube, iTunes, Netflix, Spotify), and commerce (e.g. online banking, credit card purchases, Amazon, Zappos, eBay) are all activities that are increasingly digitally mediated. In addition, people are generating increasing amounts of files including documents, media, and contact lists. Fueled by convenience and increased efficiency, the way people do things today is markedly different from the prior decade. Through this lens, the massive accumulation of data that describes people’s behavior in these applications and services is merely a byproduct of this major societal shift: these applications capture who their users communicate with, what their users purchase, and what content they consume. But these large caches of data are hardly a coincidental byproduct: Facebook, Google, Amazon, and Netflix each owe their continued success in large part to the massive stores of personal data they have amassed that describe their users’ behaviors. These companies employ their users’ data to sell advertising, recommend content, and personalize interfaces. Often, companies use the amassed data while users remain uninvolved. Most users do not understand what data is being used, how it is being used, how they might be at risk, and how they might benefit from applications that use their data. People are understandably concerned, distrustful, and feel helpless when it comes to their data. Other than withdrawing altogether from our technology-drive society, Chapter 1: Introduction what choice do they have? As a result, most people have a fairly distanced relationship with the data about them: the typical person has effectively no relationship with their data. The concerned person tries to minimize what is collected, to say “no” whenever offered a choice that lets them still receive service without surrendering their data. Thus, the ecosystem of personal data appears quite dysfunctional: the people who are the subject of that data have limited access to it and try to minimize its existence while companies vie for users so that they can have unrestricted access to the data users will generate in their services. Simultaneously, there is a sense that this data holds immense value that when combined could unlock an exciting new future of highly personalized, meaningful personal computing experiences. Many applications and services have begun to demonstrate the personalized, holistic, and user-centric potential that individuals’ data has to offer. Personal assistants like Google Now, Siri, and Cortana use the data collected within their platforms to suggest contextually relevant information and answer queries. The Nest thermostat adapts to a user’s behavior and makes adjustments auto-magically. Gmail’s priority inbox feature uses a variety of heuristics like which emails the user reads first and who the user sends emails to in order to guess which emails the user wants to be prioritized. Yet, these examples feel like they fall short of the real potential of personal data. Researchers motivate their papers with promises for the awesome, intelligent, personalized future of computing. Science fiction envisions personal assistants that understand complex situations1, learning environments that can relate lessons to our actual life experiences 2 , and technology that can automatically assess and treat mental health conditions3. With a little imagination, there is the clear potential for technology to support tasks that are difficult for people to do today: Where should I go on vacation? How can I live more sustainably? Who should I room with in college? What thing should I buy to make my life better? What should I do differently to be a better boss/employee/ spouse/parent/friend? A future where technology can help us in these situations seems more plausible than it has ever been before. Following the path to realizing this vision will require major advances across computer science: speech interfaces, machine learning, robotics, sensing hardware, database systems, privacy and security, distributed systems. Furthermore, beyond computer science much of this personalization will require domain-specific knowledge and will likely require advances in those fields as well. To be able to attempt the advances required to enable this promising future requires engaging with the present-day dysfunctional landscape of personal data characterized above, itself a daunting task. Even worse, beneath the surface of the societal and social issues surrounding personal data is a similarly dysfunctional technological landscape. Science and engineering research answers well-defined 1 Her (2015) Star Trek (2009) 3 Card, O. S. (1985). Ender's game (Vol. 1). St. Martin's Press. 2 10 Chapter 1: Introduction questions, but in the case of personal data, the goal state is ill defined. Making an advance under these conditions first requires specifying a new frame for understanding what could and should be; a vision for the future of personal data. Figure 1: Personal data today is separated across the applications and services where each type of data originated (left). To unlock the full potential of personal data, it should instead be structured to prioritize the coherence of the heterogeneous data around each individual who is the subject of that data (right). Recent work by Pentland has proposed “a new deal on data,” specifying that users should be the owners of the data that describes their own behavior (Pentland, 2009). Following this theme, Estrin has proposed a vision of “small data” wherein each individual can leverage the traces of data about them in order to build insights about themselves (Estrin, 2014). These visions offer components of an intriguing future: who should own a person’s data (people should own their own data) and what people ought to do with their own data (people should be able to build insights about themselves from their own data). Building on this recent work, this thesis proposes the framing of unified personal data as an opportunity and a goal state for advancing the landscape of personal data. The unified personal data vision claims that an individual’s heterogeneous personal data should be tightly integrated and represented all together on the level of the individual (Figure 1 right), rather than each user’s personal data being disparate, disjoint, and siloed across each of the particular services and devices that an individual uses. This framing of unified personal data as a goal state for personal data signifies an important design contribution that advances many areas of computing. An entire host of challenges must be overcome to bring about unified personal data. Personal data is siloed within the services and devices where it was collected. Companies independently determine what data to collect, whether or not data can be accessible outside of the service, how long data will be kept, the terms of use for the data, and what format that data can be accessed in. Even if a user has the power to grant a developer access to her data, the challenges continue: bringing data together from multiple sources, doing something to process that data (e.g. machine learning), and applying the data are all a massive undertaking. Furthermore, there is very little structure or support for this process today. 11 Chapter 1: Introduction Advancing the state of personal data will require a fundamental shift in the way that personal data is managed. Today, personal data is stored separately by each company that collected it, and then within each application or service it is separated by user. This approach is a natural fit for “big data” analysis: a company can use the data they have amassed across all of their users to gain insights on user behavior. If the goal is to gain insights about individuals, the current approach is a bad fit. The amount of effort required to participate in the quantified self movement helps to illustrate just how inhibitive the current approach is: to get even a partial view of one’s own data requires technical skills, and a fair amount of invested time in order to write the code that brings together data from these disparate sources and do something interesting with that data. While motivated individuals are able to draw together some of their data and even generate their own insights from it4, these systems tend to be built in an ad hoc fashion (e.g. connecting to specific sets of services, designed to run in specific programming environments). Even for individuals with technical skills, these can be difficult to build on, and for those without technical knowledge most tools that bring together multiple sources of heterogeneous data are out of reach. Though these observations might seem obvious in retrospect, they were not. In my early days as a doctoral student at Carnegie Mellon, colleagues and I would hypothesize many ideas of the form “I bet if you had [X], [Y], and [Z] data, you could infer [A].” Finally, I tried one, a comparatively simple one: “I bet if you had a person’s communication logs you could infer the strength of their relationships with all of their contacts.” In fact, we expected it was going to be so simple that the real research contribution would not be the relationship model, but instead the contribution was going to be “inferred relationship strength can be used to set sharing preferences.” As chapter 3 details, this was in fact not a simple task, and the result left much to be desired. Furthermore, following up with more data from more sources was not feasible. The resources required to make even some simple additions were too great. Why was this so difficult, so resource intensive? What changes are necessary to improve this state of personal data? This dissertation seeks answers to those questions by stepping back to take a holistic look at the ecosystem of personal data. To accomplish this, I employ a multidisciplinary human-computer interaction approach, integrating inquiry techniques from both computer science and design to make advances in both disciplines. 4 See http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/ http://feltron.com/ for two notable examples. 12 and Chapter 1: Introduction Figure 2: The conceptual framework of the steps involved in developing software that depends on personal data: data is collected from some number of data sources, the collected data is transformed into the appropriate level of abstraction or meaning for the target application, and the data is incorporated in that target application. Another important outcome of this dissertation is a conceptual framework that describes the general process that is required to develop software that depends on personal data (see Figure 2). The conceptual framework consists of two components. The first component is a continuum of personal data from very low-level (e.g. raw sensor data) to very high level (e.g. is the user experiencing major depression?). The second component is a set of three steps that are required to develop applications that depend on personal data. First, capture or collect the necessary personal data. Second, transform the collected personal data into the required level of abstraction or meaning for how it will be used. Third, apply the transformed data to the target application. While these three steps may seem simple on the surface, chapter 4 highlights a variety of challenges that highlight the complexity of engaging in this process. The conceptual framework is a useful tool for engaging a conversation around the process of developing applications with personal data, and exposes some critical issues for working with personal data, which limits what is reasonably possible for researchers and application developers to accomplish today. By distilling these challenges, this dissertation proposes a set of goals for achieving the vision of unified personal data. Finally, this dissertation describes the design and implementation of Phenom, a service designed to make progress towards these goals by modularizing the process of working with personal data. Phenom dramatically reduces the effort that is required for a developer to incorporate personal data in an application. By employing a semantic data store and focusing on an integrated, flexible, and modular approach to handling personal data, Phenom radically changes how easy it is for a developer to program applications that depend on personal data, demonstrating a first step towards the vision of unified personal data. 13 Chapter 1: Introduction 1.1 Research Contributions This dissertation offers the following technical and design contributions to HCI: 1. A proposal for unified personal data; a reframing of many HCI challenges, human needs, and technical opportunities that can all be advanced through a more holistic view of all of the individual data amassing around people as their personal data that should be brought together and structured such that it works for them and remains under their control. 2. The notion of personal data as a continuum, and a conceptual framework that unpacks the implicit process involved in working with personal data. 3. A set of design goals for improving the ecosystem of personal data. 4. The design of Phenom: a service that supports software development with personal data. Phenom modularizes the collection, interconnection, processing, and querying of personal data to solve a key set of challenges involved in developing applications that use personal data. 5. The implementation of a proof of concept of Phenom, which demonstrates its viability and utility as a personal data service. 1.2 Dissertation Overview Chapter 2 highlights many research domains that have incorporated personal data (often implicitly) in their work, including a variety of my own projects across those domains. Personal data underlies multiple threads of research, and in many cases progress in those domains appears stifled because of the limitations of the state of personal data today. Despite these limitations encountered and the commonalities across fields, it appears that no efforts have been made towards connecting these domains and engaging holistic thinking on the ecosystem of personal data. The vision of unified personal data offers a new frame for viewing people’s collections of personal data, one that offers benefits to the various research domains that have helped to define personal data and that employ it to offer an advance. Chapter 3 offers a detailed case study of the process and findings of my own research to connect communication behavior to social sharing preferences. The practical challenges faced in that work highlight many of the shortcomings of engaging in research with personal data. Chapter 4 synthesizes the landscape of personal data mapped out in the previous two chapters to engage personal data from a holistic perspective. This synthesis produces a set of general steps that are required for making use of personal data: collecting the data, making higher-level sense of the data, and applying the processed data to an application. It identifies a set of challenges and issues that inhibit work with personal data, using the framework to illustrate these challenges. Finally, it proposes a set of design goals that offer an agenda for improving the personal data ecosystem. 14 Chapter 1: Introduction Chapter 5 describes Phenom, a service that I developed to support the process of developing applications based on the vision of unified personal data. Phenom addresses some of the most prominent challenges of working with personal data by offering a modular approach that separates the steps of the personal data development process. Phenom unifies personal data on the level of the individual, supporting rich interconnections in the data and reuse of components across completely independent applications. Chapter 6 concludes the dissertation with an eye towards the future of personal data research. 15 2 Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data Traditionally, scientific and engineering research both focus on answering wellformed research questions; the mantra “what is your research question?” is universally familiar and relevant. And yet sometimes identifying what the question should be is itself a major research challenge. These situations can be daunting from the perspective of science and engineering research. Design research, in contrast, focuses on the search for a question that is worth answering. Design researchers refer to this as framing a goal state that supports an advance toward a preferred state of the world. It asks about relevance and improvement to the world as the most critical criteria. To understand this approach requires a basic understanding of the concept of design thinking. When describing design, Herbert Simon wrote “To design is to devise courses of action aimed at changing existing situations into preferred ones,” (Simon, 1969). This places design thinking in a subjective space with a focus on what might be better for the world. Rittel and Webber advanced this idea with their work on “Wicked Problems,” large-scale social issues like urban crime, that are not easily addressed through science or engineering inquiry, but that are approachable through design thinking (Rittel & Webber, 1973). These challenges cannot be accurately modeled (and thus cannot be solved by scientific or engineering methods alone) because of the conflicting perspectives of the stakeholders involved. The Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data complex open system of multiple stakeholders with conflicting goals and innumerable possible solutions described by Rittel and Weber are applicable when considering the ecosystem of personal data5. They describe how design thinking makes advances on these types of problems by proposing solutions that offer a unique framing of the problem to be solved, and that it is only through the articulation of a solution that researchers can even know the problem they want to address. Speaking on how design functions as a reflective practice, Donald Schön argues for the importance of framing problems, and specifically that the process of design thinking is about picking a specific frame to engage (Schön, 1983). Design thinkers employ a process of reflecting-in-action and reflecting on action as they generate and assess many possible frames (i.e. the futures they might want to achieve). More recently, Kee Dorst considered different forms of reasoning (deduction, induction, and abduction) to position design thinking with respect to the kinds of reasoning frequently encountered in science and engineering (Dorst, 2011). Dorst builds on Schön’s concept of framing, identifying that framing is a form of perspective-taking, with many different perspectives possible. He discusses how designers systematically cycle through many possible desired outcomes in order to discover a path forward that can resolve a problematic situation. This thesis proposes unified personal data as a mechanism to unlock the promised future of personalized computing experiences. The articulation of this goal (a preferred future) and this mechanism (unified personal data) is a match to Dorst’s conception of framing in design thinking. Framing the vision and the opportunity of unified personal data is a core contribution of this dissertation that unfolds through chapters 2, 3, and 4. The messy and iterative nature of this process does not easily lend itself to the serial format of this dissertation, and this chapter relies in part on the overview offered in Chapter 1 to offer a structure to this framing. To begin this framing, this chapter provides a broad survey of research domains that have led to the conception of unified personal data as a solution. Across computing research, researchers have been implicitly examining the need and benefit of personal data from a variety of perspectives over many years. Despite the common interest in personal data shared by each of these domains, and perhaps because of the lack of a broader cross-discipline unified personal data framing, the contributions between them have been mostly disconnected: progress in one personal-data-focused sub-discipline typically has little impact on the work in other sub-disciplines. Treating personal data holistically as a research community rather than as a disconnected (or loosely connected) combination of research topics may provide the long-term support necessary to push forward the evolution of personal data. The bulk of this chapter highlights each of these domains and connections between them (see Figure 3), focusing on challenges that each domain has encountered 5 see chapter 4 for a discussion of some of the stakeholders for personal data 17 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data related to personal data. Highlighting these challenges serves several purposes in this dissertation. First, understanding the personal-data-related challenges in each field offers multiple perspectives that contribute to the problem framing. Additionally, understanding how each of these fields relates to personal data offers the ability to contextualize advances made in the space of personal data with respect to each discipline. The end of this chapter offers an overview of various personal-data-related research projects that I have worked on. While many of these projects also have separate research contributions of their own, in the context of this dissertation these projects can be seen as design probes. Through this lens, each of these projects has offered a different perspective towards framing the opportunity of unified personal data, which I have synthesized in chapter 4. Figure 3. A map highlighting research domains across HCI that generate, make use of, or investigate personal data, and the interconnections between them. 2.1 Personal Information Management The research area of Personal Information Management studies how people acquire, organize, maintain, and retrieve the many different types of information that they use in their day-to-day lives. In many ways, the field draws inspiration from Vannevar Bush’s seminal paper “As We May Think” (Bush, 1945). Bush describes the hypothetical Memex, a microfilm-based system for storing and retrieving the multitude of information that people handle throughout their lives. The first work 18 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data actually referencing PIM appeared in the 1980s6, evolving as a research area around the same time as HCI. Jones’ survey chapter of PIM (W. Jones, 2007) describes three “senses” of Personal Information: 1. The information people keep for their own personal use (e.g. contact lists, financial records, time-tracking logs) 2. Information about a person but possibly kept by and under the control of others. (e.g. invoices from purchases made on Amazon, electronic medical records) 3. Information experienced by a person even if this information remains outside a person’s control. (e.g. the news stories a person views online, the items a person browses on Amazon but does not buy) As Jones identifies, the study of PIM primarily focuses on the first sense, but acknowledges the relevance of the second and third as well. Put another way, PIM is primarily a study of how humans use the tools available to them in order to store information that they would like to access in the future, including contacts (Whittaker, Jones, & Terveen, 2002), calendar appointments (Starner, Snoeck, Wong, & McGuire, 2004), to do items (Bellotti, Dalal, Good, Flynn, & Bobrow, 2004), email (Ducheneaut & Bellotti, 2001), and the myriad pieces of unstructured information that we collect (Bernstein, Van Kleek, Karger, & Schraefel, 2008) and the possibility of finding the structure in that data (Chang et al., 2013). Other recent work in PIM has explored the concept of unifying different types of heterogeneous personal information (Karger & Jones, 2006). The problems cited in this work offer support for the design goals specified in chapter 4. PIM research offers an important component to the broader context of personal data research. First, PIM research provides an important examination of the interface between the user and the storage and retrieval of their personal data. In PIM research the specific interaction is about users explicitly storing pieces of their personal data for the purposes of retrieving it themselves later: there is no additional processing happening on the data while it is stored. However, PIM research can contribute insights into how best to collect hand-labeled ground truth data. This is particularly important for training models on personal data. With personal data the ground truth labeling task is more constrained than in the general case because with personal data models the person labeling the data typically needs to be the person who that data is about. Today, there is much more personal information for users to manage than ever before. The volume of data has grown both because there are more types of digital personal information (e.g. media collections, shopping behavior, and taxes), and also 6 See (W. Jones, 2007) for a helpful explanation of the study of PIM that offers some connections between Vannevar Bush’s “As We May Think” and the modern study of PIM starting in the 1980s. 19 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data because there are more data in existing channels (e.g. growing histories of email interaction, and more and more email each year). The result of this is that people have more personal data than ever to keep track of and many of the “things” are stored on different third-party services. A major challenge for PIM is to provide people with easy and relevant access to their data. This dissertation focuses on offering new ways of connecting relevant data together, even across different thirdparty services, which is one component of the challenge facing PIM. Furthermore, this dissertation proposes a unified approach to track and store people’s interactions with their data, which is another aspect of PIM. 2.2 User Modeling Research related to the goal of creating software that responds and reacts to characteristics of the user first appeared in the 1970s in several different application domains. Some of this work had a goal of creating intelligent tutoring systems that would use the student’s behavior within a tutoring system to personalize the software’s behavior for that student (Burton & Brown, 1979). Other work was focused on dialog systems that would tell different things to different users based on what the software could infer the user already knew (Allen, 1979; Cohen & Perrault, 1979; Perrault, Allen, & Cohen, 1978), or by using characteristics of the user to guess what the user’s intention was (Rich, 1979a, 1979b). In these early modeling systems, the modeling components were not distinct from the rest of the application, but as the field grew user modeling systems were made more modular. The first wave of modularization was in the form of shell applications that would be a part of the application. Fueled by the advent of the internet, the next wave was server-based user models that could support multiple distributed client applications (Kobsa, 2001). User modeling has grown with the rest of computing to include modeling in mobile and ubiquitous contexts. The info-bead user modeling approach (Dim, Kuflik, & Reinhartz-Berger, 2015) is one such system, which represents different pieces of user context as info-beads that can be connected together through info-links to form infopendants. The complete collection of info-beads and info-pendants can be combined to form user models and group models, which can be used to personalized specific systems. This modular approach can enable the reuse of info-beads and info-pendants in different deployments of the info-bead user modeling approach. Though the architecture and the approach to modularity in the info-bead approach are different from the implementation of Phenom described in chapter 5, the value placed on modular components that can be used across different applications and Phenom’s bots (see section 5.1) share a common inspiration. User modeling has also begun to expand to include the idea that user data can come from across the user’s lifetime. PortMe (Kay & Kummerfeld, 2010) is a user model framework that is designed to support models that are based on the user’s lifetime of personal data. PortMe provides an interface so that users can view and interact with details of user models that are based on their own data, and relies on the PersonisAD user model server (Assad, Carmichael, Kay, & Kummerfeld, 2007) for the 20 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data underlying user model representation. This concept of holistically thinking about personal data that spans a person’s entire life is core to engaging some of the fundamental issues with personal data, and it informs the design goals in chapter 4. Where PIM research was mostly focused on a user-centric perspective of explicitlycollected personal data, User Modeling is different in many ways. Instead User Modeling takes a primarily system-centric perspective, focusing instead on developing domain-specific models based on user behavior. User modeling brings to personal data research a demonstrated process for making end-to-end systems that leverage a user’s behavior to model a specific item, and incorporate that model into the application. Though these tend to be closed systems (i.e. data is collected from, modeled by, and applied to a single application), work in user modeling represents concrete examples of leveraging personal data to create models and apply those models to specific applications. One important challenge facing work in user modeling is deploying the models. This is essential for being able to build on the models (either in research or in commercial contexts), and also for understanding the real-world validity of the models beyond the more controlled environment of a traditional study. Phenom, the system described in this dissertation, offers an architecture that supports deploying models that depend on a user’s personal data. 2.3 Recommender Systems Recommender systems emerged in the early 1990s, growing out of user modeling into a space that was more directly focused on user experience. Specifically, early recommender systems set out to address a clear problem: as more and more people started using the internet, the amount of content was growing considerably and information overload was setting in (Konstan & Riedl, 2012). Tapestry (Goldberg, Nichols, Oki, & Terry, 1992), the first recommender system, targeted email overload. That work also introduced the phrase collaborative filtering, which has been an essential technique used by many recommender systems. In the twenty years since, recommender systems have grown massively and are deployed across many commercial systems offering personalized recommendations and predictions across many different domains from media consumption (i.e. news, movies, books) to recommending social relationships (i.e. dating websites, following people on Twitter). For personal data research, recommender systems (similar to user modeling) demonstrate the potential for closed systems to leverage personal data to enable realworld personalization. One interesting dimension that recommender systems bring to the conversation around personal data is the concept of collaborative filtering: dynamically using labeled data captured from a breadth of users to predict another user’s behavior or interests. This also draws together one aspect of the research on personal information management: the manual labeling of data by users. Recommender systems, particularly those based on collaborative filtering, 21 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data demonstrate interactions where users can provide labels for their personal data and receive direct benefits in turn for that labeling. A key challenge facing recommender systems is to integrate different kinds of recommender approaches including content-based approaches (i.e. using information about the content), collaborative approaches (i.e. collaborative filtering), and contextual approaches (i.e. situational information about the user) (Konstan & Riedl, 2012). This goal is an important component of the personal data vision as well. In most recommender systems, the personal data that is used by the system is personal data that was generated within the system (e.g. ratings, viewing behavior, sharing behavior). However, to fully realize the potential of contextual approaches, recommender systems will need to start to depend on data from outside of their systems as well. Work on context-aware recommender systems represents a movement in that direction (Abbar, Bouzeghoub, & Lopez, 2009; Adomavicius & Tuzhilin, 2011). Most context-aware recommender systems to date focus on immediate context, like the time of day or location (Matyas & Schlieder, 2009; Oku, Kotera, & Sumiya, 2010). Moving forward, recommender systems will need to broaden their focus to include a more holistic view of the user’s data and begin to make use of logs of personal data that show routines, changes in behavior, and trends over time (Bobadilla, Ortega, Hernando, & Gutiérrez, 2013). As a result, the work of this dissertation is of direct interest to recommender systems. 2.4 Lifelogging The research topic of lifelogging first emerged in the mid-1990s and in many ways started as a combination PIM, multimedia, and ubiquitous computing, sharing the same basic PIM inspiration of Bush’s Memex (Bush, 1945), but dramatically increasing the amount and kinds of data that might be captured in such a system (i.e. location, video, workstation logging) (Lamming et al., 1994). Other early work in the topic of lifelogging includes Lifestreams, which proposed a new metaphor for dynamically organizing a person’s data (Freeman & Fertig, 1995; Freeman & Gelernter, 1996). While the Lifestreams work was particularly focused on documents, it sets out a list of six observations that motivated their work, and remain relevant today: 1. 2. 3. 4. Storage should be transparent Directories are inadequate as an organizing device Archiving should be automatic The system should summarize multiple related documents in a concise overview 5. Computers should make reminders convenient 6. Personal data should be accessible everywhere With a small amount of interpretation, when placed in the context of today’s landscape of personal data many of these observations remain applicable and the spirit of the goals expressed through those observations have not been met. 22 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data Unified on the concept of “total capture” with a primary goal of serving as a memory aid, a variety of lifelogging systems appeared to focus on capturing as much data as possible about individuals’ behavior (Hodges et al., 2006; Hori & Aizawa, 2003) and providing usable ways of accessing that data (Adar, Karger, & Stein, 1999; Dumais et al., 2003; Gemmell, Bell, & Lueder, 2006; Gemmell, Bell, Lueder, Drucker, & Wong, 2002). This style of lifelogging work attracted criticism and lost favor in the research world when it became apparent that the collection of these huge archives of disparate data did not lead to compelling applications (Sellen & Whittaker, 2010). Despite these criticisms, a number of commercial systems have emerged in recent years that enable the “capture” portion of lifelogging (e.g Narrative camera7, Saga mobile app8). One notable exception to this criticism is in more specific populations where there has been demonstrated value in lifelogging, for example in people with memory impairment (Browne et al., 2011; Lee & Dey, 2008), or serving as a tool for helping and understanding children with autism (Marcu, Dey, & Kiesler, 2012). In many ways, lifelogging is an attempt at finding a solution (developing the technology) without fully understanding the problem (validating the application area). lifelogging as a research area communicates an underlying hunch that there must be value in the data that characterizes our lives. However, it lacks a clear need that collecting this data will fill. Lifelogging is a different perspective on personal data: the idea that individuals will drive the collection of their own personal data, perhaps without a specific purpose in mind. This contrasts against User Modeling and Recommender Systems where the user may not even know that data is being captured and used by the system. lifelogging research is in some ways similar to PIM: a user-centric focus on the collection and retrieval of personal data. However, where lifelogging and PIM differ is in the volume and use of the data: PIM collects a comparatively small amount of data that the user expects to need later, where lifelogging collects as much data as possible, typically without a specific use in mind. Lifelogging research faces a difficult duality. On one hand, there is a general hunch that there is value contained within personal data, and it is impossible to harness that value (or even to understand what that value is) without first collecting large amounts of data. On the other hand, lifelogging is a cautionary tale of finding a solution without knowing what problem it solves. One way for personal data to address these issues is by making it easier for developers and researchers to experiment with different ways of finding value in personal data. Lowering the barrier to entry is likely to surface many more ideas and enable a real-world validation of their utility. This dissertation explores opportunities for reducing the burden on developers for carrying out these steps. 7 8 http://getnarrative.com/ http://www.getsaga.com/ 23 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data 2.5 Context-Aware Computing The field of context-aware computing is a research domain that was established in parallel with and closely related to ubiquitous computing in the early 1990’s to develop computing systems that could capture, process, and react to a person’s immediate context (Schilit, Adams, & Want, 1994). A widely used definition of context comes from (A. K. Dey, 2001): “Context is any information that can be used to characterise the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves.” Dey also offers a definition of context-aware: “A system is context-aware if it uses context to provide relevant information and/or services to the user, where relevancy depends on the user’s task.” These definitions are quite general, and Dourish (2004) argues that full context-awareness is intractable, as the relevance of context surely changes from moment to moment. The Context Toolkit is a very prominent piece of work in this domain (A. Dey, Salber, & Abowd, 2001). The Context Toolkit was a response to the problem that developing context-aware applications was far too difficult because the components of a context-aware system were not modular enough to facilitate reuse. This observation and the resulting requirements for dealing with context are in many ways analogous to the observations made in this dissertation with respect to personal data: developing with personal data is also very difficult, and one of the sources of that difficulty is also the lack of modularity and the exposure of too much complexity.9 Context-aware computing is largely motivated by vision proposed by Weiser’s “Sal” story (Weiser, 1991) where as a user moves through her day, technology is seamlessly integrated into her day, to the point where the technology becomes unremarkable and invisible in use. Despite the fact that an important component of realizing this vision will require longer-term knowledge, for example knowledge of a person’s routine behavior (Tolmie, Pycock, Diggins, Maclean, & Karsenty, 2002), contextaware computing has traditionally focused on immediate context based on data that was collected in the short term, often based only on instantaneous sensor readings. This represents an entire category of data that describes information about a person, and thus fits into the category of personal data. From the perspective of personal data, context-aware computing research contributes technical solutions for transforming sensor data into more meaningful personal data. In this way, context-aware computing makes possible the automatic collection of new kinds of personal data, or of improving the accuracy or coverage of that data. 9 There is far too much work in the realm of context-aware computing to describe in this chapter, however (Baldauf, Dustdar, & Rosenberg, 2007) and (Chen & Kotz, 2000) offer surveys of the field. 24 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data One important problem that the field of context awareness faces is the challenge of reconciling instantaneous data, which captures part of a person’s context in the moment, with the much more expansive logs of heterogeneous data that provide the information necessary to correctly interpret that instantaneous context. To engage this challenge, context-awareness research will either need to access and interpret those disparate logs of heterogeneous personal data themselves, or build on the work of others who have done this. The framing of unified personal data includes the concept of collecting long-term logs of data about the user, simplifying this challenge. 2.6 Personal Informatics and Quantified Self Personal informatics is defined as a class of systems that help people collect personally relevant information for the purpose of self-reflection and gaining selfknowledge (Li, Dey, & Forlizzi, 2010). Personal informatics has emerged as a research area simultaneous with the widespread adoption of smartphones and increased consumer interest in fitness-related wearables. At a high level, personal informatics involves two major steps: collecting data and reflecting on that data. On the collection side, personal informatics has its roots in PIM, lifelogging, contextaware computing. On the reflection side, personal informatics has its roots in information visualization (Pousman, Stasko, & Mateas, 2007). One example of an early personal informatics system that combines these steps is the Ubifit Garden (Consolvo et al., 2008), which used mobile phones to collect and visualize physical activity information. Figure 4: Stage-Based Model of Personal Informatics Systems (Li et al., 2010) Li, et al. (2010) proposed a stage-based model of personal informatics systems (see Figure 4) targeted toward behavior change that identified five stages: preparation, collection, integration, reflection, and action. This work highlighted two important features that the model emphasized: this process is iterative, and that barriers in earlier stages of the process (e.g. difficulty collecting data, difficulty integrating data from different sources) cascade to impact later stages, perhaps even making those later stages impossible. The quantified self movement, a group of people interested in self-tracking, has appeared and has gained some traction across the world over the last several years, with a small but loyal following. While a few notable individuals have attracted some 25 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data press for reporting on their own findings from examining their own personal data (e.g. Stephen Wolfram’s “The Personal Analytics of My Life” 10 and Nicholas Felton’s “Feltron Annual Report”11), for the most part examining one’s own data remains a fairly uncommon activity that requires the user to have logged her own data and also have the knowledge, skills, and motivation to turn that raw data into something meaningful or consumable. Research in personal informatics, and the growing quantified self movement demonstrate that the process of collecting personal data is a challenge, so much so that it inhibits what information can be collected even if the data already exists. Recently, there has been a sort of call to arms towards data liberation throughout the community. For example, Deborah Estrin’s concept of small data envisions a future where individuals have access to and control over all of their data own data, for use however they choose (Estrin, 2014). For personal informatics to continue to grow, it needs to be easier for people to bring together and synthesize their own data, and to do this from more sources. The vision of unified personal data offers one way of accomplishing this. 2.7 Computational Social Science and Data Mining To Understand Human Behavior With the widespread adoption of cell phones, it has become possible to collect largescale datasets that capture the dynamics of human movement and social behavior over large populations and/or long periods of time. A major differentiator in this work is whether or not the researchers have access to individuals in the population. Typically, if the researchers have access to individuals in the population, it is because the researchers have recruited the population directly and have collected the data themselves. One early example of this style of work is the reality mining project (Eagle & Pentland, 2006). In this work, 9 months of data was collected from 100 participants that included call logs, Bluetooth proximity to other devices in the study, cell tower IDs (providing rough location), and application usage. Through the collected dataset, the researchers were able to examine individual routines, dyadic behavior across individuals, and organizational behaviors across the entire dataset. Since the time that this original data was collected, subsequent studies have been conducted to examine routines within families (Davidoff, Ziebart, Zimmerman, & Dey, 2011), location dynamics of a heterogeneous sample (Kiukkonen, Blom, Dousse, Gatica-Perez, & Laurila, 2010), shifting behaviors in a residential community (Aharony, Pan, Ip, Khayal, & Pentland, 2011), and mental health in the students of a college class (R. Wang et al., 2014). By engaging in these data collections, researchers have the opportunity to collect participant responses focused 10 11 http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/ http://feltron.com/ 26 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data on a particular research question, which can then be used as ground truth when developing a model based on the automatically sensed data. These data collections each represent a massive effort on the part of the data collector. An alternative approach taken by many researchers in the space of computational social science is to obtain and analyze a dataset that already exists, for example (Conti, Passarella, & Pezzoni, 2011; González, Hidalgo, & Barabási, 2008; Onnela et al., 2007; D. Wang, Pedreschi, Song, Giannotti, & Barabasi, 2011). The perspective taken by this work mimics the broader “big data” trend: using anonymized logs, typically from a single data source, to build some insight or observe a broad scale phenomenon (Lazer et al., 2009). However, there are some major drawbacks to this approach, many of which stem from the lack of access that researchers have to the individuals whose data comprise the dataset. In particular, this lack of access means that the data is often homogeneous (e.g. only call log data no other data) because connecting multiple sources of data would require knowing who those individuals are, linking data together, and perhaps even requesting permission to do this work. Additionally, this means that this style of work also does not have access to explicitlyprovided data (e.g. survey responses), only implicitly-provided data (e.g. phone call logs from a telecommunications provider). Even if individuals wanted to provide additional information to researchers, there is no mechanism by which to provide that information. As a result, this leads researchers to define proxies based on the data that are used to represent the desired data. However, these proxies are not necessarily validated before use, which can lead to systematic problems when interpreting the data (Wiese, Min, Hong, & Zimmerman, 2015). These challenges (lacking the ground truth and real-world understanding of what these data actually represent) limits the conclusions that can be drawn from this kind of research. The ability to have anonymized unified data with useful ground truth labels from a large population could have a massive impact on the world. This kind of data could produce important scientific results, and also build insights about the population that can affect city planning, health, and public policy. The challenge of bringing this data together today, even for a single individual, inhibits this progress. The vision of unified personal data described in this dissertation represents an important step towards these goals by bringing together a user’s heterogeneous personal data. 2.8 Identity Interfaces: Virtual Possessions and SelfReflection As technology has become increasingly integrated into people’s everyday lives, many aspects of everyday life that were previously well established in the physical world have begun to bridge into a hybrid physical/digital space, or even a fully digital space. The effects of this shift are incredibly broad and far reaching, and have changed the way that people communicate, collaborate, create, consume, and collect. One important effect of this shift is the transition of many different kinds of 27 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data possessions that were previously physical possessions into virtual possessions (Odom, Zimmerman, & Forlizzi, 2010). This shift is significant in part because people’s possessions both reflect and contribute to their identities (R. W. Belk, 1988). In contrast to material possessions, virtual possessions are placeless, spaceless, and formless (Odom, Zimmerman, & Forlizzi, 2014). These qualities affect the circumstances under which people manage their possessions including the process of curating and archiving their possessions (Kaye et al., 2006), their process of spending time with and reflecting on their virtual possessions (Odom et al., 2010), and the legacy that they leave through their possessions (Gulotta, Odom, Forlizzi, & Faste, 2013). Research on virtual possessions stands distinct from that in personal informatics, though they are in some ways related. Personal informatics is a user-driven goaloriented process for collecting personal data (often through explicit action) that describes a user’s own behavior. Virtual possessions is also focused on users interacting with their own data, but the user’s motivation for interacting with this data and the provenance of this data is a much more fluid component of the user’s life: even without explicitly collecting virtual possessions, people have them and interact with them on a regular basis. The metadata that captures people’s interaction with their virtual possessions, and in many cases the virtual possession itself, are all personal data. Furthermore, this data could be used as a component of a personal informatics system. As such, many of the challenges, research questions, and exploratory systems within virtual possessions and personal informatics inform each other and the study of personal data at large. For example, the finding that fragmented virtual possessions are problematic for end users (i.e that they are stored in different non-compatible applications and services) (Odom et al., 2014) is an important issue for personal data at large. Similarly, the importance of and challenges with leaving a digital legacy (Gulotta et al., 2013) is an important point of consideration for the whole of an individual’s personal data archives, even the aspects of that archive that might not be considered “virtual possessions. Research on identity interfaces is an examination of the way people relate to the personal data that has become an integral part of their everyday lives. Having personal data in a digital format, when contrasted with the previous era where this data was either a material object or did not exist affords a different way of interacting with that data. Fragmented personal data is a major challenge for identity interfaces. While people present fragmented identities to different social groups (Farnham & Churchill, 2011), the fragmentation of personal data is service-based, not identity-based. Thus, an essential component of moving identity interfaces forward is to bring together service-fragmented data, which will enable research in this domain to continue. Unified personal data offers one solution to this very issue. 28 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data 2.9 Sharing Context The widespread adoption of the Internet over the last twenty years has brought the general public the ability to digitally share personal data socially with other people. One major reason to share personal data is to facilitate awareness across colleagues, family members, and close friends. Sideshow (Cadiz, Venolia, & Jancke, 2002), Community Bar (McEwan & Greenberg, 2005), MyVine (Fogarty, Lai, & Christensen, 2004), and ConNexus (J. Tang et al., 2001) collected some awareness information, such as IM status and calendar, and automatically shared that information with contacts in a side bar interface on the desktop. Awarenex (J. Tang et al., 2001), ContextContacts (Oulasvirta, Raento, & Tiitta, 2005) and Connecto (Barkhuus et al., 2008) are all mobile awareness systems with various representations of location and other data, such as calendar information, ring tone profile, and Bluetooth neighbors. Location is one type of personal data sharing that has been the subject of a great deal of research. While early location sharing was focused on instrumenting an office or workplace (Want, Hopper, Falcão, & Gibbons, 1992), subsequent studies explored location sharing with colleagues and friends IMBuddy (Hsieh, Tang, Low, & Hong, 2007), or with family Whereabouts Clock (Brown et al., 2007). Tang et al. explored location sharing from the perspective of the motivation behind sharing location data, focusing on the difference between purpose-driven sharing and social-driven sharing (K. P. Tang, Lin, Hong, Siewiorek, & Sadeh, 2010). They found that where purpose-driven sharing typically focuses on an exact location, social-driven location sharing was more likely to favor semantic place names to specific geographic location. Context sharing brings the technical contributions of context-aware computing toward a user-centric space. Research on context sharing offers insights into mechanisms for sharing personal data, and how people interact with that data are important dimensions of the broader domain of personal data. Context sharing depends on bringing together the information required to do a contextual inference and making that inference. The challenges of doing both of these steps are major, and this challenge inhibits more complex context sharing scenarios (e.g. the “in-common” sharing scenarios described in (Wiese, Kelley, et al., 2011)). To enable these more complex context-sharing scenarios that depend on multiple pieces of context requires a significant development effort if developers cannot easily build on context inferences that have been developed by others. In this dissertation, Phenom offers an architecture that supports the integration and reuse of many different kinds of abstractions (including inferences) that could be made on personal data. 29 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data 2.10 Privacy Privacy is perhaps the most dominant topic when it comes to collecting, using, and sharing personal data. Academic discussions about data protection and personal privacy date back to the late 1960s and early 1970s and have expanded considerably in scope since then. A major challenge in privacy research, and in designing privacysensitive systems, is that expectations and perceptions of privacy co-evolve with technology, (Iachello & Hong, 2007) and vary across individuals whose opinions may also change over time (M. S. Ackerman, Cranor, & Reagle, 1999; Westin, 2001). Thus, protecting users’ privacy is both a moving target, and also must allow for some dimension of control across individuals. Furthermore, privacy is often viewed as a tradeoff, for example trading off the risks and benefits of disclosing some information, or the tradeoff between privacy and the public interest (Iachello & Hong, 2007). The disclosure of personal data can offer benefits (either tangible or intangible) for people who disclose the data and also for the companies that hold the data, but can also be costly for either or both parties (Brandimarte & Acquisti, 2012). Personal data privacy has been a particularly active topic in recent public discourse in large part because of the exposure of the mass-data-collection of the NSA’s PRISM program12, but also because of discomfort caused by behavioral advertising and a rash of recent data breaches. A major concern within this space is that even in the cases where individuals are explicitly paying attention to the permissions that they are granting to the applications and services that they use, they often do not fully understand the permissions that they are approving (Kelley et al., 2012). One way of thinking about user-centric personal data privacy research is through the following categories: • • • • • 12 Understanding the potential privacy risks of disclosing personal data, especially cases where disclosing some data inadvertently leaks other data (e.g. (Acquisti & Gross, 2009)) Understanding the potential benefits of disclosing personal data (e.g. (Lindqvist, Cranshaw, Wiese, Hong, & Zimmerman, 2011) Helping users to understand the meaning behind different privacy options and the tradeoffs of granting or denying access to different kinds of data (e.g. (Kelley, Bresee, Cranor, & Reeder, 2009)) Providing users with interfaces that enable them to easily express their privacy preferences (e.g. (Klemperer et al., 2012)) Guaranteeing enforcement of a user’s privacy preferences (e.g. (Yang, Yessenov, & Solar-Lezama, 2012)) http://www.washingtonpost.com/investigations/us-intelligence-mining-data-from-nine-usinternet-companies-in-broad-secret-program/2013/06/06/3a0c0da8-cebf-11e2-8845d970ccb04497_story.html 30 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data Privacy is an extremely important, and also extremely challenging, topic in usercentric research on personal data, and must be an integral part of ongoing research on personal data. One major challenge towards implementing usable privacy controls is the lack of continuity in setting those controls. Users are forced to specify privacy settings in a fragmented way, specifying these preferences per-application. While in some cases this granular control might be desirable, in many cases users would benefit from a simpler, unified interface for specifying these controls. The vision of unified personal data offers the possibility of this kind of unified interfaces for considering privacy concerns and specifying privacy preferences in a unified, coherent way. 2.11 Research Examples The previous sections have offered brief overviews of the breadth of research subdisciplines that contribute to a broader understanding of personal data. This section highlights research projects related to personal data that I have engaged in across a variety of these research areas. These projects demonstrate in more detail how research in some of these different domains connects to personal data. Additionally, these projects have served as research probes that have greatly contributed to my task of framing the opportunity of unified personal data. 2.11.1 Personal Information Management: The Contact List Name Field I examined contact lists with an initial goal of leveraging the structured data within the contact list entries of users’ smartphones to infer aspects of the user’s relationship with her contacts (Wiese, Hong, & Zimmerman, 2014). To understand the feasibility of this, I collected the contact lists of 54 participants, containing 35,599 contacts. However, to my surprise 67% of the contact entries that I collected contained either no contact information, or only an email address. Most of the remaining 33% of contacts only contained one piece of information, usually a phone number. The majority of contact list features were unused. Despite the apparent lack of information contained in these lists, a deeper exploration of the content uncovered more subtle structures within the data. Analysis of the contact name field yielded twelve distinct and unexpected naming strategies. This analysis of contact lists from a broad range of 54 participants found that those lists were used in surprising ways and revealed consistent patterns. The behaviors we identified present both a challenge and an opportunity: though usage patterns prevent simple automated approaches for data mining or contact-list merging, they also suggest alternative directions for data mining to understand the behavior of individuals and their relationships with others. More broadly, the results of this work point to a mismatch between the expected use and actual use of the contact list, a very common interface for interacting with personal data. 31 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data 2.11.2 Context Awareness: Inferring Phone Placement One example of context awareness from my own work is using sensors to infer the placement of a device (Wiese, Saponas, & Brush, 2013). Enabling phones to infer whether they are currently in a pocket, purse or on a table facilitates a range of new interactions from placement-dependent notifications setting to preventing “pocket dialing.” Phone placement data may not seem to be personal data at first glance, but over time phone placement data can be used to characterize the behavior of individuals. In this work I collected two weeks of accelerometer data from 32 participants’ personal mobile devices. Using the experience sampling method (ESM), participants recorded how their devices were being stored in-situ. To evaluate algorithms for inferring the placement or proprioception of the phone, I built and evaluated models using features from the in-situ accelerometer data. These models achieve accuracies of 85% for two different two-class models (Enclosed vs. Out and On Person vs. Not) and 75% for a four-class model (Pocket, Bag, Out, Hand). I also explored opportunities to improve the accuracy of the accelerometer-only models, using prototype sensors that leverage capacitive sensing (previously unexplored for this task), multi-spectral properties, and light/proximity sensing. I compared data gathered with these sensors in a laboratory setting, with resulting models achieving top accuracy levels of 85% to 100%. This work represents one example of developing a context-aware component of an application. To the extent that a smartphone is associated with a primary user, the place that they put their phone is a form of personal data: it describes something about the user’s behavior. Furthermore, over time logs of this data can reveal trends that might offer even more information on the user’s behavior. 2.11.3 Lifelogging and Identity Interfaces: Evaluating Applications that Make Use of Long-Term Location History A major shortcoming of early Lifelogging research was the general lack of applications for collecting large logs of personal data. The process of testing application ideas and finding value can be a difficult one, but is an important component of human-centered computing. 32 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data Figure 5: An example of a scenario presented during one of the sessions. In one example from my own work, I developed a set of scenarios that illustrated some potential use cases for applying histories of a user’s location and conducted a needs validation session, following guidelines from the “speed dating” design technique (Davidoff, Lee, Dey, & Zimmerman, 2007). Needs validation uses storyboards of different scenarios, in our case to depict different concepts for location histories and future history (see Figure 5), to provide participants with many quick views of possible futures. During a session, a researcher presents the storyboards one at a time to an individual or to a small group of participants. The researcher then follows the storyboard with a lead question that focuses the discussion on the underlying need and away from the specific way the technology in the storyboard shows the need being addressed. By presenting the participants with storyboards showing people like themselves in situations that seem common, this method helps participants draw on their own experience as they visit and reflect on an imagined near future. Participants were invited to share their reactions to the storyboards and to address the corresponding questions, which ask them specific ways that their own experiences have led them to a similar need as the one addressed in the scenario. Furthermore, participants were told not to think about the technology that would be used to implement these scenarios, but just to assume that the technology could work. I brainstormed 36 scenarios, which I refined down to 18 based on redundancy of the underlying need we were addressing and the level of convincingness for each scenario. Once made, I thematically clustered the scenarios based on content. These clusters, and results from the needs validation, are described here: • Icebreaking (3 scenarios): These scenarios describe situations where location history is used to strengthen existing relationships or build new ones. Participants strongly identified with the needs implicit in the icebreaking scenarios (e.g. needing a good topic of conversation when talking with a new person). However, participants were also concerned that the usage of a conversational aid that supplies conversational topics might be awkward, 33 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data • • • make them look socially inept, or even come across as “stalkerish”. Participants felt that this kind of technology would be best suited for professions where social relationships are brief (e.g. nurse, taxi driver), so building commonality earlier is better. Recent work has continued to explore this concept of computer-supported icebreaking (Nguyen, Nguyen, Iqbal, & Ofek, 2015), though not necessarily using location history. Future – intersections and obstacles (3 scenarios): These scenarios address different aspects that affect one’s plans in the short or long term, including things that might inhibit them or that they may want to include in their plan. There are fewer future location scenarios compared to location history because of the differences in how easy it is to obtain that data accurately, which affects technical feasibility. Participants reflected a clear need for monitoring how different logistics might affect their future plans, and they also strongly identified with being able to take advantage of opportunistic serendipitous overlap with friends that they had not seen for a while. One concern expressed by some participants was that they take pride in “having it together” and being prepared for different situations, which they feared this kind of technology might diminish. Identifying a person by time and place (2 scenarios): These scenarios explore the idea that there are some situations where one would want to contact the people that were around them in a particular place some time in the past, but do not have contact information for them. Scenarios here provide a functionality that is otherwise not available in the real world. One place where this functionality is slightly available is through Craigslist “Missed Connections”13, which allows people to post a message, hoping that the person they saw at a particular place will get that message. While participants were not at all interested in the scenario for supporting missed connections, but they were however much more interested in situations where they had spoken with somebody, but had not exchanged contact information. Also, when the scenario motivation was functional rather than social, (e.g. who left the meeting room messy, or who saw the car accident), then it was no longer a problem if they hadn’t spoken. However, one important issue here was that participants did not want the barrier for a stranger to contact them to be too low. Personal traces (7 scenarios): These scenarios build on data from interviews with early adopters, which suggest that there is value in having access to your own location history. Participants responses to these scenarios were in many ways neutral: there was no problem with the scenarios, but they also were not really sure how much value the scenarios offered. This differed from the early adopters who had expressed that these kinds of scenarios are a major motivation for their usage of location logging applications. One explanation for this disparity is that the real value in this scenario actually comes from being able to see your own data, so speed dating might not be the best way to evaluate this because participants are not looking at their own data. Additionally, these scenarios are usually easier to implement: they only depend on the location logs of an individual, while 13http://pittsburgh.craigslist.org/i/personals?category=mis 34 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data • many of the other scenarios would require wide adoption in order to realize their potential. Mining existing social networks for location overlap (3 scenarios): These scenarios use existing social connections to share information, experience, or interest in a place. They differ from icebreaking in that their primary goal is not to strengthen the relationship, though it could certainly be a byproduct. Participants expressed a strong desire for these scenarios, particularly social place recommendations. Participants expressed that this would reduce the need to read and write reviews (i.e. if you see that somebody you know has been there, you can just ask them). Additionally discussion of these scenarios revealed an additional unmet need: the need to identify common interests with friends that were previously unknown. The results from this work demonstrate the importance of creating low fidelity prototypes of the future, in particular where personal data is concerned. At a high level, this exercise demonstrated that location history logs have the potential to offer real value to consumers. Even today, 6 years after that study was run, location data is mostly used in the form of “present location”, very little with past or planned/projected future location. With longer location logs and better ways of managing this data, a lot more seems possible. On the other hand, there were numerous issues and concerns associated with many of these scenarios that really reveal the challenge and the complexity of working with personal data. 2.11.4 Identity Interfaces: Postcards14 Mailing Archived Emails As Recent research speculates that changes to the form or behavior of virtual things might increase people’s perceptions of value (Odom, Zimmerman, & Forlizzi, 2011). To investigate this further, we designed and deployed a technology probe that radically altered the form and presentation of potentially valuable elements within people’s massive email archives by sending them physical postcards of email snippets. We interviewed participants, probing to understand the properties of cards that did and did not encourage self-reflection, a behavior shown to be associated with value creation (Odom et al., 2011) and a behavior that reflects the meaningfulness of an item. For the technology probe, we created a piece of software to extract potentially meaningful snippets from a person’s archive based on several heuristics, and chose photos from Google Image Search using the generated snippet as a search term. We conducted the study over a three-month period. We sent a postcard (with the image and the snippet) to each participant at a random interval between 7 to 10 days so that they likely received each new card on a different day of the week. We conducted three in-home interviews with each participant at the beginning of the study, one months, and three months into the study. 14 Work done in collaboration with Jennifer Olson, Dan Tasse, David Gerritsen, Tatiana Vlahovic, William Odom, and John Zimmerman 35 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data During the interviews with participants the postcards caused them to reflect on events, people, and humorous memories or jokes that were related to the snippets. However, many cards also left the participants feeling bemused or disinterested because they could not place the card in context. In these instances participants looked to the images on the postcards, but because of the loose coupling between snippet and image these were not helpful aids. Most participants also felt uncertain about where to place a card after it arrived. Contrasted with email, when the snippet arrived as a material postcard, the fact that the message was in their hand forced them to evaluate it from a new perspective in order to determine its next location in the world. This technology probe offers evidence for a variety of insights that further our understanding of how people relate to their personal data. At a high level, participants had clearly not thought about or engaged with their vast email archives. Even with email archives, personal data that is amongst the more accessible to end users, it is still a mostly untapped archive full of rich memories. Another insight offered by this work was the fragmented state of personal data today, brought into sharp relief by how difficult it was to contextualize the email snippets. If personal data were less fragmented it might have been easier to select meaningful snippets, and we could have better contextualized the snippets for users. Perhaps the postcard photos could themselves have had more contextual meaning, even coming directly from the participants’ photo archives. Finally, this technology probe demonstrates strong limitations on being able to interpret personal data. Many systems attempt to draw insights from personal data automatically. In this probe, even the participants, who should theoretically be the gold standard for interpreting information from their own email archives, often struggled to interpret those archives. Moving forward, to realize value by interpreting personal data the people who are the subject of that data must be involved. 2.11.5 Context Sharing: Facilitating Workplace Awareness through MyUnity I deployed myUnity, a cross-platform awareness system designed to support awareness, communication, and collaboration for an office worker environment (Wiese, Biehl, Turner, van Melle, & Girgensohn, 2011). Where previous systems were platform-centric, myUnity supported both mobile and desktop environments both for sensing information and also for presenting that information. myUnity brings together personal data from disparate sources including: vision-based office activity, mobile phone location, desktop location via network, calendar, IM presence, and phone call status. The myUnity server aggregated this data, and also aggregated these disparate data sources into a higher-level abstraction of “presence”, which was a proxy for how accessible an individual in the system was. Results from a deployment of myUnity highlighted the value of connecting to multiple data sources and of automating the sharing process. 36 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data From the perspective of personal data, there several valuable takeaways from this work. First, the value of a service (in the case of myUnity the service is awareness) can be amplified by including multiple data sources, rather than a single one. Furthermore, even some simple processing to make a higher-level inference (e.g. presence) can be very useful for helping people find value in the data. Finally, much of the value in myUnity came from the automated nature of the sharing process. 2.11.6 Context Sharing: Understanding User Motivations for Sharing Location Using Foursquare We examined social location sharing on the check-in-based location sharing site Foursquare (Lindqvist et al., 2011). Foursquare is typically considered to be the first successful social location-sharing service. In this work we conducted interviews with early adopters and deployed two surveys to understand the reasons why people use Foursquare. One major goal of this work was to gain perspective on the factors that led to the success of this site in the domain of location sharing where many others had previously failed. In this work, we found a variety of reasons why people used Foursquare: to have fun and earn badges, to facilitate social connection, to discover new places, and to keep track of where they had been previously. While it seems that over time the novelty of the social gaming wears off, the value of other motivations persisted over time. Perhaps most notably from these findings is that there was not one specific “killer app” that led to the success of Foursquare where so many location-based applications had previously failed. Instead, there were a variety of reasons that people were using Foursquare, which combined to make the site successful. This combination of motivations seems to have helped Foursquare overcome the difficult chicken-and-egg problem that plagues many services: a service can offer an exciting new feature once it has built up users and data, but it can only build up the user base if it offers value to users to begin with. This is a challenge that extends beyond location data and applies more generally to applications and services that depend on personal data. 2.11.7 Privacy: Understanding investigating self-censorship Privacy preferences by I have explored user’s privacy decisions was by investigating their decisions not to post content on Facebook, a phenomenon we termed self-censorship (Sleeper et al., 2013). In this work we asked participants to take note of when they considered sharing a piece of content on Facebook, but instead decided not to. Participants were instructed to send a quick text message whenever this occurred with a few words to describe the situation (Brandt, Weiss, & Klemmer, 2007), and then to complete nightly surveys that described the situations in more detail. We conducted and coded semi-structured interviews with participants. The findings of this work indicated that in many cases, participants chose not to share content because it would have required too much effort to specify the subset of people that they wanted to share 37 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data with. Instead of Facebook’s manual list-based sharing controls, participants wanted to be able to specify a target sharing group more dynamically, using factors such as: life facet (specific work/school, family), demographics (age, gender, geography, race), tie strength, and the person’s relationship with the post (i.e. will this person be interested). This work has broader implications for personal data. First, if privacy controls are inadequate for capturing user’s preferences within a service, it may lead to decreased usage of the service, and hence less overall value for the user. This may be especially significant for new and less established applications and services which do depend on a critical mass of engaged users in order to succeed. The second implication for personal data is that personal data could make possible the kind of sharing controls specified above. Specifically, the dynamic factors above that specify the target sharing groups refer either to information about the person being shared-with, or about the relationship between that person and the sharer. These are both types of personal data, and thus if a system had access to this personal data it might enable a new class of privacy controls. 2.12 Discussion This chapter has offered a brief overview of a variety of research domains that relate to personal data. Particularly striking through this chapter is the broad variety of perspectives and research that relate to personal data: active research is taking place with personal data across many different domains, often with a tenuous or even totally absent connection between those domains. This lack of coherence across personal data research is counter-productive, and even potentially harmful. Research in one domain wrestles with issues and challenges that have already been explored in a different domain. Today the canon of personal data research is extremely scattered, if a researcher wanted to think about her research project holistically with respect to personal data, it would be very difficult to even know where to start. How does a particular piece of research relate to the broader landscape of personal data? What are the major issues that exist around personal data? Who are the stakeholders involved? What solutions already exist to challenges that I’m encountering? Answering questions like these requires considering personal data as its own topic, from a holistic perspective. In the early days of research that involved personal data, before smartphones and the Internet, a research system could be completely self-contained. In this way, research with personal data was simpler then than it is today. For example, user modeling researchers could collect data within the context of their particular experimental system, and that data was sufficient for pushing the field forward. Users did not have other data that could potentially be added to the system, it simply didn’t exist yet. One research system was unlikely to need to use the data from a different system, and the participants in an experiment with one system were unlikely to be participants with another system anyway. In the case of context aware systems, researchers could focus only one type of data: the data in their systems. 38 Chapter 2: Situating Unified Personal Data within the Landscape of Research that Leverages Personal Data Furthermore, they could focus only on immediate context, data that had been sensed immediately or in the short term. Current research with personal data has stagnated in large part because it continues to follow these trends. Researchers use data from one or two sources. They often have to collect the data themselves using ad-hoc, one-off systems designed specifically for their study. When the study is done, the infrastructure used to collect that data, application that was built on top of personal data, it all dies with that particular study. However, where this was once acceptable and a reasonable approach to conducting research, it no longer makes sense. Today, people have large amounts of personal data built up across a variety of applications, devices, and services, and failure to take advantage of this is at best a missed opportunity. Across the board, work on personal data is pushing up against the limitations of this approach: virtual possessions, context sharing, personal informatics, and user modeling are all dealing with various formulations of the problem that there is no uniform way of accessing or working with personal data. The solution to these challenges will not come from a single research project or line of inquiry. The space of personal data is complex and multifaceted, and advancing the way that it is handled today will require a holistic approach with advances across many disciplines. In short, personal data needs to be established as a separate research area, bringing together researchers of many backgrounds to focus on and solve the challenges present in this very important aspect of computer science. A holistic, multidisciplinary approach to personal data will lead to stronger research contributions across the board, and establishing standard tools, approaches, and protocols for working with personal data will benefit the entire research community. 39 3 A Case Study: Inferring Sharing Preferences Using Communication Data A central claim of this dissertation is that working with personal data is an unnecessarily arduous process. The previous chapter lays the groundwork for this at a high level: the current ad-hoc process of working with personal data inhibits many research areas. Working with personal data is challenging for a wide variety of reasons, and efforts to improve the state of personal data require an understanding of these challenges. How, specifically, does the current ecosystem of personal data inhibit research or application development? Building this understanding is itself a challenging task. Application developers and their companies make many decisions (often implicitly) throughout the software development process and even before it begins based on myriad issues and considerations around personal data. Even with complete access to the entire software development process, there is no expedient way to capture that data in order to understand these challenges. Exploring this question from the perspective of research offers a different view with some tradeoffs. Research applications can ignore or temporarily solve issues around developing with personal data (e.g. user adoption, privacy concerns), which distance them from the reality of deploying a production system. Research is often intended to push the boundaries of what is possible, which can offer a stronger perspective of how the current state of personal data may be limiting the imagined future of that research vision. Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data This chapter documents my work to infer social sharing preferences using people’s communication history. In the context of this thesis, the ability to infer social sharing preferences from automatically collected communication logs is an example of translating low-level personal data into a higher level understanding of the user (i.e. who is she willing to share sensitive personal information with?). The process documented offers numerous concrete examples of the challenges of working with personal data. On a functional level, the ability to infer sharing preferences from automatically collected data offers real value for users. It is often reported that people are unlikely to adjust the default privacy settings, sometimes even choosing not to share at all rather than adjusting their sharing preferences (Sleeper et al., 2013). Automatically inferred sharing preferences have the potential to avoid sharing too much or too little information, both of which can be harmful for users: while over-sharing can annoy others, cause embarrassment, or even lead to job loss, under-sharing has social consequences as well, including missed opportunities for connection and social support. Figure 6: The goal of the research in this chapter is to use communication logs (call and SMS logs) to infer sharing preferences, using tie strength as an intermediate representation. Theoretical literature supports the connection between communication logs and tie strength and also the connection between tie strength and sharing preferences. Additionally, past work has demonstrated that communication behavior corresponds to tie strength. Therefore, I focused first on the connection between tie strength and sharing preferences before attempting to replicate the prior finding connecting communication behavior to tie strength. The insight that communication behavior has the potential to predict sharing preferences is based on a combination of two different findings in the HCI and social science literature. The first finding connects communication behavior with the social science construct of tie strength (informally this is the strength of the relationship between two people): more communication between two people indicates a stronger tie between them (Granovetter, 1973). This finding applies across all communication between two people, including in-person communication. Not only does social science theory support the connection between tie strength and amount of communication, but recent work has demonstrated this connection in social media (Gilbert & Karahalios, 2009). Furthermore, a number of recent research projects have use communication frequency as a direct proxy for tie strength (Conti et al., 2011; Miritello et al., 2013; Onnela et al., 2007; D. Wang et al., 2011). Automatic detection of strong ties has many potential benefits. Social support from strong ties has been associated with mediating the occurrence and severity of depression (N. Lin & Dean, 1984) as well as finding employment after losing a job (Burke & Kraut, 41 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data 2013). Automatic detection of strong ties could also be useful for a variety of user interface personalization: determining notification preferences, sorting contact lists, or setting sharing preferences. The second finding relevant to communication behavior and sharing preferences comes from theoretical literature on sharing. Belk distinguishes two sharing motives. When “sharing-in” people share things with people they feel close to or desire to feel closer to, as a way of strengthening this relationship. “Sharing-out” involves interactions with people outside of close social boundaries and is generally more like gift-giving or commodity-exchange (R. Belk, 2010). However, unlike tie strength and communication, the HCI literature had not explored the connection between tie strength and sharing preferences. With the connection between communication and tie strength already established in the literature, this chapter demonstrates a connection between features of social relationships and users’ preferences for sharing different kinds of personal information (Section 3.1). However, using phone and SMS logs as communication data, this work could not predict the entire chain (going from communication preferences to tie strength to sharing preferences; Section 3.2). Specifically, phone and SMS log data was not sufficient to accurately predict strong ties. Altogether, this process highlights many of the challenges and complications inherent in working with personal data. 3.1 Connecting Features of Social Relationships to Sharing Preferences The study presented in this section explores salient features of interpersonal relationships that predict the user’s preference for sharing personal information, such as location, proximity to another person, and activity. Specifically, this study examines the association between several factors (e.g., collocation frequency, communication frequency, closeness, and social group) with preferences for sharing specific kinds of information. In this online study, participants provided basic demographic information and a list of friends. They then associated each friend with relevant social groups, rated their perception of closeness with each friend (tie strength), and stated a willingness to share information with each individual for 21 different sharing scenarios. 3.1.1 Method To recruit participants, I posting ads in several nationwide online bulletin boards and through two study recruiting websites. Prospective participants were selected based on several criteria: • Age (20 - 50): This age range includes different life stages, especially with respect to being a parent or child within an immediate family. 42 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data • • • Occupation (non-student): Students were excluded because they do not easily allow distinctions between work and school groups. Social network membership (members of Facebook with at least 50 Facebook friends): This was a source for generating friends’ names for the study. Additionally, membership in a social networking site indicates that participants are more likely to want to share information about themselves with people they know, allowing us to observe differences in their sharing preferences (as opposed to a person who does not want to share at all). Mobile device usage (must have a smartphone): Participants with smartphones were more likely to understand the potential values and risks of the sharing scenarios. Participants were compensated $20 for completing task 1, and $60 for completing tasks 2 and 3 (listed below, and described in more detail in the following sections). The data collection took place online and participants were given two weeks to complete all parts of the study. Participants completed three distinct activities: 1. Generating a lists of friends 2. Describing each friend in terms of closeness and affiliation with different groups 3. Stating willingness to share different kinds of information with each friend Generating lists of participants’ friends To ensure that participants would answer questions about friends who varied in social group and in closeness, I asked participants to provide two lists. The first list was intended to target potential strong ties, and was generated from categories which I derived from qualitative work on relationships (McCarty, 2002; Spencer & Pahl, 2006). The categories were: • • • • • • People you currently live with (5 people maximum) Immediate family members (5) Extended family members (10) People you work with (10) People you are close to (10) People you do hobbies or activities with (10) I instructed participants to avoid duplicates. The second list consisted of all of their Facebook friends. I provided participants with instructions on how to download this information from Facebook. The final friend list included everyone from the first list (typically less than 40 people), plus a random sampling from the Facebook friend list to reach 70 total friend names. Each list was checked for duplicates and for names that the participant did not recognize. If any were found, they were replaced with randomly selected names from the Facebook friend list. This final list of 70 names is referred to as the “friend list.” 43 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data Figure 7: The instructions for the grouping activity. Describing each relationship Observable features Data collected Friend sex Friend age Years known Frequency seen Frequency communicated with electronically Closeness (strength of tie) Nonobservable Next, participants provided information about their relationship with each person on their friend list. The complete list of data collected per friend is in Table 1. I organized this information into two categories: data that would be easily observable from within a UbiComp system or social networking site, and data that would require more work either to infer from observable features or for the user to express manually. Participants indicated tie strength by answering the question “How close do you feel to this person?” on a 1-5 Likert scale. This approach is similar to the one taken in work by McCarty (2002). Group Data type Male/Female Rounded to nearest year Likert 0-7: Less than yearly (0), yearly, yearly-monthly, monthly, monthlyweekly, weekly, weekly-daily, daily (7) Likert 1-5: very distant (1), distant, neither distant nor close, close, very close (5) Participant-dependent, however each group was put in a pre-specified category Table 1 Data collected for each friend. Data in the top half of the table (“observable features”) is data that was potentially observable by a UbiComp system or social networking site. Data on the bottom half of the table would either be inferred from the observable features or manually inputted by the user Next, participants detailed their mutual affiliations with each friend by placing them into groups. The interface (see Figure 7) allowed participants to create personalized groups. In addition, it required them to classify each group into one of 12 predetermined categories: neighborhood, religious, immediate family, extended family, family friend, know through somebody else, work, school, hobby, significant other, trip/travel group, and other. I developed these categories based on a combination of literature sources (McCarty, 2002) and data from previous work on grouping friends in social network sites (Kelley, P.G., Brewer, R., Mayer, Y., Cranor, L.F., Sadeh, 44 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data 2011). I instructed participants to indicate at least one group affiliation for each friend, and encouraged them to indicate multiple group affiliations when relevant. For example, if a person and their friend went to college together, and they both attend or attended the same church, the participant would place that friend in two groups. The result is a set of affiliations, and all of the people on the friend list who are associated with each affiliation. Sharing scenarios Finally, participants indicated their willingness to share information with each friend in the context of 21 different information-sharing scenarios (see Table 3). To develop the final list of scenarios, I first brainstormed over 100 different UbiComp scenarios in which individuals could share information, such as location, activity, calendar, history, photos, etc. I grouped scenarios into 11 categories based on the type of information being shared. I assembled these scenarios in a survey and posted it on Amazon’s Mechanical Turk, with two questions for each scenario: • • How often do you currently share this information now (whether with one person or with many people): never, seldom, sometimes, frequently, constantly How useful is it to you to share this information with somebody you know, answering for maximum usefulness: totally useless, somewhat useless, neither useless nor useful, somewhat useful, totally useful I used the results from this survey as a guide to reduce the list of 100 scenarios down to 21 specific scenarios. Survey results allowed me to pick scenarios with information that respondents found was more useful to share. Further, for that information that would be useful to share, I selected for a range in current sharing practices, including a mix of information that people currently do and do not share. The resulting list fit into five different categories: current personal location (7), personal location history (5), calendar and location plans (7), communication activity (1), and social graph information (1). See Table 3 for a list of the final set of scenarios used. For each of the 21 scenarios, I asked participants to indicate their willingness to share information with each of their 70 friends using a 5-point Likert scale (labels: 1definitely not, 3-no preference, 5-definitely). I adapted this method based on past work (Olson, Grudin, & Horvitz, 2005). 3.1.2 Findings Forty-two participants completed the study. Their occupations ranged from education and engineering to administration and legal. I eliminated three problematic respondents who each demonstrated no variance for more than 65 out of the 70 friends; each individual friend had the same rating for each of the sharing scenarios. These participants seemed to have simply rated the sharing scenarios as quickly as possible. Of our remaining 39 participants, there were 28 female and 11 male, with ages ranging from 21 to 49 (M=29.8, SD=6.4). 45 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data 46 Table 2: Linear regression models predicting sharing and closeness (last column only), controlling for each participant. Each column is a different model and data in the table are non-standardized β coefficients, except for R2 in the last row, which can be compared across models to demonstrate the variance explained. For example, the “close” model (fourth column) includes one effect, friend closeness, and this model accounts for 63% of the variance in sharing preferences. Gray cells indicate effects that were not included for that particular model. The data indicate both that closeness is the best predictor of sharing, and that observable features can predict closeness. Significance: *p<0.05; **p<0.01; ***p<0.001 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data Modeling sharing preferences The differences in participants’ mean sharing answer indicated a range of individual privacy/sharing preferences (M = 2.83 out of 5 where 5 is “definitely willing to share this information with this person”, SD = 0.66). To address the question of which relationship characteristics predict sharing preferences, I conducted a mixed-model analysis of variance predicting sharing as the outcome variable (see Table 2, note that the variables ‘user age’ and ‘user sex’ refers to our study participants). This analysis accounts for the non-independence of observations within each participant. Running this analysis with different models allows for an exploration of which combinations of independent variables explain the most variance in participants’ sharing preferences. All of the regressions in Table 2 were done on a per-friend level of analysis; the models use the mean sharing value across all scenarios for each friend (n=2730) as the dependent variable, and the features that described each relationship were effects in the models. The models included the participant as a random effect to account for non-independence of ratings within each participant. The first column of Table 2 shows means and standard deviations for all continuous effects in the model. The second column in Table 2 (model name = user) is a model that has no effects except for the effect of the participant (which accounts for individual differences among participants). The result shows that certainly some amount of the variance relates to individual differences, indicating preferences for sharing in general (R2 = 0.36). Models that additionally accounted for participant-level effects of sex and age performed poorly. Modeling sharing preferences with non-observable features The third, fourth, and fifth columns in Table 2 show models with effects that only include the non-observable data. For these analyses, I sorted group categories into the three descriptive “life modes” identified by Ozenc and Farnham, (family, work, and social) (Ozenc & Farnham, 2011), which they suggest are the primary areas of a person’s life. Closeness by itself turns out to be a very strong predictor of sharing preferences (model name = close, R2 = 0.63) with each 1-point gain in closeness accounting for a 10% increase of the sharing outcome. This means that a friend who is at closeness 5 (top closeness) is 40% more likely to be shared with than a friend at closeness 1 (bottom closeness). The regression that only had life modes as a predictor did not account for as much of the variance as closeness alone did (model name = mode, R2 = 0.48), with membership in family, work, and social modes accounting for a 12%, 3%, and 3% increase in likelihood to share respectively (note that all friends were categorized into at least one of these modes). This means that just knowing which of these categories a contact is in is not particularly helpful in predicting sharing preferences. Finally, adding life mode to closeness resulted in only a slight increase in performance over just closeness (model name = non obs, R2 = 0.65), and resulted in 47 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data a loss of significance for the “social” and “work” effects: closeness and family were all that mattered in this model, with participants being more likely to share with contacts they are closer to and with contacts that are family members. Modeling sharing preferences with observable features The previous section discussed models based on relationship features like closeness that are not immediately observable. How well do observable features predict sharing? These observable features (see Table 2) include friend age, sex, years known, frequency seen, and frequency communicated with. I call these features observable because current UbiComp systems are capable of capturing them from existing social network data, or from sensor and communication logs. As such, by testing these features, I can evaluate how well a fully automated system might perform for predicting sharing preferences. This model performed well (model name = obs, R2= 0.57), though still not as well as the model with just closeness. Significant effects included friend age (0.2% less likely to share per year), frequency seen (1.4% more likely to share per point increase), frequency communicated with (3.6% more likely to share per point increase), years known (0.6% increase per year known). The only feature that was not predictive was friend sex. The model also included four interactions. First, I included the interactions between participant and friend sex and the interaction between participant and friend age to see if homophily accounted for sharing preferences (are people more likely to share with others of the same gender or others who are closer in age?), but neither of these interactions were significant. The next interaction was between years known and participant age, which I included because I hypothesized that the duration of a person’s life that they have know another person might be a useful indicator. This did have a very small effect, indicating that younger participants were more greatly influenced by how long this person had known them. Finally, the model included an interaction between frequency seen and frequency communicated with. I hypothesized that some strong ties are communicated with much more often than they are seen (e.g. family who do not live nearby); similarly, some weak ties are seen often but not communicated with particularly frequently (e.g., one might see coworkers often, but not communicate with them outside of work). This interaction was also significant, revealing that communication is a stronger indicator of willingness to share when collocation is less frequent. Modeling sharing preferences with observables and nonobservables The next model includes both observable and non-observable features. This model (model name = obs+close, R2 = 0.65) includes all of the observable features (and the interactions described in the previous section), and also includes closeness. This model explains 65% of the variance in sharing preferences, an improvement over the 48 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data 57% explained by the model that only included observable features, without closeness. Closeness has nearly the same effect in this model as it does in the closeness only model, with each point in closeness increasing the likelihood to share by 8.8%. Frequency seen is no longer significant in this model, neither is the interaction between frequency seen and frequency communicated with. Additionally, frequency communicated with has less of an effect in the model (0.8% more likely to share per point increase, down from 3.6%). The final model (model name = all) added life mode to the obs+close model described above. Including all features in the model led to almost no difference in the variance explained (R2= 0.66, compared with R2 = 0.65 for the obs+close model), and the model effects were nearly identical to those in the previous model. A model that kept all 12 group categories distinct instead of grouping them into the 3 life modes was comparable (R2= 0.67). Overall, the models with closeness explained more of the variance in sharing preferences than any of the models without closeness, and adding closeness results in the loss of significance for other effects in the model. Predicting closeness using observables Since closeness is such a predictive feature, it is worth examining how well the observable features of each relationship predict closeness. I used the same approach as before, with a mixed-model analysis of variance controlling for participant as a random effect, but this time with closeness as the outcome. I included all observable effects from the other models. This model was quite effective (R2 = 0.70, last column of Table 2). Significant effects in this model included: friend age (0.2% less close per year), frequency seen (4.2% closer per point increase), frequency communicated with (6.8% closer per point increase), years known (0.6% closer per year). The interaction between frequency seen and frequency communicated was also significant, showing that communication has a much stronger effect when collocation is infrequent. The interaction between participant age and years known was significant with a small effect as before. The friend’s sex and the interactions of the participant’s and friend’s age and participant’s and friend’s sex were not significant. 49 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data 50 Table 3: Summary of data for each sharing scenario, sorted by overall mean sharing. The first column reports the correlation with closeness, and all correlation coefficients are significant to p<.001. The Tukey-Kramer test compares the overall means for sharing in each scenario: scenarios that have the same letter are not significantly different from each other. Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data Figure 8: Hierarchical clustering using average linkage distance. Horizontal position of the branches is directly proportional to the calculated distance between each cluster. Scenarios are shorthand for the same ones in Table 3. Willingness to share across different scenarios The previous analyses examined sharing preferences in general, and found that participants were more willing to share with closer ties. Are there differences in how well closeness predicts sharing between the different sharing scenarios? Is closeness a strong predictor for certain scenarios only, or for all scenarios? Correlations between closeness and willingness to share are significant for all of the sharing scenarios, with Pearson’s correlation values ranging from r=0.25 to r=0.53, all p<0.001 (see Table 3 for all values). By asking about sharing across 21 different scenarios, I was able to investigate differences in sharing as a function of scenario type. Willingness to share in all scenarios were significantly and positively correlated with each other (r=0.40 to 0.96, Cronbach’s α = 0.97). I examined these similarities further by performing a hierarchical cluster analysis using the average linkage distance formula, a standard technique for examining groupings among items which Olson et al. also used in their analysis of privacy and sharing (Olson et al., 2005). I chose to use mean sharing per level of closeness as the input because of the strength of closeness in explaining the variance of sharing responses. The dendrogram in Figure 8 shows the clusters. The horizontal scale for the dendrogram is linearly related to the cluster distance at each point where a pair of clusters was merged. For example, in the middle of the dendrogram “hist:common 51 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data hist” and “hist:I’ve been where you are” were more closely clustered than the next two “hist:everywhere traveled” and “loc:on vacation”: this is indicated with the horizontal distance, with the first cluster formed closer to the right side than the second one. Note that the scenario names are shorthand for the scenarios in Table 3. The three clusters in the dendrogram can be roughly labeled as categories of scenarios: 1) scenarios with information about something that the participant and friend have in common (see Figure 8 top, e.g. loc:within 1 mile); 2) location-historyrelated scenarios (see Figure 8 middle, e.g. hist:everywhere traveled); and 3) scenarios that reveal sensitive information (see Figure 8 bottom, e.g. loc:always). To ensure that the means for willingness to share were in fact significantly different across clusters, I performed a Tukey-Kramer HSD across all of the means (see Table 3; there was no significant difference across scenarios that are connected by the same letter). This revealed 13 groups (some of which overlap) of scenarios with no mean difference. Table 3 shows that the seven highest-mean sharing scenarios all involve sharing personal information that has something in common with the friend’s information, for example shared calendar events or location proximity with the friend. 3.1.3 Discussion The main focus of this study was to understand which of the collected features are most useful for predicting individual sharing preferences, with the ultimate goal of being able to automatically predict sharing preferences from that information. The results show that the simple 1-5 Likert scale for closeness was clearly the most useful feature for predicting sharing, outperforming grouping and all other models that do not include closeness. Despite the relative success of closeness as a predictor when compared with life modes, the literature has favored privacy controls that focus on grouping (Danezis, 2009; Fang & LeFevre, 2010; S. Jones & O’Neill, 2010). In addition, commercial OSNs all seem to either provide grouping controls (e.g. Facebook and LinkedIn), or else require users to specify sharing preferences on a per-friend basis (e.g. Google Hangout’s “send my location” feature). While a grouping paradigm does not prevent individuals from constructing groups based on closeness, it may be more useful to explicitly ask users to do so. One advantage of using closeness to aid in the specification of sharing controls is that closeness is ordinal; providing the closeness for two friends also indicates if one is closer than the other. In contrast, a weakness of group-based privacy controls there is no natural ordering between groups; they are nominal. The ordinal nature of closeness can be useful for expressing privacy controls, as users could simply express “don’t share with anybody below medium closeness” (closeness = 3). Closeness can also support tiered rules, such as “closest friends (5) can always see my location, medium-close friends (3 and 4) can only check up to twice a day, nobody else (1 and 2) can see it without requesting.” 52 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data Additionally, closeness is a useful intermediate between communication frequency and sharing controls because it offers intelligibility to users (i.e. a user can understand that a sharing preference was specified based on closeness, and even fix incorrect inferences of closeness). This also benefits any other applications that might use closeness for various features: closeness ratings will be improved for all applications. 3.1.4 Limitations One limitation of this data is that it is entirely self-reported. Additionally, further work is required to demonstrate the real-world application of these findings. By conducting the study online and anonymously, experimenter effects were likely minimized. Furthermore, individual self-report data is the ground truth on some measures such as felt closeness. However, some of the participants’ answers may have been idealized responses (e.g. people they call less frequently than reported), or participants may have been unable to answer thoughtfully for every sharing scenario (e.g. cannot answer for all places I’ve been to). 3.2 Using Communication Data To Infer TieStrength According to social science theory, features of communication such as frequency of contact (Granovetter, 1973) and communication reciprocity (Friedkin, 1980) are reliable proxies for tie strength, and these have been increasingly used as proxies for tie strength in the research literature. Following the findings of the previous study, that self-reported tie strength predicts sharing preferences, the goal of this next study was to connect communication behavior to sharing preferences, using automatically inferred tie strength as an intermediate step in that chain. Since communication behavior should predict tie strength, and tie strength was just shown to predict sharing preferences, the results of this study were expected to be straightforward. Instead, the main result was surprising: communication behavior was not a reliable predictor of tie strength, in particular for strong ties. 3.2.1 Method How well can tie strength be inferred from contacts, call logs, and SMS logs? These data sources can be found on nearly every smartphone, and I chose them to validate an assumption in the research community that communication frequency and duration from these channels can work as an effective proxy for the strength of a relationship. Further, I planned to use the inferred tie strength to predict sharing preferences. I collected data from participants’ Android smartphones and asked them to manually categorize and rate their relationships with individual contacts as ground truth for tie strength. 53 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data Participants I recruited 40 participants (13 male and 27 female) living throughout the United States by posting ads in several places: on Craigslist in 6 major US cities, on a nationwide site for recruiting study participants, on a website for posting social relationship research studies, and on a participant pool within our university. Participants met three selection criteria. First, to avoid privacy concerns with minors, participants had to be at least 18. Second, to focus on people who could benefit from a more computationally sophisticated representation of relationships, participants had to use Facebook and have at least 50 friends through the service. Third, to ensure a sufficient amount of log data, participants had to have used the same Android phone for at least six months prior to the study. 55% of the participants were students (graduate or undergraduate), 35% were employed in a variety of professions, and 10% were unemployed. Participant ages ranged from 19 to 50 years (mean = 28.0 years, σ = 8.9). Participants were instructed to complete the ground truthing within two weeks, and were compensated $80 USD. Of the 40 participants, four were excluded from our analysis: each had fewer than two weeks of data and fewer than 100 phone calls. Findings are based on the remaining 36 participants. Procedure Participants downloaded an Android app that copied their contact list, call log, and SMS log to a database file. Participants then uploaded this file, in addition to their Facebook friends list, to the study server through a custom website that was designed for this study. The entire study was conducted through this website. Participants could stop and resume whenever they wanted, and were given two weeks to complete the entire process. By default, Android phones limit the call log to the last 500 calls and typically have a default limit of 200 SMS messages per contact. This resulted in broad differences in how many days the logs represented (range: 21-369; median: 80; mean: 108). Participants’ contact and Facebook lists were much too long for participants to completely ground truth. Through pilot testing, we found 70 contacts to be a reasonable number for participants to rate before the task became overly burdensome. To maximize participant retention, participants were asked to rate 70 contacts. The vast majority of any individual’s contacts will be weak ties. However, for this study it was necessary to collect information on strong ties as well. To ensure that strong ties were included in the list of 70 contacts, participant generated a list of contacts that fit specific social categories, regardless of their appearance in the phone contact list or Facebook list. Participants listed five people in each of the following categories: immediate family, extended family, people they live with, coworkers, people they feel close to, and people they do hobbies with. Past qualitative work suggests these categories will contain an individuals’ strong ties (McCarty, 2002; Spencer & Pahl, 2006; Wiese, Kelley, et al., 2011). This process resulted in approximately 25 unique names per participant (some names were repeated across the categories). In addition, each 54 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data participant’s top 15 contacts with the highest communication frequency for calls, SMS, and Facebook were included in the list. The characteristics of the contacts on the list allow for an examination of the assumptions that communication is a direct proxy for tie strength: participants provided ground truth data for all of their highcommunication contacts, and also for all of their self-reported strong ties. If call and SMS communication is a perfect proxy for tie strength, these two groups should be the same. The final list of 70 contacts was comprised of the category list and the frequency list, after removing duplicate names. In cases where this process yielded fewer than 70 contacts, I added randomly selected contacts from the participant’s phone’s contact list and Facebook friend list. Afterward, participants manually inspected the list for duplicates, since automatic detection using contact names alone does not reliably identify all duplicates (Wiese et al., 2014). This process repeated until each participant had a list of 70 distinct contact names (hereafter called the 70-person list). Participants provided demographics for each contact in the 70-person list, such as sex, age, and relationship duration. Participants also answered four questions about their relationship with each contact, adapted from (Marin & Hampton, 2007): 1. How close do you feel to this person? 2. How strongly do you agree with the statement “I talk with this person about important matters”? 3. How strongly do you agree with the statement “I would be willing to ask this person for a loan of $100 or more”? 4. How strongly do you agree with the statement “I enjoy interacting with this person socially”? Participants answered questions using a discrete 5-point scale, following previous work on tie strength (J. M. Ackerman, Kenrick, & Schaller, 2007; Burke, 2011; Cummings, Lee, & Kraut, 2006; Roberts & Dunbar, 2011). I used a discrete rather than continuous scale to reduce cognitive load and fatigue – participants provided a large amount of data for many contacts, and a continuous slider may have been an additional burden. To protect privacy, I did not collect the content of SMS messages. However, I did collect descriptive information such as email domain name, first six digits of phone numbers, and city/state/zip code. 3.2.2 Dataset The dataset consisted of logs for 24,370 phone contacts, 16,940 calls, 63,893 SMS messages, and 1,853 MMS messages. Note that Android phones can be set to automatically sync the phonebook with online contact lists (e.g. Gmail and Facebook), so phonebooks may have included these contacts in addition to ones entered manually. 55 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data 800 600 400 200 0 Weak Medium Strong fb-only fb&phonebook phonebook-only unlisted Figure 9. Total number of friends within each tie strength level across all participants, separated by the number of contacts who only appeared in the contact list, only in the Facebook friends list, appeared in both, or neither. The data indicates that there is a notable number of strong ties that appear only in the phonebook and not in facebook, but there are few strong ties who appear only in facebook and not in the phonebook. 3.2.3 Tie Strength and Basic Properties of the Dataset As a first step to explore the validity of using information available on a smart phone (contact list, call logs, and SMS logs) to infer tie strength, I analyzed participants’ answers for the four tie strength questions (questions 1-4 listed in the procedure section). The questions were highly reliable (α = 0.91), so I added all four responses together to form a scale. This is a standard practice that increases the reliability of a measure (Gliem & Gliem, 2003). Using the scale, I generated a ranked list of each participant’s contacts based on relationship strength. Next I partitioned each participant’s contacts into three levels of tie strength. I explored several approaches for identifying these levels. An assessment of the distribution of Z-scores from the combined tie strength metric both across all participants and per-participant revealed no obvious gaps in ratings on which I could split strong and weak ties. Instead, I based these levels on previous work by Zhou et al, which finds that “rather than a single or a continuous spectrum of group sizes, humans spontaneously form groups of preferred sizes organized in a geometrical series approximating 3–5, 9–15, 30–45, etc.” (Zhou, Sornette, Hill, & Dunbar, 2005). They found that the top group represents a person’s closest relationships (support group), and the second group represents the next closest set of relationships (sympathy group). The larger sized groups of 50 and 150 people are considered to be less stable, and are referred to as clans or regional groupings. In constructing each participant’s 70-person list, I took multiple steps to increase the likelihood of capturing many of a participant’s closest contacts. Therefore, since the 70-person list likely included the majority of a participant’s strong ties, I assigned the contacts into their respective groups based on the numbers from Zhou et al. By identifying relative tie strength for contacts within each participant instead of setting absolute ratings as a cutoff points, I normalized out individual differences between participants (e.g. a tendency for some participants to use 3 as the baseline and others to use 1, or a participant’s negative reaction to a particular question). I partitioned each contact list into three groups: • strong tie - the top group (rank 1-4) 56 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data • • medium tie - the middle group (rank 5 – 19) weak tie - the remaining contacts In cases where multiple contacts tied for a rank, all of those contacts were assigned to the same tie strength level, resulting in a slight variation in group sizes per participant. With these tie strength groupings, I began to investigate communication patterns as a proxy for tie strength. First, I discuss simple features and their relationship to the tie strength groupings. Next, I describe machine learning models for inferring these tie strength levels. Contact Source and Tie Strength The properties of the 70-person list allow me to estimate an upper bound for the percentage of a user’s close contacts who could be detected from the two contact sources: only Facebook, only the contact list, or both. As Figure 9 shows, overall 99% of people on the 70-person list showed up in either a phonebook or Facebook list (range: 95-100%, med: 100%). Overall, 19% of contacts existed only in the phonebook (range: 4-57%, med: 18%); 29% were only in Facebook (range: 0-56%, med: 31%); and 51% were in both (range: 20-90%, med: 52%). Looking across the tie strength categories reveals distinctive trends. I used Spearman’s rho (ρ) to measure the non-parametric correlations between tie strength group and presence in the phonebook and Facebook friend list. Being a Facebook-only contact was negatively correlated with tie strength (ρ=-0.32, p < 0.001). Being a phonebook-only contact was not correlated with tie strength (ρ=0.03, n.s.), although percentage-wise, more of the closer contacts were only in the phonebook. Being a phonebook-andFacebook contact was positively correlated with tie strength (ρ=0.27, p < 0.001). The red points in Figure 9 represent the 21 people that were neither in the phonebook nor Facebook list. They were people whom participants identified as immediate and extended family members, housemates or roommates, or people they worked with, felt close to, or did hobbies with. The orange points in Figure 9 represent Facebook-only contacts and the blue points represent the phonebook-only contacts. 29% of contacts would be missed if using a phonebook-only list to classify 800 600 No Comm Logs Some Comm 400 200 0 Weak Medium Strong Figure 10. Number of friends in the mobile contact list who exchanged zero (No Comm Logs) vs. at least one (Some Comm) SMS or call with our participants (determined from call log data). There are a number of strong ties with zero communication logs in the dataset. Any classifier that is based on this communication behavior will misclassify those strong ties as weak ties. This issue is even more pronounced for medium tie-strength: nearly half of those contacts have no communication in the collected dataset. 57 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data tie strength and 19% would be missed if using a Facebook-only list. Both a Facebook-only and a contact-list-only approach would miss some strong ties; however, the Facebook-only approach would miss a notably larger number of strong ties (29% vs. 4%). Tie Strength and Phone/SMS Communication To establish an upper bound for the accuracy of inferring tie strength from phone and SMS communication, I divided the phonebook contacts into two groups by communication history (none vs. some). A reasonable baseline expectation would be that contacts with no communication history would have weak tie strength. Figure 10 shows that most contacts with at least one communication in the dataset have higher levels of tie strength. Additionally, as the tie strength level increases, the percentage of contacts with some communication with the participant also increases (ρ=0.35, p < 0.0001). Still, several contacts with strong tie strength have no communication history in the dataset. Thus, attempts to classify tie strength using only call and SMS data could not correctly classify these contacts. Having at least one communication in the call and SMS logs increases the likelihood of a contact having higher tie strength. However, this is not an absolute rule: there are counter-examples in both directions - strong ties without communication history and weak tie contacts with it. % Of Total SMS Call Duration Weak Medium Strong 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.0 0.2 0.4 0.0 0.2 0.4 % Of Total Calls Figure 11. A grid of six plots showing communication frequency and total talk time. The top 3 graphs plot each contact’s aggregate call duration (y-axis) against number of calls (x-axis). The bottom 3 graphs plot each contact’s number of SMS messages (y-axis) against number of calls (x-axis). For both top and bottom, the columns separate the contacts by tie strength group. The graphs include data for contacts with at least one call or SMS. All numbers are represented as the percentage of a participant’s total communication frequency/duration. Next I explored the relationship between communication frequency and duration with respect to tie strength. Figure 11 shows six plots in a grid, with each dot representing a contact in the dataset. The graphs in the top row show aggregate call duration (y-axis) against the total number of calls (x-axis) for each contact. The 58 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data bottom row shows the total number of SMS messages (y-axis) against the total number of calls (x-axis) for each contact. Each column indicates the contact’s ground truth tie strength level. Both aggregate duration and frequency are represented as a percentage relative to the total call duration or number of calls/SMSs per participant. I expected some close contacts (appearing in the two graphs on the right column) to stand out with long call durations (high y-axis value), and others to stand out with high frequency (high x-axis value) when compared with medium tie strength contacts (middle column) or low tie strength contacts (right column). For example, a person might call an old friend infrequently, but chat for a while each time. Conversely, one might regularly make short calls to a roommate to coordinate. As expected, contacts with more frequent or longer duration communications were more often in the higher tie strength levels. Number of calls, duration of calls, and number of SMS are all positively correlated with tie strength (ρ = 0.42, 0.43, and 0.20, all p < 0.0001). Surprisingly, many people in all tie strength levels had very little communication. Weak ties generally had few calls and short durations. For strong ties, the ranges increase for number and duration of calls, with a clump of few-and-short contacts. Summary of Simple Features This section established a basic upper bound of accuracy for inferring tie strength with smartphone communication logs. The data shows that using Facebook as the only data source would miss 29% of strong ties, either because they are not Facebook friends, or because these contacts do not use Facebook at all. Next, there are some strong ties without any record of communication within the phone logs. Finally, while communication frequency and duration of calls can help indicate strong tie strength, low frequency and duration are not clear indications of weak tie strength. These trends are consistent with tie strength theory: more communication on more channels indicates a strong tie. However, our dataset has a number of counterexamples, pointing to critical challenges for automatically inferring tie strength from communication behavior. 3.2.4 Classifying Tie Strength While the above findings already indicate significant issues for using call and SMS logs to indicate tie strength, perhaps a combination of more subtle features than frequency and duration might indicate tie strength. To explore this prospect, I developed several machine learning models to classify tie strength based on call and SMS log data. Features Used for Inferring Models I defined a total of 153 machine learning features: 17 from the contact list, 66 from call logs, 36 from SMS logs, and 34 from combined calls and SMS. These features 59 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data are based on (Min, Wiese, Hong, & Zimmerman, 2013), and more details on the specific features can be found in that paper. These features include: • • • • Intensity and regularity: The number of and duration of communications has been used to infer tie strength in past work (Hill & Dunbar, 2003; Roberts & Dunbar, 2011). I modeled this factor using features such as total number and total duration of calls. Temporal tendency: In their friends-acquaintances work, Eagle, Pentland, and Lazer observed the temporal tendency in contacting people (2009). For example, calling particular contacts at different times of day and days of the week. Channel selection and avoidance: People favor a certain communication medium based on the person they are communicating with (Mesch, 2009). I modeled this using features such as the ratio between SMS and phone calls. Maintenance cost: Roberts and Dunbar (2011) found that people apply different amounts of effort in maintaining different kinds of relationships. This effort is measured with the time to last contact. To model maintenance cost, I used the number of communications in the past two weeks (short-term view) and in the past three months (longer-term view). Inferring Tie Strength Using Communication Logs Using all of the features described above, how well can a model infer tie strength? The nature of tie strength poses a challenge for building this model. Tie strength could be treated as a numeric class value based on the answers to the tie strength questions. However, the difference between a rating of 1 and 2 is not necessarily equal to the difference between a rating of 2 and 3. Additionally, early iterations treating tie strength as a continuous value tended to push scores closer to the middle, with very few people classified as being weak ties. Therefore, I used the tie strength levels of very strong tie, medium strong tie, and weak tie as nominal class values in these models. I evaluated the models using the Weka Toolkit’s (“Weka 3: Data Mining Software in Java”) implementation of a support vector machine (SMO). I conducted a leave-oneparticipant-out cross-validation (each fold contained data from one participant). This prevents any anomalies within a particular participant’s data from causing a performance overestimate. I trained 9 models, varying two aspects of input data. First I varied what the model was classifying (First column of Table 4): • • • 3-class: classifies as very strong, medium-strong, or weak 2-verystrong: binary classifier that combines medium strong and weak ties into one class, with very strong as the other class 2-mediumstrong: binary classifier that combines very strong and medium strong ties into one class, with weak ties as the other class 60 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data 61 Table 4 The results of 9 classifiers constructed using SMO. The prediction classes are tie-strength categories. For 2-verystrong, the medium strong and weak tie strength classes are combined and for 2-mediumstrong the medium strong and very strong tie strength classes are combined. Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data I also varied the input data for the classifier (Second column of Table 4): • • • all includes all contacts on the 70-person list contactlist includes only contacts from the 70-person list who appear in the user’s phonebook (see Figure 9) somecomm includes only contacts from the 70-person list with at least one logged SMS or call (see Figure 10) Classification results vary considerably (Table 4), ranging from 46.28% (κ=0.179), to 91.55% (κ=0.361). The Kappa statistic measures the agreement between predicted and observed categorizations, correcting for agreement that occurs by chance. Table 4 The results of 9 classifiers constructed using SMO. The prediction classes are tiestrength categories. For 2-verystrong, the medium strong and weak tie strength classes are combined and for 2-mediumstrong the medium strong and very strong tie strength classes are combined. These results reveal clear trends. First, within each of the class conditions, classifiers perform best for all, second best for contactlist and worst for somecomm. Figures Figure 9, Figure 10, and Figure 11 provide some insight into these results. Most of the contacts who are not in the contact list (thus excluded from contactlist models) or who have no communication history (thus excluded from the somecomm models) are not strong ties, and thus are easier to classify. As a result, the models that include them perform better. The most successful class condition is 2-verystrong, followed by 2-mediumstrong. 3-class performs the worst. This is typical of multi-class models, which usually take a performance hit compared to binary classifiers. More often than not, the models classified strong ties incorrectly – they were more likely to classify a strong tie as a weak tie than as a strong tie (in Table 4, the recall values for the strong tie class are the percentage of strong ties correctly classified, and are under 50% for all but the 2-mediumstrong model). Also, about half of ties that were classified as strong were actually not strong (in Table 4, the precision values for strong ties is the percentage of contacts that were classified as strong ties who were actually strong ties – they are under 55% for six of the nine class conditions). The plots from Figure 11 offer insight into these errors. These misclassifications emphasize the weakness of using call and SMS logs to infer tie strength, and thus the problem with using those logs as direct proxies for tie strength. This result is even more pronounced in recall values for the strong tie class of the 2-verystrong models in Table 4. The 2-verystrong-all model, which has the best overall accuracy, only detects 1/3 of strong ties correctly. 3.2.5 Error Analysis Participant Interviews Motivated by the particularly low recall of the very strong tie class in these models, I conducted semi-structured interviews with 7 of the participants. For each participant, I selected 5 to 10 contacts they had labeled as strong ties that were misclassified as weak ties (58 contacts total). I focused on this type of misclassification based on an error analysis of the data. In the error analysis, I referenced tie strength theory to consider communication expectations for medium and weak ties. People 62 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data do not only communicate with strong ties, so the presence of some communication with weak ties is reasonable. However, if participants had more communication with more of their strong ties, the model would have been better able to distinguish between strong and weak ties. This led me to focus on very strong ties with little or no communication (who were misclassified as weak ties), rather than weak ties with some communication (who were misclassified as very strong ties). Interviews took place over the phone, lasted about 30 minutes, and were recorded to facilitate note taking. I asked participants open-ended questions about the nature of their relationship and communication with each selected contact: • • • • When and how did you meet this person? What led to this being a close relationship? Has anything changed between the time that you became close and now? Was there anything different about the channels that you used to communicate with this person or the frequency of communication that you used with this person between then and now? I iteratively coded participants’ responses about each contact for themes to provide insight into the misclassifications. Several themes surfaced that help explain the discrepancy between communication frequency and tie strength. I present them in two categories: Communication Channel and Relationship Evolution. Communication Channel We used to talk on the phone more when we first became close (7 of 58 contacts). In these cases, participants indicated that they used to speak on the phone more frequently, but do so less frequently now, mostly just to catch up. In some cases, this seemed to be a result of a change in life stage (either for the user or for their contact) and/or a change in their geographic location, replicating findings from prior work (Spencer & Pahl, 2006). For example, one participant complained that he used to keep up with a friend much more regularly before that friend got married, and now they hardly speak at all. Change in life stage and change in geography are discussed further in the Relationship Evolution section below. Other contacts in this category appear to be in relationships in decline, yet the feeling of closeness lingers. One participant spoke about reaching out to a friend multiple times without reciprocity: “I’d like to be friends, but it doesn’t work unless we both put in the effort.” In-person communication (11 of 58 contacts). Participants also identified contacts whom they mostly interacted with in person. A contact’s close proximity to the home seems to play an important role in tie strength. One participant described talking to her neighbor opportunistically, when they see each other. Another detailed how she spoke with her 11-year-old son regularly, just not over the phone. Three participants described friends from classes and their dorm with whom they spoke when they saw each other. 63 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data Extended family often fell into this category. Many participants reported primarily speaking with parents, siblings, and other family members in person. In one case, a participant reported going to her parents’ house a couple times per month, but mostly not calling her dad on the phone. In these cases, lack of communication logs did not mean lack of effort in maintaining the relationships. In discussing these contacts, some participants specifically mentioned making an effort to travel once a year to see each other, or making a special effort to get together when they do happen to be in the same place. Other communication channels (25 of 58 contacts). For some strong ties, participants noted that they communicate regularly, but not via phone calls or SMS. For several participants, communication with a contact happened almost exclusively using Facebook. Other participants used instant messenger, email, Skype, or SMS replacements such as WhatsApp to stay in touch with close contacts. Relationship Evolution Different location or different life stage (27 of 58 contacts). When asked what was different about their relationship between when they became close and now, many participants responded immediately that either they or their contact had moved. As in the literature (Spencer & Pahl, 2006), participants said that with the change in geography, the communication frequency had changed, but not the perception of closeness. The move was often triggered by a change in life stage (e.g., going to college, graduating, getting a new job). However, even without moves, a significant life stage change could trigger a communication change on its own (e.g. getting married or having a child). Family is close regardless of communication (17 of 58 contacts). Many misclassified participants were family members. Several participants described specific familial relationships from the perspective of obligation, which hinted at a greater underlying complexity. For example, one participant said that she refused to take her grandmother’s phone calls, stating that she calls too frequently and repeats herself. Yet, the participant still reported feeling very close to her grandmother. Another participant, the mother of an 11 year old, said “of course I am close to him,” but that it is not necessary for them to talk on the phone. Another participant said her uncle was “definitely close, but he’s different from the other close people. He’s that really strict uncle that wants to tell me how to live my life, so I don’t talk to him too much, maybe every couple months.” Interview Summary These interviews highlight the limited effectiveness of the tie strength models. One issue that limits the effectiveness of these models is the way that relationships change over time. In particular, the circumstances under which two people became close are not necessarily the same as the current circumstances of the relationship, even if the two people remain close. Since the communication logs only capture relatively recent behavior, they do not contain the data that would indicate a strong long-term 64 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data relationship. The other main component that limits these models’ effectiveness is that much interpersonal interaction occurs outside of phone calls and text messages, including communication in other media as well as face-to-face communication. Call and SMS-based models do not account for these interactions. 3.2.6 Discussion This section (3.2) investigates the growing practice of using communication frequency and duration as a proxy for social tie strength. While social psychology theory holds frequency and long durations across all communication channels as indicators of strong ties, the research community has used behavior across a few communication channels and over relatively short time windows as a tie strength proxy. This study examined if the call and SMS logs stored on a smartphone held enough information to infer tie strength. Communication Is an Indicator of Tie Strength, But… These results support the tie strength theory literature, showing a strong relationship between tie strength and communication patterns (Gilbert & Karahalios, 2009; Roberts & Dunbar, 2011). Higher levels of communication frequency, call duration, and, in particular, communication initiated by the phone’s owner are all indicators of a strong tie. However, when operationalizing this theory with call and SMS logs, the signal is very noisy. Low levels of communication do not accurately identify weak ties: participants had many strong ties who they rarely called or SMSed. The interviews probing strong ties with little communication revealed several explanations for this pattern, each of which pose fundamental challenges for inferring tie strength. First, a person’s communication via phone and SMS does not capture all of their communications. Interactions happen through many other channels (e.g., Skype, instant messenger, landline phones), in some cases replacing communication via phone or SMS. Second, face-to-face communication remains a primary form of communication for some very close contacts, but capturing this kind of communication is difficult with current technology. Third, strong ties may form in an earlier life stage and persist across stages even as communication frequency diminishes. Even if one could capture data across multiple channels and do so for long periods of time, it is not clear that this would be sufficient to improve the models of tie strength. A breadth of recent and highly-cited research has assumed that call and SMS behavior is a good proxy for tie strength (Conti et al., 2011; Miritello et al., 2013; Onnela et al., 2007; D. Wang et al., 2011). These contributions do not attempt to identify all strong ties exhaustively. Rather, they only identify strong ties who use a specific communication channel. Our contactlist and somecomm datasets best match this task. The models for these datasets produce similar errors, and also indicate that communication frequency and duration are an incomplete signal for determining tie strength. While theory supports the relationship between communication frequency 65 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data and duration and tie strength (Hill & Dunbar, 2003), these communications should not be operationalized only through the call and SMS logs stored on a person’s phone. Alternatives for Identifying Tie Strength Researchers looking for a way to separate strong ties and weak ties need to consider alternatives to using short term communication logs from one or two channels, such as those available of today’s smartphones. One alternative is to collect data from more communication channels. This approach has several challenges. First, beyond the most popular additional sources (i.e. email, Facebook), researchers are likely to face diminishing returns when adding additional data sources. For example, some people use Skype, while others use Google Hangouts. Similarly, there are many text message replacement apps (e.g., WhatsApp, GroupMe, Kik). The number of communication channels is growing, people have different preferences for which channels they use and for what purposes, and people switch between services based on fads, or on what services their friends are using. Second, many of these services offer no API for accessing log data. Third, correctly linking contact identities across multiple communication sources is nontrivial and error-prone. Another way of augmenting this process while still using communication data to separate strong and weak ties is to use a lot more data: data that extends back to when close relationships first began, which could be on the order of years or even decades. Since this data does not exist for current close relationships, the only way to evaluate this method would be to start collecting the data now and see if it predicts the presence of strong ties, which may only be formed several years from now. Current data collection and retention practices are not conducive to long-term data collection. For example, Android devices by default only store the last 500 calls and 200 SMS messages. Furthermore, there are no standard APIs to access one’s data, and no unified structures for storing user data and maintaining history as users change devices and services. For work on long-term communication history to be possible, these practices will have to change. Investigating message content might also help to improve the separation of strong and weak ties. It is possible that in cases where there is some communication, the content of the communication with strong ties is different from weak ties in a systematic way. A drawback to this approach, and the reason that we did not explore this avenue, is that many people are uncomfortable with the privacy implications of granting content level access to calls and SMS. Another approach is to differentiate relationship-maintenance communications with strong ties (which can be infrequent but very important) from other types of communication. One way to do so is to see whom a person calls or visits when traveling (factoring in time of day to differentiate between a likely work contact versus a social contact). Another way might be to use age or the inferred life stage of 66 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data individuals and incorporate that into tie strength models. For instance, college students, 40-year-old parents, and senior citizens likely have different kinds of people in their strong ties. This method would require much deeper investigation into how people’s friendships change over time and how life stage affects these relationships. The most reliable option for distinguishing strong and weak ties is to include users in the process through interviews (Spencer & Pahl, 2006), or a survey (as I did). Some research has considered computer supported tools for collecting this kind of data (Ricken, Schuler, Grandhi, & Jones, 2010). The primary challenge here is that, even in the case that labeling is efficient, this approach still requires the time and effort of the user. The primary drawback to all of these approaches is that they require data that is hard to obtain. In general, researchers who use communication frequency as a tie strength proxy do so because it is easily available. Many of the research datasets that are being analyzed were collected and anonymized for a different purpose, often by a third party such as a telecommunications company. Researchers using such datasets do not have the possibility of collecting more data, or have any access at all to the actual participants. Furthermore, many of these datasets contain data from far too many users for a non-automated approach to be possible. Using Communication Frequency as Tie Strength Researchers will likely continue to use communication frequency as a tie strength proxy because, with the rise of smartphones, the log data is increasingly available. Here, I offer some implications for those that make this choice. Researchers should carefully consider how the imperfect proxy of communication frequency as tie strength limits the strength of their claims. A strong tie might have some in-channel communication (meaning that they would be included in the experiment), but may still have less communication in that channel than some weak ties – does this hurt the strength of a claim being made on that data? It will depend on the claims being made, and to what extent those claims rely on a clear separation between strong and weak ties. One solution for researchers in these situations is to modify their claims so that instead of relating claims to tie strength, they relate the claims directly to communication frequency. For example, the existing work (Conti et al., 2011; Miritello et al., 2013; Onnela et al., 2007; D. Wang et al., 2011) that equates tie strength and communication frequency are valuable contributions. However, their findings are explained directly in the context of tie strength, which over-estimates the reliability of inferring tie strength from communication frequency. This can negatively impact the reader’s ability to correctly interpret their findings. If tie strength is important to an argument, researchers should also explain how they believe tie strength and communication frequency are related to each other within their dataset, and should explicitly identify that communication frequency is a limited proxy. 67 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data This work has not yet explored the possibility of systematic per-user differences based on demographics, behavioral characteristics, or life stage that may affect classification accuracy in separating strong ties from weak ties. If any such effects exist, they may affect the claims that can be drawn from using communication frequency to classify tie strength. Similarly, communication frequency may be useful for detecting other dimensions of interpersonal relationships. In turn, the influence of per-user differences and other dimensions of personal relationships may further the definition of tie strength and the understanding of the nuances of tie strength as a concept. 3.3 Case Study Discussion The goal of the studies presented in sections 3.1 and 3.2 was to use the communication behaviors of users with their contacts to predict their preferences for sharing different kinds of information with those contacts, using tie strength as an intermediate representation. At the outset this logical chain seemed well supported by theory, especially since communication behavior is known to predict tie strength. However, operationalizing this theory revealed fundamental challenges for working with personal data. First, obtaining the participants’ communication data was a difficult process. It required me to write custom applications to scrape the data from participants’ devices and additional effort to gather and process the data from Facebook. Collecting data from more sources would have significantly increased the complexity of this task. The process for transforming this data so that it could be used in the machine learning models also required linking the data across the two different data providers (phone and Facebook) and merging duplicate contacts. Since simple name matching did not identify all duplicates, merging duplicates required significant manual effort both by the researchers and by the participants. Further, the type and amount of data available varied widely by participant. Android restrictions limited the total number of call logs (and SMS logs for some participants as well). For other participants the data was limited by how recently they had obtained their current phone, because call and SMS logs are not automatically synced from an old device. In some cases this meant the timespan of the dataset did not cover enough time to make the tie-strength inferences. Additionally, the irregularity of the contact list entries and the amount of contact data that was completed made it practically impossible to use data from the contact lists for any of the model’s features at all. Improving these models by including communication logs from additional data sources is not trivial. The effort need to collect, link, and merge the additional data sources would be equal to or greater than the effort needed to do so for the original data, especially since applications for scraping data are not re-usable across data sources. In the two studies presented in this chapter, manual steps were required, both for the researcher and the participant. 68 Chapter 3: A Case Study: Inferring Sharing Preferences Using Communication Data Finally, the models developed here still offer potential value for inferring tie strength to some noise-tolerant applications where the cost of an incorrect inference is small. Unfortunately, even in this case deploying these models remains a significant challenge. This is in part because the entire machine learning pipeline here was static and one-off. Formatting the data, extracting features from the data, collecting ground truth, and classifying instances from the dataset all happened with limited automation to complete this research project. Furthermore, there is no mechanism to deploy inference models. Deploying this model (e.g. as an Android library) would require significant additional development effort to automate this process and generate a usable API. Additionally, deploying a machine learning model of tie strength raises privacy concerns. In particular, inferring tie strength information for contacts requires permission to access the user’s communication metadata. One solution would be to require that any applications using the library declare these permissions in their manifest. However, this may give developers pause: for some applications, adding permissions declarations for call logs, SMS logs, and the contact list in order to obtain tie strength may not be worth the additional scrutiny of a user. Some users may ultimately choose not to download an application that accesses too much data (J. Lin et al., 2012). It seems unnecessary for an application to need to require these permissions if all they are accessing is the resulting tie-strength inference. As a case study, this work illustrates some of the challenges in using personal data to make high-level inferences: the availability of the raw data, the limited effectiveness of automated approaches for identifying duplicates, and dispersion of one behavior across many channels (in this case, communication across phone calls, SMS, and channels not captured in the study). Most importantly, personal data was found to be an unreliable indicator for a higher-level inference even in an area where strong theoretical work already existed. These challenges point to deeper issues that affect the way that personal data can be used. While they are illustrated with communication logs and inferences of tie strength, these challenges are not unique to this specific area: researchers and developers in many situations are sure to encounter the same issues. 69 4 A Conceptual Framework for Personal Data The previous chapters have offered a broad set of insights that speak to the complexity of personal data from the perspectives of end users, researchers, and application developers. This chapter begins by examining the ecosystem of personal data today from a macro level, identifying breakdowns between stakeholders. Next the chapter offers a synthesis of issues highlighted here and in previous chapters to extract more general systemic issues with the ecosystem of personal data. This synthesis leads to a conceptual framework for understanding the breadth of personal data, a range of applications that could use that data, and the process for working with that data. The conceptual framework (described in section 4.4) consists of two components. The first component is a continuum of personal data (described in section 4.2) from very low-level (e.g. raw sensor data) to very high level (e.g. is the user experiencing major depression?). The second component is a set of three steps that are required to develop applications that depend on personal data (described in section 4.3). This framework serves as a boundary object to facilitate shared understanding of this domain of personal data and the process of working with that data to serve some client application. Finally, this chapter offers some design goals for improving the ecosystem of personal data from its current state to address the many issues highlighted throughout this thesis. Chapter 4: A Conceptual Framework for Personal Data Figure 12: Personal data today is separated across the applications and services where each type of data originated (left). To unlock the full potential of personal data, it should instead be structured to prioritize the coherence of the heterogeneous data around each individual who is the subject of that data (right). 4.1 The ecosystem of personal data today Despite its “personal” nature, personal data today is organized and stored separately within each service or application where that data was collected, rather than all of an individual’s personal data being stored together. This “siloed” approach to storing personal data introduces significant problems. At a high level, these problems are: • • • • Provides poor service: Each service has an incomplete view of the user, which limits the service’s offerings. It is impossible for the user to manage the access and usage of their data across these distributed silos. Facilitates customer lock-in: Users are bound to their services that hold their data, and leaving these services would cause the user to lose all of the value that they get from their data. For example, if a user wanted to stop using Netflix and start using Amazon Instant Video, he would have to leave behind the value of recommendations based on his viewing history. Chicken-and-egg: New services that rely on rich personal data are subject to a chicken-and-egg problem for procuring that data: the service is not valuable without the data, but the data is hard or impossible to obtain without the user using the service. Ground truth data labels: If a user tracks her location using one service and labels “home” and “work” in that service, those ground truth labels do not propagate to other services that the user wants to have access to that information (for example, a direction-finding service). This diminishes the value for the user to provide ground truth and further increases the challenges for leveraging personal data. For an independent application developer to incorporate personal data into a new application today, she must follow all of the steps outlined below in section 4.3 and faces many decisions along the way. In all but the most trivial straw man examples, following this process requires a significant investment in development resources. Through this system, many applications of personal data are simply not feasible, or are even impossible. To make matters worse, because each developer solves these 71 Chapter 4: A Conceptual Framework for Personal Data challenges on their own, these bespoke solutions are unlikely to be reusable for other developers. This further impairs a successful outcome when working with personal data. 4.1.1 Stakeholders To fully understand the current state of personal data, it is important to acknowledge different stakeholders and their goals. Together, these stakeholders and their goals form an ecology of personal data. Considering the entire ecosystem is helpful for understanding why things are the way that they are today, and also what kinds of effects any changes in this ecology might have. The stakeholders include: Data-logging companies and service providers: Products, services, and applications that generate logs of user data. This can include large companies that have many products (e.g. Apple, which includes iOS, OSX, Apple-written applications, iCloud, iTunes, the iOS App Store). This can also include small companies and companies with fewer products (e.g. Dropbox, Netflix). Truly any service that a person uses has the ability to collect rich data on that person’s actions. Data-consuming applications and services: These are services that take the user’s personal data and apply it in some way to provide a service to the user. In many cases, one organization is both of these first two stakeholders (data consumers and data loggers). For example, Netflix collects a user’s viewing data, and uses that data to make recommendations. However, services may also make use of many different kinds of personal data from a multitude of different data loggers. End users: Everybody as individuals. These are the people whom the data is about and the people who are using these applications and services. 4.1.1.1 Relationship Between Data-Logging Services and Users Data-logging services would not be meaningful without people who use that service. This relationship is mutually beneficial: users get to use the product or service and the services get the data that describes the individual’s usage of the service. The data that users generate while using a service is inherently under the direct control of the data-generating service. Data-generators decide which data to collect and not to collect, how long to store collected data, and how accessible that data is to users and third-parties. Services typically do not give a user complete access to the data that has been collected about them. Even in notable situations where data is made easily accessible to end users (like Google Takeout15), there is still valuable data that is not included, but that the company still collects (e.g. a Google user’s search history and Chrome browsing history). 15 https://www.google.com/settings/takeout 72 Chapter 4: A Conceptual Framework for Personal Data A user’s data is often essential to the business model of a data logger. In many cases, services are provided to the user for free or at a price that is below the amount of money that it costs the service provider to provide that service. This is made possible by selling information gleaned from the user’s data (i.e. market insights), or incorporating that data into a service (e.g. facilitating targeted advertising by employing the user’s data). In other cases, the user’s data is what can differentiate a service to make it more attractive to a user (e.g. Netflix relies on a user’s ratings and viewing history). As a result, some service providers are likely to behave in ways that users are unlikely to switch service providers. One way that service providers can do this is by locking a user into their particular service by withholding access to the user’s data or limiting portability to different services. Finally, privacy issues for users arise when data-generators capture data that users did not want to be captured and/or did not know was being captured. 4.1.1.2 Relationship Between Data-Logging Services and DataConsuming Services As mentioned above, in the current ecology of personal data, an individual service is often both a data-generating service and also a data-consuming service. For example, the dialer application on an Android Smartphone as a data-generator collects data about whom the user calls. As a data-consumer, the dialer application shows the user whom they have called most frequently and also most recently. However, despite sometimes being the same entity, Data-consuming services are separate from data-logging services because data-consumers may also want to access data from a data-provider that is a different service. For example, the Android dialer may want to also use data from Facebook and Skype to show people that the user communicates with frequently across different communication media. In some cases, data-generators charge data-consumers money for access to a user’s data. For example, Facebook, Google, and many smartphone applications make a lot of money by leveraging a user’s data to deliver targeted advertising. Here, privacy issues can arise for users when data is made accessible to data-consumers without the knowledge or explicit consent of the user. 4.1.1.3 Relationship Between Data-Consuming Services and Users As we said above, many data-consumers are companies that are consuming the data that they created themselves as data-generators. Privacy issues arise when data consumers use data in a way that a user did not intend, or when the data reveals (either directly or indirectly) information that the user did not want revealed. One example is the recent story of a teenage girl who received advertising in the mail for 73 Chapter 4: A Conceptual Framework for Personal Data maternity clothing and cribs based on her shopping habits, even though she had not told her father that she was pregnant16. In some cases, a user might wish to provide their own data to a data-consumer. For example, a user wants to use their data from one service to help personalize a different service. Another possibility is a user might want to donate their data to researchers that are going to analyze it. In general, users are fairly limited in their ability to do this. There is typically very little communication/interaction between data-consumers from different organizations, (except in formal business relationships like Facebook & Advertisers from above), so if the user provides some information to a data-consumer (e.g. this location is my home), then the user probably needs to provide that information separately to other data-consumers, even if they were okay with that information being used by other data consumers as well. 4.1.2 Summary The way that these stakeholders interact today is troubling and leaves much to be desired. Data-loggers wield a lot of power in deciding what data is collected, how it is stored, and who has access to it. With this power also comes the responsibility of maintaining the user’s trust, and obeying laws. Data-consuming services want to provide their users the best possible service, which can rely in part on access to the user’s data. Users want to receive the best services possible, but also want to be comfortable with how their data is being used, which requires a combination of transparency and trust. Finally, all stakeholders are seeking to minimize costs, even at the expense of other stakeholders (e.g. storing data and providing it through an API can cost money, so loggers may avoid it). Examining the ecosystem of personal data in this way highlights clear problems with how personal data is managed today. The current ecosystem stifles innovation, facilitates lock-in, and offers a sub-optimal user experience. Considering these issues holistically offers the potential to improve personal data for all stakeholders. 4.2 The Personal Data Continuum In this thesis, data is considered personal data if it describes something, anything, about an individual person: her behavior, her interests, her social relationships. So far in this thesis, personal data has been discussed as a single concept: either something is personal data or it is not. However, to fully engage how personal data is collected, stored, and used, it is useful to think about different kinds of personal data and how they are related to each other. 16 http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=7 74 Chapter 4: A Conceptual Framework for Personal Data This section develops the idea that personal data can be thought of as falling somewhere along a continuum. This continuum is a core component of the conceptual framework. It ranges from very low-level data (e.g. a log of accelerometer data, latitude and longitude coordinates, audio levels) to extremely high-level data (e.g. my behaviors that do not support sustainability, the set of skills that I don’t have which would be most beneficial to learn, an inference of the state of my mental health). Personal data can exist at various points along the continuum. The personal data continuum is intended to be continuous rather than discrete, however it’s also important to keep in mind that continuum is a conceptual tool, not an absolute dimension. The personal data continuum described here is one of two components (the other being the steps introduced in section 4.3) of the conceptual framework for working with personal data (described in section 4.4). The following examples offer additional perspective into various points along the personal data continuum. Figure 13: The personal data continuum ranges from very low-level data (far left side) like sensor data that describes the user’s behavior and surroundings to very high level data (far right side) that describes information about individuals that they might not even know about themselves. Information in the lower levels can often be directly sensed, but data higher on the continuum has to be provided manually or inferred from a combination of lower level data. 4.2.1 Points along the continuum Low-level data Low-level data is often sensor data such as data produced from an accelerometer, light sensor, temperature sensor, and microphone, but this data might also be log data like keypresses or mouse movements. One characteristic of this low-level data is that it typically does not mean very much if a human is looking at it on their own. For example, the readings from an accelerometer mean very little to a human without additional processing to interpret this data, usually on a time series. At first glance, it’s not always immediately obvious what low-level data is personal data and what low-level data is not. For example, accelerometer data from a smartphone can vary in its functionality as personal data: if a user is in possession of their smartphone then the accelerometer data describes something about the user’s behavior. However, if the user has lent their smartphone to somebody else, the accelerometer is no longer generating personal data for the owner because that data 75 Chapter 4: A Conceptual Framework for Personal Data does not describe anything about that person, instead the accelerometer is now generating personal data about the person currently in possession of the phone. Person-Readable Logs At this slightly higher level, personal data begins to have more clear meaning. The kind of data that exists at this level are often logs of user behavior from different applications: phone call logs, text messages, emails, browsing history, sleep logs, physical activity logs, purchase history, a log of places visited, media consumption history (music, news stories, TV shows, movies). The list of data that roughly fits into this category is very long. This category is what most people likely think of when they think of personal data. In general the common concerns about personal data, privacy, and control relate to this kind of data. The data collected as a part of the NSA’s infamous PRISM program17 generally fits into this category. Personal Inferences At the level of personal inferences, personal data is less about individual moments in time and more about a general higher-level kind of knowledge about an individual, the kinds of things that change over the course of weeks, months, or even years. For example, social relationship data such as tie strength and life facet (Farnham & Churchill, 2011) from the previous chapters are examples of personal inferences. Other examples of these kinds of inferences might include: how physically fit the user is, how well she has been sleeping, or how social she has been. This level is completely removed from the set of things that can be instantaneously observed and automatically collected by a computer system. Holistic Understanding This category represents the upper limit of personal data. These are very high-level inferences that describe things about individuals that they might not even know about themselves. It is easier to think about the items in this category in terms of questions: Am I becoming depressed? What should I be when I grow up? What skill should I learn? How can I live more sustainably? What item should I purchase to make my life better? These are all questions that would require an incredible amount of data to answer. These questions require more information than simply personal data, but personal data is a very important component to the answers to these questions. 17The PRISM program is a clandestine surveillance program run by the NSA that collected large amounts of internet communications. For more information see: http://www.washingtonpost.com/investigations/us-intelligence-mining-data-from-nine-usinternet-companies-in-broad-secret-program/2013/06/06/3a0c0da8-cebf-11e2-8845d970ccb04497_story.html 76 Chapter 4: A Conceptual Framework for Personal Data One vision of the far-off future of personalized computing systems imagines that technology might be able to answer these questions for users automatically, or at least help lead people to these answers themselves. 4.2.2 Detecting Depression: Tracing an example through the continuum Part of the value of thinking about personal data on this continuum comes from thinking about how data at different points along the continuum relate to each other. Inferring the onset of clinical depression is one type of very high level inference that has been increasingly explored recently (Doryab, Min, Wiese, Zimmerman, & Hong, 2014; Saeb et al., 2015). This example offers a useful example for reasoning about the personal data continuum. At the high end, of the spectrum the goal is to know whether or not an individual is depressed. Obviously this is very high-level and not directly observable, especially not by today’s technology. However, it is indirectly observable (American Psychiatric Association, 2013). Criteria include: • • • • • • Patient has been less social Patient has been doing fewer things that he enjoys Patient hasn’t been sleeping well Patient has been less physically active Patient has had a significant change in weight Patient has been experiencing high stress Each of these represents a characteristic that is certainly further down on the continuum: these are closer to things that can be easily observed (though not necessarily instantaneously). Each of these items informs a the top-level inference of depressed/not depressed. Going another level lower, each of those items can be broken down into lower-level data that can generate output that answers those questions. For example, some data that might feed into how social a person has been could include data about how much time they have spent talking on the phone, how many text messages and emails they have exchanged, how many total people they have spoken on the phone with, what percentage of their time they’ve spent speaking with other people face-toface. It would be easy to brainstorm many other kinds of data here. To infer that the patient has been experiencing high stress, a model might again employ features of the patient’s communication behavior, stress indicators in their speech patterns, perhaps even the content of communication exchanges. The model might also take into account how full the patient’s calendar is and how many of the events are routine. Finally, at the lowest level, a variety of sensors provide data that inform the higherlevel types of data. For example, accelerometer and gyroscope data can be used to infer what kinds of activities the individual has been engaging in. Those sensors, 77 Chapter 4: A Conceptual Framework for Personal Data combined with the microphone and light sensor can be used to infer sleeping behavior (Min et al., 2014). This single example of detecting depression in an individual has demonstrated aspects of personal data all along the continuum, from the very lowest levels of personal data as sensor logs through several layers of inferences up to a high-level inference of detecting depression. 4.3 The steps for working with personal data Figure 14: The personal data pipeline breaks down the steps of working with personal data. At a high level, using personal data means collecting the data, inferring some meaning from that data, and then applying the data to the target application. However, these steps are deceivingly simple. In reality each of these steps is complex with many components and a host of implicit challenges. The illustration of inferring depression in the previous section, as well as the case study of inferring tie strength and sharing preferences in chapter 3 are both examples that offer some insights into the process of working with personal data with the ultimate goal of applying it to some target domain. This section expands on these to establish generalized steps that capture the process of applying personal data to an application, leveraging the continuum from the previous section. Together, the continuum and the steps described in this section combine to form the conceptual framework of personal data. At a high level, the steps are: 1. Collect the personal data from one or more sources where the data has been recorded and stored. 2. Transform the collected data on the continuum from the point where it was when collected to the point where it needs to be in order to be applied to a particular target application. 3. Use the transformed personal data in the target application. 78 Chapter 4: A Conceptual Framework for Personal Data It is easy to look at that list and infer that this is a simple process: each step is short and concise. However, this process is actually much more complicated than what it would seem at first glance. 4.3.1 Collecting the Data The first step of the process is to collect the source personal data. This data tends to be towards the lower parts of the continuum, but it might be anywhere along the continuum. This can include collecting data directly from sensors, usage logs, or from other systems that have already processed the data in some way. Collecting this data is really broken down into several steps. 1. Choose services that allow programmatic access to user data: The discussion of stakeholders highlighted the power that data-loggers have in this ecosystem: they are the gatekeepers to personal data, so if they don’t collect the desired data or don’t provide programmatic access to it, nothing else can be done. 2. Authenticating the user: Most personal data is protected through some authentication mechanism that the user must authenticate with in order to provide access to the developer. 3. Obtaining permission: The application or service that is collecting the data must obtain permission from the user to access the data. This can take many forms. On Android, this is done at install time. If connecting to a REST API (e.g. Fitbit, Email, etc) then the user must authenticate and grant permission to access the desired data at runtime. 4. Representing the data: In almost all cases, different data sources represent their data in different schemas, even when the underlying type of data is similar. 5. Linking the data together: When combining data from multiple sources, making use of the data often means linking it together in some way. For example, when collecting different kinds of communication data, it is often necessary to connect the communication based on the person that the communication was with. 6. Cleaning the data: In some cases, data from one data source could be duplicated by the data from a different source. The complexity here can vary considerably. For example, Gmail offered the ability to archive Google Talk conversations within Gmail. So, collecting data from Gmail as well as Google Talk would result in a double-counting of those communications for the users who had enabled that archiving feature. In other cases, individual pieces of data or all data from a data source may be biased or incorrect in some way. Whether a developer thinks about these steps consciously or just implicitly, each of these steps is essential for collecting personal data. Furthermore, the complexity increases considerably when collecting data from multiple sources, whether they are different data sources for the same type of data or for different types of data. For many developers, the challenges present here severely limit what they do with personal data. They may support fewer sources or they may choose not to attempt an ambitious idea because of these limitations. Furthermore, decisions made at this 79 Chapter 4: A Conceptual Framework for Personal Data stage, whether purposeful or implicit, will affect what can be done with the data later on and also the ease with which additional data sources can be added in the future. 4.3.2 Transforming the Data The next step of the process is to transform the data from the point on the continuum where it was collected to the point on the continuum that it needs to be in order to apply it in the application. Again, there are multiple steps involved here. 1. Deciding what the target data is: There are many possible ways to abstract and transform the data, and there are tradeoffs between them. (e.g. Does the application need a representation of tie strength, or just communication frequency? Communication frequency is much more explicit and easier to obtain than tie strength, but for a particular application tie strength might be the right level of abstraction). Is the target numerical? Nominal? 2. Deciding the transformation mechanism: Is the transformation going to be machine learning-based? Rule-based? A mathematical transformation? Part of this step will depend on the resources available to the developer. Does the developer know how to apply machine learning? Does the developer have a way of collecting the ground truth data that will be necessary to train machine learning models? 3. Assembling the input for the transformation: This step involves preparing the source data. One aspect of this step is strongly related to how the data was collected: is it in a format where it is easy to prepare for input, or does it require additional processing? If the transformation involves machine learning, the developer needs to determine the feature set and calculate the features. The developer must also consider what will happen for radically different inputs (e.g. if there is no data available from a particular data source for a particular user, or if the data is too sparse or over too short of a period of time for a particular user). 4. Collecting training data: Having a good dataset is key to developing a good machine learning algorithm. In the case of personal data, it can be very difficult to assemble that data: it requires collecting personal data from many users, trying to broadly cover the spectrum of possible inputs in order to produce robust models. 5. Collecting labeled ground truth: Related to collecting the training data, the developer must have labels for that training data in order to construct models. However, unlike many other machine learning problems, the effort of providing these ground truth labels cannot necessarily be shifted to paid laborers (e.g. crowd workers). Instead with personal data, the user often must label their own data because they are the only person that knows what the label is For example, in the tie strength and sharing models of chapter 3, the only person that could possibly answer is the user. The transformation step is again a complex step that will have implications for how data will be applied in the last step and also for how easy it will be to maintain the code and implement changes in the future. 80 Chapter 4: A Conceptual Framework for Personal Data 4.3.3 Using the Data With the data transformed, the final step of the process is to actually apply the newly transformed data to the application. Again, this seems like it should be straightforward, but again there is the potential for complexity. 1. Integrating the data into the application: Figuring out how to present the data to the user requires consideration. Will the user be able to understand the transformed data? Do they need to understand it? Will users feel that a particular transformation is sensitive or invasive in some way? If the transformation involved some uncertainty, how is that uncertainty handled by the application and/or represented to the user? Does the system simply display the data to the user, or does it personalize or automate some behavior based on the data? 2. Handling incorrect inferences: How does the application handle incorrect inferences? Does it allow the user to correct them? Are these changes stored? Are the changes used to retrain the model? 3. Offering transparency and control to the user: How is the resulting data used in the application? Does the user have the ability to know how their data is being used? Is the data being shared with third parties? Is there a data retention policy? Can the user change, hide, or remove data? Can the user change how the application behavior that is associated with the underlying data? 4.4 The Conceptual Framework Together, the continuum of personal data and the steps that are required to incorporate personal data in an application, which have been described above, form the components of a conceptual framework (Figure 14). This framework captures and makes explicit the otherwise implicit realities of applying personal data to an application. The framework serves as a boundary object to support reflection and discourse on the process of working with personal data. One thing that this framework makes particularly salient is the amount of effort that is currently required for a developer to incorporate personal data into an application. In some ways, this framework looks similar to a more general data analytics pipeline, however many of the specifics are notably distinct between data analytics in general, and personal data in specific. For example, the pipeline described by Fisher, DeLine, Czerwinski, and Drucker (2012) define five steps that describe the process of working with big data: acquire data, choose architecture, shape data into architecture, code/debug, and reflect. The first three of these steps map to the first step described here (collecting the data), code/debug corresponds to the second step of transforming the data, and reflect is one way of completing the third step of applying the data. However, where these steps correspond abstractly, the specifics of these two processes have important differences that make them distinct. Acquiring a big data dataset involves identifying an existing dataset, (e.g. from an online repository). This is a static step, it happens once. When developing with 81 Chapter 4: A Conceptual Framework for Personal Data personal data, the data that is being acquired is specific to each user. Thus, the data is not acquired all at once, it is instead acquired at application runtime for each individual user, and is typically continuously collected over time. Individual users can either grant or deny access to their data. Furthermore, collecting data oftentimes requires collecting hand-labeled ground truth from users, which is completely outside of the requirements for a big data pipeline. Other aspects of this step are more similar, such as linking together data from different data sources and working with different schemas. Coding and debugging with big data is largely focused on issues of scale (e.g. writing parallelizable code and abstracting away the cloud). There are other tradeoffs in this step as well, such as the tradeoff between doing manual operations on the data versus scripting. When developing a personal data application, the challenges are very different. Developers need to consider what transformation is taking place on the data. They do not have the ability to access each users’ data while developing the transformation. Instead, they need to prepare for contingencies ahead of time (e.g. how will the transformation behave with small amounts of data, large amounts of data, sparse data, dense data, etc). Another important dimension of personal data is that the same behavior can mean different things for different users. Will the developer support user-specific transformations (e.g. personalized models)? Finally, the step of reflecting with big data is a discrete step that is iterative with the step of coding and debugging in which analysts reflect on transformed big data in order to build insights in that data. By contrast, personal data is applied directly in an application that faces the end user, who is the subject of that data. Even in the case that the data is being used to support the user in reflecting on her own data, there are differences in this step (e.g. this data is personally meaningful, the user can identify errors and fill in holes in the data). Beyond reflection, personal data can be used in applications to enable new applications or support personalization, which goes beyond the structure of a traditional data analytics pipeline. The personal nature of this data has implications for the way that users relate to the data. Overall, personal data has many differences from a traditional analytics data pipeline, and two main differences stick out in particular. First, with traditional data analytics it is possible to follow the process with a series of manual steps: the start-toend pipeline does not need to be fully automated. Second, the individually meaningful nature of personal data has the potential to impact all steps of the personal data framework. 4.5 Design Goals I have synthesized the insights and challenges throughout this document to establish a set of design goals aimed at improving the state of personal data. These goals are a combination of insights gathered from the broader landscape of personal data work within the research community from chapter 2, the case study of my own experience 82 Chapter 4: A Conceptual Framework for Personal Data developing with personal data from chapter 3, and reflecting on the current state of personal data and the process of working with it from chapter 4. 4.5.1 Minimize redundant effort required of developers This design goal is a very broad goal that should be further broken down to highlight the many different places that redundant effort is currently required: • • • • • • Authenticating to multiple APIs Working with non-standard data formats Gathering or collecting the data Cleaning the data Linking data together so that it is easy to query Developing and testing useful abstractions or inference models to transform the data so that it can be used in an application These are all tasks that could be simplified or completely eliminated from the responsibility of an individual application developer. This will make the development process easier and significantly lower the bar for innovating in this space. End users will benefit from more services, better services, and less effort required fixing their own data. This design goal is directly inspired by chapter 3: the entire project would have been much more straightforward without all of the complexity involved in bringing the data together. Developing inferences can be a major barrier for developers. Even if the developer is skilled at applying machine learning (e.g. extracting features, selecting an algorithm, tuning parameters) these tasks still require time and effort (Patel et al., 2010). Furthermore, collecting training data and ground truth labels can be an even more difficult challenge. Essentially, the to address this goal the process of developing a reliable model should be made separate from the process of deploying that model in a target application. One component of addressing this design goal is to have a set of inferences that are general enough that they can be used across multiple applications. For example, there are a number of ways that tie strength could be applied in different ways across a variety of applications. 4.5.2 Organize data by individual, not by service Today, data is siloed in each application or service. This makes it easy to build applications that are based on their own data, but makes it much more difficult to offer an integrated and consistent user experience across multiple applications. This process should be quick and easy for developers. It is easy to see why personal data today is organized by application or service: there is value for a service in having this data across all of its users, and it is the simplest thing to do. Even if a service wanted to make it easy for users to store their data on the level of the individual, there is no infrastructure for this today. Where would this 83 Chapter 4: A Conceptual Framework for Personal Data data be stored? Who would protect the data? Who would pay the costs associated with storing this data? This concept has been proposed before in (Want et al., 2002), but the infrastructure for this is simply not there today. This is a major challenge across many of the research domains described in chapter 2, and the research projects in chapter 3 offer specific insights into how much of a challenge and a frustration this siloed approach is for developers and researchers. This also addresses the need cited in (Karger & Jones, 2006), that personal information must be defragmented (i.e. unified and linked together) in order for individuals to be able to realize the full potential of their data. 4.5.3 Support connections within the data It should be easy to access data that is related to a particular piece of data. It should be easy to jump between related pieces of data. For example, a piece of information such as the most recent phone can have connections to other items that are overlapping in time such as a calendar appointment, or where the user was when she made the call. It can also be related to whom the call was with. There are many cases where these rich interconnections within the data are particularly useful. In the case of the tie strength model in chapter, rich interconnections in the data would simplify the process of calculating the features for the model. Having well connected data dramatically simplifies applications is also useful for episodic memory queries (e.g. “who did I call last time I was in San Francisco?”), personal-informatics-style data exploration (e.g. “do I spend more time on the phone when I’m at home or travelling”), specifying complex rules (e.g. in end-user programming environments), and for applications such as Autobiographical Authentication (Das, Hayashi, & Hong, 2013). 4.5.4 Limit unnecessary disclosure One way of minimizing opportunities for personal data to be exploited is to follow the privacy maxim of limiting unnecessary disclosure (Romanosky, Acquisti, Hong, Cranor, & Friedman, 2006). In the ideal world, the only data that an application would access would be the data that it needed to access. For example, if an application only needs to know how many phone calls the user has made over a certain period of time, the application should definitely not have access to the individual phone call logs, only access to the count of logs over a specified time period. If the system can guarantee that only a specific set of data was accessed, it will be better able to support the user in controlling and limiting data access. 4.5.5 Offer users transparency Offer as much transparency as possible when it comes to what personal data is used, specifically how that data is used, what (if anything) is stored, and what is shared or transmitted. One way of accomplishing this is by giving examples that demonstrate what can be done. Today, many privacy policies include language that is so general that it hardly communicates anything at all. Part of this design goal is to be as specific and clear as possible. 84 Chapter 4: A Conceptual Framework for Personal Data 4.5.6 Offer users choices and control, while specifying reasonable defaults Hand-in-hand with transparency, systems that handle personal data should also offer users choices and control. In many cases, personal data is a component of the economics that enable a particular service to operate (often through revenue generated from behavioral advertising). In these cases, a user’s privacy decision might affect the viability of that service. In these cases, a combination of transparency (e.g. letting the user know that they are able to offer free or subsidized service because of access to this data) and choices (e.g. the user could pay for the service instead of providing their data) offers more flexibility to the concerned consumer. There is also room to innovate directly in the space of privacy and sharing mechanisms that are provided. Chapter 2 offers several examples of this in the form of expressing preferences that depend on “in common” information (Wiese, Kelley, et al., 2011), or easier ways of partitioning the target audience (Sleeper et al., 2013). A broad and growing literature makes it clear that designing privacy and sharing controls is incredibly difficult, and in many cases people do not even understand what privacy settings mean (Kelley et al., 2012; Liu, Gummadi, Krishnamurthy, & Mislove, 2011). This is far from a solved problem. Thus, it is not enough to offer control, services should also specify reasonable defaults. Finally, another component of this design goal is to offer users the ability to improve the service they are receiving (e.g. by providing ground truth or resolving duplicates). 85 5 Phenom: A Service for Unified Personal Data Approaching the ecosystem of personal data from a user-centered perspective represents a significant shift from how personal data is handled today. Chapter 4 offered insights into how personal data is handled today, some of the issues with the current state of personal data, and design goals for improving this state. Achieving these goals is a long-term agenda that will require buy-in from many different stakeholders. There are many opportunities for improving the ecosystem of personal data. There are also so many prospects to employ personal data to improve the way that users interact with their technology. This chapter describes the design and implementation of Phenom, a prototype service for managing personal data. Phenom incorporates several key ideas that represent a significant advance in the way that personal data is handled. While many issues remain to be addressed before the vision set forth in chapter 4 is achieved, Phenom is a proof of concept that represents an important step towards this goal. The name Phenom comes from phenomenology, a field of study which “set out to explore how people experience the world – how we progress from sense-impressions of the world to understandings and meanings” (Dourish, 2001). 5.1 System Architecture The main philosophy behind Phenom is that personal data is managed centrally in a single application-agnostic service, rather than in each independent application or service. This approach offers several key benefits: Chapter 5: Phenom: A Service for Unified Personal Data 1. A single API for an application developer to work with. The developer only has to authenticate the user to a single API, and the developer only has to work with a single format for representing the data. 2. Linking data together, correcting bad data or mistakes, and removing duplicates can all be done only once and the results will be reflected everywhere. 3. Inferences and models can be developed and improved centrally and the benefits can be had by all applications. 4. Operations on personal data can happen within the service, making it easier to constrain what data a client application has access to. 5. A user can specify privacy preferences in a centralized service with a familiar user interface, rather than in each client application. Figure 15: A system diagram for Phenom illustrating its different components. The Epistenet Data Store serves as a semantic knowledge base of personal data. Data providers bring personal data in from external data sources. Bots operate on the data contained within the datastore to generate inferences and abstractions. A unified querying API provides application developers with a single query interface to access the richly interconnected personal data from the datastore. At a high level, the Phenom architecture is composed of several key components: • • Data providers are responsible for connecting to a data source, retrieving new data from that source, and storing it in the internal datastore. The data store contains the rich interconnected data that has been brought in from the providers. The datastore contains objects of many different types, with attributes that can include references to other objects. Finally, object types are defined in a semantic tree, where the children of a type contain all of the attributes of its parent (but may contain additional attributes as well). The data store also contains inference data and ground truth data. 87 Chapter 5: Phenom: A Service for Unified Personal Data • • Bots perform operations on the data in the semantic data store, for example simple “housekeeping” operations, model-based inferences, heuristic-based inferences, etc. The API offers the flexibility of SQL-like queries to the personal data store that is particularly focused on the needs associated with querying personal data, such as connecting across multiple data types and working with timestamps and aggregates. The Java API offers a simplified abstraction on the much more complicated structure of the underlying datastore while preserving query flexibility. The remainder of this section offers a more in-depth discussion of these components that together form Phenom. 5.1.1 Epistenet: A Semantic Data Store Epistenet, the semantic data store component is a core component to Phenom [XXX tech report]. Epistenet is a system that I co-developed with Sauvik Das, who lead the development of this component. One major challenge when developing with personal data is that data from different sources are completely separate. For example different data sources use different schemas. To address this issue, Epistenet offers a unified internal format for storing personal data with a source-agnostic schema, support for connections between different objects, and a hierarchical ontology that specifies subsumption relationships between different data types. To enable this source-agnostic schema and rich interconnections between data, every piece of personal data in Epistenet is represented as an object with some number of attributes that are associated with that object. For example, a PhoneCall is one type of personal data captured by Phenom. Epistenet represents the PhoneCall object with the following attributes: Direction (incoming or outgoing), Duration, and AlterAddress (phone number). The underlying data schema of this implementation is very flexible. Data is stored in an SQLite database where EpistenetObjects are stored in one table and all Attributes are stored in a table that contains the name of the Attribute, a reference to the EpistenetObject that it is associated with, and the value. This simple database schema offers a flexible framework for objects to be represented within Epistenet. Another issue for working with personal data is that data from different sources may be interconnected with each other in many ways, but because they are coming from different sources, those interconnections are difficult to leverage. Sometimes these interconnections are across semantically disjoint data (e.g. a user’s presence at a physical location, and a cell phone call might be connected through their timestamp. The types of data are completely different). In other cases types of data have a more 88 Chapter 5: Phenom: A Service for Unified Personal Data direct semantic relationship (e.g. phonecalls and SMS messages are both types of communication). Epistenet also has infrastructure to support these kinds of connections. Each EpistenetObject is associated with an OntologyClass that identifies its type. PhoneCall is one example of an OntologyClass, so every EpistenetObject that represents a log of a phone call is associated with the PhoneCall ontology class. Ontology classes are key to a very important feature of Epistenet: the ability to maintain connections between data that is semantically related. Continuing the phone call example, a related type of data is a text message. Text messages and phone calls are both types of communication, and because of this similarity they share some attributes in common, such as Direction and AlterAddress, but not others like Duration. To capture this semantic relationship, Epistenet has a hierarchical representation of ontology classes. In the example, both the PhoneCall and SMSMessage ontology classes are children of the Communication ontology class (see Figure 16). The Communication ontology class defines the common attributes of Direction and AlterAddress and the PhoneCall and SMSMessage ontology classes inherit those attributes, and can define their own additional attributes as well. Figure 16: An example of an ontology in Epistenet. Direction edges in this graph refer to “subsumptive” relationships. So, a PhoneCall is a type of Communication. Attributes of a parent ontology class are also contained in the descendents of that ontology class. As a result, a query to Epistenet for objects of a particular OntologyClass can specify whether the objects that are returned should only be the objects that are 89 Chapter 5: Phenom: A Service for Unified Personal Data concretely associated with that OntologyClass (e.g. only objects that are explicitly identified as Communication, not PhoneCall and SMSMessage which are children of the Communication ontology class), or if it should also include objects that have a concrete type of a child OntologyClass (e.g. a query for the Communication ontology class would also return objects that were defined as both PhoneCalls or SMSMessages). Epistenet refers to these relationships as identity (i.e. only objects that are the concrete type specified in the query) and subsumption (i.e. all objects in the ontology subtree of the specified type). This semantic linkage is powerful. Inserting a PhoneCall object into the datastore means that automatically through the ontology relationship it is also represented as a Communication object, and tied the object to a CommunicationHandle (through the AlterAddress), which is linked to a Contact ontology class object, which is subsumed by the Person ontology class. This structure allows for very rich flexible queries, enabling a single query to automatically incorporate the data from different but semantically related data types. For example, it would be very simple to query for a list of the 10 contacts that a user had communicated with most recently across all communication media (Section 5.1.3 describes the specifics of executing queries such as these using the API). Defining a new Ontology Class The following steps document the process of defining a new ontology class within Epistenet: 1. Determine where in the Ontology the new class should be added. For example, if we are adding a provider for phone call logs, the place to put the PhoneCall Ontology Class would be Communication à PhoneCall. An Ontology Class could also be at the root. 2. Declare the new ontology class in the ontology.config file. This includes specifying the name, and a versioning number (here it’s 2). This file also specifies the ontology: SMSMessage,1 + Phonecall,2 … … Communication,Textbased,1 Textbased,Email,1 + Communication,Phonecall,2 3. Declare a new class that includes the attributes that will be associated with the Phonecall ontology class in in the ontologyclasses namespace: public class Phonecall extends Communication { public static final Attribute DURATION = new Attribute("Duration", AttributeValueType.INTEGER); public static Attribute[] getAttributes() { return ArrayUtils.addAll(Communication.getAttributes(), new Attribute[]{Duration}); } } 4. Declare Gmail in the OntologyClass.java enum. The number just needs to be unique within the enum: 90 Chapter 5: Phenom: A Service for Unified Personal Data Phonecall("Phonecall", 6) { @Override public Attribute[] attributes() { return Phonecall.getAttributes(); } }, After completing these steps, a new Phonecall ontology class will exist in Epistenet with all of the attributes associated with the Communication ontology class, in addition to a “Duration” attribute. Reference Attributes The example above offers a view into the general process for defining new OntologyClass types, but there are a few more details that are involved when defining some kinds of attributes. Many attributes are similar to “Duration” in the example above: they represent a basic value such as Integer, Double, Timestamp, or String. However, some attributes actually reference an Epistenet object. These are ReferenceAttributes. In the phone call example, one example of a ReferenceAttribute is the phone number. The reason that this attribute did not appear in the description above is that it was inherited from the Communication ontology class. In fact, this is even more complicated because at the level of the Communication ontology class, the identifier isn’t necessarily a phone number. It could also be an email address, or a screen name. Thus, the definition for Communication includes this: public static final ReferencesAttribute ALTER_ADDRESS = new ReferencesAttribute("AlterAddress", OntologyClass.CommunicationHandle); This “AlterAddress” ReferencesAttribute in Communication references a CommunicationHandle ontology class which has ontology class children PhoneNumber and EmailAddress. This preserves the overall structure of the Communication ontology class (i.e. communication happens with other people), but also maintains differences between different kinds of handles (i.e. phone numbers and email addresses). The Communication ontology class also includes a reference to the Person ontology class. However, in this case it is not possible to use a ReferenceAttribute, because the communication is only indirectly tied to a person through the communication handle. Instead, the IndirectReferencesAttribute is a sort of ghost reference that references the target ontology class (i.e. Person), through the attribute of an intermediate ontology class to which this ontology class has a reference (i.e. the Person attribute of the AlterAddress, which is a direct reference from Communication). public static final IndirectReferencesAttribute PERSON = new IndirectReferencesAttribute( "Person", Communication.ALTER_ADDRESS, 91 Chapter 5: Phenom: A Service for Unified Personal Data CommunicationHandle.PERSON); One issue with ReferencesAttribute and IndirectReferencesAttribute is that they are unidirectional. In this case, the PhoneCall ontology class has a ReferencesAttribute for AlterAddress and Person, but Person does not have a ReferencesAttribute to PhoneCall. ReverseReferencesAttribute solves this problem. ReverseReferencesAttribute uses the existing attribute connection to make the relationship bi-directional. For example, the following two attributes are defined in the Person ontology class: public static final ReverseReferencesAttribute PHONE_CALLS = new ReverseReferencesAttribute( "PhoneCalls", OntologyClass.Phonecall, Phonecall.PERSON); This attribute is a sort of convenience attribute. No additional data is stored in Epistenet, but when querying the Phenom API, ReverseReferencesAttribute behaves the same as ReferencesAttribute by reversing the direction of the original ReferencesAttribute. Together, the structure of Epistenet makes it very easy to interact with the data. For example, perhaps when a phonecall is initially recorded, Phenom does not know to whom the phone number belongs. With this approach, when the connection is made between the phone number and the person, all of the data is instantly updated for free. 5.1.2 Data Providers Today, working with personal data means any application that wants to use the data from a data source needs to individually connect to all of the datasources to access that data. This process it often painful and involves a fair amount of boilerplate code. Phenom removes this responsibility for each developer by doing this only once. Data providers are the component of Phenom that brings in raw personal data from any external source—for example, system content providers such as SMS logs, hardware sensors such as the accelerometer, and third-party applications such as “What’s App”. The call log, for example, is a data provider that contributes objects of the “PhoneCall” ontology class, while a web browser would contribute objects of the “SiteVisit” ontology class. Data providers are responsible for connecting to the external data source, mapping the data from that data source to an ontology class within Phenom, and avoiding creating duplicate data. Data providers are polled at configurable intervals so that the can aggregate new data. While providers themselves do not offer much novelty to Phenom, they are an essential component for Phenom. Defining a New Provider Creating a data provider involves only a small amount of overhead beyond the boilerplate code that is required to query the data source and extract data from that source. 92 Chapter 5: Phenom: A Service for Unified Personal Data 1. Add a line to providers.config with a name for the provider the number of milliseconds between polling times, and the Java classname for the provider: … sms_logs,14400000,SMSLogsProvider + call_logs,14400000,CallLogsProvider … 2. Implement the CallLogsProvider class in the providers namespace. The key aspect of implementing the provider is implementing the poll method. public class CallLogsProvider extends Provider { public static final long PERSISTENCE = Long.MAX_VALUE; public static final String PERMISSION = "android.permission.READ_CALL_LOG"; public static final String PROVIDER_NAME = "call_logs"; public void poll() { … Within this method, the main steps are to a. Poll the data source for new data: EpistenetAdapter adapter = this.getAdapter(); long lastUpdated = adapter.getLastUpdatedTimeForProvider( this.getProviderName()); Cursor c = this.mContext.getContentResolver().query( CallLog.Calls.CONTENT_URI, new String[] { Calls.DATE, Calls.NUMBER, Calls.DURATION, Calls.TYPE, Calls.CACHED_NAME}, Calls.DATE + " > " + String.valueOf(lastUpdated), null, null); b. Cycle through the new data to create new objects: if (c != null && c.getCount() > 0) { while c.moveToNext()) { long oid = adapter.createObject(PERSISTENCE); c. Associate all of the relevant attributes with each object: adapter.createAttribute( Phonecall.DURATION, c.getString(c.getColumnIndex(Calls.DURATION)),oid); adapter.createAttribute( Phonecall.DIRECTION, this.getStringifiedType( c.getInt(c.getColumnIndex(Calls.TYPE))), oid); adapter.createAttribute( Phonecall.ALTER_NAME, UtilityFuncs.coalesceString(c.getString( c.getColumnIndex(Calls.CACHED_NAME)),"Unknown"),oid ); // Code to create phone number object if necessary long[] numberCreated = adapter.createObjectIfDoesNotExist( new String[]{ PhoneNumber.HANDLE.getSelectName()}, new String[]{UtilityFuncs.formatPhoneNumber( c.getString(c.getColumnIndex(Calls.NUMBER)))}, -1); if(numberCreated[0] > 0){ adapter.createAttribute(PhoneNumber.HANDLE, UtilityFuncs.formatPhoneNumber( c.getString(c.getColumnIndex(Calls.NUMBER))), numberCreated[1]); adapter.createObjectOntologyLink(numberCreated[1], adapter.getIDsForOntologyClassNames( OntologyClass.PhoneNumber.className()).get( OntologyClass.PhoneNumber.className())); 93 Chapter 5: Phenom: A Service for Unified Personal Data } adapter.createAttribute( Phonecall.ALTER_ADDRESS, String.valueOf(numberCreated[1]), oid); d. Finally, add in the meta attribute, link the object to the rest of the ontology, and release resources: long metaAttribute = this.createMetaAttribute(oid, c.getLong(c.getColumnIndex(Calls.DATE))); this.createObjectOntologyLinks(oid); }} this.closeCursor(c); } Omitting a few accessors and utility methods, this is all that’s required to create the provider for phone logs. The flexibility of this approach allows for much more complex providers to be implemented if necessary. 5.1.3 API for Querying Unified Personal Data For Phenom to be effective, it needs to enable developers to access the rich interconnected personal data that is contained within in a simple and flexible manner. This requires designing and deploying an API that will be support the ontological structure and flexible attribute schema supported by the rest of Phenom, and providing a single unified environment for accessing the results of an arbitrary ontology class. The Phenom API provides a unified interface for accessing the personal data that is stored within Phenom. The Phenom API is accessible to client applications through a lightweight Android Library Project. Client applications can use the library to query the API. The library binds to the Phenom service, which runs in a separate process on the phone. Results are then returned to the client application through a callback mechanism. Specifying a query Specifying a query to Phenom requires defining one or more Filter objects. A Filter can be very simple. For example, the following Filter specifies that the duration field should be returned for all phone calls: Filter phonecallFilter = new Filter(OntologyClass.Phonecall).projection(Phonecall.DURATION); While filters can also be more complex than this, the basic idea is the same. To create a Filter, the OntologyClass that specifies the type the Filter should return. After this, the Filter object behaves like a builder. Options for the filter include: • • Simple Constraints on the OntologyClass’s attributes (e.g. lessThan, greaterThan, equal, notEqual, inSet, notInSet, inRange) Limit the number of results returned 94 Chapter 5: Phenom: A Service for Unified Personal Data • • • • Specify the sort order based on an attribute Specify an SQL-style “group by” on an Attribute Specify the projection of attributes to be returned. Attributes not explicitly included here will not be Constrain through join: essentially allowing for a compound query to be specified based on a ReferencesAttribute that connects this OntologyClass with a different OntologyClass. In all of these cases, when an Attribute is required as a parameter, any attribute can be used. In particular, Phenom provides support for three additional non-concrete attribute types not previously discussed: AggregateAttribute, TimepartAttribute, and ReferencesAggregateAttribute. AggregateAttributes perform the same behavior as SQL aggregates (i.e. sum, average, count, min, max, and group concat). An AggregateAttribute must be based on a concrete attribute, and can be obtained by calling the asAggregate() function of Attribute. For example: Phonecall.DURATION.asAggregate(AggregateType.SUM); The resulting AggregateAttribute can be used in any place where a concrete attribute would normally be used. It is important to note that the “group by” for a filter should be specified, otherwise the ID attribute is used as the attribute by default. TimepartAttributes make it easy to extract information from a timestamp and use it within the query. For example this is the query for extracting the year number and month number from a timestamp and returning it in a single attribute: PlaceVisit.TIMESTAMP.asTimePart(TimePart.YEAR, TimePart.MONTH) It is also possible to get an AggregateAttribute of a TimepartAttribute, which allows for easy querying of aggregated statistics based on time. For example, the following attribute would give the most recent month and year of a PlaceVisit: PlaceVisit.TIMESTAMP.asTimePart(TimePart.YEAR, TimePart.MONTH) .asAggregate(AggregateType.MAX)); Finally, ReferencesAggregateAttribute makes it easy to get aggregate information about the attributes of an object referenced by a ReferencesAttribute. For example Together, this simple query interface easily enables a set of rich queries to be made to Phenom. For example, the following query returns statistics about the 10 places where the user has spent the largest amount of time, including: • • • • Latitude and longitude the total amount of time spent there the average length of a stay and the most recent year and month that the user was there 95 Chapter 5: Phenom: A Service for Unified Personal Data Filter placeVisitsFilter = new Filter(OntologyClass.PlaceVisit) .projection( PlaceVisit.DURATION.asAggregate(AggregateType.SUM), PlaceVisit.DURATION.asAggregate(AggregateType.AVERAGE), PlaceVisit.TIMESTAMP.asTimePart(TimePart.YEAR, TimePart.MONTH) .asAggregate(AggregateType.MAX)) .orderBy(PlaceVisit.DURATION.asAggregate(AggregateType.SUM), false); Filter placesFilter = new Filter(OntologyClass.Place) .projection(Place.LATITUDE, Place.LONGITUDE) .constrainThroughJoin(placeVisitsFilter, Place.VISITS) .limit(10); This is a prime example of a query that would be much more difficult, or even impossible, to execute without Phenom. Handling a Query Result Queries to Phenom are executed asynchronously, so handling a result from Phenom involves implementing a callback. It is easy to implement this callback as an anonymous function similar to the way that click handlers are often implemented: mApiClient.sendQuery(placesFilter, new PhenomCallback() { @Override public void onSuccess(ArrayList<PhenomObject> objs) { processResults(objs); }); Queries are returned as a list of PhenomObjects, which offers a basic structure for representing query results. Essentially, each PhenomObject represents an object of the OntologyClass specified in the query. Attributes of the ontology class that were specified in the Filter’s projection can be accessed by calling the get method for the corresponding type of the attribute. 5.1.4 Bots The raw personal data gathered and stored within Phenom is useful in its own right, but often the real value of this aggregated personal data comes from additional processing or inferences that are done on top of the raw data. In some cases it makes sense for individual developers to do this additional processing, but in many cased multiple developers can make use of the same processing work. For example, as discussed in chapter 3, tie strength can be useful to a variety of applications. Bots are the component that offers this functionality to Phenom. Bots carry out worker functionality on the Epistenet semantic data store, following a blackboard architecture where the Epistenet datastore is the blackboard. Bots are somewhat similar to Providers in that they are polled on a fixed schedule and they do work on the contents of the semantic data store. However, instead of inserting new data into the datastore, bots operate on the existing data. This can include maintainance tasks such as removing duplicated data or identifying connections across multiple kinds of data. However, the more exiting use of Bots is to offer the ability to generate inferences and abstractions based on the existing data within Epistenet. For example, the 96 Chapter 5: Phenom: A Service for Unified Personal Data “home_labeler” bot uses some basic heuristics to label places that the user calls home or has called home in the past, and the “strong_tie” bot uses communication behavior to infer some of a user’s strong ties and label those contacts as strong ties in Epistenet. In these examples, there is an Attribute associated with the Place and Contact ontology classes respectively for each of these inferences. When a bot has made an inference, it simply adds or updates the corresponding attribute. Defining a new Bot There are only a few steps required to create a new Bot. 1. Add a line to bots.config with a name for the bot, the number of milliseconds between polling times, the Java classname for the bot, and the version number of the config file for which this bot was added: … significant_places,86400000,SignificantPlaceBot,1 + tie_strength_bot,86400000,TieStrengthBot,2 2. Implement the TieStrengthBot class in the bots namespace. The key aspect of implementing the bot is implementing the poll method. public class TieStrengthBot extends Bot { private static final String BOT_NAME = "tie_strength_bot"; public TieStrengthBot(Context c){ super(c); } @Override public String getBotName() { return BOT_NAME; } @Override public String getPermission() { return "phenom.permissions.tie_strength"; } @Override public void poll() { … Implementing this method depends on the specific functionality of the bot. In the case of the tie strength bot, the steps are to: a. Query Epistenet for the relevant data: // A few aliases for readability ReferencesAggregateAttribute smsCount = Person.SMS_MESSAGES.getReferencesAggregate( SMSMessage.ID.asAggregate(AggregateType.COUNT)); ReferencesAggregateAttribute callCount = Person.PHONE_CALLS.getReferencesAggregate( Phonecall.ID.asAggregate(AggregateType.COUNT)); ReferencesAggregateAttribute callDuration = Person.PHONE_CALLS.getReferencesAggregate( Phonecall.DURATION.asAggregate(AggregateType.SUM)); Filter personList = new Filter(OntologyClass.Person) .projection(Person.ID, Person.NAME, smsCount, callCount, callDuration); ArrayList<PhenomObject> allPeople = getAdapter().doPhenomQuery(personList); 97 Chapter 5: Phenom: A Service for Unified Personal Data b. Calculate tie strength based on the specified heuristics: int maxDur = 0; int maxCallCt = 0; int maxSMSCt = 0; for (PhenomObject person : allPeople) { maxSMSCt = Math.max(maxSMSCt, person.getInt(smsCount,0)); maxCallCt = Math.max(maxCallCt, person.getInt(callCount,0)); maxDur = Math.max(maxDur, person.getInt(callDuration,0)); } for (PhenomObject person : allPeople) { double closeness = ((person.getInt(callDuration,0) / maxDur) + (person.getInt(callCount,0) / maxCallCt) + (person.getInt(smsCount,0) / maxSMSCt))/3; getAdapter().createOrUpdateAttribute(Person.TIE_STRENGTH, Double.toString(closeness), person.getLong(Person.ID)); } In this case, creating a bot was as simple as that, there are no other steps. Of course, bots can be much more complex, for example actively retraining a model based on new data labels from the user. 5.2 Evaluation: Example Applications and Queries This section offers two examples of applications that Phenom makes particularly simple, where previously they would have been much more complicated, perhaps impossible. These examples offer a basic evaluation that demonstrates the value offered by Phenom. 5.2.1 Bootstrapping Users’ Interests from Location Data The first example is an approach that is intended to solve the “cold start” problem that happens when a user begins using a new service that is trying to personalize content within the application (e.g. a personalized news reading application). One innovative approach to solving this problem is to try to use the user’s location history to identify significant places that the user has been to. Specifically, a location history can be used to identify unique places that a user has visited, how recently, and how frequently the user has been there. With this information, it is possible to look up additional information about a location, like what type of a location it is, and any identifying characteristics of the location (e.g. a user that frequents a rock gym is likely interested in climbing). This data can then be used to generate an interest profile for a user. Even if the results are only partially correct, this approach is still better than completely random data, or no data at all. Phenom Implementation One of the bots implemented in Phenom is a SignificantPlaces bot, which uses a few different heuristics to identify places that are significant to the user based on her location history. 98 Chapter 5: Phenom: A Service for Unified Personal Data With the significant places bot implemented, this particular application is fairly straightforward in Phenom: mApiClient = new ApiClient(this); Filter placesFilter = new Filter(OntologyClass.Place) .projection(Place.LATITUDE, Place.LONGITUDE) .notEqual(Place.SIGNIFICANT_PLACE, "NULL"); mApiClient.sendQuery(placesFilter, new PhenomCallback() { @Override public void onSuccess(ArrayList<PhenomObject> objs) { getTagsFromFlickr(objs); } }); Upon receiving the callback from Phenom, the application can cycle through the significant locations and query a third-party API for tags that are associated with those locations. In this example, The application connects to Flickr to retrieve the annotated photo tags from geotagged photos, but an implementation might also use data from Foursquare, Yelp, or Google Maps. Next, the application does TF-IDF with the words from the tags, and the result is a word vector that offers some clues to the user’s interests that should be better than a random selection of articles. This is a great example of the value offered by Phenom: there were very few steps involved in getting the data needed in a usable format. Phenom obviates the need to create the custom code to gather and store location data. Furthermore it makes it easier to retrieve location data based on different qualities of the data points (e.g. a window of time, a particular city, etc). This offers a similar kind of abstraction to that which happens within modern GUI toolkits. These toolkits offer developers a lot of support for developing the graphical interface portions of an application. While each developer still needs to write the business logic and functionality of an application, they do not need to be concerned with the specific implementation of the GUI components (e.g. standard appearance of widgets, event stream, etc.). Similarly, Phenom obviates the boilerplate code that each developer would need to write in order to raise the abstraction level to a point where developers can focus on the business logic and functionality that is specific to their application. Non-Phenom Implementation Without Phenom, the first step to implementing this example is to access enough of a user’s location history that it would be possible to identify the user’s significant places. Possibilities include: 1. Asking the user to upload their location history (e.g. from Google Location History, which provides users with that data but does not offer an API) 2. Collecting the data from the API of a service that the user already uses (e.g. from Moves, or from Foursquare) 99 Chapter 5: Phenom: A Service for Unified Personal Data 3. Collecting the user’s location automatically within the application over a period of time until enough data has been collected that significant places are salient Each of these options has drawbacks. Options 1 and 2 are service-dependent in a way that excludes users who do not use those services. Option 3 includes all users, but involves running on the user’s device for long enough to bootstrap with enough data that significant places could be determined. Developers who are using Phenom are not exposed to this challenge because the data has already been brought together through the use of data providers, which can cover all three of these alternatives in a way that supports reuse across applications. However, without Phenom a developer has to cobble together a solution that is likely to exclude more users from the feature. In this case, the goal was to eliminate the cold-start problem, so option 3 does not work. From a development perspective option 2 seems easier than option 1, though this does have the drawback of only collecting location check-ins, rather than all location data. First is pseudocode for accessing a user’s checkins: Intent intent = FoursquareOAuth.getConnectIntent(context, CLIENT_ID); startActivityForResult(intent, REQUEST_CODE_FSQ_CONNECT); … @Override protected void onActivityResult(int requestCode, int resultCode, data) { switch (requestCode) { case REQUEST_CODE_FSQ_CONNECT: AuthCodeResponse codeResponse = FoursquareOAuth.getAuthCodeFromResult(resultCode, data); Intent intent = FSOauth.getTokenExchangeIntent(context, CLIENT_ID, CLIENT_SECRET, authCode); Intent startActivityForResult(intent, REQUEST_CODE_FSQ_TOKEN_EXCHANGE); break; case REQUEST_CODE_FSQ_TOKEN_EXCHANGE: AccessTokenResponse tokenResponse = FSOauth.getTokenFromResult(resultCode, data); checkins = retrieveCheckinData(resultCode.getAccessToken()); break; } } … private Checkin[] retrieveCheckinData(String accessToken){ FoursquareApi api = new FoursquareApi( "ClientID", "ClientSecret", "CallbackURL", accessToken, new IOHandler()); Result<CheckinGroup> result = api.usersCheckins(null, 1000, 0, Long.MIN_VALUE, Long.MAX_VALUE); return result.getResult().getItems(); } At this point, we have access to the user’s checkins. The next step is to process the checkins in a way that surfaces “significant places”. Where developers that are using Phenom can make use of the existing “significant places” bot, here the developer 100 Chapter 5: Phenom: A Service for Unified Personal Data would need to determine those significant places independently. For simplicity here, significant places can be the user’s most frequent places. We might also want other information to be included here, such as the places where the user has spent the longest duration. However because we are using checkins this information is not available. Pseudocode for this follows: HashMap<Location, Integer> visitCount = new HashMap(); for(Checkin c : checkins){ Integer count = frequency.get(c.getLocation()); if(count == null) count = 0; frequency.put(c.getLocation, ++count); } visitCount.sortByValue(); //Implemented elsewhere getTagsFromFlickr(visitCount); Comparing Implementations Even with this relatively basic task, these two implementations demonstrate several ways that Phenom offers value compared to the non-Phenom implementation. The most obvious difference between these two examples is the number of lines of code written for each example: notably fewer lines of code for Phenom. This is possible because the code for gathering and storing locations, as well as for calculating significant places, is code that many applications would be able to use across a variety of applications. Even more value for the developer comes from the modularity behind Phenom. Specifically, the Phenom-based implementation above will instantly be able to take advantage of any improvements made to earlier parts of the process without changing any lines of code (e.g. collecting data from more sources, automatically collecting location data even before this application was installed, an improved algorithm for detecting significant places, or user-provided ground truth on which places are or are not significant). This means that the developer could deploy her application and not need to make any changes in order to receive these benefits. By contrast, for the non-Phenom implementation, the developer had to make choices on which data to include in the process. Adding another data source to the existing data source means more coding for the developer, both for accessing the data, but also for integrating it. Adding two new data sources is twice as much work. Furthermore, the non-Phenom implementation is unlikely to receive corrections to significant place labels from the user. If the developer wanted this information, she would need to implement a mechanism for the user to provide it. However, even with such an implementation, the likelihood of a user providing feedback for use in a single application seems low. One drawback to the existing implementation of Phenom is that the codebase (i.e. for providers, bots, and the schema) is managed centrally: there isn’t an immediate mechanism for a developer to add a new data provider, or to contribute her own bot. The most immediate way to address this is to run Phenom as an open source 101 Chapter 5: Phenom: A Service for Unified Personal Data project, where individual developers could submit pull requests for changes that they would like to make. 5.2.2 Ordering Contacts Based On Tie Strength The next example again demonstrates something that would require many more steps to complete without the assistance of Phenom. In this example, we will make use of the TieStrengthBot described earlier in this chapter. Filter contactTieStrengthFilter = new Filter(OntologyClass.Person) .projection(Person.NAME, Person.TIE_STRENGTH) .orderBy(Person.TIE_STRENGTH, false); mApiClient.sendQuery(contactTieStrengthFilter, new PhenomCallback() { @Override public void onSuccess(ArrayList<PhenomObject> objs) { setContactOrder(objs); } }); After retrieving the ordered list of contacts, the application can use that information to determine which contacts to show more prominently. One interesting example where this could be applied might be in an Email application: the email inbox might first group emails by day, but within each day it could show emails first from people that the user is closer to, and then show other emails below. Non-Phenom Implementation Without Phenom, the first step to implementing this example is to get programmatic access to the user’s call and SMS logs. For this example, we’re trying to calculate the number of phone calls in the call log, number of SMS messages in the SMS log, and the total duration of calls in the call log. There are two main approaches for doing this, and each has tradeoffs. One alternative is to take a more SQL-centric approach to calculating the communication statistics. This involves making SQL group-by queries that groups the communication log tables by each contact and uses SQL aggregates (i.e. COUNT() and SUM(duration)). This solution might typically require the fewest lines of code, but because of the structure of the Android Content Providers, this process has several problems. First, there is no way to join between the CallLogs provider and the Contacts provider. Doing a GROUP BY on the phone number column is tempting, but the implementation is such that the same phone number might be represented by different strings, even on the same device (e.g. dashes, parentheses, leading country code). In this case, the best solution is that the CallLogs provider does have a cached URI for each contact, which can be used in the GROUP BY clause. However, this information is not guaranteed to be updated as contact records change. Finally, the Android Content Provider API does not support GROUP BY anyway, so it turns out that this approach is simply not possible The other approach is to calculate those statistics in the Java code of the application. This approach requires writing much more code, but will also be more precise and 102 Chapter 5: Phenom: A Service for Unified Personal Data reliable. Furthermore, if the developer wanted to add some other data that was not already in an SQLite database or Android content provider (e.g. if querying a REST API), then these calculations would need to be done in code. In spite of this, we will need to pursue the second approach because the first is simply not possible with the current implementation of Android. HashMap<String, Integer> callCount = new HashMap<>(); HashMap<String, Integer> callDuration = new HashMap<>(); HashMap<String, Integer> smsCount = new HashMap<>(); int maxCallCount = 0; int maxCallDuration = 0; int maxSMSCount = 0; Cursor callCursor = this.mContext.getContentResolver().query( CallLog.Calls.CONTENT_URI, new String[] { Calls.DATE, Calls.NUMBER, Calls.DURATION, Calls.CACHED_NAME, Calls.CACHED_LOOKUP_URI }, null, null, null); while( callCursor != null && callCursor.moveToNext()){ String lookupURI = callCursor.getString( callCursor.getColumnIndex(Calls.CACHED_LOOKUP_URI)); int count = 0; if (callCount.get(lookupURI) != null) count = callCount.get(lookupURI); callCount.put(lookupURI, ++count); maxCallCount = Math.max(maxCallCount, count); int duration = 0; if (callDuration.get(lookupURI) != null) duration = callDuration.get(lookupURI); duration += callCursor.getInt( callCursor.getColumnIndex(Calls.DURATION)); callDuration.put(lookupURI, duration); maxCallDuration = Math.max(maxCallDuration, duration); } callCursor.close(); Cursor smsInboxCursor = this.mContext.getContentResolver().query( Sms.Inbox.CONTENT_URI, new String[] { Sms.DATE, Sms.ADDRESS, }, null, null, null); Cursor smsSentCursor = this.mContext.getContentResolver().query( Sms.Sent.CONTENT_URI, new String[] { Sms.DATE, Sms.ADDRESS, }, null, null, null); for(Cursor c : new Cursor [] {smsInboxCursor, smsSentCursor}){ while(c != null && c.moveToNext()){ 103 Chapter 5: Phenom: A Service for Unified Personal Data String phoneNumber = c.getString( c.getColumnIndex(Sms.ADDRESS)); /* Note that the method below doesn’t exist so must be implemented * and requires a call to the contacts provider */ String lookupURI = getLookupUriForNumber(phoneNumber); int count = 0; if (smsCount.get(lookupURI) != null) count = smsCount.get(lookupURI); smsCount.put(lookupURI, ++count); maxSmsCount = Math.max(maxSmsCount, count); } } smsInboxCursor.close(); smsSentCursor.close(); So the code above produces the call counts, total call duration, and SMS counts for each contact. There are a couple of things to note in the inconsistency between the APIs for the calls and SMSs. First, note that SMSs are split into different tables, depending on whether they are incoming, outgoing, drafts, etc. Thus, the developer needs to know to query both the inbox and the sent SMSs. Additionally, the SMS Content Provider does not provide the cached lookup URI, so that information has to be retrieved manually in the code from the Contacts Content Provider. The next step is to calculate a tie strength score for each contact. HashMap<String, Double> tieStrength = new HashMap<>(); for(Entry<String, Integer> e : callCount.entrySet()){ String lookupURI = e.getKey(); int callCount = e.getValue(); int callDuration = callDuration.get(lookupURI); int smsCount = 0; if((int val = smsCount.remove(lookupURI)) != null) smsCount = val; double tieStrengthVal = (callCount/maxCallCount/3) + (callDuration/maxCallDuration/3) + (smsCount/maxSmsCount/3); tieStrength.put(lookupURI, tieStrengthVal); } // For the remaining SMS counts, where a contact didn’t have any calls for(Entry<String, Integer> e : smsCount.entrySet()){ String lookupURI = e.getKey(); int smsCount = e.getValue(); double tieStrengthVal = (smsCount/maxSmsCount/3); tieStrength.put(lookupURI, tieStrengthVal); } The last step is to sort the HashMap by its values, so that the highest tie strength values are at the top of the list. That code is omitted here. 104 Chapter 5: Phenom: A Service for Unified Personal Data Comparing Implementations Again, as with the previous implementation example, it is clear that the Phenom implementation is much simpler for an application developer. The Phenom solution is more robust as well: it does not rely on the cached contact information or the phone’s contact list. This also means that adding in additional communication data (e.g. emails, social network, or instant messaging) would be easier with Phenom than in the custom implementation. Finally, the Phenom implementation can automatically and for free take advantage of any ground truth data provided by the user (i.e. fixing incorrectly labeled contacts whose real tie strength does not match that which was calculated). The reason that this approach works for Phenom is because there are many applications that might be able to make use of communication metadata, and of tie strength (e.g. contact ordering, notification prioritization, personal informatics). Additionally, these are both potentially useful as input to even higher-level inferences (e.g. mental health, social support, busyness). Thus, the code that supports this in Phenom is valuable because it can be reused across a variety of applications, eliminating the need for any of those developers to redo the common steps of these processes. 5.3 Discussion Phenom is a proof of concept system that demonstrates the possibility and the power of an integrated service for managing personal data on the level of the individual rather than on the level of a company or data source. The architecture of Phenom described in this implementation organizes the personal data process into a modular set of reusable components that are flexible enough to store arbitrary types of personal data, support the linkages between personal data regardless of whether they are from the same or different sources, generate inferences and abstractions on the data, and provide access to that data through a unified API. As a result, individual applications do not need to solve the issues and challenges associated with storing personal data, those responsibilities can be delegated to Phenom and solved once. 5.3.1 Reflecting On Design Goals Section 4.5 laid out an ambitious set of design goals targeted at addressing the issues and challenges that are associated with the current ecosystem of personal data. While no single developer, system, or approach could possibly address the entirety of these goals singlehandedly, the implementation of Phenom that is described in this chapter represents an important step towards reaching these goals. Phenom speaks directly to some of these goals, and indirectly to others. Each of those goals is discussed in turn below. 105 Chapter 5: Phenom: A Service for Unified Personal Data Minimize redundant effort required of developers Phenom dramatically reduces the net effort that is required of developers in order to make use of personal data, and each of the components of Phenom contribute to this. Through the abstraction of Data Providers, Phenom simplifies the process of working with multiple APIs and the only developer who needs to be concerned with the structure of the API of the data source is the developer who implements the data provider for that data source. The Epistenet data store supports rich interconnection between data, both homogeneous through the semantic network of OntologyClasses, and also heterogeneous through the use of ReferenceAttributes. The combination of these two types of interconnection is especially powerful. Finally, the API offers even more value to developers by simplifying operations that would otherwise be complicated and would require a deeper understanding of the underlying implementation. Phenom’s Bots offer the ability to support the reuse of machine learning by enabling the modular deployment of models that can generate inferences and abstractions based on the contents of Epistenet. This represents a good first step, but more can be done to further streamline the process of developing machine learning models. Part of that opportunity comes from better mechanisms for collecting ground truth and retraining models. Another opportunity is to provide better support for the actual process of coming up with an initial model. With the current design of Phenom this remains a challenge because Phenom does not offer developers who are implementing bots any way to access the personal data of individuals, even if a user is willing to offer their data. The real value of Phenom on this design goal is visible through the two examples highlighted in section 5.2. The amount of code required, the complexity, and the potential to make errors was dramatically better with Phenom than without it. The Phenom solutions are easier, more robust, and more flexible to future additions and improvements to the process. Organize data by individual, not by service Phenom addresses this goal very directly: in Phenom, the top level of organization for personal data is the user, not the service or data source that the data came from. Support connections within the data Again, Phenom directly engages the goal of supporting connections within the data. The main limitations of supporting connections within the data now likes in the hierarchical organization of the ontology, and in the choice of ReferencesAttributes to associate with a particular OntologyClass. Limit unnecessary disclosure Phenom’s API easily supports many queries that would have previously required the developer to have access to copious amounts of raw personal data. Not only does this 106 Chapter 5: Phenom: A Service for Unified Personal Data capability help to minimize redundant effort by developers, but it also lays the groundwork for a system that offers users strong guarantees on how much data is being accessed by developers. While Phenom does not fully implement that system, it is now conceivable to do. Offer users transparency, and offer users choices and control, while specifying reasonable defaults These final two personal data design goals remain mostly untouched by Phenom, and are ripe for implementation and further work. 5.3.2 Next Steps Phenom is a big idea and it represents a major shift in the approach to handling personal data. However, as the previous section suggested, there are a number of important aspects of Phenom that will need further development in order to realize its full potential. Privacy Without question, the topic of privacy is the aspect of Phenom requires the most attention. However, developing a strong approach to privacy here is a large topic that will require significant additional work. One approach to handling privacy in Phenom is to simply continue to enforce the Android permissions framework that is already in use on the platform. This implementation would involve tracking the permissions that were required to obtain all of the data that was used in the specification of a certain query, and ensuring that the client application has declared all of those permissions in its own AndroidManifest.xml file. This approach is in some ways the most obvious and probably the simplest to implement as well, however there are several problems with this approach. First, querying for something like tie strength, for example, would require Android’s permissions for contacts, call logs, and sms logs, even though neither call logs nor SMS logs are directly accessible by the developer, and it would be very difficult to infer much meaning from the value produced by the tie strength bot (except to infer that the user has shared no communication with a particular contact). This approach is suboptimal because it does not allow for stronger guarantees on what data an application is or not accessing, which is an important aspect of the design goals. The next problem is that some data that Phenom aggregates or will be able to aggregate in the future does not have an Android permission (and did not come from Android in the first place). In these cases, a different permission approach would be necessary because developers would not have access to a permission that they should declare for those data types. 107 Chapter 5: Phenom: A Service for Unified Personal Data Another approach for handing privacy controls in Phenom is to define custom permissions for new types of data (e.g. a permission for tie strength, a permission for accessing aggregated statistics on calls, etc). This is approach is a permutation of the previous approach that at least offers some solutions to the aforementioned issues. While this approach is more plausible, it suffers from an unfortunate tradeoff. In order for this approach to guarantee minimum access, it will likely result in an explosion of new permissions for every possible permutation of different combinations of data that might be accessed. This will be unwieldy for developers, and certainly will make the development process more complex. However, even more troubling is that users are likely to be overwhelmed or confused by the explosion of new preferences. This could result in the average user paying less attention to privacy preferences. Studies have already shown that users often do not understand the existing permissions (Kelley et al., 2012). Finally, a more progressive approach would be to rethink the permissions system more holistically. One idea in this directions is the idea of tiered permissions. The idea behind tiered permissions is that some kinds of data are considered to be less sensitive than others, and so permission to access this data should be presented differently to the user. For example, something as simple as how recently the user made a phone call, how many contacts are in the contact list, or the average number of text messages the user sends per month are all likely to be perceived as less sensitive, so these items might appear at a lower tier. By contrast, the exact location of the user’s home and work, her entire call log, or the amount of money the user has in her bank account might be considered more sensitive items and belong in the top tier. There are probably different kinds of data that belong in between these two as well. For example, tie strength seems like it might be in between the two extremes. The idea with this approach is that lower tier items would require less confirmation from the user in order to access, while higher tier things might be especially prominent to encourage the user to be cautious. This tiered approach feels promising, but also introduces its own challenges. For example, deciding what belongs in each tier requires non-trivial effort. Furthermore, adding additional bots or data sources is going to require even more decisions to be made. Finally, there is the question of this approach to the vulnerability of an inference attack. For example, the tiered permissions model may determine that simply knowing whether or not the user is currently at home is less sensitive than the exact location of the user’s home. However, if the application can gain access to the user’s current location in some other way, then the developer still has access to the more sensitive information. In this particular example, perhaps Phenom would check to be sure that the application has not declared permissions to access Android’s location APIs. However, there are many different combinations of inference attacks that might occur, and it seems intractable to be able to protect against all of them. Even the combination of different data that are accessed within Phenom might change something that was otherwise not very sensitive into something that was very sensitive. For example, it has been demonstrated that having access to an individual’s date of birth and place of birth (both fairly innocuous facts on their own), can be exploited to guess the considerably sensitive 108 Chapter 5: Phenom: A Service for Unified Personal Data information of the individual’s social security number (Acquisti & Gross, 2009). Ultimately, the challenges of creating usable interfaces for managing privacy and security are so difficult that an entire research area has developed to understand and address these challenges. This topic is a rich space for future work. Ground Truth and Mediation Beyond privacy, there are a number of opportunities to push Phenom forward. First is providing users with opportunities for correcting incorrect inferences and providing ground truth data to help improve inference mechanisms. It is conceivable that individuals would be willing to provide better labels for their own data if in exchange they receive better service. Existing examples of this include features like Netflix asking users to rate more movies so that they get better recommendations, and Gmail Priority Inbox asking users to select which items should be moved to the Priority Inbox and which items should be removed. There are many opportunities for individual applications to encourage this type of labeling and Phenom should provide a process for integrating with that. Also along these lines, as discussed in the previous section, there are opportunities to further simplify the process of developing machine learning models and improving those models after they have been deployed. Externally Defined Providers Next, it would be very useful for Phenom to accept incoming data from external data providers. This feature would allow client applications that otherwise do not want to expend resources to provide and maintain an API to still contribute that data to Phenom and thus offer access and control of that data to the user. This will lead to other important challenges to consider. For example, what if the data that an application wants to contribute to Phenom does not fit into the existing ontology or requires an additional attribute? The way that Phenom is implemented today, those decisions are made statically at compile time. However, in the future it is possible that the ontology definition and the attributes associated with a particular ontology class could be dynamically defined. Such an implementation would need to have a centralized component for handling the definition of the ontology. Otherwise, a decentralized version would mean that a developer could never depend on what ontology is implemented on a particular device, which is problematic from a development perspective. Architecture The topic of a centralized component for storing a dynamically changing ontology also leads to a broader discussion of the particular architecture that Phenom is implemented in today. Phenom is quite decentralized in its current implementation, with an instance of Phenom running on the phone of each user. This has a variety of tradeoffs: 109 Chapter 5: Phenom: A Service for Unified Personal Data • Individuals may feel more secure that their data is physically in their control on their own device. The reverse perspective is that a smartphone is much easier to physically steal than if the data was stored in the cloud. • There is no centralized cost for owning and maintaining servers, including the processing power, storage capacity, and electricity costs. This means that it might be easier to spark adoption of Phenom because there is no cost barrier to starting to use it. The reverse perspective here is that resources on a smartphone are certainly limited: storage space, processing power, and battery. If Phenom really became popular, its impact on the resources of the device might become more salient to the user. • Because Phenom is decentralized, there is no real support for non-phone applications to gain access to Phenom. This could become an issue, in particular if part of the value of Phenom is to offer a consistent user experience across all of the applications that an individual uses. If we collectively adopted a computing architecture more akin to one that would support the proposed Personal Server (Want et al., 2002), but given the widespread success of mobile data and cloud computing, the idea of changing our computing infrastructure to support this idea of a Personal Server seems unlikely. With a centralized architecture, the issues and concerns would be reversed. A third option to consider is the potential to support a hybrid architecture, with some components of Phenom centralized, and others decentralized. Such an approach might begin to offer the benefits of each approach while minimizing the drawbacks. One example of this idea of a hybrid decentralized platform is the social network platform Diaspora18. In Diaspora, the idea was that any individual could host their own server (called a pod), and that pods could connect with each other, but physical control of the server and the personal data is decentralized. Ultimately, a hybrid architecture would represent a massive undertaking, but may also offer the most promise for deploying the ideas behind Phenom out into the real world. 5.4 Related Work Aspects of the approach that Phenom takes to handling personal data are related to a variety of project in the space of HCI and mobile computing. While Chapter 2 gave a much broader overview of various work related to personal data, this section is mainly focused on systems that may appear similar or related to Phenom. The Context Toolkit (A. Dey et al., 2001) is a software framework for making software context-aware. In the context toolkit, data is collected from sensors by context widgets that separate the data that was collected from the specific complexity of 18 https://diasporafoundation.org/ 110 Chapter 5: Phenom: A Service for Unified Personal Data how it was collected, interpreters raise the level of abstraction of the data within each sensor, aggregators bring together related contextual information together from different sensors, services trigger actions based on the data, and discoverers maintain a registry of what capabilities exist in the framework. Phenom was inspired in part by The Context Toolkit. The most obvious difference between these two systems is that The Context Toolkit is designed to bridge the gap between very low-level, sensorbased personal data. By contrast, Phenom is not designed to handle very low level sensor data but is much more focused on accepting the output of a system such as the Context Toolkit. Following on from The Context Toolkit, a number of frameworks and tools have been developed that further expand the idea that underlies The Context Toolkit. In particular, following the creation and widespread adoption of the Android operating system, a handful of tools have emerged that are focused on offering a unified framework for interacting with a phone’s contextual data, whose definition has in some instances been expanded beyond hardware sensors to include data from “software sensors” and even humans. These systems include the AWARE framework (Ferreira, Kostakos, & Dey, 2015), ohmage (Ramanathan et al., 2012), and the Funf Open Sensing Framework (Aharony et al., 2011). While the specific details of the implementations of these systems do vary, the basic structure is fairly similar across all of these systems. All three of these systems have a strong focus on collecting data in the context of a study: they all include a backend server component and tools for researchers to collect and analyze data from participants. In addition to these features, all three systems do offer a library that contains the core components for developers to integrate the framework into the development of their own applications. These systems do share some aspects of similarity with Phenom: they all run on Android, they all collect personal data, and in some cases (particularly with AWARE), there is some effort to raise the level of abstraction of the data beyond the level it was collected at. However, Phenom stands distinct from these systems. Perhaps most distinct is the combination of Phenom’s semantic data store and Phenom’s API. None of the three systems mentioned above support linking and interconnection of data across different data types. Instead, they all expose the underlying personal data through a very thin API layer. Phenom’s API offers the ability to easily specify complex cross-data-type queries. Furthermore, the ontological hierarchy in Epistenet offers additional power and flexibility in working with the data. Finally, Phenom’s framing as a service for managing the breadth of personal data is distinct from those presented in the above systems, where the focus is more directly on the information that is available on the phone. One idea that was proposed is that the phone’s operating system is what should be responsible for collecting and making inferences from contextual data (Chu, Kansal, Liu, & Zhao, 2011). This offers a different perspective on collecting personal data from a smartphone. For example, operating system-level support for collecting context could provide unified support for collecting user behavior within applications (Fernandes, Riva, & Nath, 2015). This is a perfect example of the kind of data that 111 Chapter 5: Phenom: A Service for Unified Personal Data Phenom would be perfect for collecting: information about what users do when they are in applications could dramatically improve the amount of data from which we can make personal inferences in Phenom. Again, the level of inference described in this work is at the lower levels of contextual inference, where Phenom is positioned to be making higher-level inferences. A number of personal data stores have been proposed over the years, with various architectures, access mechanisms, and privacy controls (Bell, 2001; Cáceres, Cox, Lim, Shakimov, & Varshavsky, 2009; de Montjoye, Shmueli, Wang, & Pentland, 2014; “Higgins Personal Data Service,” n.d.; Hong & Landay, 2004; Mun et al., 2010; Want et al., 2002). The motivations behind these systems echo each other: offering users ownership and control over her personal data, strongly emphasizing privacy. Echoing the points above, Phenom’s unique approach to storing, interconnecting, and querying the data makes it distinct from these other approaches. Furthermore, Phenom’s bots offer additional functionality for making inferences and abstractions internally in the system. The related work that offers something most similar is openPDS (de Montjoye et al., 2014). openPDS includes a component called SafeAnswers. SafeAnswers essentially offers functionality complimentary to bots. However, in the SafeAnswers model, individual developers are responsible for writing the code that will be run in the system, and the only data that is released to the developer is the answer to the question. By contrast, Phenom’s Bots are intended to be highly-reusable, application-agnostic modules. Furthermore, because the output of bots is also stored in the semantic data store, Bot output can be easily and flexibly combined with other parts of the user’s data, a functionality not supported by the SafeAnswers architecture. Recently, both Google and Apple have released platforms (Fit (“Google Fit,” n.d.) and HealthKit (“Apple HealthKit,” n.d.) respectively), which share some aspects with Phenom: they are intended to collect fitness and health data from arbitrary applications, store that data in a data-centric format rather than a source-centric one, and then make that data available to other applications with the user’s permission. The approach taken here is in some ways closer to Phenom’s ontologyclass-driven semantic approach to organizing data. However, these systems are restricted to health data, they lack the ability to generate inferences within the system, and they do not provide the same interconnected API querying facilities that Phenom offers. 112 6 Conclusion Personal data today is abundant, and there remains enormous potential for it to grow both in the breadth of sources captured and in the duration of time captured. Applications that make use of this data are limited only by our own creativity. But between the vast amounts of personal data and the functional applications that are enabled by the data lies an orthogonal set of challenges. The ecosystem of personal data was not purposefully designed with a goal of unlocking the full potential of a collected and quantified world. In fact, it seems that nobody at all has approached personal data from a holistic perspective. This dissertation explores a holistic view of personal data. A broad survey of computer science in chapter 2 research reveals multiple domains where an integrated approach to personal data is the key to advancing the state of the art in that discipline. The case study in chapter 3 demonstrates the practical challenges and issues that transform the simple steps of a research project into a resourceintensive distraction from the main goal of the work. Chapter 4 explores the ecosystem of personal data: What does it look like today? What is wrong with it? What would improve it? It introduces a conceptual framework for thinking about the process of working with personal data consisting of a continuum of abstraction levels of personal data, and three steps necessary for working with it. Using the frame of unified personal data, simplifies many of the challenges involved in this process. Chapter 5 demonstrates a proof-of-concept service for unified personal data that offers a single user-centric data store of richly interconnected personal data. This dissertation offers the following technical and design contributions to HCI: 1. A proposal for unified personal data; a reframing of many HCI challenges, human needs, and technical opportunities that can all be advanced by more Chapter 6: Conclusion 2. 3. 4. 5. holistically viewing all of the individual data amassing around people as their personal data that should work for them. The notion of personal data as a continuum, and a conceptual framework that unpacks the implicit process involved in working with personal data. A set of design goals for improving the ecosystem of personal data. The design of Phenom: a service that supports software development with personal data. Phenom modularizes the collection, interconnection, processing, and querying of personal data to solve a key set of challenges involved in developing applications that use personal data. The implementation of a proof of concept of Phenom which demonstrates its viability and utility as a personal data service. Personal data is only in its beginnings as a research domain. If researchers from many disciplines are going to continue to employ personal data to make research advances in their own disciplines, it is imperative that we establish this multidisciplinary domain. The possibility of a world where unified personal data can be used to enable powerful and complex applications is very real, however many important and interconnected questions remain in personal data research. What economic model will enable companies to maintain their value and competitive advantage while also enabling end-users fair access to their data? What software architecture offers the best compromise of across concerns? What access mechanisms will offer an effective balance between privacy and utility? Even beyond research, as a society we will need to answer a set of questions that we might not be ready for. Who “owns” my personal data? Is ownership even the most applicable concept? Does an individual have a right to access their own data? A right to demand that it is collected? A right to demand that it is deleted? A right to stop it from being deleted? In the context of these questions, Phenom is a software artifact that offers the ability to engage these questions, explore potential solutions, and continue to evolve the ecosystem of personal data. 114 7 References Abbar, S., Bouzeghoub, M., & Lopez, S. (2009). Context-aware recommender systems: A serviceoriented approach. In VLDB PersDB workshop (pp. 1–6). Ackerman, J. M., Kenrick, D. T., & Schaller, M. (2007). Is friendship akin to kinship? Evolution and Human Behavior, 28(5). Ackerman, M. S., Cranor, L. F., & Reagle, J. (1999). Privacy in e-commerce: examining user scenarios and privacy preferences. In Proceedings of the 1st ACM conference on Electronic commerce (pp. 1–8). Acquisti, A., & Gross, R. (2009). Predicting Social Security numbers from public data. Proceedings of the National Academy of Sciences of the United States of America, 106(27), 10975–80. doi:10.1073/pnas.0904891106 Adar, E., Karger, D., & Stein, L. A. (1999). Haystack: Per-user Information Environments. In Proceedings of the Eighth International Conference on Information and Knowledge Management (pp. 413– 422). New York, NY, USA: ACM. doi:10.1145/319950.323231 Adomavicius, G., & Tuzhilin, A. (2011). Context-aware recommender systems. In Recommender systems handbook (pp. 217–253). Springer. Aharony, N., Pan, W., Ip, C., Khayal, I., & Pentland, A. (2011). Social fMRI: Investigating and shaping social mechanisms in the real world. Pervasive and Mobile Computing, 7(6), 643–659. doi:http://dx.doi.org/10.1016/j.pmcj.2011.09.004 Allen, J. F. (1979). A Plan-based Approach to Speech Act Recognition. American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental Disorders (DSM5). American Psychiatric Pub. Chapter 7: References Apple HealthKit. (n.d.). Retrieved https://developer.apple.com/healthkit/ August 10, 2015, from Assad, M., Carmichael, D., Kay, J., & Kummerfeld, B. (2007). PersonisAD: Distributed, active, scrutable model framework for context-aware services. Pervasive Computing, 55–72. Baldauf, M., Dustdar, S., & Rosenberg, F. (2007). A survey on context-aware systems. International Journal of Ad Hoc and Ubiquitous Computing, 2(4), 263–277. Barkhuus, L., Brown, B., Bell, M., Sherwood, S., Hall, M., & Chalmers, M. (2008). From awareness to repartee: sharing location within social groups. CHI ’08: Proceeding of the TwentySixth Annual SIGCHI Conference on Human Factors in Computing Systems. Retrieved from http://portal.acm.org/citation.cfm?id=1357054.1357134 Belk, R. (2010). Sharing. Journal of Consumer Research, 36(5), 715–734. doi:10.1086/612649 Belk, R. W. (1988). Possessions and the Extended Self. The Journal of Consumer Research, 15(2), 139– 168. Bell, G. (2001). A personal digital store. Communications of the ACM, 44(1), 86–91. Bellotti, V., Dalal, B., Good, N., Flynn, P., & Bobrow, D. (2004). What a to-do: studies of task management towards the design of a personal task list manager. In In Proceedings of the SIGCHI conference on Human factors in computing systems. Retrieved from http://portal.acm.org/citation.cfm?id=985785 Bernstein, M., Van Kleek, M., Karger, D., & Schraefel, M. (2008). Information scraps: How and why information eludes our personal information management tools. ACM Transactions on Information Systems (TOIS), 26(4). Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems survey. Knowledge-Based Systems, 46, 109–132. Brandimarte, L., & Acquisti, A. (2012). The Economics of Privacy. In The Oxford Handbook of the Digital Economy. Oxford University Press. Brandt, J., Weiss, N., & Klemmer, S. (2007). txt 4 l8r: lowering the burden for diary studies under mobile conditions. CHI ’07: CHI '07 Extended Abstracts on Human Factors in Computing Systems. Brown, B., Taylor, A., Izadi, S., Sellen, A., Kaye, J., & Eardley, R. (2007). Location family values: A field trial of the whereabouts clock. Ubiquitous Computing (Ubicomp ’07). Browne, G., Berry, E., Kapur, N., Hodges, S., Smyth, G., Watson, P., & Wood, K. (2011). SenseCam improves memory for recent events and quality of life in a patient with memory retrieval difficulties. Memory, 19(7), 713–722. Burke, M. (2011). Reading, Writing, Relationships: The Impact of Social Network Sites on Relationships and Well-Being. Carnegie Mellon University. Burke, M., & Kraut, R. (2013). Using Facebook after losing a job: Differential benefits of strong and weak ties. In Proceedings of the 2013 conference on Computer supported cooperative work (pp. 116 Chapter 7: References 1419–1430). Burton, R. R., & Brown, J. S. (1979). An investigation of computer coaching for informal learning activities. International Journal of Man-Machine Studies, 11(1), 5–24. doi:http://dx.doi.org/10.1016/S0020-7373(79)80003-6 Bush, V. (1945, July). As we may doi:http://dx.doi.org/10.1145/227181.227186 think. The Atlantic Monthly. Cáceres, R., Cox, L., Lim, H., Shakimov, A., & Varshavsky, A. (2009). Virtual individual servers as privacy-preserving proxies for mobile devices. In Proceedings of the 1st ACM workshop on Networking, systems, and applications for mobile handhelds (pp. 37–42). Cadiz, J., Venolia, G., & Jancke, G. (2002). Designing and deploying an information awareness interface. In In Proceedings of the 2002 ACM conference on Computer supported cooperative work. Chang, K. S.-P., Myers, B. A., Cahill, G. M., Simanta, S., Morris, E., & Lewis, G. (2013). Improving Structured Data Entry on Mobile Devices. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology (pp. 75–84). New York, NY, USA: ACM. doi:10.1145/2501988.2502043 Chen, G., & Kotz, D. (2000). A survey of context-aware mobile computing research. Chu, D., Kansal, A., Liu, J., & Zhao, F. (2011). Mobile Apps: It’s Time to Move Up to CondOS. In Proceedings of the 13th USENIX conference on Hot topics in operating systems (p. 16). Cohen, P. R., & Perrault, C. R. (1979). Elements of a plan-based theory of speech acts. Cognitive Science, 3(3), 177–212. doi:10.1016/S0364-0213(79)80006-3 Consolvo, S., McDonald, D. W., Toscos, T., Chen, M. Y., Froehlich, J., Harrison, B., … Landay, J. a. (2008). Activity sensing in the wild: a field trial of ubifit garden. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’08) (pp. 1797–1806). doi:10.1145/1357054.1357335 Conti, M., Passarella, A., & Pezzoni, F. (2011, June 20). A model for the generation of social network graphs. 2011 IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks. IEEE. doi:10.1109/WoWMoM.2011.5986141 Cummings, J. N., Lee, J. B., & Kraut, R. (2006). Communication technology and friends during the transition from high school to college. In Computers, phones, and the Internet: Domesticating information technology. Danezis, G. (2009). Inferring Privacy Policies for Social Networking Services. In Conference on Computer and Communications Security (pp. 5–10). Das, S., Hayashi, E., & Hong, J. I. (2013). Exploring capturable everyday memory for autobiographical authentication. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing - UbiComp ’13 (p. 211). New York, New York, USA: ACM Press. doi:10.1145/2493432.2493453 117 Chapter 7: References Davidoff, S., Lee, M. K., Dey, A. K., & Zimmerman, J. (2007). Rapidly exploring application design through speed dating. LNCS. Retrieved from http://www.springerlink.com/index/w174666l525j2741.pdf Davidoff, S., Ziebart, B. D., Zimmerman, J., & Dey, A. K. (2011). Learning patterns of pick-ups and drop-offs to support busy family coordination. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1175–1184). de Montjoye, Y.-A., Shmueli, E., Wang, S. S., & Pentland, A. S. (2014). openPDS: protecting the privacy of metadata through SafeAnswers. PloS One, 9(7), e98790. doi:10.1371/journal.pone.0098790 Dey, A. K. (2001). Understanding and Using Context. Personal Ubiquitous Comput., 5(1), 4–7. doi:10.1007/s007790170019 Dey, A., Salber, D., & Abowd, G. (2001). A conceptual framework and a toolkit for supporting the rapid prototyping of context-aware applications. Human-Computer Interaction, 12(2), 97– 166. Dim, E., Kuflik, T., & Reinhartz-Berger, I. (2015). When User Modeling Intersects Software Engineering: The Info-bead User Modeling Approach. User Modeling and User-Adapted Interaction, 25(3), 189–229. doi:10.1007/s11257-015-9159-1 Dorst, K. (2011). The core of “design thinking”and its application. Design Studies, 32(6), 521–532. Doryab, A., Min, J. K., Wiese, J., Zimmerman, J., & Hong, J. I. (2014). Detection of behavior change in people with depression. In AAAI Workshops Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence. Dourish, P. (2001). Seeking a foundation for context-aware computing. Human--Computer Interaction, 16(2-4), 229–241. Dourish, P. (2004). What we talk about when we talk about context. Personal Ubiquitous Comput. doi:http://dx.doi.org/10.1007/s00779-003-0253-8 Ducheneaut, N., & Bellotti, V. (2001). E-mail as habitat: an exploration of embedded personal information management. Interactions, 8(5), 30–38. doi:10.1145/382899.383305 Dumais, S., Cutrell, E., Cadiz, J. J., Jancke, G., Sarin, R., & Robbins, D. C. (2003). Stuff I’ve seen: a system for personal information retrieval and re-use. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 72–79). Eagle, N., & Pentland, A. (2006). Reality mining: sensing complex social systems. Personal and Ubiquitous Computing, 10(4), 255–268. Eagle, N., Pentland, A., & Lazer, D. (2009). Inferring Social Network Structure using Mobile Phone Data. PNAS, 106(36). Estrin, D. (2014). Small data, where n= me. Communications of the ACM, 57(4), 32–34. Fang, L., & LeFevre, K. (2010). Privacy wizards for social networking sites. In Proceedings of the 19th 118 Chapter 7: References international conference on World wide web (pp. 351–360). Farnham, S. D., & Churchill, E. F. (2011). Faceted identity, faceted lives. Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work - CSCW ’11, 359. doi:10.1145/1958824.1958880 Fernandes, E., Riva, O., & Nath, S. (2015). My OS ought to know me better: In-app behavioural analytics as an OS service. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV). Ferreira, D., Kostakos, V., & Dey, A. K. (2015). AWARE: Mobile Context Instrumentation Framework. Frontiers in ICT, 2. doi:10.3389/fict.2015.00006 Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012). Interactions with Big Data Analytics. Interactions, 19(3), 50–59. doi:10.1145/2168931.2168943 Fogarty, J., Lai, J., & Christensen, J. (2004). Presence versus availability: the design and evaluation of a context-aware communication client. International Journal of Human-Computer Studies. Freeman, E., & Fertig, S. (1995). Lifestreams: Organizing your electronic life. In AAAI Fall Symposium: AI Applications in Knowledge Navigation and Retrieval (pp. 38–44). Freeman, E., & Gelernter, D. (1996). Lifestreams: A storage model for personal data. ACM SIGMOD Record, 25(1), 80–86. Friedkin, N. (1980). A test of structural features of Granovetter’s strength of weak ties theory. Social Networks, 2(4), 411–422. Gemmell, J., Bell, G., & Lueder, R. (2006). MyLifeBits. Communications of the ACM, 49(1), 88–95. doi:10.1145/1107458.1107460 Gemmell, J., Bell, G., Lueder, R., Drucker, S., & Wong, C. (2002). MyLifeBits: Fulfilling the Memex Vision. In Proceedings of the Tenth ACM International Conference on Multimedia (pp. 235– 238). New York, NY, USA: ACM. doi:10.1145/641007.641053 Gilbert, E., & Karahalios, K. (2009). Predicting tie strength with social media. In In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 211–220). Gliem, J. A., & Gliem, R. R. (2003). Calculating, Interpreting, And Reporting Cronbach’s Alpha Reliability Coefficient For Likert-Type Scales. Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using Collaborative Filtering to Weave an Information Tapestry. Commun. ACM, 35(12), 61–70. doi:10.1145/138859.138867 González, M., Hidalgo, C., & Barabási, A. (2008). Understanding individual human mobility patterns. Nature. Google Fit. (n.d.). Retrieved August 10, 2015, from https://developers.google.com/fit/?hl=en Granovetter, M. (1973). The strength of weak ties. The American Journal of Sociology,. Gulotta, R., Odom, W., Forlizzi, J., & Faste, H. (2013). Digital Artifacts As Legacy: Exploring the 119 Chapter 7: References Lifespan and Value of Digital Data. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1813–1822). New York, NY, USA: ACM. doi:10.1145/2470654.2466240 Higgins Personal Data Service. http://www.eclipse.org/higgins/ (n.d.). Retrieved August 10, 2015, from Hill, R. A., & Dunbar, R. I. M. (2003). Social network size in humans. Human Nature, 14, 53–72. Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler, A., … Wood, K. (2006). SenseCam: A retrospective memory aid. In UbiComp 2006: Ubiquitous Computing (pp. 177– 193). Springer. Hong, J., & Landay, J. (2004). An Architecture for Privacy-Sensitive Ubiquitous Computing. In The Second International Conference on Mobile Systems, Applications, and Services (MobiSys 2004) (pp. 177–189). Hori, T., & Aizawa, K. (2003). Context-based Video Retrieval System for the Life-log Applications. In Proceedings of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval (pp. 31–38). New York, NY, USA: ACM. doi:10.1145/973264.973270 Hsieh, G., Tang, K. P., Low, W. Y., & Hong, J. I. (2007). Field Deployment of IMBuddy: A Study of Privacy Control and Feedback Mechanisms for Contextual Instant Messengers. In The Ninth International Conference on Ubiquitous Computing (Ubicomp 2007). Iachello, G., & Hong, J. (2007). End-User Privacy in Human-Computer Interaction. FNT in Human-Computer Interaction. doi:10.1561/1100000004 Jones, S., & O’Neill, E. (2010). Feasibility of Structural Network Clustering for Group-Based Privacy Control in Social Networks. In Symposium on Usable Privacy and Security (SOUPS 2010). Jones, W. (2007). Personal information management. Annual Review of Information Science and Technology, 41(1), 453–504. Karger, D. R., & Jones, W. (2006). Data unification in personal information management. Communications of the ACM, 49(1), 77–82. Kay, J., & Kummerfeld, B. (2010). PortMe : personal lifelong user modelling portal. School of Information Technologies, University of Sydney [Sydney]. Kaye, J. “Jofish,” Vertesi, J., Avery, S., Dafoe, A., David, S., Onaga, L., … Pinch, T. (2006). To Have and to Hold: Exploring the Personal Archive. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 275–284). New York, NY, USA: ACM. doi:10.1145/1124772.1124814 Kelley, P. G., Bresee, J., Cranor, L. F., & Reeder, R. W. (2009). A nutrition label for privacy. In Proceedings of the 5th Symposium on Usable Privacy and Security (p. 4). Kelley, P. G., Consolvo, S., Cranor, L. F., Jung, J., Sadeh, N., & Wetherall, D. (2012). A conundrum of permissions: installing applications on an android smartphone. In Financial 120 Chapter 7: References Cryptography and Data Security (pp. 68–79). Springer. Kelley, P.G., Brewer, R., Mayer, Y., Cranor, L.F., Sadeh, N. (2011). An Investigation into Facebook Friend Grouping. In Proceedings of INTERACT 2011. Kiukkonen, N., Blom, J., Dousse, O., Gatica-Perez, D., & Laurila, J. (2010). Towards rich mobile phone datasets: Lausanne data collection campaign. Proc. ICPS, Berlin. Klemperer, P., Liang, Y., Mazurek, M., Sleeper, M., Ur, B., Bauer, L., … Reiter, M. (2012). Tag, You Can See It!: Using Tags for Access Control in Photo Sharing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 377–386). New York, NY, USA: ACM. doi:10.1145/2207676.2207728 Kobsa, A. (2001). Generic user modeling systems. User Modeling and User-Adapted Interaction, 11(1-2), 49–63. Konstan, J. A., & Riedl, J. (2012). Recommender systems: from algorithms to user experience. User Modeling and User-Adapted Interaction, 22(1-2), 101–123. Lamming, M., Brown, P., Carter, K., Eldridge, M., Flynn, M., Louie, G., … Sellen, A. (1994). The design of a human memory prosthesis. The Computer Journal, 37(3), 153–163. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., … Van Alstyne, M. (2009). Computational Social Science. Science, 323(5915), 721–723. doi:10.1126/science.1167742 Lee, M. L., & Dey, A. K. (2008). Lifelogging memory appliance for people with episodic memory impairment. In Proceedings of the 10th international conference on Ubiquitous computing (pp. 44–53). Li, I., Dey, A., & Forlizzi, J. (2010). A stage-based model of personal informatics systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 557–566). Lin, J., Amini, S., Hong, J. I., Sadeh, N., Lindqvist, J., & Zhang, J. (2012). Expectation and purpose: understanding users’ mental models of mobile app privacy through crowdsourcing. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (pp. 501–510). Lin, N., & Dean, A. (1984). Social support and depression. Social Psychiatry and Psychiatric Epidemiology, 19(2), 83–91. Lindqvist, J., Cranshaw, J., Wiese, J., Hong, J., & Zimmerman, J. (2011). I’M the Mayor of My House: Examining Why People Use Foursquare - a Social-driven Location Sharing Application. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2409–2418). New York, NY, USA: ACM. doi:10.1145/1978942.1979295 Liu, Y., Gummadi, K. P., Krishnamurthy, B., & Mislove, A. (2011). Analyzing Facebook Privacy Settings: User Expectations vs. Reality. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference (pp. 61–70). New York, NY, USA: ACM. doi:10.1145/2068816.2068823 Marcu, G., Dey, A. K., & Kiesler, S. (2012). Parent-driven Use of Wearable Cameras for Autism 121 Chapter 7: References Support: A Field Study with Families. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (pp. 401–410). New York, NY, USA: ACM. doi:10.1145/2370216.2370277 Marin, A., & Hampton, K. N. (2007). Simplifying the Personal Network Name Generator. Field Methods. Matyas, C., & Schlieder, C. (2009). A spatial user similarity measure for geographic recommender systems. In GeoSpatial Semantics (pp. 122–139). Springer. McCarty, C. (2002). Structure in personal networks. Journal of Social Structure, 3(1). McEwan, G., & Greenberg, S. (2005). Supporting social worlds with the community bar. In Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work. Mesch, G. S. (2009). Social context and communication channels choice among adolescents. Computers. Human Behavior, 25, 244–251. Min, J.-K., Doryab, A., Wiese, J., Amini, S., Zimmerman, J., & Hong, J. I. (2014). Toss “n” turn: smartphone as sleep and sleep quality detector. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI ’14 (pp. 477–486). New York, New York, USA: ACM Press. doi:10.1145/2556288.2557220 Min, J.-K., Wiese, J., Hong, J. I., & Zimmerman, J. (2013). Mining smartphone data to classify life-facets of social relationships. In In Proc. CSCW ’13. New York, New York, USA. doi:10.1145/2441776.2441810 Miritello, G., Moro, E., Lara, R., Martínez-López, R., Belchamber, J., Roberts, S. G. B., & Dunbar, R. I. M. (2013). Time as a limited resource: Communication strategy in mobile phone networks. Social Networks, 35(1), 89–95. doi:10.1016/j.socnet.2013.01.003 Mun, M., Hao, S., Mishra, N., Shilton, K., Burke, J., Estrin, D., … Govindan, R. (2010). Personal data vaults: a locus of control for personal data streams. In Proceedings of the 6th International COnference (p. 17). Nguyen, T. T., Nguyen, D. T., Iqbal, S. T., & Ofek, E. (2015). The Known Stranger: Supporting Conversations Between Strangers with Personalized Topic Suggestions. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (pp. 555–564). New York, NY, USA: ACM. doi:10.1145/2702123.2702411 Odom, W., Zimmerman, J., & Forlizzi, J. (2010). Virtual possessions. In Proceedings of the 8th ACM Conference on Designing Interactive Systems - DIS ’10 (p. 368). New York, New York, USA: ACM Press. doi:10.1145/1858171.1858240 Odom, W., Zimmerman, J., & Forlizzi, J. (2011). Teenagers and Their Virtual Possessions: Design Opportunities and Issues. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1491–1500). New York, NY, USA: ACM. doi:10.1145/1978942.1979161 Odom, W., Zimmerman, J., & Forlizzi, J. (2014). Placelessness, Spacelessness, and Formlessness: Experiential Qualities of Virtual Possessions. In Proceedings of the 2014 Conference on Designing 122 Chapter 7: References Interactive Systems (pp. 985–994). New York, NY, USA: ACM. doi:10.1145/2598510.2598577 Oku, K., Kotera, R., & Sumiya, K. (2010). Geographical recommender system based on interaction between map operation and category selection. In Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems (pp. 71–74). Olson, J., Grudin, J., & Horvitz, E. (2005). A study of preferences for sharing and privacy. Conference on Human Factors in Computing Systems. Retrieved from http://portal.acm.org/citation.cfm?id=1057073 Onnela, J.-P., Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., … Barabási, A.-L. (2007). Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences of the United States of America, 104(18), 7332–6. doi:10.1073/pnas.0610245104 Oulasvirta, A., Raento, M., & Tiitta, S. (2005). ContextContacts: re-designing SmartPhone’s contact book to support mobile awareness and collaboration. Proceedings of the 7th International Conference on Human Computer Interaction with Mobile Devices & Services - MobileHCI ’05. Ozenc, F. K., & Farnham, S. D. (2011). Life “ Modes ” in Social Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Patel, K., Bancroft, N., Drucker, S. M., Fogarty, J., Ko, A. J., & Landay, J. (2010). Gestalt: integrated support for implementation and analysis in machine learning. In Proceedings of the 23nd annual ACM symposium on User interface software and technology (pp. 37–46). Pentland, A. (2009). Reality mining of mobile communications: Toward a new deal on data. The Global Information Technology Report 2008–2009, 1981. Perrault, C. R., Allen, J. F., & Cohen, P. R. (1978). Speech Acts As a Basis for Understanding Dialogue Coherence. In Proceedings of the 1978 Workshop on Theoretical Issues in Natural Language Processing (pp. 125–132). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/980262.980282 Pousman, Z., Stasko, J. T., & Mateas, M. (2007). Casual information visualization: Depictions of data in everyday life. Visualization and Computer Graphics, IEEE Transactions on, 13(6), 1145– 1152. Ramanathan, N., Alquaddoomi, F., Falaki, H., George, D., Hsieh, C., Jenkins, J., … Estrin, D. (2012). ohmage: An open mobile system for activity and experience sampling. In 2012 6th International Conference on Pervasive Computing Technologies for Healthcare (pp. 203–204). Rich, E. (1979a). Building and exploiting user models. In Proceedings of the 6th international joint conference on Artificial intelligence-Volume 2 (pp. 720–722). Rich, E. (1979b). User modeling via stereotypes*. Cognitive Science, 3(4), 329–354. Ricken, S. T., Schuler, R. P., Grandhi, S. A., & Jones, Q. (2010). TellUsWho: Guided Social Network Data Collection. In 2010 43rd Hawaii International Conference on System Sciences (pp. 1– 10). IEEE. doi:10.1109/HICSS.2010.365 123 Chapter 7: References Rittel, H. J., & Webber, M. (1973). Dilemmas in a general theory of planning. Policy Sciences, 4(2), 155–169. doi:10.1007/BF01405730 Roberts, S. G. B., & Dunbar, R. I. M. (2011). Communication in social networks: Effects of kinship, network size, and emotional closeness. Personal Relationships. Romanosky, S., Acquisti, A., Hong, J. I., Cranor, L. F., & Friedman, B. (2006). Privacy Patterns for Online Interactions. In The 11th European Conference on Pattern Languages of Programs (Europlop 2006). Saeb, S., Zhang, M., Karr, C. J., Schueller, S. M., Corden, M. E., Kording, K. P., & Mohr, D. C. (2015). Mobile Phone Sensor Correlates of Depressive Symptom Severity in Daily-Life Behavior: An Exploratory Study. Journal of Medical Internet Research, 17(7), e175. doi:10.2196/jmir.4273 Schilit, B., Adams, N., & Want, R. (1994). Context-aware computing applications. Mobile Computing Systems and Applications. Schön, D. A. (1983). The reflective practitioner: How professionals think in action (Vol. 5126). Basic books. Sellen, A. J., & Whittaker, S. (2010). Beyond Total Capture: A Constructive Critique of Lifelogging. Commun. ACM, 53(5), 70–77. doi:10.1145/1735223.1735243 Simon, H. A. (1969). The sciences of the artificial. Cambridge, MA. Sleeper, M., Balebako, R., Das, S., McConahy, A. L., Wiese, J., & Cranor, L. F. (2013). The post that wasn’t: exploring self-censorship on facebook. In Proceedings of the 2013 conference on Computer supported cooperative work (pp. 793–802). ACM. Spencer, L., & Pahl, R. E. (2006). Rethinking friendship: hidden solidarities today. Princeton University Press. Starner, T. E., Snoeck, C. M., Wong, B. A., & McGuire, R. M. (2004). Use of mobile appointment scheduling devices. In CHI’04 Extended Abstracts on Human Factors in Computing Systems (pp. 1501–1504). Tang, J., Yankelovich, N., Begole, J., Kleek, M., Li, F., & Bhalodia, J. (2001). ConNexus to awarenex: extending awareness to mobile users. ACM Conference on Human Factors in Computing Systems (CHI2001), CHI Letters 3(1). Tang, K. P., Lin, J., Hong, J. I., Siewiorek, D. P., & Sadeh, N. (2010). Rethinking Location Sharing: Exploring the Implications of Social-Driven vs. Purpose-Driven Location Sharing. Tolmie, P., Pycock, J., Diggins, T., Maclean, A., & Karsenty, A. (2002). Unremarkable computing. Proceedings. doi:http://dx.doi.org/10.1145/503376.503448 Wang, D., Pedreschi, D., Song, C., Giannotti, F., & Barabasi, A.-L. (2011). Human mobility, social ties, and link prediction. In In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (p. 1100). doi:10.1145/2020408.2020581 Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., … Campbell, A. T. (2014). 124 Chapter 7: References StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. 3–14). Want, R., Hopper, A., Falcão, V., & Gibbons, J. (1992). The active badge location system. ACM Transactions on Information Systems. doi:10.1145/128756.128759 Want, R., Pering, T., Danneels, G., Kumar, M., Sundar, M., & Light, J. (2002). The personal server: Changing the way we think about ubiquitous computing. In Ubicomp 2002: Ubiquitous Computing (pp. 194–209). Springer. Weiser, M. (1991). The computer for the 21st century. Scientific American, 265(3), 94–104. doi:http://dx.doi.org/10.1145/329124.329126 Weka 3: Data Mining Software in Java. (n.d.). Westin, A. (2001). Opinion surveys: What consumers have to say about information privacy. Prepared Witness Testimony, The House Committee on Energy and Commerce. Whittaker, S., Jones, Q., & Terveen, L. (2002). Contact management. In Proceedings of the 2002 ACM conference on Computer supported cooperative work - CSCW ’02 (p. 216). New York, New York, USA: ACM Press. doi:10.1145/587078.587109 Wiese, J., Biehl, J. T., Turner, T., van Melle, W., & Girgensohn, A. (2011). Beyond “Yesterday”s Tomorrow’: Towards the Design of Awareness Technologies for the Contemporary Worker. In Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services (pp. 455–464). New York, NY, USA: ACM. doi:10.1145/2037373.2037441 Wiese, J., Hong, J. I., & Zimmerman, J. (2014). Challenges and opportunities in data mining contact lists for inferring relationships. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Wiese, J., Kelley, P. G., Cranor, L. F., Dabbish, L., Hong, J. I., & Zimmerman, J. (2011). Are you close with me? are you nearby?: investigating social groups, closeness, and willingness to share. In In Proceedings of the 13th international conference on Ubiquitous computing - UbiComp ’11. New York, New York, USA. doi:10.1145/2030112.2030140 Wiese, J., Min, J.-K., Hong, J. I., & Zimmerman, J. (2015). “You Never Call, You Never Write”: Call and SMS Logs Do Not Always Indicate Tie Strength. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 765–774). New York, NY, USA: ACM. doi:10.1145/2675133.2675143 Wiese, J., Saponas, T. S., & Brush, A. J. B. (2013). Phoneprioception: Enabling Mobile Phones to Infer Where They Are Kept. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2157–2166). New York, NY, USA: ACM. doi:10.1145/2470654.2481296 Yang, J., Yessenov, K., & Solar-Lezama, A. (2012). A language for automatically enforcing privacy policies. In ACM SIGPLAN Notices (Vol. 47, pp. 85–96). 125 Chapter 7: References Zhou, W. X., Sornette, D., Hill, R. A., & Dunbar, R. I. M. (2005). Discrete hierarchical organization of social group sizes. In In Proceedings of the Royal Society of London B: Biological Sciences (Vol. 272). 126