IDS UNIT-1
IDS UNIT-1
IDS UNIT-1
GAYATHRI
Predictive Modeling: Building models to forecast market events and trends, which help in
making better investment or loan decisions.
Fraud Detection and Risk Management: Using data to detect fraud and reduce financial
risk, especially in lending. By analyzing customer data such as spending habits and credit
history, data scientists can assess the likelihood of loan defaults or creditworthiness.
Credit Scoring: Financial institutions use machine learning algorithms to evaluate a
customer’s creditworthiness based on their past financial behavior, ultimately determining
loan amounts and interest rates.
Example of Lending Club, an online marketplace that connects borrowers with investors.
Lending Club uses predictive models based on historical loan data to identify risky loan
applicants and reduce defaults. Data scientists apply various machine learning algorithms to
create such models.
KMEC/III SEM/CSM/IDS T.GAYATHRI
2 Public Policy:
Public policy is the creation and implementation of laws, regulations, and policies to
address societal problems for the benefit of citizens. Social sciences like economics,
political science, and sociology play a key role in shaping public policy.
Data science helps governments and agencies understand citizen behaviours in areas
such as traffic, public transportation, social welfare, and community wellbeing. By
analysing large datasets, data scientists can provide insights to inform better public
policy decisions.
The availability of open data has made it easier to obtain valuable information for policy
analysis. Examples of open data repositories include:
Data.gov (US government), with over 200,000 datasets covering diverse topics.
City of Chicago’s open data portal, which organizes data in categories like
administration, finance, and sanitation.
NYC OpenData, offering datasets by city agency, including over 400 datasets related
to city government.
One notable initiative using data science for public policy is the Data Science for
Social Good project, which brings together data scientists from around the world to
work on projects that address societal issues.
This includes challenges like estimating the size of refugee camps in war zones or
developing systems to use data for social good. The project typically holds events in
June each year.
3 Politics:
Political Process: Politics involves electing officials who create and implement policies to
govern a state, with government finances often supported by taxes.
Data Science in Politics: The integration of data science into political campaigns has grown
significantly. Notable examples include President Obama's 2008 campaign, where data
science played a crucial role in Internet-based efforts, and Donald Trump’s 2016 campaign,
which effectively used social media for targeted voter outreach.
Voter Targeting and Social Media: Data science has improved voter targeting models,
increasing voter participation. Trump's 2016 campaign used social media data to tailor
messages to individuals, while studies of Twitter content revealed differences in messaging
strategies between Trump and Hillary Clinton, such as the emphasis on masculine vs.
feminine traits and the use of user-generated content.
Ethical Concerns: The use of data science in politics also brings ethical concerns, highlighted
by the Cambridge Analytica scandal, where data from 87 million Facebook users was
exploited for political ad targeting. This incident raised questions about privacy, which is an
ongoing issue in data use for political purposes.
KMEC/III SEM/CSM/IDS T.GAYATHRI
4 Healthcare
Evolution of Healthcare Data: The healthcare industry has always collected data, but the
amount of information now available is vast and diverse. This includes biological data such
as gene expression, DNA sequences, proteomics, and metabolomics, in addition to clinical
data, electronic health records (EHRs), and medical claims.
Clinical Data Integration: Data scientists combine clinical trial data with real-world
observations from practicing physicians, enabling more personalized healthcare. This
approach helps healthcare professionals identify effective treatments for patients and
understand health outcomes on a larger scale.
Personal Health Management: Wearable devices like Fitbit have revolutionized personal
health tracking. These devices collect detailed health data, including heart rate, blood glucose
levels, sleep patterns, stress levels, and brain activity. Such tools empower individuals to
track and manage their health more effectively.
Research and Long-Term Monitoring: Wearable health trackers have become essential in
health research, allowing for long-term monitoring of physical activity and health behaviors.
For example, a study using Fitbit devices tracked physical activity among overweight,
postmenopausal women over 16 weeks, demonstrating the effectiveness of self-monitoring in
maintaining health behaviors.
5 Urban Planning
Transforming Urban Planning: Urban planning is undergoing a shift, with data science
playing a key role in improving the quality of life and urban systems. This shift is driven by
the growing availability of data and new computational methods that can be applied to
understand and improve cities.
UrbanCCD Initiatives: The Urban Center for Computation and Data (UrbanCCD) at the
University of Chicago is leading efforts to address the rapid growth of cities. By integrating
advanced computational methods, the center aims to improve city design and operations,
bringing together experts in various fields to tackle urban challenges like inefficient
transportation and overcrowded slums.
Challenges of Urban Expansion: As cities grow quickly, traditional urban design tools
become insufficient. The UrbanCCD highlights the need for advanced computational
resources to anticipate the impact of urban expansion and find solutions to related problems,
such as poverty, health issues, and environmental concerns.
Smaller-Scale Data Solutions: Smaller cities and initiatives are also leveraging data science.
For example, chicagoshovels.org offers a "Plow Tracker" to help residents track snowplows
in real time and organize snow removal efforts. This platform also provides real-time bus
KMEC/III SEM/CSM/IDS T.GAYATHRI
arrival information for residents. Similarly, Boston’s Office of New Urban Mechanics has
created various apps, such as the SnowCOP app, to improve public services during
snowstorms and other city operations.
Data for Local Services: In Jackson, Michigan, data is used to track water usage and identify
potentially abandoned homes, demonstrating how even smaller cities are harnessing data for
better municipal services and resource management. The potential applications of data
science in urban planning and local government services are extensive.
6 Education
Challenges of Technology in Education: Joel Klein, former Chancellor of New York Public
Schools, argues that simply providing computers to students does not necessarily improve
education. While technology plays an important role in education, its impact depends on how
it is integrated into the learning process.
Teachers as Data Scientists: In this future scenario, teachers will rely heavily on data
analysis to assess student performance. Automated tools will provide insights on each
student's progress, allowing for more personalized instruction.
Big Data in Education: Big data offers significant potential for improving education by
providing insights into student performance and learning techniques. Teachers will be able to
analyze a wide range of student behaviors and actions, such as reading duration, resource
usage, and mastery of key concepts, enabling them to adopt more effective teaching methods
tailored to each student’s needs.
7 Libraries
Role of Librarians in Data Science: Jeffrey M. Stanton highlights the overlap between the
tasks of data science professionals and librarians. In the future, librarians will play a crucial
role in helping individuals find, analyze, and understand diverse data sources, thus becoming
key figures in knowledge creation and resource accessibility.
Data Organization and Metadata: Mark Bieraugel advocates for librarians to take an active
role in organizing big datasets. This includes creating taxonomies, designing metadata
schemes, and systematizing retrieval methods, which will help make large datasets more
accessible and useful.
KMEC/III SEM/CSM/IDS T.GAYATHRI
Statistics vs. Data Science: The distinction between statistics and data science lies in
modern computing. Statistics was originally developed to solve data problems in pre-
computer times, like testing agricultural fertilizers or estimating small sample
accuracy. Data science, on the other hand, addresses modern challenges such as
managing large datasets, writing code for data manipulation, and visualizing data.
Statistics as a Subset of Data Science: Andrew Gelman, a statistician at Columbia
University, argues that statistics is a subset of data science and not necessarily the
most important part. He highlights that the administrative aspects of data science—
such as data harvesting, processing, storing, and cleaning—are more central than
traditional statistical methods.
Essential Data Science Skills: Nathan Yau, a statistician and data visualizer,
identifies three key skills for data scientists:
1. Statistical and Machine Learning Knowledge: Understanding basic
statistics and machine learning concepts to avoid errors like confusing
correlation with causation.
2. Computer Science Proficiency: The ability to manipulate large datasets using
programming languages like R or Python.
3. Data Visualization and Communication: The ability to present data and
analysis in a clear and meaningful way to people who may not be familiar with
complex data.
Computer Science Contributions to Data Science: Computer science plays a critical role in
data science through various techniques and methods. Key contributions include:
Database Systems: Systems that manage both structured and unstructured data, enabling
efficient data analysis.
Visualization Techniques: Methods that help people understand and interpret complex data.
Algorithms for Complex Data: Algorithms that allow for faster computation and processing
of complex and diverse datasets.
Mutual Support Between Data Science and Computer Science: Data science and
computer science are closely linked and support each other. Many of the techniques used in
KMEC/III SEM/CSM/IDS T.GAYATHRI
data science, such as machine learning algorithms, pattern recognition, and data visualization,
have been developed within computer science.
• Engineering in various fields (chemical, civil, computer, mechanical, etc.) has created
demand for data scientists and data science methods.
• Engineers constantly need data to solve problems. Data scientists have been called
upon to develop methods and techniques to meet these needs
• Data science has benefitted from new software and hardware developed via
engineering, such as the CPU (central processing unit) and GPU (graphic processing
unit) that substantially reduce computing time.
• The trend has drastically changed in the construction industry due to use of
technology in the last few decades.
• Now it is possible to use “smart” building techniques that are rooted in collecting and
analyzing large amounts of heterogeneous data.
• It has become possible to estimate the likely cost of construction from the unit price
for a specific item, like a guardrail, that contractors are likely to bid given a
contractor’s location, time of year, total value, relevant cost indices, etc.
• From 3D printing of models that can help predict the weak spots in construction, to
use of drones in monitoring the building site during the actual construction phase, all
these technologies generate volumes of data that need to be analyzed to engineer the
construction design and activity.
• Business analytics (BA) refers to the skills, technologies, and practices for continuous
iterative exploration and investigation of past and current business performance to
gain insight and be strategic.
• And that is where data science comes in. To fulfill the requirements of BA, data
scientists are needed for statistical analysis, to help drive successful decision-making.
KMEC/III SEM/CSM/IDS T.GAYATHRI
• By partnering with data science, social scientists use machine learning to address
complex policy questions.
• Another example is using image and text recognition to analyze cultural changes over
time through historical photos or archived social media posts.
• Computational social science raises ethical issues about how data is used, especially
when it impacts people’s lives.
• Information science, which often intersects with computing and informatics, supports
fields where data usage and management are key, such as library science and
information technology.
• The goal is to cover how people study, access, use, and produce information in
different contexts, highlighting
• The system perspective helps users observe, analyze, and interpret data.
• The human perspective focuses on making data useful for individual purposes.
• "Usefulness" is a criterion for evaluating how well data (information) helps a user
accomplish a specific task.
• Relevance varies by user context: e.g., general users may find broad information on
coffee health effects useful, while a dietitian may require more specific data.
KMEC/III SEM/CSM/IDS T.GAYATHRI
• Scholars in information science tend to combine the user side and the system side to
understand how and why data is generated and the information they convey, given a
context. This is often then connected to studying people’s behaviors
Example
• For instance, information scientists may collect log data of one’s browser activities to
understand one’s search behaviors (the search terms they use, the results they click,
the amount of time they spend on various sites, etc.).
• This could allow them to create better methods for personalization and
recommendation.
Computational Thinking
• Essential skills include reading, writing, and thinking, which everyone should have
regardless of gender, profession, or discipline.
• Computational thinking is becoming a necessary skill for everyone, not just computer
scientists.
• Defined by Jeannette Wing as using abstraction and decomposition to approach large,
complex tasks or design complex systems.
• It is an iterative process based on the following three stages:
1. Problem formulation (abstraction)
2. Solution expression (automation)
3. Solution execution and evaluation (analyses).
KMEC/III SEM/CSM/IDS T.GAYATHRI
R:Used primarily for statistical analysis and data visualization. Simple syntax similar to
Python, making it beginner-friendly.
SQL: Useful for handling large datasets that cannot be stored in simple files. Allows
interaction with databases to manage and analyze data beyond local storage limitations.
Comparison of Python vs. Java (Example):
• A "Hello, World" program in Java requires multiple steps (writing, compiling, and
running), which is more complex.
• The same program in Python can be written and executed in a few lines, showing
Python’s simplicity.
Additional Tools:
UNIX:
– Basic knowledge of UNIX can help in data processing and performing tasks
without extensive coding.
– Useful for day-to-day data handling and file manipulation.
• Python and R support various packages for advanced tasks such as machine learning,
which can be easily imported for specific functions.
KMEC/III SEM/CSM/IDS T.GAYATHRI
Data Storage:
• Simple text files (e.g., CSV format) are often used for data, but SQL databases are
preferred for handling larger datasets and remote data access.
Data: Introduction
Effective data analysis begins with gathering and sorting relevant data, relying on appropriate
information sources. Previously, different forms of data were introduced, including numerical
data (like height and weight), multimedia data (like photos), and open government datasets.
These data types can be stored in various locations, from personal devices to large data
warehouses.
Data Types
Structured data is highly organized, easily stored in databases, and can be quickly retrieved
through simple searches. It includes labeled values, such as numbers with specific tags.
Unstructured data, by contrast, lacks an organized format, making it harder to categorize and
search.
Structured Data:
Structured data is the primary focus for exercises in this book due to its organized nature and
defined labels.Examples, like height and weight, illustrate structured data since values are
assigned to specific fields (e.g., "60" for height and "120" for weight).
KMEC/III SEM/CSM/IDS T.GAYATHRI
Structured data can include various types, not limited to numbers. It can also contain text,
Boolean values, and categorical data, such as age, income, housing type, employment status,
and marital status.
Unstructured data
“It was found that a female with a height between 65 inches and 67 inches had an IQ of 125–
130. However, it was not clear looking at a person shorter or taller than this observation if the
change in IQ score could be different, and, even if it was, it could not be possibly concluded
that the change was solely due to the difference in one’s height.”
Natural Language vs. Structured Format: Unstructured data, such as natural language,
aligns with human communication but not with machine readability, which favors
structured formats.
Example - Email as Unstructured Data: Emails, while organized by users, are
inherently unstructured as they cover multiple topics and lack a strict, consistent
format.
Data Collections
1 Open Data
Open data refers to data that is freely available to the public, without restrictions from
copyright, patents, or control mechanisms. It allows unrestricted use and sharing of
information.
Various entities, including local and federal governments, NGOs, and academic
institutions, lead open data initiatives.
Examples include the US Government's open data repositories and Project Open Data,
developed by the White House in 2013 to promote open data use.
The US Government’s policy encourages agencies to treat data as an asset and release
it in a public, open, and usable format whenever possible.
1. Public: Agencies should prioritize openness, with exceptions only for privacy,
confidentiality, security, or legal reasons.
2. Accessible: Data should be available in open, modifiable, and machine-readable
formats, with no discrimination in access or use.
3. Described: Data must be accompanied by robust metadata and documentation to
provide context, limitations, and processing guidance.
4. Reusable: Data should be licensed openly, allowing unrestricted reuse.
5. Complete: Data should be published in primary form with high granularity, and
derived data must reference primary sources.
6. Timely: Data should be released promptly, considering the needs of users and the
value of timely access.
7. Managed Post-Release: A contact person should be designated to assist users and
address concerns regarding compliance with open data standards.
Social media platforms offer valuable data for research and marketing, thanks to APIs
provided for accessing structured data.
Understanding APIs
The Facebook Graph API is widely used for tasks such as developing applications,
studying human behavior, and tracking disaster responses.
Social media platforms like Yelp release datasets to promote research in fields like
photo classification, natural language processing, sentiment analysis, and graph
mining.
Interested individuals can explore research opportunities through platforms like the
Yelp dataset challenge.
Future Learning
This method of collecting data will be revisited in later chapters, providing deeper
insights into data collection via APIs.
3. Multimodal Data
The increasing number of connected devices in the Internet of Things (IoT) generates
vast amounts of data, often beyond traditional forms like numbers and text.
IoT data can include multimodal (different forms) and multimedia (different media)
data, such as images, music, sounds, gestures, body posture, and spatial usage.
Data collected from IoT devices can be categorized into two types: structured data
(organized and labeled) and unstructured data (without defined labels).
One key application is the analysis of brain imaging data, where multimodal datasets,
like EEG, MEG, and fMRI, are used to study brain activity.
Statistical parametric mapping (SPM) is a technique used in this field to analyze
differences in brain activity during functional neuroimaging experiments.
Example:
scss
Copy code
treat,before,after,diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3
Advantages: Compatible with spreadsheet programs like Excel and Google Sheets.
Easy to read and share.
Disadvantages: Commas within data can cause issues, requiring special handling
(e.g., escaping commas with a backslash).
Example:
css
Copy code
Name Age Address
Ryan 33 1115 W Franklin
Paul 25 Big Farm Way
Jim 45 W Main St
Samantha 32 28 George St
Advantages: Less likely to conflict with data, as the tab character is rarely used in
text fields.
Disadvantages: Less common than CSV, so specific tools may be required to read it.
KMEC/III SEM/CSM/IDS T.GAYATHRI
A flexible markup language for storing and transporting data, both human-readable and
machine-readable.
Example:
xml
Copy code
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="information science" cover="hardcover">
<title lang="en">Social Information Seeking</title>
<author>Chirag Shah</author>
<year>2017</year>
<price>62.58</price>
</book>
<book category="data science" cover="paperback">
<title lang="en">Hands-On Introduction to Data Science</title>
<author>Chirag Shah</author>
<year>2019</year>
<price>50.00</price>
</book>
</bookstore>
Advantages: Portable, flexible, and widely used for data exchange between different
systems.
Disadvantages: Requires parsing software to process, as it’s not directly viewable or
usable without transformation.
A format used to deliver regularly updated information, based on XML, such as news
articles, blog posts, or other web updates.
Example:
xml
Copy code
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Dr. Chirag Shah’s Home Page</title>
<link>http://chiragshah.org/</link>
<description> Chirag Shah’s webhome</description>
<item>
<title>Awards and Honors</title>
<link>http://chiragshah.org/awards.php</link>
<description>Awards and Honors Dr. Shah received</description>
</item>
KMEC/III SEM/CSM/IDS T.GAYATHRI
</channel>
</rss>
Advantages: Provides real-time updates to users, ideal for dynamic content like news
or blogs.
Disadvantages: Less useful for static content, requires an RSS reader or aggregator
for users to access the updates.
A lightweight data-interchange format that is easy to read and write for humans and
machines. It’s commonly used for web data exchange.
Example:
json
Copy code
{
"name": "John",
"age": 25,
"state": "New Jersey"
}
Data Pre-processing
Data in the real world is often dirty; that is, it is in need of being cleaned up before it can be
used for a desired purpose. This is often called data pre-processing. What makes data “dirty”?
Here are some of the factors that indicate that data is not clean or ready to process:
• Incomplete. When some of the attribute values are lacking, certain attributes of interest are
lacking, or attributes contain only aggregate data.
• Noisy. When data contains errors or outliers. For example, some of the data points in a
dataset may contain extreme values that can severely affect the dataset’s range.
• Inconsistent. Data contains discrepancies in codes or names. For example, if the “Name”
column for registration records of employees contains values other tha alphabetical letters, or
if records do not start with a capital letter, discrepancies are present.
KMEC/III SEM/CSM/IDS T.GAYATHRI
1.Data Cleaning
Data cleaning involves preparing data by identifying and correcting or removing errors,
inconsistencies, or irrelevant information to improve its quality for analysis. There are
various methods for cleaning data, depending on the nature of the issues at hand.
a. Data Munging
Data munging, also known as data wrangling, is the process of converting data into a format
that is easier for computers to process and analyze. Often, raw data is unorganized or poorly
formatted, making it difficult to work with. The transformation process can be done
manually, automatically, or semi-automatically, depending on the complexity of the task.
Missing data occurs when some values are absent, which can be due to various reasons such
as data collection issues, system errors, or incomplete data entry. For example, some
customers may not have home phone numbers, or the data collection process may have
missed certain fields.
Noisy data refers to corrupted or erroneous values caused by issues such as faulty data
collection instruments or technology limitations. This can result in inconsistent or imprecise
data, which can skew results.
2. Data Integration
Data integration involves combining data from various sources to create a unified and
cohesive dataset. This process allows for more efficient and effective data analysis by
consolidating information into a single, consistent format.
1. Combine Data from Multiple Sources: The first step is to merge data from different
sources, such as databases or files, into a single storage system (e.g., one database or a
unified file).
2. Schema Integration: This involves combining the metadata from different sources,
aligning the structure and meaning of the data across multiple datasets to ensure
consistency.
3. Detect and Resolve Data Value Conflicts: Conflicts may arise when data from
different sources represent the same real-world entity with different attributes or
values. Examples include:
o Attribute Conflicts: Different sources may use different attributes to describe
the same entity.
KMEC/III SEM/CSM/IDS T.GAYATHRI
o Unit Conflicts: Different sources may use different units, such as metric vs.
British units.
4. Address Redundant Data: Redundancy can occur when integrating data from
multiple sources. Common cases include:
o Naming Differences: The same attribute may have different names across
sources.
o Derived Attributes: Some attributes may be calculated in one table (e.g.,
annual revenue) but stored as a separate field in another.
o Detection via Analysis: Techniques such as correlation analysis may help
identify and handle redundant data.
3.Data Transformation
Data must be transformed so it is consistent and readable (by a system). The following five
processes may be used for data transformation. For the time being, do not worry if these seem
too abstract. We will revisit some of them in the next section as we work through an example
of data pre-processing.
4. Normalization: Scaled to fall within a small, specified range and aggregation. Some of the
techniques that are used for accomplishing normalization (but we will not be covering them
here) are:
a. Min–max normalization.
b. Z-score normalization.
5. Data Reduction
still fulfills the task requirements. Essentially, it reduces the dimensionality of the data
while retaining the relevant information needed for the analysis.
2. Dimensionality Reduction: Unlike data cube aggregation, dimensionality reduction
aims to reduce data by analyzing its structure. Each dimension or column in a dataset
is treated as a "feature," and the goal is to identify which features can be removed or
combined. This process helps to eliminate redundancy and create composite features
that better represent the data. Common strategies for dimensionality reduction
include:
o Sampling
o Clustering
o Principal Component Analysis (PCA)
5. Data Discretization
Discretization is the process of converting continuous data into discrete, more manageable
parts. This is particularly useful when dealing with numerical data that is difficult to analyze
in its raw form. The main goal is to simplify the data without losing its essential meaning,
and it can also be seen as a form of data reduction.
Achieving Discretization
Discretization is typically done by dividing the range of continuous attributes into intervals.
For example:
Temperature could be split into categories such as cold, moderate, and hot.
Stock prices could be categorized as above or below market valuation.