0% found this document useful (0 votes)

44 views

IDS UNIT-1

Uploaded by

Gayathri

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

IDS UNIT-1

Uploaded by

Gayathri

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

KMEC/III SEM/CSM/IDS T.

GAYATHRI

INTRODUCTION TO DATA SCIENCE

UNIT-1
INTRODUCTION, DATA, DATA COLLECTIONS,
PREPROCESSING
Syllabus: Introduction: What Is Data Science? Where Do We See Data Science? Finance,
Public Policy, Politics, Healthcare, Urban Planning, Education, Libraries. How Does Data
Science Relate to Other Fields: Data Science and Statistics, Data Science and Computer
Science, Data Science and Engineering, Data Science and Business Analytics, Data Science,
Social Science, and Computational Social Science. The Relationship between Data Science
and Information Science: Information vs. Data Users in Information Science, Data Science in
Information Schools (iSchools). Computational Thinking. Skills for Data Science, Tools for
Data Science, Issues of Ethics, Bias, and Privacy in Data Science, Hands-On Problems. Data:
Introduction, Data Types: Structured Data, Unstructured Data, Challenges with Unstructured
Data. Data Collections: Open Data, Social Media Data, Multimodal Data, Data Storage and
Presentation. Data Pre-processing: Data Cleaning, Data Integration, Data Transformation,
Data Reduction, Data Discretization.

What Is Data Science?

Data science as a field of study and practice that involves the collection, storage, and
processing of data in order to derive important insights into a problem or a phenomenon.
Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road
vision, etc.), and could be in different formats (text, audio, video, augmented or virtual
reality, etc.). We will also treat data science as an independent field by itself rather than a
subset of another domain, such as statistics or computer science.

Where Do We See Data Science?

1.Finance:

Predictive Modeling: Building models to forecast market events and trends, which help in
making better investment or loan decisions.
Fraud Detection and Risk Management: Using data to detect fraud and reduce financial
risk, especially in lending. By analyzing customer data such as spending habits and credit
history, data scientists can assess the likelihood of loan defaults or creditworthiness.
Credit Scoring: Financial institutions use machine learning algorithms to evaluate a
customer’s creditworthiness based on their past financial behavior, ultimately determining
loan amounts and interest rates.

Example of Lending Club, an online marketplace that connects borrowers with investors.
Lending Club uses predictive models based on historical loan data to identify risky loan
applicants and reduce defaults. Data scientists apply various machine learning algorithms to
create such models.
KMEC/III SEM/CSM/IDS T.GAYATHRI

2 Public Policy:

 Public policy is the creation and implementation of laws, regulations, and policies to
address societal problems for the benefit of citizens. Social sciences like economics,
political science, and sociology play a key role in shaping public policy.
 Data science helps governments and agencies understand citizen behaviours in areas
such as traffic, public transportation, social welfare, and community wellbeing. By
analysing large datasets, data scientists can provide insights to inform better public
policy decisions.

The availability of open data has made it easier to obtain valuable information for policy
analysis. Examples of open data repositories include:

 Data.gov (US government), with over 200,000 datasets covering diverse topics.
 City of Chicago’s open data portal, which organizes data in categories like
administration, finance, and sanitation.
 NYC OpenData, offering datasets by city agency, including over 400 datasets related
to city government.

 One notable initiative using data science for public policy is the Data Science for
Social Good project, which brings together data scientists from around the world to
work on projects that address societal issues.
 This includes challenges like estimating the size of refugee camps in war zones or
developing systems to use data for social good. The project typically holds events in
June each year.

3 Politics:

Political Process: Politics involves electing officials who create and implement policies to
govern a state, with government finances often supported by taxes.

Data Science in Politics: The integration of data science into political campaigns has grown
significantly. Notable examples include President Obama's 2008 campaign, where data
science played a crucial role in Internet-based efforts, and Donald Trump’s 2016 campaign,
which effectively used social media for targeted voter outreach.

Voter Targeting and Social Media: Data science has improved voter targeting models,
increasing voter participation. Trump's 2016 campaign used social media data to tailor
messages to individuals, while studies of Twitter content revealed differences in messaging
strategies between Trump and Hillary Clinton, such as the emphasis on masculine vs.
feminine traits and the use of user-generated content.

Ethical Concerns: The use of data science in politics also brings ethical concerns, highlighted
by the Cambridge Analytica scandal, where data from 87 million Facebook users was
exploited for political ad targeting. This incident raised questions about privacy, which is an
ongoing issue in data use for political purposes.
KMEC/III SEM/CSM/IDS T.GAYATHRI

4 Healthcare

Evolution of Healthcare Data: The healthcare industry has always collected data, but the
amount of information now available is vast and diverse. This includes biological data such
as gene expression, DNA sequences, proteomics, and metabolomics, in addition to clinical
data, electronic health records (EHRs), and medical claims.

Clinical Data Integration: Data scientists combine clinical trial data with real-world
observations from practicing physicians, enabling more personalized healthcare. This
approach helps healthcare professionals identify effective treatments for patients and
understand health outcomes on a larger scale.

Personal Health Management: Wearable devices like Fitbit have revolutionized personal
health tracking. These devices collect detailed health data, including heart rate, blood glucose
levels, sleep patterns, stress levels, and brain activity. Such tools empower individuals to
track and manage their health more effectively.

Research and Long-Term Monitoring: Wearable health trackers have become essential in
health research, allowing for long-term monitoring of physical activity and health behaviors.
For example, a study using Fitbit devices tracked physical activity among overweight,
postmenopausal women over 16 weeks, demonstrating the effectiveness of self-monitoring in
maintaining health behaviors.

Innovative Partnerships and Healthcare Advancements: Companies like Apple have

partnered with institutions such as Stanford Medicine to use devices like the Apple Watch to
monitor heart health, identifying irregular heart rhythms such as atrial fibrillation. Insurance
companies are also offering incentives for the use of these devices, which assist in diagnosis,
treatment, and health monitoring.

5 Urban Planning

Transforming Urban Planning: Urban planning is undergoing a shift, with data science
playing a key role in improving the quality of life and urban systems. This shift is driven by
the growing availability of data and new computational methods that can be applied to
understand and improve cities.

UrbanCCD Initiatives: The Urban Center for Computation and Data (UrbanCCD) at the
University of Chicago is leading efforts to address the rapid growth of cities. By integrating
advanced computational methods, the center aims to improve city design and operations,
bringing together experts in various fields to tackle urban challenges like inefficient
transportation and overcrowded slums.

Challenges of Urban Expansion: As cities grow quickly, traditional urban design tools
become insufficient. The UrbanCCD highlights the need for advanced computational
resources to anticipate the impact of urban expansion and find solutions to related problems,
such as poverty, health issues, and environmental concerns.

Smaller-Scale Data Solutions: Smaller cities and initiatives are also leveraging data science.
For example, chicagoshovels.org offers a "Plow Tracker" to help residents track snowplows
in real time and organize snow removal efforts. This platform also provides real-time bus
KMEC/III SEM/CSM/IDS T.GAYATHRI

arrival information for residents. Similarly, Boston’s Office of New Urban Mechanics has
created various apps, such as the SnowCOP app, to improve public services during
snowstorms and other city operations.

Data for Local Services: In Jackson, Michigan, data is used to track water usage and identify
potentially abandoned homes, demonstrating how even smaller cities are harnessing data for
better municipal services and resource management. The potential applications of data
science in urban planning and local government services are extensive.

6 Education

Challenges of Technology in Education: Joel Klein, former Chancellor of New York Public
Schools, argues that simply providing computers to students does not necessarily improve
education. While technology plays an important role in education, its impact depends on how
it is integrated into the learning process.

Personalized Learning and Data-Driven Education: There is a growing trend towards

more data-driven and personalized learning environments. Educators and technology
advocates see the future of education involving the use of data to tailor learning experiences
to individual students’ needs.

The Future of Learning Environments: According to Darrell M. West of the Brookings

Institution, future learning environments will be highly interactive, with technology providing
constant feedback to students. For example, students could learn reading skills through a
computerized program that tracks their progress in real time, offering feedback on reading
comprehension, vocabulary, and use of online resources.

Teachers as Data Scientists: In this future scenario, teachers will rely heavily on data
analysis to assess student performance. Automated tools will provide insights on each
student's progress, allowing for more personalized instruction.

Big Data in Education: Big data offers significant potential for improving education by
providing insights into student performance and learning techniques. Teachers will be able to
analyze a wide range of student behaviors and actions, such as reading duration, resource
usage, and mastery of key concepts, enabling them to adopt more effective teaching methods
tailored to each student’s needs.

7 Libraries

Role of Librarians in Data Science: Jeffrey M. Stanton highlights the overlap between the
tasks of data science professionals and librarians. In the future, librarians will play a crucial
role in helping individuals find, analyze, and understand diverse data sources, thus becoming
key figures in knowledge creation and resource accessibility.

Data Organization and Metadata: Mark Bieraugel advocates for librarians to take an active
role in organizing big datasets. This includes creating taxonomies, designing metadata
schemes, and systematizing retrieval methods, which will help make large datasets more
accessible and useful.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Automating Literature Reviews: A practical example of data science in libraries is the

ability to automate literature reviews. For instance, a research librarian could help a scientist,
like Alice, by using digital technologies to quickly analyze thousands of articles and
summarize trends, gaps, and emerging research areas. This use of data science algorithms,
like network analysis, significantly reduces the time and effort required for comprehensive
research, making it much easier for researchers to navigate complex datasets.

How Does Data Science Relate to Other Fields?

1 Data Science and Statistics

 Statistics vs. Data Science: The distinction between statistics and data science lies in
modern computing. Statistics was originally developed to solve data problems in pre-
computer times, like testing agricultural fertilizers or estimating small sample
accuracy. Data science, on the other hand, addresses modern challenges such as
managing large datasets, writing code for data manipulation, and visualizing data.
 Statistics as a Subset of Data Science: Andrew Gelman, a statistician at Columbia
University, argues that statistics is a subset of data science and not necessarily the
most important part. He highlights that the administrative aspects of data science—
such as data harvesting, processing, storing, and cleaning—are more central than
traditional statistical methods.

Skills for Data Scientists

 Essential Data Science Skills: Nathan Yau, a statistician and data visualizer,
identifies three key skills for data scientists:
1. Statistical and Machine Learning Knowledge: Understanding basic
statistics and machine learning concepts to avoid errors like confusing
correlation with causation.
2. Computer Science Proficiency: The ability to manipulate large datasets using
programming languages like R or Python.
3. Data Visualization and Communication: The ability to present data and
analysis in a clear and meaningful way to people who may not be familiar with
complex data.

2 Data Science and Computer Science

Computer Science Contributions to Data Science: Computer science plays a critical role in
data science through various techniques and methods. Key contributions include:

Database Systems: Systems that manage both structured and unstructured data, enabling
efficient data analysis.

Visualization Techniques: Methods that help people understand and interpret complex data.

Algorithms for Complex Data: Algorithms that allow for faster computation and processing
of complex and diverse datasets.

Mutual Support Between Data Science and Computer Science: Data science and
computer science are closely linked and support each other. Many of the techniques used in
KMEC/III SEM/CSM/IDS T.GAYATHRI

data science, such as machine learning algorithms, pattern recognition, and data visualization,
have been developed within computer science.

Importance of Machine Learning: Machine learning is a cornerstone of modern data

science. Basic knowledge of machine learning is essential for conducting meaningful data
science in many domains.

Data Science and Engineering

• Engineering in various fields (chemical, civil, computer, mechanical, etc.) has created
demand for data scientists and data science methods.

• Engineers constantly need data to solve problems. Data scientists have been called
upon to develop methods and techniques to meet these needs

• Data science has benefitted from new software and hardware developed via
engineering, such as the CPU (central processing unit) and GPU (graphic processing
unit) that substantially reduce computing time.

• The trend has drastically changed in the construction industry due to use of
technology in the last few decades.

• Now it is possible to use “smart” building techniques that are rooted in collecting and
analyzing large amounts of heterogeneous data.

• It has become possible to estimate the likely cost of construction from the unit price
for a specific item, like a guardrail, that contractors are likely to bid given a
contractor’s location, time of year, total value, relevant cost indices, etc.

• In addition, “smart” building techniques have been introduced by use of various

technologies.

• From 3D printing of models that can help predict the weak spots in construction, to
use of drones in monitoring the building site during the actual construction phase, all
these technologies generate volumes of data that need to be analyzed to engineer the
construction design and activity.

Data Science and Business Analytics

• Business analytics (BA) refers to the skills, technologies, and practices for continuous
iterative exploration and investigation of past and current business performance to
gain insight and be strategic.

• BA focuses on developing new perspectives and making sense of performance based

on data and statistics.

• And that is where data science comes in. To fulfill the requirements of BA, data
scientists are needed for statistical analysis, to help drive successful decision-making.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• By partnering with data science, social scientists use machine learning to address
complex policy questions.

• One application includes predicting unemployment trends using algorithms, helping

policymakers proactively plan economic support.

• Another example is using image and text recognition to analyze cultural changes over
time through historical photos or archived social media posts.

• Computational social science raises ethical issues about how data is used, especially
when it impacts people’s lives.

• For instance, algorithms used in criminal justice(critical decisions, such as sentencing,

bail, and parole) to predict recidivism(relapse to criminal behaviour) rates.

Data Science and Information Science

• It is especially useful for individuals in information-intensive fields like healthcare,
finance, policy-making, and education.

• Information science, which often intersects with computing and informatics, supports
fields where data usage and management are key, such as library science and
information technology.

• The goal is to cover how people study, access, use, and produce information in
different contexts, highlighting

1. Data Versus Information

• Data is raw, meaningless until analyzed.
• Information is "data with meaning and purpose."
• Example: "480,000" is a data point; when described as annual U.S. deaths from
smoking, it becomes information.
DIKW Model:
• The Data, Information, Knowledge, and Wisdom (DIKW) model presents a hierarchy.
• Data is defined as fact, signal, or symbol.
• Information is "useful" and contextualized, with a specific emphasis in information
science on human-related context and circumstances.

2.Users in Information Science

• The system perspective helps users observe, analyze, and interpret data.
• The human perspective focuses on making data useful for individual purposes.
• "Usefulness" is a criterion for evaluating how well data (information) helps a user
accomplish a specific task.
• Relevance varies by user context: e.g., general users may find broad information on
coffee health effects useful, while a dietitian may require more specific data.
KMEC/III SEM/CSM/IDS T.GAYATHRI

• Scholars in information science tend to combine the user side and the system side to
understand how and why data is generated and the information they convey, given a
context. This is often then connected to studying people’s behaviors
Example
• For instance, information scientists may collect log data of one’s browser activities to
understand one’s search behaviors (the search terms they use, the results they click,
the amount of time they spend on various sites, etc.).
• This could allow them to create better methods for personalization and
recommendation.

3. Data Science in Information Schools (iSchools)

• iSchools provide a detailed understanding of individual, community, and societal

phenomena.
• Example: Data collected from a community can be applied to improve wellbeing
through policy change or urban planning.
• iSchool curricula give students a diverse perspective on data and information,
preparing them to view data science within a larger context ("big data picture“).
• Students learn not only technical skills (e.g., computer science, statistics, machine
learning) but also how to address human aspects of data.
• Example: Analyzing electronic health records in an iSchool setting involves studying
both the data and patients interactions with health information.
• iSchool programs offer insights into data contexts, including communication,
information studies, library science, and media research.
• iSchools are well-suited for students interested in combining technical expertise with
practical, human-centered applications.

Computational Thinking
• Essential skills include reading, writing, and thinking, which everyone should have
regardless of gender, profession, or discipline.
• Computational thinking is becoming a necessary skill for everyone, not just computer
scientists.
• Defined by Jeannette Wing as using abstraction and decomposition to approach large,
complex tasks or design complex systems.
• It is an iterative process based on the following three stages:
1. Problem formulation (abstraction)
2. Solution expression (automation)
3. Solution execution and evaluation (analyses).
KMEC/III SEM/CSM/IDS T.GAYATHRI

Skills for Data Science

• Data science is a growing field with applications in many areas, making it an exciting
career option.
Essential Skills for Data Scientists (according to Jeanne Harris):
• Willingness to Experiment:
– Data scientists need curiosity, creativity, and analytical thinking to identify
and solve problems.
• Proficiency in Mathematical Reasoning:
– Strong knowledge of basic statistics and mathematical reasoning is essential.
– Employers look for applicants who can interpret data, use logic, and apply
statistical methods.
• Data Literacy:
– The ability to interpret datasets, perform analysis, and create meaningful
visualizations is crucial.
– Data literacy is becoming essential across all fields, with calls to include it as a
fundamental skill in education, similar to reading and writing.

Tools for Data Science

Python: A popular scripting language in data science for its ease of use, quick
implementation, and extensive libraries for data processing and visualization. Programs run
line-by-line, which makes Python suitable for beginners.

R:Used primarily for statistical analysis and data visualization. Simple syntax similar to
Python, making it beginner-friendly.

SQL: Useful for handling large datasets that cannot be stored in simple files. Allows
interaction with databases to manage and analyze data beyond local storage limitations.
Comparison of Python vs. Java (Example):

• A "Hello, World" program in Java requires multiple steps (writing, compiling, and
running), which is more complex.
• The same program in Python can be written and executed in a few lines, showing
Python’s simplicity.
Additional Tools:

UNIX:

– Basic knowledge of UNIX can help in data processing and performing tasks
without extensive coding.
– Useful for day-to-day data handling and file manipulation.

Packages and Libraries:

• Python and R support various packages for advanced tasks such as machine learning,
which can be easily imported for specific functions.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Data Storage:

• Simple text files (e.g., CSV format) are often used for data, but SQL databases are
preferred for handling larger datasets and remote data access.

Issues of Ethics, Bias, and Privacy in Data Science

• Data science is powerful but not a cure-all for society's problems. It raises important
issues related to privacy, bias, and ethics.
• Data collection processes often lead to privacy and ethical concerns. Key questions
include: Who collected the data? Why was it collected? Did people consent to its
collection and use?
Data Misuse Example:
• The Cambridge Analytica scandal: Data from Facebook was used for political
campaigning without user knowledge or consent.
• Highlights that data availability doesn’t imply a right to use or profit from it without
permission.
Data as a Product:
• Free services like social media often monetize user data, making users "the product."
Example: Each user is valued at $158 for Facebook, $182 for Google, and $733 for
Amazon.

Data: Introduction
Effective data analysis begins with gathering and sorting relevant data, relying on appropriate
information sources. Previously, different forms of data were introduced, including numerical
data (like height and weight), multimedia data (like photos), and open government datasets.
These data types can be stored in various locations, from personal devices to large data
warehouses.

Data Types

Data can be categorized as either structured or unstructured, a distinction essential in data

science as many analytical techniques rely on one type or the other.

Structured data is highly organized, easily stored in databases, and can be quickly retrieved
through simple searches. It includes labeled values, such as numbers with specific tags.
Unstructured data, by contrast, lacks an organized format, making it harder to categorize and
search.

Structured Data:

Structured data is the primary focus for exercises in this book due to its organized nature and
defined labels.Examples, like height and weight, illustrate structured data since values are
assigned to specific fields (e.g., "60" for height and "120" for weight).
KMEC/III SEM/CSM/IDS T.GAYATHRI

Structured data can include various types, not limited to numbers. It can also contain text,
Boolean values, and categorical data, such as age, income, housing type, employment status,
and marital status.

Unstructured data

Unstructured data is data without labels. Here is an example:

“It was found that a female with a height between 65 inches and 67 inches had an IQ of 125–
130. However, it was not clear looking at a person shorter or taller than this observation if the
change in IQ score could be different, and, even if it was, it could not be possibly concluded
that the change was solely due to the difference in one’s height.”

Challenges with Unstructured Data

 Complexity in Compilation and Organization: Unstructured data is challenging to

compile and organize, requiring significant time and energy.
 Difficulty in Automated Parsing: Unlike structured data, unstructured data cannot be
easily parsed by computers, which limits the ability to derive quick insights.

Human vs. Machine Interaction with Data

 Natural Language vs. Structured Format: Unstructured data, such as natural language,
aligns with human communication but not with machine readability, which favors
structured formats.
 Example - Email as Unstructured Data: Emails, while organized by users, are
inherently unstructured as they cover multiple topics and lack a strict, consistent
format.

Data Science and Unstructured Data

 Challenges with Volume: Unstructured data’s sheer volume requires significant

resources for analysis, such as in web searches.
 Missed Opportunities: Current data mining techniques may overlook valuable insights
within large sets of unstructured data, highlighting the need for efficient data science
methods.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Data Collections
1 Open Data

 Open data refers to data that is freely available to the public, without restrictions from
copyright, patents, or control mechanisms. It allows unrestricted use and sharing of
information.

Initiatives Supporting Open Data

 Various entities, including local and federal governments, NGOs, and academic
institutions, lead open data initiatives.
 Examples include the US Government's open data repositories and Project Open Data,
developed by the White House in 2013 to promote open data use.

US Open Data Policy (M-13-3)

 The US Government’s policy encourages agencies to treat data as an asset and release
it in a public, open, and usable format whenever possible.

Principles of Open Data

1. Public: Agencies should prioritize openness, with exceptions only for privacy,
confidentiality, security, or legal reasons.
2. Accessible: Data should be available in open, modifiable, and machine-readable
formats, with no discrimination in access or use.
3. Described: Data must be accompanied by robust metadata and documentation to
provide context, limitations, and processing guidance.
4. Reusable: Data should be licensed openly, allowing unrestricted reuse.
5. Complete: Data should be published in primary form with high granularity, and
derived data must reference primary sources.
6. Timely: Data should be released promptly, considering the needs of users and the
value of timely access.
7. Managed Post-Release: A contact person should be designated to assist users and
address concerns regarding compliance with open data standards.

2. Social Media Data

 Social media platforms offer valuable data for research and marketing, thanks to APIs
provided for accessing structured data.

Understanding APIs

 An Application Programming Interface (API) provides rules and methods for

requesting and sending data from a social media platform. These programmatic calls
return data in structured formats, like XML.

Example - Facebook Graph API

KMEC/III SEM/CSM/IDS T.GAYATHRI

 The Facebook Graph API is widely used for tasks such as developing applications,
studying human behavior, and tracking disaster responses.

Social Media Data for Research and Development

 Social media platforms like Yelp release datasets to promote research in fields like
photo classification, natural language processing, sentiment analysis, and graph
mining.
 Interested individuals can explore research opportunities through platforms like the
Yelp dataset challenge.

Future Learning

 This method of collecting data will be revisited in later chapters, providing deeper
insights into data collection via APIs.

3. Multimodal Data

 The increasing number of connected devices in the Internet of Things (IoT) generates
vast amounts of data, often beyond traditional forms like numbers and text.

Multimodal and Multimedia Data

 IoT data can include multimodal (different forms) and multimedia (different media)
data, such as images, music, sounds, gestures, body posture, and spatial usage.

Structured and Unstructured Data

 Data collected from IoT devices can be categorized into two types: structured data
(organized and labeled) and unstructured data (without defined labels).

Applications of Multimedia Data

 One key application is the analysis of brain imaging data, where multimodal datasets,
like EEG, MEG, and fMRI, are used to study brain activity.
 Statistical parametric mapping (SPM) is a technique used in this field to analyze
differences in brain activity during functional neuroimaging experiments.

Resources for Datasets and Challenges

 For further exploration of datasets and challenges, Appendix E provides additional

sources and opportunities for solving real-life problems related to data processing.
KMEC/III SEM/CSM/IDS T.GAYATHRI

Data Storage and Presentation

1. CSV (Comma-Separated Values)

A simple text-based format where data fields are separated by commas.

 Example:

scss
Copy code
treat,before,after,diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3

 Advantages: Compatible with spreadsheet programs like Excel and Google Sheets.
Easy to read and share.
 Disadvantages: Commas within data can cause issues, requiring special handling
(e.g., escaping commas with a backslash).

2. TSV (Tab-Separated Values)

Similar to CSV, but uses tab characters to separate data fields.

 Example:

css
Copy code
Name Age Address
Ryan 33 1115 W Franklin
Paul 25 Big Farm Way
Jim 45 W Main St
Samantha 32 28 George St

 Advantages: Less likely to conflict with data, as the tab character is rarely used in
text fields.
 Disadvantages: Less common than CSV, so specific tools may be required to read it.
KMEC/III SEM/CSM/IDS T.GAYATHRI

3. XML (eXtensible Markup Language)

A flexible markup language for storing and transporting data, both human-readable and
machine-readable.

 Example:

xml
Copy code
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="information science" cover="hardcover">
<title lang="en">Social Information Seeking</title>
<author>Chirag Shah</author>
<year>2017</year>
<price>62.58</price>
</book>
<book category="data science" cover="paperback">
<title lang="en">Hands-On Introduction to Data Science</title>
<author>Chirag Shah</author>
<year>2019</year>
<price>50.00</price>
</book>
</bookstore>

 Advantages: Portable, flexible, and widely used for data exchange between different
systems.
 Disadvantages: Requires parsing software to process, as it’s not directly viewable or
usable without transformation.

4. RSS (Really Simple Syndication

A format used to deliver regularly updated information, based on XML, such as news
articles, blog posts, or other web updates.

 Example:

xml
Copy code
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Dr. Chirag Shah’s Home Page</title>
<link>http://chiragshah.org/</link>
<description> Chirag Shah’s webhome</description>
<item>
<title>Awards and Honors</title>
<link>http://chiragshah.org/awards.php</link>
<description>Awards and Honors Dr. Shah received</description>
</item>
KMEC/III SEM/CSM/IDS T.GAYATHRI

</channel>
</rss>

 Advantages: Provides real-time updates to users, ideal for dynamic content like news
or blogs.
 Disadvantages: Less useful for static content, requires an RSS reader or aggregator
for users to access the updates.

5. JSON (JavaScript Object Notation)

A lightweight data-interchange format that is easy to read and write for humans and
machines. It’s commonly used for web data exchange.

 Example:

json
Copy code
{
"name": "John",
"age": 25,
"state": "New Jersey"
}

 Advantages: Lightweight and easy to integrate into web applications. Human-

readable format.
 Disadvantages: Less efficient for very large or complex datasets compared to XML.

Data Pre-processing
Data in the real world is often dirty; that is, it is in need of being cleaned up before it can be
used for a desired purpose. This is often called data pre-processing. What makes data “dirty”?
Here are some of the factors that indicate that data is not clean or ready to process:
• Incomplete. When some of the attribute values are lacking, certain attributes of interest are
lacking, or attributes contain only aggregate data.
• Noisy. When data contains errors or outliers. For example, some of the data points in a
dataset may contain extreme values that can severely affect the dataset’s range.
• Inconsistent. Data contains discrepancies in codes or names. For example, if the “Name”
column for registration records of employees contains values other tha alphabetical letters, or
if records do not start with a capital letter, discrepancies are present.
KMEC/III SEM/CSM/IDS T.GAYATHRI

1.Data Cleaning

Data cleaning involves preparing data by identifying and correcting or removing errors,
inconsistencies, or irrelevant information to improve its quality for analysis. There are
various methods for cleaning data, depending on the nature of the issues at hand.

a. Data Munging

Data munging, also known as data wrangling, is the process of converting data into a format
that is easier for computers to process and analyze. Often, raw data is unorganized or poorly
formatted, making it difficult to work with. The transformation process can be done
manually, automatically, or semi-automatically, depending on the complexity of the task.

 Example: Transforming a recipe description into a structured table format, such as

turning the text "Add two diced tomatoes, three cloves of garlic, and a pinch of salt"
into a table that lists each ingredient with its quantity and unit/size.
KMEC/III SEM/CSM/IDS T.GAYATHRI

b. Handling Missing Data

Missing data occurs when some values are absent, which can be due to various reasons such
as data collection issues, system errors, or incomplete data entry. For example, some
customers may not have home phone numbers, or the data collection process may have
missed certain fields.

 Strategies: There is no universal solution for missing data. Common strategies

include:
o Ignoring the record with missing data.
o Filling missing values with a global constant.
o Imputation techniques (e.g., replacing missing values with mean or median
values).
o Using inference-based methods like decision trees or Bayesian inference.

c. Smooth Noisy Data

Noisy data refers to corrupted or erroneous values caused by issues such as faulty data
collection instruments or technology limitations. This can result in inconsistent or imprecise
data, which can skew results.

 Example: A thermometer measuring temperatures as 70.1°F and 70.9°F but storing

both as 70°F due to storage limitations, leading to loss of detail in interpreting
whether someone has a fever.
 Approach: To smooth noisy data:
o Identify and remove outliers, such as data points that fall outside the expected
range.
o Resolve inconsistencies, like standardizing the formatting of customer names.
o Apply techniques like data smoothing or transformation to reduce the impact
of noise and enhance data accuracy.

2. Data Integration

Data integration involves combining data from various sources to create a unified and
cohesive dataset. This process allows for more efficient and effective data analysis by
consolidating information into a single, consistent format.

Steps for Data Integration

1. Combine Data from Multiple Sources: The first step is to merge data from different
sources, such as databases or files, into a single storage system (e.g., one database or a
unified file).
2. Schema Integration: This involves combining the metadata from different sources,
aligning the structure and meaning of the data across multiple datasets to ensure
consistency.
3. Detect and Resolve Data Value Conflicts: Conflicts may arise when data from
different sources represent the same real-world entity with different attributes or
values. Examples include:
o Attribute Conflicts: Different sources may use different attributes to describe
the same entity.
KMEC/III SEM/CSM/IDS T.GAYATHRI

o Unit Conflicts: Different sources may use different units, such as metric vs.
British units.
4. Address Redundant Data: Redundancy can occur when integrating data from
multiple sources. Common cases include:
o Naming Differences: The same attribute may have different names across
sources.
o Derived Attributes: Some attributes may be calculated in one table (e.g.,
annual revenue) but stored as a separate field in another.
o Detection via Analysis: Techniques such as correlation analysis may help
identify and handle redundant data.

3.Data Transformation

Data must be transformed so it is consistent and readable (by a system). The following five
processes may be used for data transformation. For the time being, do not worry if these seem
too abstract. We will revisit some of them in the next section as we work through an example
of data pre-processing.

1. Smoothing: Remove noise from data.

2. Aggregation: Summarization, data cube construction.

3. Generalization: Concept hierarchy climbing.

4. Normalization: Scaled to fall within a small, specified range and aggregation. Some of the
techniques that are used for accomplishing normalization (but we will not be covering them
here) are:

a. Min–max normalization.

b. Z-score normalization.

c. Normalization by decimal scaling.

5. Attribute or feature construction.

a. New attributes constructed from the given ones.

5. Data Reduction

Data reduction involves obtaining a smaller, more manageable representation of a dataset

while preserving its essential information. The goal is to simplify the data without losing
critical insights, enabling more efficient analysis.

Techniques for Data Reduction

1. Data Cube Aggregation: This technique focuses on reducing data within a

multidimensional data cube. The data cube is structured with dimensions representing
different attributes. Aggregation reduces the data to the smallest representation that
KMEC/III SEM/CSM/IDS T.GAYATHRI

still fulfills the task requirements. Essentially, it reduces the dimensionality of the data
while retaining the relevant information needed for the analysis.
2. Dimensionality Reduction: Unlike data cube aggregation, dimensionality reduction
aims to reduce data by analyzing its structure. Each dimension or column in a dataset
is treated as a "feature," and the goal is to identify which features can be removed or
combined. This process helps to eliminate redundancy and create composite features
that better represent the data. Common strategies for dimensionality reduction
include:
o Sampling
o Clustering
o Principal Component Analysis (PCA)

5. Data Discretization

Discretization is the process of converting continuous data into discrete, more manageable
parts. This is particularly useful when dealing with numerical data that is difficult to analyze
in its raw form. The main goal is to simplify the data without losing its essential meaning,
and it can also be seen as a form of data reduction.

Types of Attributes in Discretization

1. Nominal: Values from an unordered set (e.g., types of fruit, colors).

2. Ordinal: Values from an ordered set (e.g., rankings, levels of satisfaction).
3. Continuous: Real numbers, such as temperature or stock price, that can take any
value within a range.

Achieving Discretization

Discretization is typically done by dividing the range of continuous attributes into intervals.
For example:

 Temperature could be split into categories such as cold, moderate, and hot.
 Stock prices could be categorized as above or below market valuation.

Babok Visual v3
89% (9)
Babok Visual v3
218 pages
Will Larson - An Elegant Puzzle - Systems of Engineering Management-Stripe Press (2019)
100% (3)
Will Larson - An Elegant Puzzle - Systems of Engineering Management-Stripe Press (2019)
394 pages
Money Master The Game
72% (29)
Money Master The Game
48 pages
Digital Marketing Analytics - in Theory and in Practice
No ratings yet
Digital Marketing Analytics - in Theory and in Practice
276 pages
4 Simple Scalping Trading Strategies and Advanced Techniques
84% (25)
4 Simple Scalping Trading Strategies and Advanced Techniques
13 pages
Fdocuments - in How To Build The Quick Canoe 155 Amazon s3 Better Than A Pure Flat Bottomed Thing
No ratings yet
Fdocuments - in How To Build The Quick Canoe 155 Amazon s3 Better Than A Pure Flat Bottomed Thing
44 pages
Quantitative Trading Strategies Using Python Technical Analysis, Statistical Testing, and Machine Learning (Peng Liu) (Z-Library)
100% (10)
Quantitative Trading Strategies Using Python Technical Analysis, Statistical Testing, and Machine Learning (Peng Liu) (Z-Library)
341 pages
John Doerr OKRs and Measure What Matters Book Summary
100% (18)
John Doerr OKRs and Measure What Matters Book Summary
37 pages
User Experience Revolution
100% (6)
User Experience Revolution
249 pages
The Basics of Data Analytics
88% (8)
The Basics of Data Analytics
17 pages
Hippodrome Spec
No ratings yet
Hippodrome Spec
26 pages
9781838826321-Managing Data Science
100% (7)
9781838826321-Managing Data Science
276 pages
Behavioral Analytics For Dummies
100% (1)
Behavioral Analytics For Dummies
51 pages
HBR's 10 Must Reads On Emotional Intelligence
100% (3)
HBR's 10 Must Reads On Emotional Intelligence
10 pages
Beginners-Guide-To-Learn-Algorithmic-Trading 1
100% (21)
Beginners-Guide-To-Learn-Algorithmic-Trading 1
58 pages
Data Analysis With Microsoft Excel
91% (23)
Data Analysis With Microsoft Excel
532 pages
Trading Mentors Learn Timeless Strategies and Best Practices From Successful Traders
97% (31)
Trading Mentors Learn Timeless Strategies and Best Practices From Successful Traders
274 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
100% (14)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Business Models
50% (2)
Business Models
143 pages
Introduction to Information Quality
From Everand
Introduction to Information Quality
Craig Fisher
No ratings yet
An Introduction To Serious Algorithmic Trading
83% (6)
An Introduction To Serious Algorithmic Trading
40 pages
Beyond The See Big Band
43% (7)
Beyond The See Big Band
4 pages
Share Exchange Agreement: Between: (FIRST PARTY NAME) (The "Shareholder"), An Individual With His Main Address
100% (1)
Share Exchange Agreement: Between: (FIRST PARTY NAME) (The "Shareholder"), An Individual With His Main Address
5 pages
A Guide to Data Science and Analytics: Navigating the Data Deluge: Tools, Techniques, and Applications
From Everand
A Guide to Data Science and Analytics: Navigating the Data Deluge: Tools, Techniques, and Applications
Juniper Blake
No ratings yet
Grade 10 Unit 4 - Data Science
No ratings yet
Grade 10 Unit 4 - Data Science
14 pages
DS QB
No ratings yet
DS QB
81 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
DS-2, Week 1, Lecture
No ratings yet
DS-2, Week 1, Lecture
10 pages
Data Science Training
No ratings yet
Data Science Training
8 pages
Data Science and The Future
No ratings yet
Data Science and The Future
20 pages
Article On Data Science
No ratings yet
Article On Data Science
3 pages
All About Data Science: Learn Data Science from scratch
From Everand
All About Data Science: Learn Data Science from scratch
Devi Prasad
No ratings yet
Data Science and Analytics Essentials: The Revolution of Decision-Making: Leveraging Data in the Digital Age
From Everand
Data Science and Analytics Essentials: The Revolution of Decision-Making: Leveraging Data in the Digital Age
Daniel Richards
No ratings yet
Data Science and Its Relevance Across Disciplines
No ratings yet
Data Science and Its Relevance Across Disciplines
17 pages
Unit I & II_FDS_II AI&DS
No ratings yet
Unit I & II_FDS_II AI&DS
48 pages
Mastering Data Science and Analytics: The Power of Data: From Analysis to Action in the Modern World
From Everand
Mastering Data Science and Analytics: The Power of Data: From Analysis to Action in the Modern World
Finnley Harper
No ratings yet
PDF Data Science
No ratings yet
PDF Data Science
7 pages
CSE3038 Module 1
No ratings yet
CSE3038 Module 1
21 pages
DS QB unit 1
No ratings yet
DS QB unit 1
45 pages
IDS-UNIT-1-FINAL (1)
No ratings yet
IDS-UNIT-1-FINAL (1)
30 pages
himadev
No ratings yet
himadev
37 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
Aids QB2
No ratings yet
Aids QB2
13 pages
Data Science
No ratings yet
Data Science
40 pages
Deta Science
No ratings yet
Deta Science
8 pages
Unit I Introduction To Data Science
No ratings yet
Unit I Introduction To Data Science
79 pages
Data Science in Context V.99 Web Beta
No ratings yet
Data Science in Context V.99 Web Beta
293 pages
ABUBAKAR AMINU 2022-149040CS New
No ratings yet
ABUBAKAR AMINU 2022-149040CS New
14 pages
AIDS-QB2
No ratings yet
AIDS-QB2
17 pages
Introduction to Data Science- Unit-1
No ratings yet
Introduction to Data Science- Unit-1
9 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Data Science A Systematic Treatment
No ratings yet
Data Science A Systematic Treatment
11 pages
DS_Module 1
No ratings yet
DS_Module 1
57 pages
Extended_Comprehensive_Guide_to_Data_Science
No ratings yet
Extended_Comprehensive_Guide_to_Data_Science
2 pages
Handbook Introduction of Data Science AY 23-24
No ratings yet
Handbook Introduction of Data Science AY 23-24
171 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Data Science
No ratings yet
Data Science
2 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
5th Unit Data Science 24
No ratings yet
5th Unit Data Science 24
12 pages
Session 1819
No ratings yet
Session 1819
47 pages
DataScience Intro
No ratings yet
DataScience Intro
36 pages
Chapter 1-Introduction to data science
No ratings yet
Chapter 1-Introduction to data science
39 pages
Data Science
No ratings yet
Data Science
7 pages
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Cmr Bda Why Data Analytics
No ratings yet
Cmr Bda Why Data Analytics
108 pages
Data Science Overview - Part1
No ratings yet
Data Science Overview - Part1
29 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
Data Science
No ratings yet
Data Science
10 pages
Data Science
No ratings yet
Data Science
28 pages
Notes Data Science
No ratings yet
Notes Data Science
5 pages
PSAI Unit 1
No ratings yet
PSAI Unit 1
70 pages
2606
No ratings yet
2606
51 pages
Unit 1 Data Science Notes
No ratings yet
Unit 1 Data Science Notes
33 pages
s42979-021-00765-8 (1)
No ratings yet
s42979-021-00765-8 (1)
22 pages
data science assignment
No ratings yet
data science assignment
4 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Basic Data Science
No ratings yet
Basic Data Science
2 pages
1.1 Idml
No ratings yet
1.1 Idml
3 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Data Science Introduction
No ratings yet
Data Science Introduction
22 pages
Modul1 PPt.pptx
No ratings yet
Modul1 PPt.pptx
56 pages
Business Intelligence Career Master Plan 1 / Converted Edition Eduardo Chavez Download PDF
100% (7)
Business Intelligence Career Master Plan 1 / Converted Edition Eduardo Chavez Download PDF
64 pages
Python Finance - Harnessing The - Bisette, Vincent
100% (1)
Python Finance - Harnessing The - Bisette, Vincent
498 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (7)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Rework
100% (7)
Rework
201 pages
Introduction To Algo Trading
100% (7)
Introduction To Algo Trading
50 pages
Big Data Management
100% (3)
Big Data Management
53 pages
Mindmapping in 8 Easy Steps
No ratings yet
Mindmapping in 8 Easy Steps
40 pages
Product Management TACTICS Volume 2
100% (3)
Product Management TACTICS Volume 2
39 pages
Business Analytics
100% (1)
Business Analytics
47 pages
Technical Analysis
100% (6)
Technical Analysis
34 pages
CFO Best Practices White Paper
100% (3)
CFO Best Practices White Paper
26 pages
Trading Book List
100% (1)
Trading Book List
10 pages
Hedge Fund Reports and Financial Statement
No ratings yet
Hedge Fund Reports and Financial Statement
96 pages
R-22 Vs R-410 A - What's The Difference - A Refrigerant Comparison
No ratings yet
R-22 Vs R-410 A - What's The Difference - A Refrigerant Comparison
7 pages
Conceptual Framework - SHRM
100% (4)
Conceptual Framework - SHRM
13 pages
Domondon Tax NOTES
No ratings yet
Domondon Tax NOTES
81 pages
A Basic PPT On Internet of Things (IOT) : Recommended
No ratings yet
A Basic PPT On Internet of Things (IOT) : Recommended
6 pages
The Islamia University of Bahawalpur, Pakistan: Al-Noor Post Graduate College Head Rajkan
100% (2)
The Islamia University of Bahawalpur, Pakistan: Al-Noor Post Graduate College Head Rajkan
2 pages
Modular Robots (Report)
100% (1)
Modular Robots (Report)
17 pages
TDP Info
No ratings yet
TDP Info
7 pages
PopSockets v. DozTrading - Complaint
No ratings yet
PopSockets v. DozTrading - Complaint
106 pages
Roadmap of Smart Grid For Bangladesh Bas PDF
No ratings yet
Roadmap of Smart Grid For Bangladesh Bas PDF
8 pages
Guest Room Layout
No ratings yet
Guest Room Layout
10 pages
California Math Expressions Common Core Homework and Remembering Grade 5
100% (1)
California Math Expressions Common Core Homework and Remembering Grade 5
4 pages
CCNA 2 v7 Modules 7 - 9 - Available and Reliable Networks Exam Answers
No ratings yet
CCNA 2 v7 Modules 7 - 9 - Available and Reliable Networks Exam Answers
41 pages
Cis
No ratings yet
Cis
6 pages
Portable Multi-Gas Detector User Manual: Ver BSA20150424001
No ratings yet
Portable Multi-Gas Detector User Manual: Ver BSA20150424001
16 pages
A C +Haier+12000+BTU
No ratings yet
A C +Haier+12000+BTU
51 pages
Leal Et Al-2019-Journal of Public Health Dentistry
No ratings yet
Leal Et Al-2019-Journal of Public Health Dentistry
6 pages
Oppotunity Identification Skill T4
No ratings yet
Oppotunity Identification Skill T4
36 pages
Chun's Final Research
No ratings yet
Chun's Final Research
125 pages
DJControl Starlight QuickStart Manual V2
No ratings yet
DJControl Starlight QuickStart Manual V2
12 pages
Internet of Things Is A Revolutionary Approach For Future Technology Enhancement: A Review
No ratings yet
Internet of Things Is A Revolutionary Approach For Future Technology Enhancement: A Review
21 pages
Content: Edition #1
No ratings yet
Content: Edition #1
4 pages
Autodesk PDMC and Vault Professional Optimize Multidisciplinary Workflows
No ratings yet
Autodesk PDMC and Vault Professional Optimize Multidisciplinary Workflows
44 pages
Corpo
100% (2)
Corpo
41 pages
English B - Prose 2018 Response
100% (1)
English B - Prose 2018 Response
2 pages
Knight Frank - Dubai Residential Market Review Summer 2023
No ratings yet
Knight Frank - Dubai Residential Market Review Summer 2023
14 pages