DSV Module-1

Data Science & Visualization Module-1
Module-1:
Introduction to Data Science
Introduction: Definition of Data Science- Big Data and Data Science hype – and
getting past the hype - Datafication - Current landscape of perspectives -
Statistical Inference - Populations and samples – Statistical modeling,
probability distributions, fitting a model – Over fitting. Basics of R:
Introduction, R- Environment Setup, Programming with R, Basic Data Types.
What Is Data Science?

Over the past few years, there’s been a lot of hype in the media about “data science”
and “Big Data.” Today, Data rules the world. This has resulted in a huge demand for Data
Scientists.
A Data Scientist helps companies with data-driven decisions, to make their business
better. Data science is a field that deals with unstructured, structured data, and semi-
structured data. It involves practices like data cleansing, data preparation, data analysis, and
much more.
Data Science is the combination of: statistics, mathematics, programming, and problem-
solving; capturing data in ingenious ways; the ability to look at things differently; and the
activity of cleansing, preparing, and aligning data. This umbrella term includes various
techniques that are used when extracting insights and information from data.
Big Data refers to significant volumes of data that cannot be processed effectively with the
traditional applications that are currently used. The processing of big data begins with raw
data that isn’t aggregated and is most often impossible to store in the memory of a single
computer.
Data Analytics is the science of examining raw data to reach certain conclusions. Data
analytics involves applying an algorithmic or mechanical process to derive insights and
running through several data sets to look for meaningful correlations. It is used in several
industries, which enables organizations and data analytics companies to make more informed
decisions, as well as verify and disprove existing theories or models. The focus of data
analytics lies in inference, which is the process of deriving conclusions that are solely based
on what the researcher already knows.
Big Data and Data Science Hype

Data science enables companies not only to understand data from multiple sources but also to
enhance decision making. As a result, data science is widely used in almost every industry,
including health care, finance, marketing, banking, city planning, and more.
If you are probably means you have something useful to contribute to making data science
into a more legitimate field that has the power to have a positive impact on society.
So, what is eyebrow-raising about Big Data and data science? Let’s count the ways:
1. There’s a lack of definitions around the most basic terminology. What is “Big Data”
anyway? What does “data science” mean? What is the relationship between Big Data
and data science? Is data science the science of Big Data? Is data science only the stuff
going on in companies like Google and Facebook and tech companies? Why do many
Dept. of CSE, SJCIT 1 Prepared by AJAY.N

people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to
data science as only taking place in tech? Just how big is big? Or is it just a relative
term? These terms are so ambiguous, they’re well-nigh meaningless.
2. There’s a distinct lack of respect for the researchers in academia and industry labs who
have been working on this kind of stuff for years, and whose work is based on decades
(in some cases, centuries) of work by statisticians, computer scientists,
mathematicians, engineers, and scientists of all types. From the way the media
describes it, machine learning algorithms were just invented last week and data was
never “big” until Google came along. This is simply not the case. Many of the methods
and techniques we’re using—and the challenges we’re facing now—are part of the
evolution of everything that’s come before. This doesn’t mean that there’s not new and
exciting stuff going on, but we think it’s important to show some basic respect for
everything that came before.
3. The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era like “Masters of the Universe” to describe data scientists, and
that doesn’t bode well. In general, hype masks reality and increases the noise-to-signal
ratio. The longer the hype goes on, the more many of us will get turned off by it, and
the harder it will be to see what’s good underneath it all, if anything.
4. Statisticians already feel that they are studying and working on the “Science of Data.”
That’s their bread and butter. Maybe you, dear reader, are not a statistician and don’t
care, but imagine that for the statistician, this feels a little bit like how identity theft
might feel for you. Although we will make the case that data science is not just a
rebranding of statistics or machine learning but rather a field unto itself, the media
often describes data science in a way that makes it sound like as if it’s simply statistics
or machine learning in the context of the tech industry.
5. People have said to us, “Anything that has to call itself a science isn’t.” Although there
might be truth in there, that doesn’t mean that the term “data science” itself represents
nothing, but of course what it represents may not be science but more of a craft.
Getting Past the Hype

Data science enables companies not only to understand data from multiple sources but also to
enhance decision making. As a result, data science is widely used in almost every industry,
including health care, finance, marketing, banking, city planning, and more.
Around all the hype, in other words, there is a ring of truth: this is something new. But
at the same time, it’s a fragile, nascent idea at real risk of being rejected prematurely. For one
thing, it’s being paraded around as a magic bullet, raising unrealistic expectations that will
surely be disappointed.
Rachel gave herself the task of understanding the cultural phenomenon of data science
and how others were experiencing it. She started meeting with people at Google, at startups
and tech companies, and at universities, mostly from within statistics departments.
From those meetings she started to form a clearer picture of the new thing that’s
emerging. She ultimately decided to continue the investigation by giving a course at
Columbia called “Introduction to Data Science,” which Cathy covered on her blog. We
figured that by the end of the semester, we, and hopefully the students, would know what all
this actually meant. And now, with this book, we hope to do the same for many more people.
Why Now?
We have massive amounts of data about many aspects of our lives, and, simultaneously, an
abundance of inexpensive computing power. Shopping, communicating, reading news,

listening to music, searching for information, expressing our opinions—all this is being
tracked online, as most people know.
What people might not know is that the “datafication” of our offline behavior has
started as well, mirroring the online data collection revolution (more on this later).
It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals,
bioinformatics, social welfare, government, education, retail, and the list goes on. There is a
growing influence of data in most sectors and most industries. In some cases, the amount of
data collected might be enough to be considered “big data”; in other cases, it’s not.
But it’s not only the massiveness that makes all this new data interesting (or poses
challenges). It’s that the data itself, often in real time, becomes the building blocks of data
products. On the Internet, this means Amazon recommendation systems, friend
recommendations on Facebook, film and music recommendations, and so on. In finance, this
means credit ratings, trading algorithms, and models. In education, this is starting to mean
dynamic personalized learning and assessments coming out of places like Knewton and Khan
Academy. In government, this means policies based on data.
We’re witnessing the beginning of a massive, culturally saturated feedback loop
where our behavior changes the product and the product changes our behavior. Technology
makes this possible: infrastructure for largescale data processing, increased memory, and
bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This
wasn’t true a decade ago.
Considering the impact of this feedback loop, we should start thinking seriously about
how it’s being conducted, along with the ethical and technical responsibilities for the people
responsible for the process.
Datafication
An article on “The Rise of Big Data”. have discuss the concept of datafication, and their
example is how we quantify friendships with “likes”: it’s the way everything we do, online or
otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe
multiple storage units, and maybe also for sale.
Datafication as a process of “taking all aspects of life and turning them into data.”
As examples, they mention that “Google’s augmented-reality glasses datafy the gaze. Twitter
datafies stray thoughts. LinkedIn datafies professional networks.”
Datafication is an interesting concept and led us to consider its importance with
respect to people’s intentions about sharing their own data.
We are being datafied, or rather our actions are, and when we “like” someone or
something online, we are intending to be datafied, or at least we should expect to be. But
when we merely browse the Web, we are unintentionally, or at least passively, being datafied
through cookies that we might or might not be aware of. And when we walk around in a
store, or even on the street, we are being datafied in a completely unintentional way, via
sensors, cameras, or Google glasses.
The Current Landscape

What is data science? Is it new, or is it just statistics or analytics rebranded? Is it real, or is it
pure hype? And if it’s new and if it’s real, what does that mean? This is an ongoing
discussion, but one way to understand what’s going on in this industry is to look online and
see what current discussions are taking place. This doesn’t necessarily tell us what data
science is, but it at least tells us what other people think it is, or how they’re perceiving it.
What is Data Science?” and here’s Metamarket CEO Mike Driscoll’s answer:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-
inspired statistics. But data science is not merely hacking—because when hackers finish

debugging their Bash one-liners and Pig scripts, few of them care about non-Euclidean
distance metrics. And data science is not merely statistics, because when statisticians finish
theorizing the perfect model, few could read a tab-delimited file into R if their job depended
on it.
Data science is the civil engineering of data. Its acolytes possess a practical
knowledge of tools and materials, coupled with a theoretical understanding of what’s
possible. Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010,
shown in Figure 1-1.
Figure 1-1: Drew Conway’s Venn diagram of data science

He also mentions the sexy skills of data geeks from Nathan Yau’s 2009 post, “Rise of the
Data Scientist”, which include:
• Statistics (traditional analysis you’re used to thinking about)
• Data munging (parsing, scraping, and formatting data)
• Visualization (graphs, tools, etc.)
But wait, is data science just a bag of tricks? Or is it the logical extension of other fields like
statistics and machine learning?
For a slightly different perspective, see ASA President Nancy Geller’s 2011 Amstat News
article, “Don’t shun the ‘S’ word”, in which she defends statistics: We need to tell people that
Statisticians are the ones who make sense of the data deluge occurring in science,
engineering, and medicine; that statistics provides methods for data analysis in all fields,
from art history to zoology; that it is exciting to be a Statistician in the 21st century because
of the many challenges brought about by the data explosion in all of these fields.
Much of the development of the field is happening in industry, not academia. That is, there
are people with the job title data scientist in companies, but no professors of data science in
academia. (Though this may be changing.). It makes sense to us that once the skill set
required to thrive at Google working with a team on problems that required a hybrid skill set
of stats and computer science paired with personal characteristics including curiosity and
persistence spread to other Silicon Valley tech companies, it required a new job title. Once it

became a pattern, it deserved a name. And once it got a name, everyone and their mother
wanted to be one.
Both LinkedIn and Facebook are social network companies. Often‐times a description or
definition of data scientist includes hybrid statistician, software engineer, and social scientist.
This made sense in the context of companies where the product was a social product and still
makes sense when we’re dealing with human or user behavior.
Data Science Jobs

Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise
Observation: Nobody is an expert in everything, which is why it makes more sense to create
teams of people who have different profiles and different expertise-together, as a team, they
can specialize in all those things.
Data Science Profile

Rachel handed out index cards and asked everyone to profile themselves (on a relative rather
than absolute scale) with respect to their skill levels in the following domains:
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization
Figure 1-2. Rachel’s data science profile, which she created to illustrate trying to
visualize oneself as a data scientist; she wanted students and guest lecturers to “riff” on
this—to add buckets or remove skills, use a different scale or visualization method, and
think about the drawbacks of self-reporting.

We taped the index cards to the blackboard and got to see how everyone else thought of
themselves. There was quite a bit of variation, which is cool—lots of people in the class were
coming from social sciences, for example. Where is your data science profile at the moment,
and where would you like it to be in a few months, or years?
As we mentioned earlier, a data science team works best when different skills (profiles) are
represented across different people, because nobody is good at everything. It makes us
wonder if it might be more worthwhile to define a “data science team”—as shown in Figure
1-3— than to define a data scientist.
Figure 1-3. Data science team profiles can be constructed from data scientist profiles;
there should be alignment between the data science team profile and the profile of the
data problems they try to solve.
Just for comparison, check out what Harlan Harris recently did related to the field of data
science: he took a survey and used clustering to define subfields of data science, which gave
rise to Figure 1-4.

Figure 1-4. Harlan Harris’s clustering and visualization of subfields of data science
from Analyzing the Analyzers (O’Reilly) by Harlan Harris, Sean Murphy, and Marck
Vaisman based on a survey of several hundred data science practitioners in mid-2012.
In Academia
The reality is that currently, no one calls themselves a data scientist in academia, except to
take on a secondary title for the sake of being a part of a “data science institute” at a
university, or for applying for a grant that supplies money for data science research.
Instead, let’s ask a related question: who in academia plans to become a data
scientist? There were 60 students in the Intro to Data Science class at Columbia. When
Rachel proposed the course, she assumed the makeup of the students would mainly be
statisticians, applied mathematicians, and computer scientists. Actually, though, it ended up
being those people plus sociologists, journalists, political scientists, biomedical informatics
students, students from NYC government agencies and nonprofits related to social welfare,
someone from the architecture school, others from environmental engineering, pure
mathematicians, business marketing students, and students who already worked as data
scientists. They were all interested in figuring out ways to solve important problems, often of
social value, with data.
For the term “data science” to catch on in academia at the level of the faculty, and as a
primary title, the research area needs to be more formally defined. Note there is already a rich
set of problems that could translate into many PhD theses.
Here’s a stab at what this could look like: an academic data scientist is a scientist,
trained in anything from social science to biology, who works with large amounts of data,
and must grapple with computational problems posed by the structure, size, messiness, and
the complexity and nature of the data, while simultaneously solving a realworld problem.

The case for articulating it like this is as follows: across academic disciplines, the
computational and deep data problems have major commonalities. If researchers across
departments join forces, they can solve multiple real-world problems from different domains.
In Industry
What do data scientists look like in industry? It depends on the level of seniority and whether
you’re talking about the Internet/online industry in particular. The role of data scientist need
not be exclusive to the tech world, but that’s where the term originated; so for the purposes of
the conversation, let us say what it means there.
A chief data scientist should be setting the data strategy of the company, which
involves a variety of things: setting everything up from the engineering and infrastructure for
collecting data and logging, to privacy concerns, to deciding what data will be user-facing,
how data is going to be used to make decisions, and how it’s going to be built back into the
product. She should manage a team of engineers, scientists, and analysts and should
communicate with leadership across the company, including the CEO, CTO, and product
leadership.
She’ll also be concerned with patenting innovative solutions and setting research
goals. More generally, a data scientist is someone who knows how to extract meaning from
and interpret data, which requires both tools and methods from statistics and machine
learning, as well as being human. She spends a lot of time in the process of collecting,
cleaning, and munging data, because data is never clean. This process requires persistence,
statistics, and software engineering skills—skills that are also necessary for understanding
biases in the data, and for debugging logging output from code.
Once she gets the data into shape, a crucial part is exploratory data analysis, which
combines visualization and data sense. She’ll find patterns, build models, and algorithms—
some with the intention of understanding product usage and the overall health of the product,
and others to serve as prototypes that ultimately get baked back into the product. She may
design experiments, and she is a critical part of datadriven decision making. She’ll
communicate with team members, engineers, and leadership in clear language and with data
visualizations so that even if her colleagues are not immersed in the data themselves, they
will understand the implications.
Statistical Inference
Every data scientist should do once they’ve gotten data in hand for any data-related project:
exploratory data analysis (EDA).
When you’re developing your skill set as a data scientist, certain foundational pieces need to
be in place first statistics, linear algebra, some programming. Even once you have those
pieces, part of the challenge is that you will be developing several skillsets in parallel
simultaneously—data preparation and munging, modeling, coding, visualization, and
communication—that are interdependent. As we progress, we need to start somewhere, and
will begin by getting grounded in statistical inference. it’s always helpful to go back to
fundamentals and remind ourselves of what statistical inference and thinking is all about. And
further still, in the age of Big Data, classical statistics methods need to be revisited and
reimagined in new contexts.
The world we live in is complex, random, and uncertain. At the same time, it’s one big data-
generating machine. As we commute to work on subways and in cars, as our blood moves
through our bodies, as we’re shopping, emailing, procrastinating at work by browsing the
Internet and watching the stock market, as we’re building things, eating things, talking to our
friends and family about things, while factories are producing products, this all at least
potentially produces data.

Data represents the traces of the real-world processes, and exactly which traces we gather are
decided by our data collection or sampling method. After separating the process from the data
collection, we can see clearly that there are two sources of randomness and uncertainty.
A mathematical model for both uncertainty and randomness is offered by probability theory
A world/processes is defined by one or more variables. The model of the world is defined by
a function:
Model = f (x) or f (x, y, z) (A multivariable function)
The function is unknown, model is unclear, at least initially. Typically, our task is to come up
with the model, given the data.
The process of going from the world to the data, and then from the data back to the world, is
the field of statistical inference.
More precisely, statistical inference is the discipline that concerns itself with the development
of procedures, methods, and theorems that allow us to extract meaning and information from
data that has been generated by stochastic (random) processes.
Populations and Samples

In classical statistical literature, a distinction is made between the population and the sample.
The word population or observations (the overall dataset) immediately makes us think of the
entire US population of 300 million people, or the entire world’s population of 7 billion
people.
If we could measure the characteristics or extract characteristics of all those objects,
we’d have a complete set of observations, and the convention is to use N to represent the total
number of observations in the population.
Suppose your population was all emails sent last year by employees at a huge
corporation, BigCorp. Then a single observation could be a list of things: the sender’s name,
the list of recipients, date sent, text of email, number of characters in the email, number of
sentences in the email, number of verbs in the email, and the length of time until first reply.
When we take a sample, we take a subset of the units of size n in order to examine the
observations to draw conclusions and make inferences about the population. There are
different ways you might go about getting this subset of data, and you want to be aware of
this sampling mechanism because it can introduce biases into the data, and distort it, so that
the subset is not a “mini-me” shrunk-down version of the population. Once that happens, any
conclusions you draw will simply be wrong and distorted.
In the BigCorp email example, you could make a list of all the employees and select
1/10th of those people at random and take all the email they ever sent, and that would be your
sample. Alternatively, you could sample 1/10th of all email sent each day at random, and that
would be your sample. Both these methods are reasonable, and both methods yield the same
sample size. But if you took them and counted how many email messages each person sent,
and used that to estimate the underlying distribution of emails sent by all individuals at
BigCorp, you might get entirely different answers.
New kinds of data

Gone are the days when data is just a bunch of numbers and categorical variables. A strong
data scientist needs to be versatile and comfortable with dealing a variety of types of data,
including:
• Traditional: numerical, categorical, or binary

• Text: emails, tweets, New York Times articles
• Records: user-level data, timestamped event data, json formatted log files
• Geo-based location data: briefly touched on in this chapter with NYC housing data

• Network
• Sensor data
• Images
These new kinds of data require us to think more carefully about what sampling means in
these contexts.
Big Data and Statistical Inferences

A few ways to think about Big Data:
 “Big” is a moving target. Constructing a threshold for Big Data such as 1 petabyte is
meaningless because it makes it sound absolute.
 “Big” is when you can’t fit it on one machine. Different individuals and companies
have different computational resources available to them, so for a single scientist data
is big if she can’t fit it on one machine because she has to learn a whole new host of
tools and methods once that happens.
 Big Data is a cultural phenomenon. It describes how much data is part of our lives
 The 4 Vs: Volume, variety, velocity, and value. Many people are circulating this as a
way to characterize Big Data.
An article “The Rise of Big Data.” In it, they argue that the Big Data revolution consists of
three things:
• Collecting and using a lot of data rather than small samples
• Accepting messiness in your data
• Giving up on knowing the causes
They describe these steps in a rather grand fashion by claiming that Big Data doesn’t need to
understand cause given that the data is so enormous.
Sample size N
For statistical inference N < ALL
For Bigdata N=ALL
For some typical bigdata analysis N=1, world model through the eyes of the twitter user
Modeling
A model is our attempt to understand and represent the nature of reality through a particular
lens, be it architectural, biological, or mathematical.
y = f (x)
A model is an artificial construction where all extraneous detail has been removed or
abstracted. Attention must always be paid to these abstracted details after a model has been
analyzed to see what might have been overlooked.
Statistical Modeling
Before you get too involved with the data and start coding, it’s useful to draw a picture of
what you think the underlying process might be with your model. What comes first? What
influences what? What causes what? What’s a test of that?
But different people think in different ways. Some prefer to express these kinds of
relationships in terms of math.
The mathematical expressions will be general enough that they have to include parameters,
but the values of these parameters are not yet known.
In mathematical expressions, the convention is to use Greek letters for parameters and Latin
letters for data. So, for example, if you have two columns of data, x and y, and you think
there’s a linear relationship, you’d write down
y = β0 + β1 x.
You don’t know what β0and β1are in terms of actual numbers yet, so they’re the parameters.

Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows,
showing how things affect other things or what happens over time. This gives them an
abstract picture of the relationships before choosing equations to express them.
So try writing down a linear function (more on that in the next unit). When you write it down,
you force yourself to think: does this make any sense? If not, why? What would make more
sense? You start simply and keep building it up in complexity, making assumptions, and
writing your assumptions down. You can use full-blown sentences if it helps—
e.g., “I assume that my users naturally cluster into about five groups because when I hear the
sales rep talk about them, she has about five different types of people she talks about” then
taking your words and trying to express them as equations and code. Some of the building
blocks of these models are probability distributions.
Probability distributions
Probability distributions are the foundation of statistical models. When we get to linear
regression and Naive Bayes, you will see how this happens in practice. One can take multiple
semesters of courses on probability theory, and so it’s a tall challenge to condense it down for
you in a small section.
Back in the day, before computers, scientists observed real-world phenomenon, took
measurements, and noticed that certain mathematical shapes kept reappearing. The classical
example is the height of humans, following a normal distribution—a bell-shaped curve, also
called a Gaussian distribution, named after Gauss.
Other common shapes have been named after their observers as well (e.g., the Poisson
distribution and the Weibull distribution), while other shapes such as Gamma distributions or
exponential distributions are named after associated mathematical objects.
Natural processes tend to generate measurements whose empirical shape could be
approximated by mathematical functions with a few parameters that could be estimated from
the data.
Not all processes generate data that looks like a named distribution, but many do. We can use
these functions as building blocks of our models. It’s beyond the scope of the book to go into
each of the distributions in detail, but we provide them in Figure 1.4 as an illustration of the
various common shapes, and to remind you that they only have names because someone
observed them enough times to think they deserved names. There is actually an infinite
number of possible distributions.

Figure 1.4. A bunch of continuous density functions (aka probability distributions)

They are to be interpreted as assigning a probability to a subset of possible outcomes, and
have corresponding functions.
For example, the normal distribution is written as:
The parameter μ is the mean and median and controls where the distribution is centered
(because this is a symmetric distribution), and the parameter σ controls how spread out the
distribution is. This is the general functional form, but for specific real-world phenomenon,
these parameters have actual numbers as values, which we can estimate from the data.
Normal, Uniform, Cauchy, t, F-, Chi Square, exponential, Weibull, lognormal etc. They are
known as continuous density functions. Any random variable x or y can be assumed to have
probability distribution p(x), if it maps it to a positive real number. For a probability density
function, if we integrate the function to find the area under the curve it is 1, allowing it to be
interpreted as probability.
Fitting a model
Fitting a model means that you estimate the parameters of the model using the observed data.
You are using your data as evidence to help approximate the real-world mathematical process
that generated the data. Fitting the model often involves optimization methods and
algorithms, such as maximum likelihood estimation, to help get the parameters.

In fact, when you estimate the parameters, they are actually estimators, meaning they
themselves are functions of the data. Once you fit the model, you actually can write it as
y = 7.2+4.5x,
for example, which means that your best guess is that this equation or functional form
expresses the relationship between your two variables, based on your assumption that the data
followed a linear pattern.
Fitting the model is when you start actually coding: your code will read in the data, and you’ll
specify the functional form that you wrote down on the piece of paper. Then R or Python will
use built-in optimization methods to give you the most likely values of the parameters given
the data.
Overfitting
Overfitting is a concept in data science, which occurs when a statistical model fits exactly
against its training data. When this happens, the algorithm unfortunately cannot perform
accurately against unseen data, defeating its purpose Overfitting is the term used to mean that
you used a dataset to estimate the parameters of your model, but your model isn’t that good at
capturing reality beyond your sampled data.
You might know this because you have tried to use it to predict labels for another set of data
that you didn’t use to fit the model, and it doesn’t do a good job, as measured by an
evaluation metric such as accuracy.
Basics of R:
R programming introduction: R is a scripting language for statistical data manipulation,
statistical analysis, graphics representation and reporting. It was inspired by, and is mostly
compatible with, the statistical language S developed by AT&T. R was created by Ross Ihaka
and Robert Gentleman at the University of Auckland, New Zealand, and is currently
developed by the R Development Core Team.
The core of R is an interpreted computer language which allows branching and looping as
well as modular programming using functions. R allows integration with the procedures
written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
Features of R:
R is a programming language and software environment for statistical analysis, graphics
representation and reporting.
The following are the important features of R
 R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility.
 It incorporates features found in object-oriented and functional programming
languages.
 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
 R provides special data types that includes the Numerical, Integer , Character,
Logical, Complex.
 R provides very powerful data structures it contains Vectors, Matrices, List, Arrays,
Dataframes, Classes.

 Because R is open source software, it‘s easy to get help from the user community.
Also, a lot of new functions are contributed by users, many of whom are prominent
statisticians.
As a conclusion, R is world‘s most widely used statistics programming language.
How to Run R: R operates in two modes: interactive and batch mode.

1. Interactive mode: The one typically used is interactive mode. In this mode, you type in
commands, R displays results, you type in more commands, and so on.
2. Batch mode: Batch mode does not require interaction with the user. It‘s useful for
production jobs, such as when a program must be run periodically; say once per day, because
you can automate the process.
R Command Prompt: Once you have R environment setup, then it‘s easy to start your R
command prompt by just typing the following command at your command prompt –
C:\Users\CSELAB>R
This will launch R interpreter and you will get a prompt > where you can start typing your
program as
follows −
>myString<-"WELCOME R PROGRAMMING"
>print(myString)
[1] "WELCOME R PROGRAMMING"
Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.
Object orientation: can be explained by example. Consider statistical regression. When you
perform a regression analysis with other statistical packages, such as SAS or SPSS, you get a
mountain of output on the screen. By contrast, if you call the lm() regression function in R,
the function returns an object containing all the results—the estimate coefficients, their
standard errors, residuals, and so on. You then pick and choose, programmatically, which
parts of that object to extract. You will see that R‘s approach makes programming much
easier, partly because it offers a certain uniformity of access to data.
Functional Programming:
As is typical in functional programming languages, a common theme in R programming is
avoidance of explicit iteration. Instead of coding loops, you exploit R‘s functional features,
which let you express iterative behavior implicitly. This can lead to code that executes much
more efficiently, and it can make a huge timing difference when running R on large data sets.
The functional programming nature of the R language offers many advantages:
• Clearer, more compact code
• Potentially much faster execution speed
• Less debugging, because the code is simpler
• Easier transition to parallel programming.
Comments: Comments are like helping text in your r program and they are ignored by the
interpreter while executing your actual program. single comment is written using # in the
beginning of the statement as follows
# My first program in R Programming
R does not support multi-line comments but you can perform a trick which is something as
follows −

if(FALSE){
"This is a demo for multi-line comments and it should be put inside either a single
OR double quote"
}
myString<-"Hello, World!" print(myString)
Variables in R
Variables are used to store data, whose value can be changed according to our need. Unique
name given to variable (function and objects as well) is identifier.
Rules for writing Identifiers in R

1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit.
3. Reserved words in R cannot be used as identifiers.
Valid identifiers in R
total, Sum, .fine.with.dot, this_is_acceptable, Number5
Invalid identifiers in R
tot@l, 5um, _fine, TRUE, .0ne
Best Practices
Earlier versions of R used underscore (_) as an assignment operator. So, the period (.) was
used extensively in variable names having multiple words. Current versions of R support
underscore as a valid identifier but it is good practice to use period as word separators. For
example, a.variable.name is preferred over ―a_variable_name― or alternatively we could
use camel case as aVariableName
Constants in R
Constants, as the name suggests, are entities whose value cannot be altered. Basic types of
constant are numeric constants and character constants.
Numeric Constants
All numbers fall under this category. They can be of type integer, double or complex. This
can be checked with the typeof() function. Numeric constants followed by L are regarded as
integer and those followed by i are regarded as complex.
>typeof(5)
[1] "double"
>typeof(5L)
[1] "integer"
>typeof(5i)
[1] "complex"
Character Constants: Character constants can be represented using either single quotes (')
or double quotes (") as delimiters.
> 'example'
[1] "example"

>typeof("5")
[1] "character"
Built-in Constants
Some of the built-in constants defined in R along with their values is shown below.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
"Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z"
>letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r"
"s" [20] "t" "u" "v" "w" "x" "y" "z"
>pi
[1] 3.141593
> month.name
[1] "January" "February" "March" "April" "May" "June" [7] "July"
"August" "September" "October" "November"
"December"
>month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
But it is not good to rely on these, as they are implemented as variables whose values can be
changed.
>pi
[1] 3.141593
>pi<- 56
>pi [1] 56
Maths in R Programming: R Having Number Of Mathematical Calculations in R .It

Contains Given Below






What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from data in various forms, both structured
and unstructured, similar to data mining.
Why is Data Science?

• Because you have too many data such as money, reviews, customer data, people working,
etc.
• You want to keep it clear and easy to understand so you can make a change that’s why data
science is relevant.
• Data analysis lets people make better decisions, either faster or better.
Why Data Science is important?

Every company, however, has information, and its business value depends on how much
information it thinks. Since late, Information Science has acquired significance in the light of
the fact that it can assist companies with growing business estimation of their accessible
knowledge and thus allow them to take the upper hand against their rivals. It can help us
know our customers better, it can help us refine our processes and it can help us make better
decisions. Knowledge, in the light of information technology, has become a vital instrument.
Role of Data Scientist

• Data scientists help organizations understand and handle data, and address complex
problems using knowledge from a range of technology niches.
• They are typically built in the fields of computer science, modeling, statistics, analytics and
mathematics, coupled with modeling statistics and mathematics combined with a clear
business sense.
How to do Data Science?

A typical data science process looks like this, which can be modified for specific use case:
 Understand the business
 Collect & explore the data
 Prepare & process the data
 Build & validate the models
 Deploy & monitor the performance
Tools for Data Science

1. R
2. Python
3. SQL
4. Hadoop
5. Tableau
6. Weka
Applications of Data Science

1. Data Science in Healthcare
2. Data Science in E-commerce
3. Data Science in Manufacturing
4. Data Science as Conversational Agents
5. Data Science in Transport

What does that mean by "Big Data?"

Huge Data is a term used to describe a gigantic amount of both organized and unstructured
knowledge that is so immense that using traditional database and programming methods is
impossible to handle. The amount of knowledge is too high in most undertaking situations or
it moves too quickly, or it reaches the existing planning cap.
(1) Data collection by significant quantities a. Via machines, sensors, men, events.
(2) To do something about it. Decision taking, testing observations, gaining perspective,
forecasting the future.
Types of Big Data

There are three types of data behind Big Data Figure 1.5. There's a lot of useful knowledge in
each category that you can mine to use in various projects.
Figure 1.5: Types of Big data

Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly stored and
accessed from a database by simple sea table in a company database will be structured as the
employee details, their job positions, their salaries, etc.,
Unstructured
Unstructured information alludes to the information that does not have a particular structure
or structure at all. This makes it exceptionally troublesome and tedious to process and
investigate unstructured information. Email is a case of unstructured information.
Semi-organized
Semi-organized information relates to the information containing both the arrangements
referenced over, that is, organized and unstructured information. To be exact, it alludes to the
information that in spite of the fact that has not been grouped under a (database), yet contains
essential data or labels that isolate singular components inside the information.
Big Data also requires several sources and often comes often from different types. But it is
not always an easy job to know how to combine all of the tools you need to work with
different styles.

Three Characteristics of Big Data-

1. Volume - Data quantity
2. Velocity - Data Speed
3. Variety- Data Types
1. Volume
A run of the mill PC may have had 10 gigabytes of storage in 2000. Today, Facebook is
regularly ingesting 500 terabytes of new information. Boeing 737 can generate 240 terabytes
of flight information on a single trip across the US. The advanced cell phones, the
information they generate and expend; sensors implanted into ordinary items can take care of
containing human, field, and other data, including video, in billions of new, continually
refreshed information, before long outcome.
2. Velocity
• Clickstreams and ad impressions capture consumer activity at millions of events per second
• Highfrequency stock trading algorithms represent market movements within microseconds
• machine-tomachine processes share data between billions of devices
• networks and sensors produce huge realtime log data
• online gaming systems help millions of competitor users, each pro
3. Variety
• Big data are not just numbers, dates, and strings. Huge data is also geospatial information,
3D information, sound and video, and unstructured content, including web-based log
documents.
• Modern database systems were built to handle smaller structured information volumes,
fewer changes or an expected, steady information structure.
• Big Data inquiry involves details of different kinds.
Benefits of Big Data

Capacity to process Big Data gets different advantages, for example,
1. Organizations can use outside insight while taking choices
Access to social information from web indexes and locales like facebook, twitter are
empowering associations to calibrate their business procedures.
2. Improved client assistance
Customary client criticism frameworks are getting supplanted by new frameworks structured
with Big Data advances. In these new frameworks, Big Data and common language handling
innovations are being utilized to peruse and assess buyer reactions.
3. Early recognizable proof of hazard to the item/administrations, assuming any
4. Better operational effectiveness
Huge Data innovations can be utilized for making an arranging territory or landing zone for
new information before recognizing what information ought to be moved to the information
distribution center. Also, such coordination of Big Data advances and information
distribution center encourages an association to offload inconsistently got to information.
Big Data Tools for Data Analysis

1) Apache Hadoop
2) CDH (Cloudera Distribution for Hadoop)
3) Cassandra
4) Knime
5) Datawrapper
6) MongoDB
7) Lumify

8) HPCC
9) Storm
10) Apache SAMOA
11) Talend
12) Rapidminer
13) Qubole
14) Tableau
15) R
Big Data Techniques

Six big data analysis techniques
Big data is defined by the three V's: the vast amount of data, the pace in which it is processed,
and the broad variety of data.7 Due to the second descriptor, the pace, data analytics has
extended into the technical fields of machine learning and artificial intelligence. In addition to
the emerging computer-based analytical methods, data harnesses are now used to analyze t.
In addition to the emerging data harnesses of computer-driven research techniques, analyzes
are often focused on conventional statistical methods.9 Essentially, how data analysis
techniques operate within an enterprise is doubled; broad data analysis is generated by
streaming data as it appears, and then conducting batch analysis of data as it grows – to
search for behavioral patterns and trends.
When data becomes more informative in its size, scope and depth, the more creativity it
drives.
1. A/B testing
2. Data fusion and data integration
3. Data mining
4. Machine learning
5. Natural language processing (NLP).
6. Statistics.
Underfitting and Overfitting

Machine learning uses data to create a “model” and uses model to make predictions
 Customers who are women over age 20 are likely to respond to an advertisement
 Students with good grades are predicted to do well on the SAT
 The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains
Underfitting
Model used for predictions is too simplistic
 60% of men and 70% of women responded to an advertisement, therefore all future
ads should go to women
 If a furniture item has four legs and a flat top it is a dining room table
 The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains
Overfitting
Model used for predictions is too specific
 The best targets for an advertisement are married women between 25 and 27 years
with short black hair, one child, and one pet dog
 If a furniture item has four 100 cm legs with decoration and a flat polished wooden
top with rounded edges then it is a dining room table

Module-1: Questions

DSV Module-1

Uploaded by

Copyright:

Available Formats

DSV Module-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSV Module-1

Uploaded by

Copyright:

Available Formats

Data Science & Visualization Module-1

What Is Data Science?

Big Data and Data Science Hype

Dept. of CSE, SJCIT 1 Prepared by AJAY.N

Getting Past the Hype

Dept. of CSE, SJCIT 2 Prepared by AJAY.N

The Current Landscape

Dept. of CSE, SJCIT 3 Prepared by AJAY.N

Figure 1-1: Drew Conway’s Venn diagram of data science

Dept. of CSE, SJCIT 4 Prepared by AJAY.N

Data Science Jobs

Data Science Profile

Dept. of CSE, SJCIT 5 Prepared by AJAY.N

Dept. of CSE, SJCIT 6 Prepared by AJAY.N

Dept. of CSE, SJCIT 7 Prepared by AJAY.N

Dept. of CSE, SJCIT 8 Prepared by AJAY.N

Populations and Samples

New kinds of data

• Traditional: numerical, categorical, or binary

Dept. of CSE, SJCIT 9 Prepared by AJAY.N

Big Data and Statistical Inferences

Dept. of CSE, SJCIT 10 Prepared by AJAY.N

Dept. of CSE, SJCIT 11 Prepared by AJAY.N

Figure 1.4. A bunch of continuous density functions (aka probability distributions)

Dept. of CSE, SJCIT 12 Prepared by AJAY.N

Dept. of CSE, SJCIT 13 Prepared by AJAY.N

How to Run R: R operates in two modes: interactive and batch mode.

Dept. of CSE, SJCIT 14 Prepared by AJAY.N

Rules for writing Identifiers in R

Dept. of CSE, SJCIT 15 Prepared by AJAY.N

Maths in R Programming: R Having Number Of Mathematical Calculations in R .It

Dept. of CSE, SJCIT 16 Prepared by AJAY.N

Dept. of CSE, SJCIT 17 Prepared by AJAY.N

Dept. of CSE, SJCIT 18 Prepared by AJAY.N

Dept. of CSE, SJCIT 19 Prepared by AJAY.N

Dept. of CSE, SJCIT 20 Prepared by AJAY.N

Dept. of CSE, SJCIT 21 Prepared by AJAY.N

What is Data Science?

Why is Data Science?

Why Data Science is important?

Role of Data Scientist

How to do Data Science?

Tools for Data Science

Applications of Data Science

Dept. of CSE, SJCIT 22 Prepared by AJAY.N

What does that mean by "Big Data?"

Types of Big Data

Figure 1.5: Types of Big data

Dept. of CSE, SJCIT 23 Prepared by AJAY.N

Three Characteristics of Big Data-

Benefits of Big Data

Big Data Tools for Data Analysis

Dept. of CSE, SJCIT 24 Prepared by AJAY.N

Big Data Techniques

Underfitting and Overfitting

Dept. of CSE, SJCIT 25 Prepared by AJAY.N

Dept. of CSE, SJCIT 26 Prepared by AJAY.N

You might also like