DSV Module-1
DSV Module-1
DSV Module-1
Module-1:
Introduction to Data Science
Introduction: Definition of Data Science- Big Data and Data Science hype – and
getting past the hype - Datafication - Current landscape of perspectives -
Statistical Inference - Populations and samples – Statistical modeling,
probability distributions, fitting a model – Over fitting. Basics of R:
Introduction, R- Environment Setup, Programming with R, Basic Data Types.
Data Science is the combination of: statistics, mathematics, programming, and problem-
solving; capturing data in ingenious ways; the ability to look at things differently; and the
activity of cleansing, preparing, and aligning data. This umbrella term includes various
techniques that are used when extracting insights and information from data.
Big Data refers to significant volumes of data that cannot be processed effectively with the
traditional applications that are currently used. The processing of big data begins with raw
data that isn’t aggregated and is most often impossible to store in the memory of a single
computer.
Data Analytics is the science of examining raw data to reach certain conclusions. Data
analytics involves applying an algorithmic or mechanical process to derive insights and
running through several data sets to look for meaningful correlations. It is used in several
industries, which enables organizations and data analytics companies to make more informed
decisions, as well as verify and disprove existing theories or models. The focus of data
analytics lies in inference, which is the process of deriving conclusions that are solely based
on what the researcher already knows.
people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to
data science as only taking place in tech? Just how big is big? Or is it just a relative
term? These terms are so ambiguous, they’re well-nigh meaningless.
2. There’s a distinct lack of respect for the researchers in academia and industry labs who
have been working on this kind of stuff for years, and whose work is based on decades
(in some cases, centuries) of work by statisticians, computer scientists,
mathematicians, engineers, and scientists of all types. From the way the media
describes it, machine learning algorithms were just invented last week and data was
never “big” until Google came along. This is simply not the case. Many of the methods
and techniques we’re using—and the challenges we’re facing now—are part of the
evolution of everything that’s come before. This doesn’t mean that there’s not new and
exciting stuff going on, but we think it’s important to show some basic respect for
everything that came before.
3. The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era like “Masters of the Universe” to describe data scientists, and
that doesn’t bode well. In general, hype masks reality and increases the noise-to-signal
ratio. The longer the hype goes on, the more many of us will get turned off by it, and
the harder it will be to see what’s good underneath it all, if anything.
4. Statisticians already feel that they are studying and working on the “Science of Data.”
That’s their bread and butter. Maybe you, dear reader, are not a statistician and don’t
care, but imagine that for the statistician, this feels a little bit like how identity theft
might feel for you. Although we will make the case that data science is not just a
rebranding of statistics or machine learning but rather a field unto itself, the media
often describes data science in a way that makes it sound like as if it’s simply statistics
or machine learning in the context of the tech industry.
5. People have said to us, “Anything that has to call itself a science isn’t.” Although there
might be truth in there, that doesn’t mean that the term “data science” itself represents
nothing, but of course what it represents may not be science but more of a craft.
Why Now?
We have massive amounts of data about many aspects of our lives, and, simultaneously, an
abundance of inexpensive computing power. Shopping, communicating, reading news,
listening to music, searching for information, expressing our opinions—all this is being
tracked online, as most people know.
What people might not know is that the “datafication” of our offline behavior has
started as well, mirroring the online data collection revolution (more on this later).
It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals,
bioinformatics, social welfare, government, education, retail, and the list goes on. There is a
growing influence of data in most sectors and most industries. In some cases, the amount of
data collected might be enough to be considered “big data”; in other cases, it’s not.
But it’s not only the massiveness that makes all this new data interesting (or poses
challenges). It’s that the data itself, often in real time, becomes the building blocks of data
products. On the Internet, this means Amazon recommendation systems, friend
recommendations on Facebook, film and music recommendations, and so on. In finance, this
means credit ratings, trading algorithms, and models. In education, this is starting to mean
dynamic personalized learning and assessments coming out of places like Knewton and Khan
Academy. In government, this means policies based on data.
We’re witnessing the beginning of a massive, culturally saturated feedback loop
where our behavior changes the product and the product changes our behavior. Technology
makes this possible: infrastructure for largescale data processing, increased memory, and
bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This
wasn’t true a decade ago.
Considering the impact of this feedback loop, we should start thinking seriously about
how it’s being conducted, along with the ethical and technical responsibilities for the people
responsible for the process.
Datafication
An article on “The Rise of Big Data”. have discuss the concept of datafication, and their
example is how we quantify friendships with “likes”: it’s the way everything we do, online or
otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe
multiple storage units, and maybe also for sale.
Datafication as a process of “taking all aspects of life and turning them into data.”
As examples, they mention that “Google’s augmented-reality glasses datafy the gaze. Twitter
datafies stray thoughts. LinkedIn datafies professional networks.”
Datafication is an interesting concept and led us to consider its importance with
respect to people’s intentions about sharing their own data.
We are being datafied, or rather our actions are, and when we “like” someone or
something online, we are intending to be datafied, or at least we should expect to be. But
when we merely browse the Web, we are unintentionally, or at least passively, being datafied
through cookies that we might or might not be aware of. And when we walk around in a
store, or even on the street, we are being datafied in a completely unintentional way, via
sensors, cameras, or Google glasses.
debugging their Bash one-liners and Pig scripts, few of them care about non-Euclidean
distance metrics. And data science is not merely statistics, because when statisticians finish
theorizing the perfect model, few could read a tab-delimited file into R if their job depended
on it.
Data science is the civil engineering of data. Its acolytes possess a practical
knowledge of tools and materials, coupled with a theoretical understanding of what’s
possible. Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010,
shown in Figure 1-1.
became a pattern, it deserved a name. And once it got a name, everyone and their mother
wanted to be one.
Both LinkedIn and Facebook are social network companies. Often‐times a description or
definition of data scientist includes hybrid statistician, software engineer, and social scientist.
This made sense in the context of companies where the product was a social product and still
makes sense when we’re dealing with human or user behavior.
Figure 1-2. Rachel’s data science profile, which she created to illustrate trying to
visualize oneself as a data scientist; she wanted students and guest lecturers to “riff” on
this—to add buckets or remove skills, use a different scale or visualization method, and
think about the drawbacks of self-reporting.
We taped the index cards to the blackboard and got to see how everyone else thought of
themselves. There was quite a bit of variation, which is cool—lots of people in the class were
coming from social sciences, for example. Where is your data science profile at the moment,
and where would you like it to be in a few months, or years?
As we mentioned earlier, a data science team works best when different skills (profiles) are
represented across different people, because nobody is good at everything. It makes us
wonder if it might be more worthwhile to define a “data science team”—as shown in Figure
1-3— than to define a data scientist.
Figure 1-3. Data science team profiles can be constructed from data scientist profiles;
there should be alignment between the data science team profile and the profile of the
data problems they try to solve.
Just for comparison, check out what Harlan Harris recently did related to the field of data
science: he took a survey and used clustering to define subfields of data science, which gave
rise to Figure 1-4.
Figure 1-4. Harlan Harris’s clustering and visualization of subfields of data science
from Analyzing the Analyzers (O’Reilly) by Harlan Harris, Sean Murphy, and Marck
Vaisman based on a survey of several hundred data science practitioners in mid-2012.
In Academia
The reality is that currently, no one calls themselves a data scientist in academia, except to
take on a secondary title for the sake of being a part of a “data science institute” at a
university, or for applying for a grant that supplies money for data science research.
Instead, let’s ask a related question: who in academia plans to become a data
scientist? There were 60 students in the Intro to Data Science class at Columbia. When
Rachel proposed the course, she assumed the makeup of the students would mainly be
statisticians, applied mathematicians, and computer scientists. Actually, though, it ended up
being those people plus sociologists, journalists, political scientists, biomedical informatics
students, students from NYC government agencies and nonprofits related to social welfare,
someone from the architecture school, others from environmental engineering, pure
mathematicians, business marketing students, and students who already worked as data
scientists. They were all interested in figuring out ways to solve important problems, often of
social value, with data.
For the term “data science” to catch on in academia at the level of the faculty, and as a
primary title, the research area needs to be more formally defined. Note there is already a rich
set of problems that could translate into many PhD theses.
Here’s a stab at what this could look like: an academic data scientist is a scientist,
trained in anything from social science to biology, who works with large amounts of data,
and must grapple with computational problems posed by the structure, size, messiness, and
the complexity and nature of the data, while simultaneously solving a realworld problem.
The case for articulating it like this is as follows: across academic disciplines, the
computational and deep data problems have major commonalities. If researchers across
departments join forces, they can solve multiple real-world problems from different domains.
In Industry
What do data scientists look like in industry? It depends on the level of seniority and whether
you’re talking about the Internet/online industry in particular. The role of data scientist need
not be exclusive to the tech world, but that’s where the term originated; so for the purposes of
the conversation, let us say what it means there.
A chief data scientist should be setting the data strategy of the company, which
involves a variety of things: setting everything up from the engineering and infrastructure for
collecting data and logging, to privacy concerns, to deciding what data will be user-facing,
how data is going to be used to make decisions, and how it’s going to be built back into the
product. She should manage a team of engineers, scientists, and analysts and should
communicate with leadership across the company, including the CEO, CTO, and product
leadership.
She’ll also be concerned with patenting innovative solutions and setting research
goals. More generally, a data scientist is someone who knows how to extract meaning from
and interpret data, which requires both tools and methods from statistics and machine
learning, as well as being human. She spends a lot of time in the process of collecting,
cleaning, and munging data, because data is never clean. This process requires persistence,
statistics, and software engineering skills—skills that are also necessary for understanding
biases in the data, and for debugging logging output from code.
Once she gets the data into shape, a crucial part is exploratory data analysis, which
combines visualization and data sense. She’ll find patterns, build models, and algorithms—
some with the intention of understanding product usage and the overall health of the product,
and others to serve as prototypes that ultimately get baked back into the product. She may
design experiments, and she is a critical part of datadriven decision making. She’ll
communicate with team members, engineers, and leadership in clear language and with data
visualizations so that even if her colleagues are not immersed in the data themselves, they
will understand the implications.
Statistical Inference
Every data scientist should do once they’ve gotten data in hand for any data-related project:
exploratory data analysis (EDA).
When you’re developing your skill set as a data scientist, certain foundational pieces need to
be in place first statistics, linear algebra, some programming. Even once you have those
pieces, part of the challenge is that you will be developing several skillsets in parallel
simultaneously—data preparation and munging, modeling, coding, visualization, and
communication—that are interdependent. As we progress, we need to start somewhere, and
will begin by getting grounded in statistical inference. it’s always helpful to go back to
fundamentals and remind ourselves of what statistical inference and thinking is all about. And
further still, in the age of Big Data, classical statistics methods need to be revisited and
reimagined in new contexts.
The world we live in is complex, random, and uncertain. At the same time, it’s one big data-
generating machine. As we commute to work on subways and in cars, as our blood moves
through our bodies, as we’re shopping, emailing, procrastinating at work by browsing the
Internet and watching the stock market, as we’re building things, eating things, talking to our
friends and family about things, while factories are producing products, this all at least
potentially produces data.
Data represents the traces of the real-world processes, and exactly which traces we gather are
decided by our data collection or sampling method. After separating the process from the data
collection, we can see clearly that there are two sources of randomness and uncertainty.
A mathematical model for both uncertainty and randomness is offered by probability theory
A world/processes is defined by one or more variables. The model of the world is defined by
a function:
Model = f (x) or f (x, y, z) (A multivariable function)
The function is unknown, model is unclear, at least initially. Typically, our task is to come up
with the model, given the data.
The process of going from the world to the data, and then from the data back to the world, is
the field of statistical inference.
More precisely, statistical inference is the discipline that concerns itself with the development
of procedures, methods, and theorems that allow us to extract meaning and information from
data that has been generated by stochastic (random) processes.
• Network
• Sensor data
• Images
These new kinds of data require us to think more carefully about what sampling means in
these contexts.
Modeling
A model is our attempt to understand and represent the nature of reality through a particular
lens, be it architectural, biological, or mathematical.
y = f (x)
A model is an artificial construction where all extraneous detail has been removed or
abstracted. Attention must always be paid to these abstracted details after a model has been
analyzed to see what might have been overlooked.
Statistical Modeling
Before you get too involved with the data and start coding, it’s useful to draw a picture of
what you think the underlying process might be with your model. What comes first? What
influences what? What causes what? What’s a test of that?
But different people think in different ways. Some prefer to express these kinds of
relationships in terms of math.
The mathematical expressions will be general enough that they have to include parameters,
but the values of these parameters are not yet known.
In mathematical expressions, the convention is to use Greek letters for parameters and Latin
letters for data. So, for example, if you have two columns of data, x and y, and you think
there’s a linear relationship, you’d write down
y = β0 + β1 x.
You don’t know what β0and β1are in terms of actual numbers yet, so they’re the parameters.
Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows,
showing how things affect other things or what happens over time. This gives them an
abstract picture of the relationships before choosing equations to express them.
So try writing down a linear function (more on that in the next unit). When you write it down,
you force yourself to think: does this make any sense? If not, why? What would make more
sense? You start simply and keep building it up in complexity, making assumptions, and
writing your assumptions down. You can use full-blown sentences if it helps—
e.g., “I assume that my users naturally cluster into about five groups because when I hear the
sales rep talk about them, she has about five different types of people she talks about” then
taking your words and trying to express them as equations and code. Some of the building
blocks of these models are probability distributions.
Probability distributions
Probability distributions are the foundation of statistical models. When we get to linear
regression and Naive Bayes, you will see how this happens in practice. One can take multiple
semesters of courses on probability theory, and so it’s a tall challenge to condense it down for
you in a small section.
Back in the day, before computers, scientists observed real-world phenomenon, took
measurements, and noticed that certain mathematical shapes kept reappearing. The classical
example is the height of humans, following a normal distribution—a bell-shaped curve, also
called a Gaussian distribution, named after Gauss.
Other common shapes have been named after their observers as well (e.g., the Poisson
distribution and the Weibull distribution), while other shapes such as Gamma distributions or
exponential distributions are named after associated mathematical objects.
Natural processes tend to generate measurements whose empirical shape could be
approximated by mathematical functions with a few parameters that could be estimated from
the data.
Not all processes generate data that looks like a named distribution, but many do. We can use
these functions as building blocks of our models. It’s beyond the scope of the book to go into
each of the distributions in detail, but we provide them in Figure 1.4 as an illustration of the
various common shapes, and to remind you that they only have names because someone
observed them enough times to think they deserved names. There is actually an infinite
number of possible distributions.
The parameter μ is the mean and median and controls where the distribution is centered
(because this is a symmetric distribution), and the parameter σ controls how spread out the
distribution is. This is the general functional form, but for specific real-world phenomenon,
these parameters have actual numbers as values, which we can estimate from the data.
Normal, Uniform, Cauchy, t, F-, Chi Square, exponential, Weibull, lognormal etc. They are
known as continuous density functions. Any random variable x or y can be assumed to have
probability distribution p(x), if it maps it to a positive real number. For a probability density
function, if we integrate the function to find the area under the curve it is 1, allowing it to be
interpreted as probability.
Fitting a model
Fitting a model means that you estimate the parameters of the model using the observed data.
You are using your data as evidence to help approximate the real-world mathematical process
that generated the data. Fitting the model often involves optimization methods and
algorithms, such as maximum likelihood estimation, to help get the parameters.
In fact, when you estimate the parameters, they are actually estimators, meaning they
themselves are functions of the data. Once you fit the model, you actually can write it as
y = 7.2+4.5x,
for example, which means that your best guess is that this equation or functional form
expresses the relationship between your two variables, based on your assumption that the data
followed a linear pattern.
Fitting the model is when you start actually coding: your code will read in the data, and you’ll
specify the functional form that you wrote down on the piece of paper. Then R or Python will
use built-in optimization methods to give you the most likely values of the parameters given
the data.
Overfitting
Overfitting is a concept in data science, which occurs when a statistical model fits exactly
against its training data. When this happens, the algorithm unfortunately cannot perform
accurately against unseen data, defeating its purpose Overfitting is the term used to mean that
you used a dataset to estimate the parameters of your model, but your model isn’t that good at
capturing reality beyond your sampled data.
You might know this because you have tried to use it to predict labels for another set of data
that you didn’t use to fit the model, and it doesn’t do a good job, as measured by an
evaluation metric such as accuracy.
Basics of R:
R programming introduction: R is a scripting language for statistical data manipulation,
statistical analysis, graphics representation and reporting. It was inspired by, and is mostly
compatible with, the statistical language S developed by AT&T. R was created by Ross Ihaka
and Robert Gentleman at the University of Auckland, New Zealand, and is currently
developed by the R Development Core Team.
The core of R is an interpreted computer language which allows branching and looping as
well as modular programming using functions. R allows integration with the procedures
written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
Features of R:
R is a programming language and software environment for statistical analysis, graphics
representation and reporting.
The following are the important features of R
R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility.
It incorporates features found in object-oriented and functional programming
languages.
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
R provides special data types that includes the Numerical, Integer , Character,
Logical, Complex.
R provides very powerful data structures it contains Vectors, Matrices, List, Arrays,
Dataframes, Classes.
Because R is open source software, it‘s easy to get help from the user community.
Also, a lot of new functions are contributed by users, many of whom are prominent
statisticians.
As a conclusion, R is world‘s most widely used statistics programming language.
R Command Prompt: Once you have R environment setup, then it‘s easy to start your R
command prompt by just typing the following command at your command prompt –
C:\Users\CSELAB>R
This will launch R interpreter and you will get a prompt > where you can start typing your
program as
follows −
>myString<-"WELCOME R PROGRAMMING"
>print(myString)
[1] "WELCOME R PROGRAMMING"
Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.
Object orientation: can be explained by example. Consider statistical regression. When you
perform a regression analysis with other statistical packages, such as SAS or SPSS, you get a
mountain of output on the screen. By contrast, if you call the lm() regression function in R,
the function returns an object containing all the results—the estimate coefficients, their
standard errors, residuals, and so on. You then pick and choose, programmatically, which
parts of that object to extract. You will see that R‘s approach makes programming much
easier, partly because it offers a certain uniformity of access to data.
Functional Programming:
As is typical in functional programming languages, a common theme in R programming is
avoidance of explicit iteration. Instead of coding loops, you exploit R‘s functional features,
which let you express iterative behavior implicitly. This can lead to code that executes much
more efficiently, and it can make a huge timing difference when running R on large data sets.
The functional programming nature of the R language offers many advantages:
• Clearer, more compact code
• Potentially much faster execution speed
• Less debugging, because the code is simpler
• Easier transition to parallel programming.
Comments: Comments are like helping text in your r program and they are ignored by the
interpreter while executing your actual program. single comment is written using # in the
beginning of the statement as follows
# My first program in R Programming
R does not support multi-line comments but you can perform a trick which is something as
follows −
if(FALSE){
"This is a demo for multi-line comments and it should be put inside either a single
OR double quote"
}
myString<-"Hello, World!" print(myString)
Variables in R
Variables are used to store data, whose value can be changed according to our need. Unique
name given to variable (function and objects as well) is identifier.
Valid identifiers in R
total, Sum, .fine.with.dot, this_is_acceptable, Number5
Invalid identifiers in R
tot@l, 5um, _fine, TRUE, .0ne
Best Practices
Earlier versions of R used underscore (_) as an assignment operator. So, the period (.) was
used extensively in variable names having multiple words. Current versions of R support
underscore as a valid identifier but it is good practice to use period as word separators. For
example, a.variable.name is preferred over ―a_variable_name― or alternatively we could
use camel case as aVariableName
Constants in R
Constants, as the name suggests, are entities whose value cannot be altered. Basic types of
constant are numeric constants and character constants.
Numeric Constants
All numbers fall under this category. They can be of type integer, double or complex. This
can be checked with the typeof() function. Numeric constants followed by L are regarded as
integer and those followed by i are regarded as complex.
>typeof(5)
[1] "double"
>typeof(5L)
[1] "integer"
>typeof(5i)
[1] "complex"
Character Constants: Character constants can be represented using either single quotes (')
or double quotes (") as delimiters.
> 'example'
[1] "example"
>typeof("5")
[1] "character"
Built-in Constants
Some of the built-in constants defined in R along with their values is shown below.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
"Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z"
>letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r"
"s" [20] "t" "u" "v" "w" "x" "y" "z"
>pi
[1] 3.141593
> month.name
[1] "January" "February" "March" "April" "May" "June" [7] "July"
"August" "September" "October" "November"
"December"
>month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
But it is not good to rely on these, as they are implemented as variables whose values can be
changed.
>pi
[1] 3.141593
>pi<- 56
>pi [1] 56
Big Data also requires several sources and often comes often from different types. But it is
not always an easy job to know how to combine all of the tools you need to work with
different styles.
1. Volume
A run of the mill PC may have had 10 gigabytes of storage in 2000. Today, Facebook is
regularly ingesting 500 terabytes of new information. Boeing 737 can generate 240 terabytes
of flight information on a single trip across the US. The advanced cell phones, the
information they generate and expend; sensors implanted into ordinary items can take care of
containing human, field, and other data, including video, in billions of new, continually
refreshed information, before long outcome.
2. Velocity
• Clickstreams and ad impressions capture consumer activity at millions of events per second
• Highfrequency stock trading algorithms represent market movements within microseconds
• machine-tomachine processes share data between billions of devices
• networks and sensors produce huge realtime log data
• online gaming systems help millions of competitor users, each pro
3. Variety
• Big data are not just numbers, dates, and strings. Huge data is also geospatial information,
3D information, sound and video, and unstructured content, including web-based log
documents.
• Modern database systems were built to handle smaller structured information volumes,
fewer changes or an expected, steady information structure.
• Big Data inquiry involves details of different kinds.
8) HPCC
9) Storm
10) Apache SAMOA
11) Talend
12) Rapidminer
13) Qubole
14) Tableau
15) R
Module-1: Questions