Data Science Unit 1 Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

CS3352 FOUNDATIONS OF DATA SCIENCE

UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing – Basic
Statistical descriptions of Data

INTRODUCTION
Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains.
Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from structured and unstructured data.

DATA SCIENCE AND OTHERS


Various disciplines in data science are : Statistics, Big data analytics, Database
management, Machine learning, Data mining and Artificial intelligence.
Tools : Hadoop, Pig, Spark, R, Python, and Java,
Python is a great language for data science because it has many data science libraries available,
and it’s widely supported by specialized software.

BENEFITS AND USES OF DATA SCIENCE AND BIG DATA

Data science and big data are used almost everywhere in both commercial and
noncommercial settings.
Commercial companies in almost every industry use data science and big data to gain
insights into their customers, processes, staff, completion, and products. Many companies use
data science to offer customers a better user experience, as well as to cross-sell, up-sell, and
personalize their offerings. A good example of this is Google AdSense, which collects data from
internet users so relevant commercial messages can be matched to the person browsing the
internet.
Financial institutions use data science to predict stock markets, determine the risk of
lending money, and learn how to attract new clients for their services.
Governmental organizations are also aware of data’s value. It can help in extracting
useful information and knowledge from large volume of data in order to improve government
decision making.
Nongovernmental organizations (NGOs) are also no strangers to using data. They use
it to raise money and defend their causes. The World Wildlife Fund (WWF), for instance,
employs data scientists to increase the effectiveness of their fundraising efforts. Many data
scientists devote part of their time to helping NGOs, because NGOs often lack the resources to
collect data and employ data scientists. DataKind is one such data scientist group that devotes its
time to the benefit of mankind.
Universities use data science in their research but also to enhance the study experience of
their students. The rise of massive open online courses (MOOC) produces a lot of data, which
allows universities to study how this type of learning can complement traditional classes.
MOOCs are an invaluable asset if you want to become a data scientist and big data professional,
so definitely look at a few of the better-known ones: Coursera, Udacity, and edX. The big data
and data science landscape changes quickly, and MOOCs allow you to stay up to date by
following courses from top universities. If you aren’t acquainted with them yet, take time to do
so now; you’ll come to love them as we have.

FACETS OF DATA
In data science and big data you’ll come across many different types of data, and each of them
tends to require different tools and techniques. The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Let’s explore all these interesting data types.
 Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record.
As such, it’s often easy to store structured data in tables within databases or Excel files (figure
1.1). SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases. You may also come across structured data that might give you a hard time
storing it in a traditional relational database. Hierarchical data such as a family tree is one such
example.
The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines. More often, data comes unstructured.

An Excel table is an example of structured data.


 Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying. One example of unstructured data is your regular email . Although email
contains structured elements such as the sender, title, and body text, it’s a challenge to find the
number of people who have written an email complaint about a specific employee because so
many ways exist to refer to a person, for example. The thousands of different languages and
dialects out there further complicate this.
A human-written email, as shown , is also a perfect example of natural language data.

Email is simultaneously an example of unstructured data and natural language data.


 Natural language
Natural language is a special type of unstructured data; it’s challenging to process because it
requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained in one
domain don’t generalize well to other domains. Even state-of-the-art techniques aren’t able to
decipher the meaning of every piece of text. This shouldn’t be a surprise though: humans
struggle with natural language as well. It’s ambiguous by nature. The concept of meaning itself
is questionable here. Have two people listen to the same conversation. Will they get the same
meaning? The meaning of the same words can vary when coming from someone upset or joyous.
 Machine-generated data
Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention. Machine-generated data is becoming
a major data resource and will continue to do so. Wikibon has forecast that the market value of
the industrial Internet (a term coined by Frost & Sullivan to refer to the integration of complex
physical machinery with networked sensors and software) will be approximately $540 billion in
2020. IDC (International Data Corporation) has estimated there will be 26 times more connected
things than people in 2020. This network is commonly referred to as the internet of things.
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and
telemetry (figure 1.3)
Example of machine-generated data
 Graph-based or network data
“Graph data” can be a confusing term because any data can be shown in a graph. “Graph” in this
case points to mathematical graph theory. In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects. Graph or network data is, in short, data that
focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical data.
Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.
Examples of graph-based data can be found on many social media websites (figure 1.4). For
instance, on LinkedIn you can see who you know at which company. Your follower list on
Twitter is another example of graph-based data. The power and sophistication comes from
multiple, overlapping graphs of the same nodes. For example, imagine the connecting edges here
to show “friends” on Facebook. Imagine another graph with the same people which connects
business colleagues via LinkedIn. Imagine a third graph based on movie interests on Netflix.
Overlapping the three
different-looking graphs makes more interesting questions possible.
Friends in a social network are an example of graph-based data.

 Audio, image, and video


Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers. MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll
increase video capture to approximately 7 TB per game for the purpose of live, in-game
analytics. High-speed cameras at stadiums will capture ball and athlete movements to calculate
in real time, for example, the path taken by a defender relative to two baselines.
Recently a company called DeepMind succeeded at creating an algorithm that’s capable of
learning how to play video games. This algorithm takes the video screen as input and learns to
interpret everything via a complex process of deep learning. It’s a remarkable feat that prompted
Google to buy the company for their own Artificial Intelligence (AI) development plans. The
learning algorithm takes in data as it’s produced by the computer game; it’s streaming data.
 Streaming data
While streaming data can take almost any of the previous forms, it has an extra property. The
data flows into the system when an event happens instead of being loaded into a data store in a
batch. Although this isn’t really a different type of data, we treat it here as such because you need
to adapt your process to deal with this type of information.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market.
DATA SCIENCE PROCESS
Overview of the data science process
Following a structured approach to data science helps you to maximize your chances of success
in a data science project at the lowest cost. It also makes it possible to take up a project as a
team, with each team member focusing on what they do best. Take care, however: this approach
may not be suitable for every type of project or be the only way to do good data science. The
typical data science process consists of six steps through which you’ll iterate

The data science process


The following list is a short introduction; each of the steps will be discussed in greater depth
throughout this chapter.

1 The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project
this will result in a project charter.

2 The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and transformation before it becomes
usable.

3 Now that you have the raw data, it’s time to prepare it. This includes transforming the data
from a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from different data sources, and
transform it. If you have successfully completed this step, you can progress to data visualization
and modeling.
4 The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.

5 Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make the predictions
stated in your project charter. Now is the time to bring out the heavy guns, but remember
research has taught us that often (but not always) a combination of simple models tends to
outperform one complicated model. If you’ve done this phase right, you’re almost done.

6 The last step of the data science model is presenting your results and automating the analysis, if
needed. One goal of a project is to change a process and/or make better decisions. You may still
need to convince the business that your findings will indeed change the business process as
expected. This is where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects require you to perform
the business process over and over again, so automating the project will save time.
In reality you won’t progress in a linear way from step 1 to step 6. Often you’ll regress and
iterate between the different phases.

STEP 1: DEFINING RESEARCH GOALS AND CREATING A PROJECT CHARTER


A project starts by understanding the what, the why, and the how of your project. What does the
company expect you to do? And why does management place
such a value on your research? Is it part of a bigger strategic picture or a “lone wolf” project
originating from an opportunity someone detected? Answering these three questions (What,
Why, how) is the goal of the first phase, so that everybody knows what to do and can agree on
the best course of action.
The outcome should be a clear research goal, a good understanding of the context, well defined
deliverables, and a plan of action with timetable.
Spend time understanding the goals and context of your research
An essential outcome is the research goal that states the purpose of your assignment in a clear
and focused manner. Understanding the business goals and context is critical for project success.
Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your research is
going to change the business, and understand how they’ll use your results.
Create a project charter
Clients like to know upfront what they’re paying for, so after you have a good understanding of
the business problem, try to get a formal agreement on the deliverables. All this information is
best collected in a project charter. For any significant project this would be mandatory.
A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
Your client can use this information to make an estimation of the project costs and the data and
people required for your project to become a success.

STEP 2: RETRIEVING DATA


The next step in data science is to retrieve the required data. Sometimes you need to go into the
field and design a data collection process yourself, but most of the time you won’t be involved in
this step. Many companies will have already collected and stored the data for you, and what they
don’t have can often be bought from third parties. Don’t be afraid to look outside your
organization for data, because more and more organizations are making even high-quality data
freely available for public and commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. This may be difficult, and even if you succeed,
data is often like a diamond in the rough: it needs polishing to be of any use to you.
 Start with data stored within the company
The data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals. The primary goal of a
database is data storage, while a data warehouse is designed for reading and analyzing that data.
A data mart is a subset of the data warehouse and geared toward serving a specific business unit.
While data warehouses and data marts are home to preprocessed data, data lakes contains data in
its natural or raw format. But the possibility exists that your data still resides in Excel files on the
desktop of a domain expert.
 Don’t be afraid to shop around
If data isn’t available inside your organization, look outside your organization’s walls. Many
companies specialize in collecting valuable information. For instance, Nielsen and GFK are well
known for this in the retail industry. Other companies provide data so that you, in turn, can
enrich their services and ecosystem. Such is the case with
Twitter, LinkedIn, and Facebook. Although data is considered an asset more valuable than oil by
certain companies,
more and more governments and organizations share their data for free with the world. This data
can be of excellent quality; it depends on the institution that creates and manages it. The
information they share covers a broad range of topics such as the number of accidents or amount
of drug abuse in a certain region and its demographics.
This data is helpful when you want to enrich proprietary data but also convenient when training
your data science skills at home.

A list of open-data providers that should get you started


 Do data quality checks now to prevent problems later
Expect to spend a good portion of your project time doing data correction and cleansing,
sometimes up to 80%. The retrieval of data is the first time you’ll inspect the data in the data
science process. Most of the errors you’ll encounter during the data gathering phase are easy to
spot, but being too careless will make you spend many hours solving data issues that could have
been prevented during data import.

STEP 3: CLEANSING, INTEGRATING, AND TRANSFORMING DATA


The data received from the data retrieval phase is likely to be “a diamond in the rough.” Your
task now is to sanitize and prepare it for use in the modeling and reporting phase.
 Cleansing data
Data cleansing is a subprocess of the data science process that focuses on removing errors in
your data so your data becomes a true and consistent representation of the processes it originates
from.
By “true and consistent representation” we imply that at least two types of errors exist. The first
type is the interpretation error, such as when you take the value in your data for granted, like
saying that a person’s age is greater than 300 years. The second type of error points to
inconsistencies between data sources or against your company’s standardized values. An
example of this class of errors is putting “Female” in one table and “F” in another when they
represent the same thing: that the person is female. Another example is that you use Pounds in
one table and Dollars in another.
 DATA ENTRY ERRORS
Data collection and data entry are error-prone processes. They often require human intervention,
and because humans are only human, they make typos or lose their concentration for a second
and introduce an error into the chain. But data collected by machines or computers isn’t free
from errors either. Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
 IMPOSSIBLE VALUES AND SANITY CHECKS
Sanity checks are another valuable type of data check. Here you check the value against
physically or theoretically impossible values such as people taller than 3 meters or someone with
an age of 299 years. Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120

 OUTLIERS
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations. The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on
the upper side when a normal distribution is expected. The normal distribution, or Gaussian
distribution, is the most common distribution in natural sciences.
Distribution plots are helpful in detecting outliers and helping you understand the variable.

 Correct errors as early as possible


A good practice is to mediate data errors as early as possible in the data collection chain and to
fix as little as possible inside your program while fixing the origin of the problem.
■ Data errors may point to defective equipment, such as broken transmission lines and defective
sensors.
■ Data errors can point to bugs in software or in the integration of software that may be critical
to the company.

 Combining data from different data sources (INTEGRATION)


Your data comes from several different places, and in this sub step we focus on integrating these
different sources. Data varies in size, type, and structure, ranging from databases and Excel files
to text documents.
 THE DIFFERENT WAYS OF COMBINING DATA
You can perform two operations to combine information from different data sets. The first
operation is joining: enriching an observation from one table with information from another
table. The second operation is appending or stacking: adding the observations of one table to
those of another table.
 JOINING TABLES
Joining tables allows you to combine the information of one observation found in one table with
the information that you find in another table.
Let’s say that the first table contains information about the purchases of a customer and the other
table contains information about the region where your customer lives. Joining the tables allows
you to combine the information so that you can use it for your model.
To join tables, you use variables that represent the same object in both tables, such as a date, a
country name, or a Social Security number. These common fields are known as keys. When
these keys also uniquely define the records in the table they are called primary keys. One table
may have buying behavior and the other table may have demographic information on a person.
Joining two tables on the Item and Region keys

 APPENDING TABLES
Appending or stacking tables is effectively adding observations from one table to another table.
One table contains the observations from the month January and the second table contains
observations from the month February. The result of appending these tables is a larger one with
the observations from January as well as February. The equivalent operation in set theory would
be the union, and this is also the command in SQL, the common language of relational databases.
Other set operators are also used in data science, such as set difference and intersection.

 Transforming data
Certain models require their data to be in a certain shape. Now that you’ve cleansed and
integrated the data, this is the next task you’ll perform: transforming your data so it takes a
suitable form for data modeling.

STEP 4: EXPLORATORY DATA ANALYSIS


During exploratory data analysis you take a deep dive into the data. Information becomes much
easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an
understanding of your data and the interactions between variables. This phase is about exploring
data, so keeping your mind open and your eyes peeled is essential during the exploratory data
analysis phase. The goal isn’t to cleanse the data, but it’s common that you’ll still discover
anomalies you missed before, forcing you to take a step back and fix them.
The visualization techniques you use in this phase range from simple line graphs or histograms.

Bar chart

Line chart
Distribution chart
STEP 5: BUILD THE MODELS
Using machine learning and statistical techniques to achieve your project goal.
With clean data in place and a good understanding of the content, you’re ready to build models
with the goal of making better predictions, classifying objects, or gaining an understanding of the
system that you’re modeling. The techniques you’ll use now are borrowed from the field of
machine learning, data mining, and/or statistics.
Building a model is an iterative process. The way you build your model depends on whether you
go with classic statistics or the somewhat more recent machine learning school, and the type of
technique you want to use. Either way, most models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison.
 Model and variable selection
You’ll need to select the variables you want to include in your model and a modeling technique.
Your findings from the exploratory analysis should already give a fair idea of what variables will
help you construct a good model. Many modeling techniques are available, and choosing the
right model for a problem requires judgment on yourpart. You’ll need to consider model
performance and whether your project meets all the requirements to use your model, as well as
other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
■ Does the model need to be easy to explain? When the thinking is done, it’s time for action.
 Model execution
Once you’ve chosen a model you’ll need to implement it in code.
STEP 6: PRESENTING FINDINGS AND BUILDING APPLICATIONS ON TOP OF
THEM
After you’ve successfully analyzed the data and built a well-performing model, you’re ready to
present your findings to the world. This is an exciting part; all your hours of hard work have paid
off and you can explain what you found to the stakeholders.

Presenting your results to the stakeholders and industrializing your analysis process for repetitive
reuse and integration with other tools.
WHAT IS DATA MINING?
Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and transform it
into an understandable structure for further use. The key properties of data mining are
 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases

1. The Scope of Data Mining


Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in gigabytes of
store scanner data — and mining a mountain for a vein of valuable ore. Both processes require
either sifting through an immense amount of material, or intelligently probing it to find exactly
where the value resides. Given databases of sufficient size and quality, data mining technology
can generate new business opportunities by providing these capabilities:

o Automated prediction of trends and behaviors. Data mining automates the process of
finding predictive information in large databases. Questions that traditionally required
extensive hands-on analysis can now be answered directly from the data — quickly. A
typical example of a predictive problem is targeted marketing. Data mining uses data on
past promotional mailings to identify the targets most likely to maximize return on
investment in future mailings. Other predictive problems include forecasting bankruptcy
and other forms of default, and identifying segments of a population likely to respond
similarly to given events.

o Automated discovery of previously unknown patterns. Data mining tools sweep


through databases and identify previously hidden patterns in one step. An example of
pattern discovery is the analysis of retail sales data to identify seemingly unrelated
products that are often purchased together. Other pattern discovery problems include
detecting fraudulent credit card transactions and identifying anomalous data that could
represent data entry keying errors.
2. Tasks of Data Mining Data mining involves six common classes of tasks:
o Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
o Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
o Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

o Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
o Regression – attempts to find a function which models the data with the least error.
o Summarization – providing a more compact representation of the data set, including
visualization and report generation.

3. Architecture of Data Mining

A typical data mining system may have the following major components.
Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies used to
organize attributes or attribute values into different levels of abstraction. Knowledge such as user
beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may
also be included. Other examples of domain knowledge are additional interestingness constraints
or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis, and evolution analysis.

Pattern Evaluation Module:


This component typically employs interestingness measures interacts with the data
mining modules so as to focus the search toward interesting patterns. It may use interestingness
thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the implementation of the data mining method
used. For efficient data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining processs as to confine the search to only the
interesting patterns.

User interface:
This module communicates between users and the data mining system, allowing the user
to interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.

DATA WAREHOUSE:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product. Time-Variant: Historical data is kept in a
data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or
even older data from a data warehouse. This contrasts with a transactions system, where often
only the most recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated with a
customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Data Warehouse Design Process: A data warehouse can be built using a top-down approach, a
bottom-up approach, or a combination of both.

A Three Tier Data Warehouse Architecture:

Tier-1:
The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g.,
to merge similar data from different sources into a unified format), as well as load and refresh
functions to update the data warehouse . The data are extracted using application program
interfaces known as gateways. A gateway is DEPT OF CSE & IT VSSUT, Burla supported by
the underlying DBMS and allows client programs to generate SQL code to be executed at a
server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data warehouse
and its contents.

Tier-2: The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS that maps operations on multidimensional data to
standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.

Tier-3: The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
BASIC STATISTICAL DESCRIPTION OF DATA

Branch of mathematics dealing with the collection , analysis, interpretation and


presentation of masses of numerical data.

 Descriptive Statistics

Descriptive statistics provides us with tools—tables, graphs, averages, ranges,


correlations—for organizing and summarizing the inevitable variability in collections of actual
observations or scores.
Examples are:
1. A tabular listing, ranked from most to least.
2. A graph showing the annual change in global temperature during the last 30 years.
3. A report that describes the average difference in grade point average (GPA)between
college students.

 Inferential Statistics

Statistics also provides tools—a variety of tests and estimates—for generalizing beyond
collections of actual observations. This more advanced area is known as inferential statistics.
for example:
An assertion about the relationship between job satisfaction and overall happiness.

Different statistical mwthods,


MODE
MEDIAN
MEAN
WHICH AVERAGE?

 MODE
 The mode reflects the value of the most frequently occurring score.
 Set of numbers that appears most often.
Ex: Determining the mode of following retirement ages: 60,63,45,63,65,70,55,63,60,65,63(Here,
63 occurs the most.)

More Than One Mode


Bimodal: Distributions withtwo obvious peaks, even though they are not exactly the
same height, are referred to as bimodal.
Multimodal: Distributions with more than two peaks are referred to as multimodal.

 MEDIAN
 The median reflects the middle value when observations are ordered from least to
most.
 The median splits a set of ordered observations into two equal parts, the upper and
lower halves.
 MEAN
 The mean is the most common average, one you have doubtless calculated many times.
 The mean is found by adding all scores and then dividing by the number of scores.
That is,

Two types of MEAN


Sample Mean
 A subset of scores.
 The balance point for a sample, found by dividing the sum for the values of all scores in
the sample by the number of scores in the sample.

Formula for Sample Mean

It’s usually more efficient to substitute symbols for words in statistical formulas,
including the word formula given above for the mean. When symbols are used, X designates the
sample mean, and the formula becomes

and reads: “X-bar equals the sum of the variable X divided by the sample size n.” [Note that the
uppercase Greek letter sigma (Σ) is read as the sum of, not as sigma.

Formula for Population Mean


Population Mean (μ)
The balance point for a population, found by dividing the sum for all scores in the
population by the number of scores in the population.

The formula for the population mean differs from that for the sample mean only because
of a change in some symbols. In statistics, Greek symbols usually describe population
characteristics, such as the population mean, while English letters usually describe sample
characteristics, such as the sample mean. The population mean is represented by μ (pronounced
“mu”), the lowercase Greek letter m for mean,

where the uppercase letter N refers to the population size. Otherwise, the calculations are the
same as those for the sample mean.
Sample question: All 57 residents in nursing home were surveyed to see how many times a day
that eat meals.
1 meal (2 people)
2 meal (7 people)
3 meal (28 people)
4 meal (12 people)
5 meal (8 people)
What is population mean for the number of meals eaten per day?
=(1x2)+(2x7)+(3x28)+(4x12)+(5x8)/57
=2+14+84+48+40/57
=188/57
=3.29 (Approx 3.3)
The Population mean is 3.3

 WHICH AVERAGE
If Distribution Is Not Skewed
When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.

If Distribution Is Skewed
 Positively skewed distribution
In positively skewed distribution, the mean is greater than Median and skewed in Positive
direction (Right side)
.
 Negatively skewed distribution
In Negatively skewed distribution, the mean is greater than Median and skewed in
Negative direction (Left side).

You might also like