Combine PDF

Introduction to Data Science
Dr Marcin Maleszka
Wroclaw University of Science and Technology, Poland
for
International University, Vietnam
Introduction assignment (for class no. 3)
• Using any methods try to find info on me:
• What is my home address / what do I drive (plates number)?
• Alternatively: how would you do this for a person in Vietnam?
• Could the methods used be automated for a large number of people?
• A common method for a single approach (security specialist) would be to ask me or
someone else. In Data Science we need to look for data about thousands/milions of
people. Finding me is an example, we need a method to find a TYPE of person.
• When doing any assignments remember to:
• Give your name / student ID
• Provide final answer and steps to solution
• Be brief but precise
• This task will outline one of first problems a Data Scientists
encounters – where to get the data!
What is „Data”?
• Organization by complexity of concepts:
• Data – raw numbers
• Information – interpretation added
• Knowledge – rules added / pattern extracted
• (Wisdom? Trust? Intelligence?)
• For purposes of Data Science any of those may be the input

• Most often, as in other places, it will be „raw” data
• The result will be often knowledge, but sometimes information
What is Data Science?
• The methods to extract useful information and knowledge from data,
but mostly:
• unexpected patterns
• aggregations (visualisations)
• representations (models)
• Data Science operates on the level of Data Lake
• It takes into account all possible sources in all possible situations
• Many „classic” field are nowadays treated as Data Science
• Statistics
• Data Mining
• Some tools from areas of Machile Learning and Artificial Intelligence
• Graph analysis (Social Networks)
What is this course
• We will briefly follow most common tools of Data Science, mostly
from the point of view of statistics and data mining.
• We will visit Big Data, machine learning, AI, graph methods and others
• Literature:
• Slides! on Blackboard
• Murtaza Haider, Getting Started with Data Science: IBM Press; 1st, 2015.
• Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques 3rd,
2011
• Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman, Mining of Massive
Datasets 2nd, 2014
Grading
• 40% Final Exam (live in IU)

• 30% Mid-term Exam (live in IU)
• 30% Assignments & Tests during online class
• Minimum of 80% attendance required.

Data Scientists & Data Analysts
• Data Scientist is a new catch-all term for an old concept, it may fit to:
• Statisticians
• (and mathematical positions overall)
• Risk Analysts
• (and Analyst positions overall)
• Business Intelligence specialists
• Data Warehouse specialists
• But there are (small) differences and similar positions outside DS name
https://www.datanix.ai/post/iipgh-data-science-webinar-5
http://nirvacana
.com/thoughts/
2013/07/08/b
ecoming-a-data-
scientist/
https://medium.co
m/hackernoon/navi
gating-the-data-
science-career-
landscape-
db746a61ac62
Data Science presentation
• Data Scientist needs to present the result in an attractive form. This

inludes both written part and any graphics (tables and graphs).
• Later assignments and some exam questions will require this part!
• Try your hand at giving narratives to a report on simple data (information):
• Weather – it is 30 C whole week, but 20 C one day
• How to narrate (not „describe”) it in Vietnam? How to narrate it in colder country (Poland)?
• Several years ago a Mars probe crashed because of one team using SI units and the other
Imperial units (e.g. meter vs yard)
• When writing a newspaper report, what narration to build here?
• What narration to use in internal report, a white paper, a scientific paper?
Reporting data - selecting good graphs
• Which graph will fit best?
• Linear graph presenting time with trend line
• Comparison of category distribution – wheel, column
• Gauge / dial
• Result cards
• Progression table
• Raw data
• Choosing a method of presentation depends on intended message
• Remember to add legend and label axes/categories
Graphs
• Important rule: NOT TOO MUCH DATA ON ONE GRAPHS
• Adding labels with values:

• Only if 1-2 data series
• No 3D graphs with data series
• Two graphs: long and short-time perspective
• Simpler graph is more readable
Graphs: trend lines
• Why trends are important?
• Usually trend on a linear graph
• If a lot of graph is empty, may rescale an axis
• May be a layer graph
• Merging column and linear graph

• Only if the same values on X axis
• Very clear legend
• Very good description of Y axis scales
Graphs: comparison graphs
• Comparing the same measure to the previous value
• What is the most important message?
• The more periods compared,

the harder the analysis
• Key is to find good type of graph
• May use 3D if it remains readable
Graphs: attribute distribution graph
• Distribution is best visible on wheel graph
• Need information on time, when data was gathered
• There is no trends
• Notes:
• Add labels
• No more than 10 categories (the rest of the wheel should be labeled „other”)
• Show two wheel graphs for comparisons
• Check if its the most readable approach
• Does not need a Data Warehouse (only one dimension shown)
Filtering in dashboards
• Filtering
• Reduce the range of presented KPI
• Convenient for interactive reports
• Prepared options for changing perspective
• How to compare resutls for different filters?
• May analyze deeper by adding more attributes
• Pivoted tables and graphs:

• May analyze data for different values of same attribute
• Deeper analysis
Data visualization
• Visualization
• Imporant part of understanding information
• Helps to search for information and make decisions
• Supports analysis of larger datasets
• Allows detecting additional dependencies
• Reduces effort needed to process information
• Helps to remember data
• Important aspects:
• Color, shape, size, orientation, position, readability
• Visual form, graphical elements, visual clues
• Conditional formatting
• Infographics, schemas, graphs
Data visualization
• Creating a graph:
1. Determine aim – what to show?
2. Determine what to compare
1. Percentage of whole
2. Ranking
3. Change dynamics
4. Histogram
5. Correlations between variables
3. Prepare graphs
4. Format graph
Example: column graphs
Example: column graphs
The first assignment
(for class no. 5)
• This is how someone presented
data on temperatures
• Is it easy to read and understand?
• Present similar weather data in a more clear format
• Start from raw data that you find in any source.
• Determine what you wish to show (here: how many days were very hot, very
cold in each year in the previous 70 years).
• Create a very clear and easy to follow graph – it may look however you wish,
as long as YOU find it clear.
• We will discuss some of your approaches during class
Organizing the data
• In general Data Science operates on the level of Data Lake – all
information, without filtering, in one place. We take out specific parts
to investigate specific situations.
• This can be operational database of company (OLTP) + external
knowledge sources pooled together in one place.
• Alternatively, the information may be organized in some other form:
• Tabular (denormalized database)
• Multidimensional (data warehouse)
• Tree or Forest (documents)
• Graph (social network)
Historical perspective on DB
• 1960s:
• Data collection, databases are created, networked DBMS
• 1970s:
• Relational model, implementation of relational DBMS
• 1980s:
• RDBMS, advanced data models (extended-relational, object-oriented, etc.)
• 1990s:
• Data mining, data warehouses, multimedia databases, web databases
• 2000s and later:
• moving services to cloud, Big Data, Data Science (nothing fundamentally new)
Historical perspective in business
• Relational databases
• Different systems in a single company
• Accounting
• Sales
• Logistics
• HR
• Client relations
• ERP offers integration of some aspects, but not strategic analysis
• A new tool was required
OLTP = On-Line Transactional Processing
• Contains data oriented towards processes (e.g. invoices)
• The amount of data is limited (e.g. several GB)
• Contains only current data or limited historical data
• Works with a large amount of simple queries
• Contains basic data (atomic values)
• All operations are allowed: adding, modifying, deleting data
OLAP = On-line Analytical Processing
(for pure data source: Data Warehouses)
• Contains data oriented towards topics (e.g. sales, inventory)

• The amount of data is unlimited (e.g. TBs or more)
• Contains current data and ALL historical data
• Works with a very complex queries concerning a lot of data
• Contains basic data and aggregations
• Data is often added, very rarely modified, „never” deleted
Data warehouse – simple definition
Data warehouse is a:
• Topic oriented
• Integrated
• Chronological
• Constant
data collection intended for decission suport task.
• Note: it is not a type of database. Instead both are types of data collection.
Data warehouse and data separation
Data warehouse and data separation
Data profiling
• Candidate keys
• Amount of missing data
• Distribution of data
• Unique values
ETL
• In most basic terms: download data from the source and load it into
the Data Warehouse
• Copying (duplicating) data between data collections (data bases)
• Data is Extracted from OLTP database, Transformed to fit the DW
schema and Loaded into the DW
• A copy of source data may (or may not) be stored on the DW
hardware
• The theoretical aspects of design are more important than eventual
implementation.
ETL and ELT
ETL vs ELT
ETL ETL
• Extract – duplicating the data • Extract – preparing the data
into the temporary staging area from the source in their original
• Needs another server form (schema-on-read)
• Transform – preparing the model • Load – duplicating the raw data
and transforming data to the to the DW server (into Data
desired form (schema-on-write) Lake)
• Load • Transform – using methods
working with non-relational data
or data in different formats and
structures
PART 1: Statistics
Probability
• What is the chance of rolling a 2 on a dice?
• Before even trying to answer you should narrow down
the boundaries of the problem!
• Teachers often assume 6-sided dice (d6)
• Tabletop gamers will commonly use d4,d6,d8,d10,d20.
• So what is the chance of rolling 2 on a d6?

• „All things being equal” it is 1 in 6.
• I have rolled 1,6,4,3,1,5,6,1,3,4,5,3,6,1,3,5 on a d6.

• What is the chance of rolling 2 now?
Probability – random variable
• Random variable is a function that assigns results to simple events.
• The result of rolling a d6 could be some random variable x.
• The height of a random students could be described by a random variable y.
• Expected value („the average”)
• Variance (how spread out are the values) – standard deviation

squared
Probability – density (uniform distribution)
Probability – density (normal distribution)
Probability – distributor function
• Probability that P(X < x)
(y axis is probability that
the value of the occurence
will be smaller than the
value on x axis)
• How would it look for

uniform distribution?
t-Student distribution
• When we don’t have enough data for working with normal
distrtribution.
Standardization
• Statistical tools are geared towards N(0,1) for normal distribution.
• We need to transform real distribution to this one
𝑁 𝜇, 𝜎 ⟶ 𝑁(0,1)
• The process of standarization takes parameters of the population, not

of the data (average and standard deviation):
• There are estimators for these values as well…

Probability
Probability – rolling 2d6
Example
• Average height: 177cm
• Standard deviation: 19cm
• Let’s calculate probability that:

a) Someone is 160cm tall or less
b) Someone is higher than 160cm
c) Someone is in range [158,196) of their height
d) Someone is outside this range
• As P(a) + P(b) = 1 and P(c) + P(d) = 1, we will have only half of

calculations to complete!
Example (P(x<160))
N(177,19)
160
p1 = P(x≤160) = 0.185464
Example (P(x>160))
N(177,19)
160
p2 = P(x>160) = 1 – p1 = 0.814536
Example (P(158<x<196))
N(177,19)
158 196
p3 = P(158≤x<196) = 0.629072
Example (P(x<158 || x≥196))
N(177,19)
158 196
p4 = P(x<158 || x≥196) = 1 – p3 = 0.370928

Using standard deviation
Long tail
(fat tail)
Boxplots
Average Median
95% CI Max value
Std. Dev.
Q3 Upper Quantile
Median
Standard
Deviation Q1 Lower Quantile
Average
95%
confidence Min value
interval
Boxplot
Example
• Using data {1,1,1, 2,2,2,2, 3,3, 100} calculate
• Average
• Median
• Q1 quantile
• Q3 quantile
• 90% confidence interval for average

• 90% confidence interval for median
Confidence intervals
• Multiple uses. Here, one of most important is chance that values will
be inside given interval when resampling.
Example
• We have a bus line with some total travel time (end-to-end). Officially
it is 28 minutes. We will test this hypothesis:
• Let’s take a random sample of 100 runs of the bus. The average time
is 𝑥=31,5
ҧ minutes and standard deviation is s=5 minutes. Now we
construct a 95% confidence interval for the average. For this large size
of group we use normal distribution.
Statistical Tests
• A statistical hypothesis is any assumption about the distribution of
the population (its general function or its parameters).
• We check if the hypothesis is true based on a random sample.
• A statistical test is a method that allows to reject or „accept” a
statistical hypothesis. It is a decision rule.
• Non-parametric tests are used to check the type of distribution

(normal or not) and randomness of sample.
• Parametric tests check specific values, e.g. average, variance.
Statistical tests
• We always check the 0-th hypothesis („are equal”) against its
opposite – alternate hypothesis („are different”).
• In each statistical test we first formulate the 0-th hypothesis H0 –
what we verify, and the alternate hypothesis H1, which would be
accepted if we rejest the 0-th hypothesis. We also set up some
significance level α, that is the maximum error risk that we are willing
to accept.
• Finally we determine the correct test statistic (a function) based on
information about the sample and our hypotheses. We calculate the
result of the statistic – the p-value.
Statistical tests
• P-value determines the result of the test:
• If p-value is smaller or equal to α then we reject H0 and accept H1 as true.
• If p-value is larger than α then we do not have a basis to reject H0.
• Formally speaking H0 can never be accepted, we just say that we
cannot rule it out.
• This is because rejecting H0 has only 5% (α) chance of error, but for accepting
it the error cannot be minimized to such values (it is another variable β).
• In practice if α = 0.05 and p = 0.99 then we would have a high likelihood that
the 0-th hypothesis is true and could proceed with such assuption to next
steps. But it is a last resort!
• It is better to reformulate the 0-th hypothesis, so that we AIM to reject it.
Normal distribution tests
• While usually books start with parametric tests for average or
variance, these require a normal (t-Student) distribution. Let us
instead start with tests that will check if the distribution is normal.
• The H0 is that the distribution of values in the population is normal.

• If we reject H0 then the distribution is not normal.
• If we cannot reject H0, then it may be normal and we make sure of it by using
other methods. The most common approach is creating a graph (density or
distributor).
Normal distribution tests
• For „small” samples we usually work with two tests:
• Kolmogorov-Smirnov test (KS), often with Lillefors correction (KSL test) that is
used when we do not know about average or standard deviation of the entire
population.
• Shapiro-Wilk test (SW test), which is preferable in most places, due to very
good quality of results. It may give wrong results
for data with over 2.000 samples.
• The verification graph is for example Q-Q plot:

Normal distribution tests: KS
• The Kolmogorov-Smirnov test was developed in 1830s. As a one-
sample test it allows a comparison between the distribution of a
hypothetical distributor function of a normal distribution (CDF) and
emprical distributor function (ECDF) by calculating their maximum
difference.
• If the difference (D statistic) is large, then a hypothesis that the distribution is
normal should be rejected.
• In many implementations some calculations from 1950s are used and require
that average and standard deviation of the normal distribution of entire
population is known beforehand.
• K-S can be also used to compare distributions of any two samples.
• Overall assumptions of KS test (two-sided)
• Samples are random (both groups of samples are random and independent)
• The space should be at least ordinal (can define „larger than” function) and
preferentially continuous.
• We know the average and standard deviation of entire population(s). We
cannot use data estimated from sample.
• Hypothesis:
• H0: F(x) = G(x) for all x (F,G are distributor functions)
• H1: F(x) != G(x) for at least one x
• The test statistic is:
D = sup [ F(x) – G(x) ]
• It can be verified by looking at a graph!
• The graph is the empirical distributor function.
• Let us try to draw a distributor function for the following data:
controlB={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49,

0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57}
• Now let us try to compare the following two:
controlA={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -

0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -
0.50, -0.09}
treatmentA={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63,

5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76,
3.50}
Normal distribution tests: KS – example in R
controlA <- c(0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -
0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50,
-0.09)
treatmentA <- c(-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63,
5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76,
3.50)
ks.test(controlA,treatmentA)
qqnorm(controlA, pch = 1, frame = FALSE)
qqline(controlA, col = "steelblue", lwd = 2)
plot(ecdf(controlA))
cdf1 <- ecdf(controlA)
cdf2 <- ecdf(treatmentA)
minMax <- seq(min(controlA, treatmentA), max(controlA,
treatmentA), length.out=length(controlA))
x0 <- minMax[which( abs(cdf1(minMax) - cdf2(minMax)) ==
max(abs(cdf1(minMax) - cdf2(minMax))) )]
y0 <- cdf1(x0)
y1 <- cdf2(x0)
plot(cdf1, verticals=TRUE, do.points=FALSE, col="blue")
plot(cdf2, verticals=TRUE, do.points=FALSE, col="green", add=TRUE)
points(c(x0, x0), c(y0, y1), pch=16, col="red")
segments(x0, y0, x0, y1, col="red", lty="dotted")
Normal distribution tests: Lilliefors
• Lilliefors test is in a fact a correction to standard K-S test.

• K-S test requires average and standard deviation to come from full
population, otherwise the results may be incorrect.
• There is complex workaround, but Lilliefors correction is much easier,
we calculate if an additional statistic marks a statistically significant
difference between distributions.
Normal distribution tests: Lilliefors
• Overall assumptions of KS test (two-sided)
• Samples are random (both groups of samples are random and independent)
• We know the average and standard deviation of the sample (estimate).
• Hypothesis:
• H0: F(x) = G(x) for all x (F,G are distributor functions)
• H1: F(x) != G(x) for at least one x
• Test: D = sup [ F(x) – G(x) ]
• The same equation, but different method to calculate threshold.
Normal distribution tests: SW
• Shapiro-Wilk test was developed in 1968 for small samples of data (less
than 20 elements), but modern tools use a 1980s correction that works for
up to 5000 samples. It only tests normal distribution.
• The test statistic is based on linear regression task for qqplot.
• Overall assumptions of SW test
• Samples are random.
• We know the average and standard deviation of the sample (estimate).
• Hypothesis:
• H0: F(x) ~ N(x)
• H1: F(x) !~ N(x)
Normal distribution tests: SW
• The test statistic
• The higher the value of the statistic, the closer it is to normal

distribution.
• For really small samples this may be disturbed by insufficient data.
• In R the function is shapiro.test()

• May use additional libraries
Black Swan problem
• Is normal distribution the best description of random events?
• Named after the discovery of Black Swans.
• An outlier in data with large impact, and unexpected unknown.
• Examples:
• COVID-19
• A ship blocking Suez Canal
• „depression” in 2008
• Some earlier economic cycles
Comparison tests: 1 group
Interval scale Ordinal scale Nominal scale
Chi2 test,
normal N Wilcoxon
one
distribution? ranked sign
proportion
KSL test, test
SW test tests
t-Student
test for one
group
Comparison tests: 2 groups
Wilcoxon
normal N related Y pair Bowker-
related Y
distribution? variables? order McNemara
test variables?
KSL test, test, Z test for
SW test 2 proportions
Mann-
Y Whitney N
N test, Chi2
square Chi2 test (R X C or
t-Student test for 2 X 2), Fisher test
related Y test for (R X C), Fister test,
trends
variables? related mid-p (2 X 2), Z
groups test for 2
proportions
N t-Student
N test with
Cochrane-
equal Cox
variances? correction
Fisher-Snedecor t-Student
test
Y test for
unrelated
groups
Comparison tests: more than 2 groups
normal N related Y Friedman

related Y
distribution? variables? ANOVA Q-Cochran
variables?
KSL test, ANOVA
SW test
Y N
N
ANOVA
related Y for Chi2 test for
variables? multidimensional
related contgence tables
groups
N
N Kruskal-
Wallis
equal ANOVA
variances?
Brown-Forysth ANOVA
test, Levene test
Y for
unrelated
groups
TTest
• TTest is a common shortname for the t-Student test. In general it is
used to compare two groups – if one is higher values than the other.
• We call it a „statistically significant” difference. It may be very small and non-
obvious for a human observer.
• This is mostly done when the same experiment (providing samples) is
conducted on two subgroups (e.g. man-woman, treatment-control).
• As previous figures has shown this can be done for:
• One group (compare with a constant)
• Two unrelated groups
• Two related groups (in fact: one group, but in two different contexts).
TTest for unrelated groups
• Assumptions:
• The distributions in both groups are close to normal.
• The number of elements in both groups are similar (some assume it may go
up to double in one group). This may use additional chi-square test.
• In examples we will always use groups with same number of elements.
• For Data Mining approaches we could use over/undersampling methods, possibly with
cross-validation, but this does not work with statistical approaches.
• Variations in both groups are similar. We may check with, e.g. Levene’s test.
• Hypothesis:
• H0: μ1 = μ2
• (two-sided) H1: μ1 != μ2
• (one-sided) H1: μ1 > μ2 (right) or μ1 < μ2 (left)
TTest for unrelated groups
• Statistic T increases with more different averages, when variances are
similar:
• T-distribution for a given number of degrees of freedom is used

(usually: n1+n2-2).
• The p-value is the probabilty of the error of accepting a hypothesis of
there being differences between averages (so low-value = different;
high-value = maybe similar).
TTest for unrelated groups in R
(similar for other variants)
controlA <- c(0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -
0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50,
-0.09)
treatmentA <- c(-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63,
5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76,
3.50)
var.test (controlA, treatmentA)
t.test (controlA, treatmentA, paired = FALSE)
t.test (controlA, treatmentA, var.equal = TRUE)
TTest for related groups
• This type of test is mostly done on samples from the same elements
of a group, but in different situations (context). This is often a before-
after situation.
• Intergroup differentation – it is important to note that there is also

outside context to consider. E.g. in ‚before’ test of some medicine men
have results between 101 and 103 (avg of 102) and women between
103 and 105 (avg 104); then difference of 2 is significant; but if (for
same averages) the results are 0-200 then difference of 2 is
insignificant.
TTest for related groups, for one group
• Related: Same assumptions + objects related, same hypothesis.
• T statistic has similar interpretation, but different equation:
• For one group (compare with constant):

• One assumption – distribution is normal
• T statistic is simple to calculate (uses estimated avg. and std.dev):
Mann-Whitney U test
• A popular alternative to t-Student test for unrelated groups because
of less strict requirements (assumptions) on the samples:
• Data can be continous (interval), ordered (ordinal) or even binary. We do not
need equivalence of group, normal distribution or similar variances.
• Hypothesis
• H0: P ( X > Y ) = P ( Y > X )
• (two-sided) H1: P ( X > Y ) != P ( Y > X )
• (one-sided) H1: P ( X > Y ) > P ( Y > X ) or P ( X > Y ) < P ( Y > X )
Mann-Whitney U test
• First calculate number of ranks R1 ( x > y ) and R2 ( y > x)
• Test statistic U (need to bo done for both R1 and R2, take smaller U)
• For samples larger than 20 and semi-normal distribution:
• In R: wilcox.test(controlA,treatmentA)
More non-parametric test
• There are some additional tests for checking if elements in two
samples are the same (from the same group).
• Sign test – compares difference between elements in each sample, number of

pluses and minuses are counted.
• Wilcoxon pair order test – the same differences are ordered and given a rank,
a rank table gives the threshold for identical/different.
• McNemara test and more

ANOVA tests
• ANOVA (Analysis of Variance) is a group of methods and statistical
models used to compare two or more populations based on their
average. The key point is that we do not compare each pair (which
introduces some error as well), but do everything in one test.
• Specifically each test assumes some error rate α. If we repeat the test
multiple times, then the error rate increases uncontrolably.
• There are different types of ANOVA tests, it is mostly used as a catch-

all term.
ANOVA tests: one-way
• Usually used for comparing averages for three groups.
• Assumptions:
• Populations have normal distribution
• Samples are unrelated
• Variations are equal
• Hypothesis:
• H0: Samples from all groups are from populations with the same average
• H1: Average from at least one sample is significantly different than the others
ANOVA tests: one-way
• Results, example in Matlab:
• Source (of variance):
inter-, intragroup, total, etc.
• SS (sum of squared deviations
of each average from total average)
• df (number of degrees of freedom)
• MS = SS / df
• F statistic = MS (Columns) / MS (Error)
• p-value based on F distributor function.
• In R: data <- PlantGrowth
res.aov <- aov(weight ~ group, data = data)
summary(res.aov)
ANOVA tests: Kruskal-Wallis, Friedman
• If the distribution is not normal then we use non-parametric tests,
here Kruskal-Wallis test or Friedman test. They verify the hypothesis
about the differences methen medians being insignificant.
• The assumption is that distributions are similar.
• Can be done on ordinal or interval scale.

ANOVA tests: two-way
• Two-way ANOVA is used for more complex situations, where the
comparison between groups is not done in one dimension, but in
multiple ones.
• Two-way ANOVA is also used between two groups (if there is more then one
attribute to compare them by), e.g.: factory1technologyA,
factory1technologyB, factory2technologyA, factory2technologyB.
• Assumptions:
• Normal distribution in source populations
• Unrelated samples.
• Equal variances.
• Equinumerous groups.
• ( two first may be conditionally broken )
ANOVA tests: two way
• Hypotheses (all are tested):
• H01: population averages for first attribute are equal (one-way ANOVA for
columns)
• H02: population averages for second attribute are equal (one-way ANOVA for
rows)
• H03: both attributes do not have a combined influence on population averages
TTest Assignment part 1 (class no. 7)
• Test two groups. This is the number of gestures that people make:
• nervous = 3, 3, 4, 5, 5
• average = 4, 6, 7, 9, 9
• Assume some narration. Try to describe it as a general report with broad
implications. Add actual tests (which one? what parameters?) as an appendix.
• Test two groups. This are how people describe themselves as ‚funny’
in Likert scale (1-not funny, 10-super funny):
• Before30: 6,7,10,9
• After30: 5,6,2,3
TTest Assignment part 2 (class no. 8)
• Test one group. This is hight of students in a group:
• Height: 175.26,177.8,167.64,160.02,172.72, 177.8,175.26, 170.18,157.48,
160.02, 193.04, 149.86,157.48,157.48,190.5,157.48,182.88,160.02
• Test two groups. This is speed of writing on keyboard before and after
a fast-typing course
• dataAfter: 74.86,77.4,67.24,59.62,72.32,77.4,74.86,69.78,57.08,59.62,
92.64,49.46, 57.08,57.08,90.1,57.08
• dataBefore:
72.32,57.08,69.78,72.32,74.86,69.78,54.54,49.46,57.08,54.54,74.86,
67.24,57.08,57.08,54.54,77,4
Correlation
• Correlation is the research of dependencies between variables
• As one variable changes, the average of the other changes as well
• The change may be linear, or more complex
• It may have different strength or depend
on more than one variable
• Some correlations exist but don’t mean

ANYTHING ! as there is no causality
between them. In fact there may be
a hidden third variable that influences
both…
Correlation
• Correlation is a calculated measure of relations / dependency
between two attributes / variables
• ρ (rho) equal to 0 – no relations („all things being equal”)
• ρ is in range [-1,1]
• ρ > 0 the larger the values of one attribute, the larger the ones of the
second attribute
• ρ < 0 the larger the values of one attribute, the smaller the ones of
the second attribute
Correlation
• If ρ + 1
• There exist such a and b than Y = aX + b
• If ρ = 1 then a>0
• If ρ=-1 then a<0
• With this we can add more details to definition – correlation

measures LINEAR dependency between X and Y
Correlation
Correlation coefficients
• Correlation coefficietns
• Pearson coefficient
• Only for linear, we cannot use it for ranked data
• Spearman coefficient
• Kendall coefficient
• Example for Pearson coefficient

• Data: GPA for ten students after I and III year
• Let’s check if there is a dependency between the GPA’s.
I year 3.5 4 3.8 4.6 3.9 3 3.5 3.9 4.5 4.1
III year 4.2 3.9 3.8 4.5 4.2 3.4 3.8 3.9 4.6 4
Correlation – drawn example (Pearson)
Student grades
5
4,5
After year III
3,5
𝜚 = 0.830619
3
2,5
2,5 3 3,5 4 4,5 5
After year I
Correlation – example Spearman
• Data: students ranked by grades in categories
STEM 1 2 3 4 5
Language, art., etc. 3 1 2 5 4
6 σ𝑛𝑖=1 𝑑𝑖2
𝑟𝑠 = 1 −
𝑛 𝑛2 − 1
where 𝑑𝑖 = 𝑎𝑖 − 𝑏𝑖
Data 𝑎𝑖 𝑏𝑖 𝑑𝑖 𝑑𝑖2
S1 1 3 -2 4 6 ∙9
S2 2 1 -1 1 𝑟𝑠 = 1 − = 0.55
5 25 − 1
S3 3 2 1 1
S4 4 5 1 1
S5 5 4 1 1
SUM 9
STEM 1 2 3 4 5
Language, art., etc. 3 1 2 5 4
Correlation - Kendall
• We observe 2 numeric attributes A and B.
• Connect observations into pairs. S1 S2 S3 S4 S5
A
1 2 3 4 5
S1 x 1 1 1 1
• How many pairs are ordered (P) how many are not (Q) S2 -1 x 1 1 1
S3 -1 -1 x 1 1
S4 -1 -1 -1 x 1
𝑃−𝑄 14 − 6 S5 -1 -1 -1 -1 x
𝜏= = = 0.4
𝑛(𝑛 − 1) 5∗4
S1 S2 S3 S4 S5
B
3 1 2 5 4
S1 x -1 -1 1 1
S2 1 x 1 1 1
S3 1 -1 x 1 1
S4 -1 -1 -1 x -1
S5 -1 -1 -1 1 x
Correlation – Pearson (again)
• We can get back to a fully statistical approach for Pearson
• Hypothesis:
• H0: ρ = 0
• H1: ρ != 0
• Formally a test statistic
𝑐𝑜𝑣(𝑋, 𝑌)
𝑅=
𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌
• Following if |R| > r(α,n) then we reject H0, but often skipped, as R = ρ
Correlation -> Regression
• Following the previous example with grades 𝑅 = 0.830619 >
0.6319 = 𝑟(𝛼; 𝑛) for 𝛼 = 0.05 i 𝑛 = 10
• Formally, if ρ != 0 then E(Y|X=x) = f(x), so f is regression function.
• What is regression? The idea is to predict values for some variable

based on other (previous) values.
Y = f(X1, X2, …, Xn)
Student grades
5,5
Regression 5 y = 0,6218x + 1,6174
After III year

4,5
4
• Mathematical model
3,5
Y = f(x1, x2, …, xn) + e 3
• Y – dependent variable 2,5
• x1, … , xn – values of independent 2,5 3 3,5 4 4,5 5 5,5 6
variables X1, … , Xn After I year
• ε – bias, representing unobserved total influence of other factors
• For the same values of X1, … , Xn , it is possible to get different values of Y (so
the data itself is not represented by a function!)
• But regression can be.
(Ordinary) Least Squares
Method
• A (basic) method to calculate linear
regression function.
• Still needs „all things being equal”
to make sense.
• Variables need to be correlated, but
correlation does not impy causation!
• Regression shows dependency and can
be used for prediction of future trends.
• Suburban living example (ad. book).
(Ordinary) Least Squares
Method
• Ordinary Least Squares Method works to minimize the sum of
distances (error) between the regression line and observations
(points).
• In general this leads to a following set of
equations:
• For linear regression y = b1x + b0

PART 2: DATA MINING
Data mining – simple definition
Data mining is an iterative and interactive process of discovering:
• Non-trivial
• Implicit
• Previously unknown
• Potentially useful (interesting)
information or patterns from data in large databases (data warehouses)
Sometimes also called: KDD (knowledge discovery in databases),

knowledge extraction, pattern analysis, business intelligence, etc.
DM definition in details
• Non-trivial means that they are not simple patterns

• Implicit means that they are reasoned from the data
• Previously unknown means that they are new, surprising
• Potentially useful (interesting) means that they will work not just on
previous data, but also on future ones and can be applied there.
Mining „interesting” data
• How can we measure „attractivenes” of patterns? By postulates:
• It is understandable – can be expressed in natural language
• It works on new data (but may not be 100% accurate then)
• It can be applied somewhere
• It is entirely new or it confirms some user hypotheses
• Measure candidates:
• Objective (based on statistics and structure): support, confidence, etc.
• Subjective (based on user beliefs): unexpectedness, freshness, etc.
Methods of mining for patterns
• Searching for all patterns:

• Association rules
• Classification
• Clustering
• And more…
• Searching for only interesting patterns:
• Searching for all + filtering
• Optimization problem (may use heuristics)
Example of some methods
Client X – will NOT grant loan

applied O – will grant loan
for this
size of
loan
Client income
Linear classification
No
loan
loan
Linear regression
Clustering
Income
Single (cutting) rule
Non-linear classifier
Nearest Neighbour algorithm
Some other issues to consider
• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
Some other issues to consider
• Issues relating to the diversity of data types
• Handling relational and complex types of data
• Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
• Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
• Protection of data security, integrity, and privacy
Phases of full KDD process
• Understanding the problem domain – knowledge and main aims
• Creating the data collection, using data selection process
• Data cleaning and preprocessing (may take over 50% of time!) ETL, DW
• Data reduction and transformation (Curse of dimensionality!)

• Selecting data mining methods and algorithms
Data mining
• Data mining process
• Patterns rating and presentation (visualization, transformation,
removing unnecessary patterns)
• Applying the discovered knowledge
General aims in Data Mining
• Prediction: To foresee the possible future situation on the basis of previous
events. Given sales recordings from previous years can we predict what
amount of goods we need to have in stock for the forthcoming season?
• Description: What is the reason that some events occur? What are the
reasons for the cars of one producer to sell better that equal products of
other producers?
• Verification: We think that some relationship between entities occur. Can
we check if (and how) the thread of cancer is related to environmental
conditions?
• Exception detection: There may be situations (records) in our databases
that correspond to something unusual. Is it possible to identify credit card
transactions that are in fact frauds?
Data Mining algorithm elements
• Knowledge representation
• Logical language used to describe the mined patterns, usually: logical rules,
decision trees, neural networks
• Criteria for rating the mined knowledge
• Search strategy to maximize optimization criteria
• Parameter search
• Model search (for model families)
Designing a good Data Mining process
• Each approach only fits to some practical problems
• The issue is to find a good question to ask (properly formulate the
problem)
• There is no criteria for this, experience is needed
• Example: decision trees are good for multidimensional problems, data

described with different types of attributes, but not good for data
where a polynomial would be good
More things to consider in DM
• Methodology and interaction with the user
• Discovering knowledge of different types from data
• Interaction during discovery (on different abstraction levels)
• Presentation of results using domain knowledge, visualization
• Communication (query language)
• Noisy and incomplete data
• Pattern attractiveness (again!)
• Scalability and effectiveness
• Scalability and effectiveness of data mining algorithms
• Parallel and disctributed approaches
More things to consider in DM
• Different data types
• Relational and complex data types
• Discovering knowledge from heterogenous databases or Internet
• Applications
• Applying the discovered knowledge
• Data mining tools for specific areas
• Intelligent query system
• Process management and decision taking
• Integration of discovered knowledge with previous one
• Security, integrity and privacy of data
Association rules
• Example of rules extracted from 3 transactions (3 clients):
• Clients that buy beer will also buy peanuts
• People only but bread if the total value of purchases is over 50 USD
Transaction ID Product Date Value

1 Pizza Saturday 48,40
1 Milk Saturday 2,80
1 Bread Saturday 1,50
2 Beer Tuesday 16,20
2 Peanuts Tuesday 8,50
3 Bread Saturday 1.50
3 Peanuts Saturday 25,50
3 Beer Saturday 32,40
Association rules – another example
TID Items bought
1 A C T W
2 C D W
3 A C T W
4 A C D W
5 A C D T W
6 C D T
• What patterns to look for?

• What is often bought?
• What is bought together?
• How to incite more clients?
Association rules – another example
• Rules:
• A -> C
• C -> W
• AC -> T
• T -> ACW
• Collocation matrix:
Association rules - notation
• The set of all items I = { i1, i2, i3,… , im }
• List of all transactions D = [ (tid1,T1), (tid2,T2), … }, where each
transaction is a pair (tidj,Tj) and:
• Tidj – unique identifier
• Tj ⊂ I – the set of items bought in this transaction
• itemset is any subset of I
• k-itemset is any itemset with k elements
• s(X) is the number of transactions containing itemset X
Association rules - definition
• An association rule is any implication
X => Y
where X and Y are itemsets.
• We can measure attractiveness of the rule with functions:
• Support – how often both itemsets occurs in whole set
Support (X => Y ) = s( X ∪ Y )
• Confidence – if the first itemset occurs, what proportion contains both
𝑠(𝑋∪𝑌)
Confidence (X => Y) =
𝑠(𝑋)
Association rules – followup on example
TID Items bought
1 A C T W
2 C D W
3 A C T W
4 A C D W
5 A C D T W
6 C D T
• Rules:
• A -> C , support 4, confidence 1
• C -> W , support 5, confidence 0.83
• AC -> T , support 4, confidence 0.75
• T -> ACW , support 3, confidence 0.75
Association rules – problem definition
• Data
• Set of all items I = { i1, i2, i3,… , im }
• Set of all transactions D = [ (tid1,T1), (tid2,T2), … }
• Minimum level of suport sup_min and minimum level of confidence conf_min
(constants)
• Task:
• Find all association rules with support > sup_min and confidence > conf_min
Association rules as computational task
• Number of all association rules

is O(3m), where m is the number
of all items (in this graphic: d).
• Checking all rules is computationally impossible in real datasets
• Some other methods were proposed: sequential, parallel, etc.
Mining association rules
Most algorithms use two steps:

1. Mine frequent sets – find itemsets with support larger than
min_sum (so-called frequent itemsets)
2. Divide frequent itemsets – for each frequent itemset, find its
division into two subsets, so that it creates rules with confidence
larger than min_conf.
Mining association rules – example ctd.
TID Items bought
• Min_sup = 3 (or: 0.5 of transactions no.) 1 A C T W

2 C D W
• Min_conf = 0.75 3 A C T W
4 A C D W
5 A C D T W
• Let’s try creating rules for itemset {AC}.
6 C D T
Two possibilities:
Support Itemset
• A => C (Confidence = 1)
• C => A (Confidence = 0.66) 6 C
5 CW
4 A,D,T,AC,AW,CD, CT,ACW
3 AT,DW,TW,ACT,
ATW,CDW, CTW, ACTW
Mining frequent itemsets
a note to reduce computation time
Mining frequent itemsets
• Following on the previous observation:
• If {A,B} is a frequent itemset, then {A} and {B} should also be frequent.
• In general: If X is a frequent k-itemset, then all of its subsets should be
frequent (k-1)-itemsets.
• Thus a basis for an algorithm:
• Find all frequent 1-itemsets
• Generate frequent 2-itemsets from frequent 1-itemsets
• …
• Generate frequent k-itemsets from frequent (k-1)-itemsets
Mining frequent itemsets – Apriori algorithm
1. C1 := I ; F1 := all frequent 1-itemsets

2. For (k := 2 ; Fk-1 ≠ ∅ ; k++) do
3. Ck := AprioriGen(Fk-1) ; //generate new candidates
4. Fk := { X ∈ Ck : support(X) > min_sup }
5. End for
6. Return F1 ∪ F2 ∪ … ∪ Fk ;
Mining frequent itemsets – AprioriGen
• There are two main steps to AprioriGen (Fk-1) :
1. Merge : create unions X ∪ Y of such pairs of X,Y ∈ Fk-1 that have k-2
starting elements in common.
Example:
Fk-1 = { AB,AC,AD,AE,BC,BD,BE }
Ck = { ABC,ABD,ABE,ACD,ACE,ADE,BCD,BCE,BDE }
2. Reduce (cut) : delete those elements, where not all k-1 element
subsets are in Fk-1.
Example:
can remove ACE as there is no CE in Fk-1
Mining frequent itemsets – Apriori algorithm
TID Items bought Frequent itemsets

1 A C T W
F1 A, C, D, T, W
2 C D W
3 A C T W C2 AC, AD, AT, AW, …
4 A C D W
5 A C D T W F2 AC, AT, AW, CD, CT, CW, DW, TW
6 C D T
C3 ACT, ACW, ATW, CDW, CTW
Min_sup = 3 F3 ACT, ACW, ATW, CDW

Min_conf = 0.8
C4 …
Association rules generation
• Problem: Let X be a frequent itemset. Find Y ⊂ X such that
confidence( X\Y => Y) > min_conf
• If AB=>CD is a good rule, then ABC=>D and ABD=>C are as well.
• Strategies:
• Move elements to the right
one by one
• Use AprioriGen to generate Y
Improved Apriori algorithm
• Base Apriori algorithms needs to search the entire database to

calculate support for candidates
• Improvement: new structure with filtered transactions – those that
can support current candidates
• Counting_base: new data structure, which is updated for each k
• Algorithm AprioriTid: calculates support only based on counting_base
AprioriTid algorithm
1. C1 = CB1 := I ; F1 := all frequent 1-itemsets
2. For (k := 2 ; Fk-1 ≠ ∅ ; k++) do
3. Ck := AprioriGen(Fk-1) ; //generate new candidates
4. CBk := Counting_base_generate(Ck,CBk-1) ;
5. Support_count(Ck,CBk) ;
6. Fk := { X ∈ Ck : support(X) > min_sup }
7. End for
8. Return F1 ∪ F2 ∪ … ∪ Fk ;
Generating Counting_Base for AprioriTid
• CBk connects each transaction t with a list of candidates in it
• CBk contains a list of pairs (t , { c ∈ Ck : c ⊂ t } ) = (t, Sk(t) )
• If a transaction does not contain k-element candidates, it will be
deleted from CBk and further sets
• Iterative calculation:
• CB1 = D //all transactions
• CBk = { ( i, Sk (i) ) } where Sk (i) is determined from Sk-1(i) as follows:
IF { x1,x2,x3,…,xi-2,a} ∈ Fi-1 ∩ CBk-1 and { x1,x2,x3,…,xi-2,b} ∈ Fi-1 ∩ CBk-1
THEN { x1,x2,x3,…,xi-2,a,b} ∈ Sk (i)
AprioriTid example
STEP 1
STEP 2
STEP 3
Another approach: AprioriHybrid
• AprioriTid searches the CBk table instead of the whole transactional
base – it is effective, if CBk is much smaller than the whole DB.
• AprioriTid is better than Apriori if:
• CBk fits in memory
• Frequent itemsets have long tail distribution
• AprioriHybrid:
• Uses Apriori for first iterations
• Starts using AprioriTid once we estimate that CBk will fit in memory
• In practical terms AprioriHybrid may be ~30% faster
than Apriori and ~60% faster than AprioriTid
• Yet another approach is to use Hash tree
Association rules mined using FP-tree
• The main idea is that the data is saved in less computationally
demanding structure of the FP-tree.
• Frequent itemsets are calculated based on the FP-tree.
• Entire database is only searched two times:
• To determine how often each item occurs
• To construct FP-tree
• As it uses „divide and conquer” method it is much faster than Apriori
example TID Items
1 f,a,c,d,g,i,m,p
• Min_sup = 3 2 a,b,c,f,l,m,o
3 b,f,h,j,o
4 b,c,k,s,p
• First scan of database: 5 a,f,c,e,l,p,m,n
f c a b m p l o d e g h i j k n s
4 4 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1
• After deleting infrequent items (support<3)

f c a b m p
4 4 3 3 3 3
example
• Another scan of database, then for each transaction

• Remove infrequent items
• Sort by items frequency
• Add to FP-tree
TID Items
1 f,c,a,m,p
2 f,c,a,b,m
3 f,b
4 c,b,p
5 f,c,a,m,p
example
Searching for frequent itemsets
• The search checks for single items in the „header table” top to
bottom
• For each item i :
• Construct a collection of conditional patterns as „prefix paths” (a path from
room to i)
• Construct a conditional FP-tree
• We use conditional patterns for i as a small database of transactions D(i)
• We build FP-tree for D(i)
• If there is only one path, then STOP and return frequent itemsets
• Otherwise repeat
example
• Frequent itemsets for p:

• Conditional patterns for p: [f:2, c:2, a:2, m:2], [c:1, b:1]
• c is the only frequent item and conditional FP-tree has 1 root (c:3)
• Therefore {p,c} is the only frequent itemset
• Frequent itemsets for m:
• Conditional patterns for m: [f:2, c:2, a:2], [f:1, c:1, a:1, b:1]
• f, c and a are frequent items, so conditional FP-tree:
• Therefore frequent itemsets are: {f,m}, {c,m}, {a,m},

{f,c,m}, {f,a,m}, {c,a,m}, {f,c,a,m}
FP-growth algorithm(Tree, α)
Classification problems
• We have a set of objects with attributes (training set):
• One attribute is the decision attribute (class, category) of the object
• Other attributes are conditional attributes (explanatory variables, features)
• Objective: when given a new object, determine decision attribute
• By finding a dependency (function) between conditional attributes

and decision attribute.
Classification – applications in business
• Strategic analysis and management
• Predicting client interests
• Keeping clients
• Competition analysis
• Improving insurance
• Fraud detection
• Detecting insurance frauds
• Detecting suspect money transactions
• Medical insurance fraud networks detection
• And more…
Classification in two steps
• Creating a model – describing decision classes (defined by decision
attribute):
• Each object in decision table belongs to one decision class
• Classifier: an algorithm that determines the decision class based on values of
conditional attributes
• Classifiers may be described with logical formulas, decision trees,
mathematical formula.
• Using a model – by assigning new objects to a proper decision class.
• Which models are better?

Lazy and eager classification methods
• Lazy methods:
• No (fast) learning. Does not generate model (classifier)
• Classification is done directly from training data
• Eager methods:
• Long learning based on building a full model (classifier) from training data
• Classification is done with the generated model
Classification in two steps
Classifiers: k-Nearest Neighbours
• Often used for numerical attributes
• Requirements:
• Training data set
• Distance function (between objects)
• Parameter: k – the number of neighbours to consider
• Classification is based on finding k nearest neigbours and then seting

the class of new object based on classes on neighbours (e.g. majority
vote)
• In base approach we use Euclidean distance
• We may instead use Voronoi diagram (divide space into „nearest to 1

point” subspaces), which provides easy illustration for k=1.
• We may introduce weighted distance, e.g.

• Weight attributes by distance
• Weight votes by distance, rating
• We may use other distances, e.g. Manhattan metric
• A lazy method:
• Works directly on existing data, does not create classifier.
• For small k it is very sensitive to noise
• For large k it has high computational complexity
• For continous decision space we may calculate average of neighbours
• Very sensitive to distance function
• Curse of dimensionality
• Non-normalized attributes
• Non-intuitive results
Classifiers: Bayesian
• Probabilistic learning: Calculate explicit probabilities for hypothesis,
among the most practical approaches to certain types of learning
problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct. Prior
knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by
their probabilities
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
• With the learning set and „a posteriori” probability of hypothesis h,
we can use Bayes theorem to calculate:
• We can also calculate MAP (maximum posteriori) hypotesis as:
but this has high computational cost and requires knowledge of many
probability distributions.
• Classification of an object described by attribute values x:
• P(x) would be the same for all hypotheses (classes)

• P(dec=c) is the probability of decision class c
• We are looking to maximize P(dec=c|x), which means that we need to
maximize P(x|dec=c) * P(dec=c)
• So the computationally problematic part is: P(x|dec=c)

• We could simplify by assuming that attributes are independent
(orthogonal). Then:
• This classifier is called Naive Bayes and has much smaller

computation complexity.
Naive Bayes - example
• A patient had a positive test for some disease. The test is correct in
98% of true cases and 97% of negative cases. We also know that 0,8%
of population has the disease
Naive Bayes - example
• The patient is not sure and repeats the test:
• We assumed that attributes are independent, which made
calculations possible.
• But what if they are not independent?
• This assumption is very rarely satisfied, attributes are often correlated.
• Other approaches that do not assume independence of attributes:

• Bayes Net, where we combine Bayesian reasoning with relationships between
features.
• Decision trees, which only reason on one feature at a time, starting with the
most important (significant)
Classifiers: Bayesian Belief Network
FH+, S+ FH+, S- FH-, S+ FH-, S-
Family
Smoker
History Lung
0,8 0,5 0,7 0,1
Cancer
Not LC 0,2 0,5 0,3 0,9

Lung Conditional probability table for LungCancer node
Cancer
Positive
Dyspnea
XRay
Classifiers: Bayesian Belief Network
• BBN allows for some of all variables to be conditionally independent.

• There is a graphical model of causal relationship.
• There are different possibilities, depending on our (expert) knowledge
• Given both network structure and all the variables
• Given network structure but only some variables
• When the network structure is not known in advance
Comparing classification methods
• Methods for grading classification methods:
• Prediction quality
• Total cost (different errors = different cost)
• Lift curve
• ROC surve
• Errors in real value prediction
• Error rate =
number of errors / number of test objects
• Error = wrong classification
• We may test full dataset or on random data.
Parameter tuning
• It is important that the test data is not used in any way to create the
classifier
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and
test data
• Validation data is used to optimize parameters
Classifiers (and general supervised learning)
• Once evaluation is complete, all the data can be used to build the final
classifier
• Generally, the larger the training data the better the classifier (but returns
diminish)
• The larger the test data the more accurate the error estimate
• If many (thousands) of examples are available, including several hundred
examples from each class, then a simple evaluation is sufficient
• Randomly split data into training and test sets (usually 2/3 for train, 1/3 for test)
• Build a classifier using the training set and evaluate it using the test set.
Unbalanced data
• Sometimes, classes have very unequal frequency
• medical diagnosis: 90% healthy, 10% disease
• eCommerce: 99% don’t buy, 1% buy
• Similar situation with multiple classes
• Majority class classifier can be 97% correct, but useless
• With two classes, a good approach is to build balanced training and
test sets, and train model on a balanced set
• randomly select desired number of minority class instances
• add equal number of randomly selected majority class
• Generalization of balancing to multiple classes
• Ensure that each class is represented with approximately equal proportions in
train and test
Predicting performance
• If we estimated error rate at x %, how close is it to the true error rate?
• It depends on the amount of test data
• Similar to any random event with two possibilities (success, error may be like
head, tail in a coin toss).
• We can use statistics, treating the series of independent events as a Bernoulli
process.
• We can say: p lies within a certain specified interval with a certain specified
confidence
• Example: S=750 successes in N=1000 trials
• Estimated success rate: 75%
• How close is this to true success rate p?
• Answer: with 80% confidence p∈[73.2,76.7]
• Example: S=75 and N=100
• Estimated success rate: 75%
• With 80% confidence p ∈[69.1,80.1]
Predicting performance: confidence interval
• How to estimate this confidence interval?
• Example:
• Average height 177cm, standard deviation 19cm. Distribution N(177,19).
• Calculate probability that the height of a random person is:
• 160cm or less
• More than 160cm
• In the interval [158,196)
• Outside the interval [158,196)
Predicting performance
• We just need to do the same for Bernouli distribution (for large N):
• Mean: p, Variance: p(1-p)
• Expected success rate: f=S/N
• Mean for f: p, Variance for f: p(1-p)/N
𝑓−𝑝
• Normalization of f:
𝑝(1−𝑝)
ൗ𝑁
𝑓−𝑝
• So calculating the confidence interval is: 𝑃 −𝑧 ≤ ≤𝑧 =𝑐
𝑝(1−𝑝)ൗ
𝑁
• (we take the value s from Gauss table)

Small datasets
• We usually divide the data into 2/3 for training and 1/3 for testing
• But if size of dataset is too small, those subsets may not be

representative (e.g. do not hold elements of every class)
• Stratified sample approach: another version of balancing the data

• Make sure that each class is represented with approximately equal
proportions in both subsets
Repeated holdout method
• Holdout estimate can be made more reliable by repeating the process
with different subsamples
• In each iteration, a certain proportion is randomly selected for training
(possibly with stratification)
• The error rates on the different iterations are averaged to yield an overall
error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets overlap
• Can we prevent overlapping?
Cross-validaiton
• Cross-validation avoids overlapping test sets
• First step: data is split into k subsets of equal size
• Second step: each subset in turn is used for testing and the remainder for
training
• This is called k-fold cross-validation
• Often the subsets are stratified before the cross-validation is
performed
• The error estimates are averaged to yield an overall error estimate
Cross-validation
• Random division of data into k groups
• Select first group for testing, the rest for training
• Select second group for testing, the rest (including first) for training
•…
Cross-validation
• Standard method for evaluation: stratified ten-fold cross-validation
• Extensive experiments have shown that ten is the best choice to get an
accurate estimate (drawbacks include diminishing returns)
• Stratification reduces the estimate’s variance
• Improved approach: repeated stratified cross-validation
• E.g. ten-fold cross-validation is repeated ten times and results are averaged
(reduces the variance)
Crossvalidation
• Leave-One-Out – a special case where number of groups is equal to
the number of cases, that is for n objects we build the classifier n
times!
• It is best at rating the classifier but very computationally costly
• Stratification is not possible (one sample in the test set)

Bootstraping
• Similar to Cross-Validation, but where CV uses samples without
replacement, bootstrap approach may have repeating elements in
training set (selected at random), the rest are for testing.
• For an n-element dataset, when creating an n-element training set,
the probability for each element to be included is 63,2%.
• A single element will NOT be included in a single random select with
probability P = 1 – 1/n
• For all random selects it will be Pn = e-1 ~= 0.368
• Error rate may be high, so the process is often repeated several times
and the results are averaged.
Cost Sensitive Learning Prediction
A Not A
A TP FN
Value
Real
Not A FP TN
• Machine Learning methods are built to
minimize FP+FN.
• Specific real-world cases may require other approaches
• Direct Marketing maximizes TP:
• Develop a model of a target from a list of candidates (one class in dataset)
• Score all candidates and rank them
• Select top items for further action
• It is a method used by retailers, etc.
Model-Sorted List
• We use some Model of the target class to rate each candidate
• There should be more targets (TP) near the top of the list
Gain Chart
• „Gain” is a popular name for CPH = „Cumulative Percentage of Hits”
• CPH(P,M) is the percentage of all targets in the first P (P ∈ (0,1) ) of
the list scored by model M
Lift
• Using „Gain” we can calculate „Lift”
• Lift (P,M) = CPH(P,M) / P
• In previous case Lift(5%,model) =
CPH(5%,model) / 5% = 21% / 5% = 4.2
• The model is 4.2 better than random!
• (sometimes author use name

Lift for Gain)
Lift
• Lift (P,Random) = 1
• Expected value, but in „Random” this may be random
• Lift(100%,M) = 1
• Always
• Lift < 1 when the order in the scored list is reversed
• Better model = higher lift

ROC curve
• ROC means „receiver operating characterstic”
• In signal detection it is used to show tradeoff between hit rate and false alarm
rate over a noisy channel
• How it differs from gain chart? Crossvalidation
• Y axis is the percentage of true

positives in sample
• X axis is the percentage of false
positives in sample
Single training
ROC curve
• Simple method of getting a ROC curve using cross-validation:
• Collect probabilities for instances in test folds
• Sort instances according to probabilities
• This method is implemented in WEKA
• There are other approaches:
• E.g. generate a ROC curve for each
fold and average them
• We can use ROC curve to select

methods depending on datasets, e.g:
• Small, focused sample -> use A
• Larger dataset -> use B
Cost-sensitive learning
• In real situations TPs and FN often have different costs, e.g.:
• Medical diagnostic tests: does X have leukemia?
• Loan decisions: approve mortgage for X?
• Web mining: will X click on this link?
• Promotional mailing: will X buy the product?
• Methods are usually not cost-sensitive. Simple methods to help:
• Re-sampling of instances according to costs
• Weighting of instances according to costs
• Some schemes are inherently cost-sensitive, e.g. naive Bayes
Classifiers: decision trees
• A tree structure where:
• Leaves contain decisions of object
classification
• Other nodes contains tests
for attribute values
• There are as many outgoing branches
as possible results of tests
• It looks like a computer program
with conditional instructions
(if-then-elseif)
• Finding an optimal decision
tree is NP-hard!
• There are two classes of test functions:
• Univariate tree
• Multivariate tree
• Example function for continuos attribute:
• Inequality test:
• Example functions for other attributes:

• Equality test
• „Member of” test
• „Value copy”
• How do we determine which tree is best?
• Quality determined by tree size – smaller is better:
• Less nodes
• Less height
• Less leaves (but minimum number of leaves = number of classes)
• Quality determines by classification error on training set
• Quality determines by classification error on testing set
• Example of such measure:

• Constructing an optimal decision tree is NP-hard
• Data (input):
• Decision table S
• Set of test functions TEST
• Quality criterion Q
• Look for (output): decision tree T with highest quality Q(T)
Classifiers: decision trees (basic algorithm)
S - Decision table
Build_tree (U, S, T): TEST - Set of test functions
Q - Quality criterion
1. If (stop_criterion(U,S) = true) then T – tree (initially empty)
1. T.label = class(U,S) U – set of all objects
Rt – set of all test results (values)
2. Return
2. t = select_test(U,TEST)
3. T.test = t
4. for v∈Rt :
1. Uv = { x ∈ U : t(x) = v }
2. Create new subtree T’
3. T.branch(v) = T’
4. Build_tree (Uv, S, T’)
Classifiers: decision trees (basic algorithm)
• stop_criterion(U,S) – this function stops tree construction (returns
true) if the set of objects U is empty, contains only elements of one
class or cannot be divided using any test
• class(U,S) – selects the label using plurality vote – it will the decision
class with most representatives in U.
𝑐𝑙𝑎𝑠𝑠 𝑈, 𝑆 = 𝑎𝑟𝑔 max (𝑃(𝑑𝑒𝑐 = 𝑐))
𝑐∈𝑉𝑑𝑒𝑐
• select_test(U,TEST) uses one of heuristic methods to score tests…

Comparing tests
• Each set of objects X can be divided by decision class:
𝑋 = 𝐶1 ∪ 𝐶2 ∪… ∪ 𝐶𝑑
|𝐶𝑖 |
• A vector (p1,…,pr) where 𝑝𝑖 = is called distribution of classes in X
|𝑋|
Comparing tests
• Both functions (conflict(X) and entropy(X)) have highest value where
distribution of classes in X is uniform and lowest value when there is
only one class.
• For two decision classes:
Comparing tests
• Let t define a division of X into subsets 𝑋1 ∪ 𝑋2 ∪… ∪ 𝑋𝑟 . We can use
measures to compare and score tests, e.g.:
• Number of pairs of objects distinguished by test t:
• Information gain
• Gini index
• The higher the value, the better the test.
Comparing tests
• If t’ divides the space into smaller parts than t then
Gain(t’,X) > Gain(t,X) and Disc(t’,X) > Disc(t,X) (it is monotonous).
• Score values for test t are small if distributions of clases in X are similar.
• Gain ratio is better scoring for test than just Gain:
Where iv(t,X) is the information value of the test
• There are other scoring methods: Chi-square test, etc.

Tree pruning
• Like many other algorithms, decision trees may be overfitted

• Solution:
• We may shorten a description (attributes list) and decrease quality of
classification on the training set
• We may change a subtree into a single leaf or a smaller tree (pruning)
• We do this when the new leaf is not worse than the existing subtree (for data outside the
training set)
• We test on additional dataset – pruning set.
Tree pruning
T - tree
P – distribution of classes in X
Prune (T, P)
1. For all n ∈ T:
1. Create new leaf l labeled with category dominating in Pn
2. If (leaf l is not worse than a subtree rooted in n based on set P)
1. Exchange the subtree with l
2. Return T
Tree pruning
• Pruning criterion:
• Let et(l) be classification error for leaf l
• Let et(n) be classification error for subtree rooted in n
• We prune if
usually
Tree pruning
Decision trees: case of missing values
Learning:
• We can reduce the criterion of test scoring by a (no. of objects with
unknown values / no. of all objects)
• We can fill unknown values of attributes with the value most common
in objects connected with the current node
• We can fill unknown values of attributes with weighted average of
known values.
Decision trees: case of missing values
Classification:
• We stop classification in the current node and return the dominant
label for it
• We can fill unknown values using one of approaches for training
• We can calculate probability distribution of classes based on the
subtree
Decision trees
• Each node is connected with a subset of data, which requires memory
• Looking for best division requires multiple data reordering, which is
computationally expensive (especially with multiple different values).
• There are other algorithms for constructing decision trees, e.g. Sprint:
• Works with some data on hard drive storage
• Uses preordering to speed up calculation on real-value attributes
• Data is ordered only once before calculation
• Can be easily made parallel
Decision trees: Sprint algorithm
• Each attribute has its list of values
• Each element of the list has three parts:
• Attribute value
• Class number i
• Number of object in the dataset rid
• Real attributes are ordered (once on creation)
• At the start the lists are in the tree root
• When nodes are created, lists are divided and connected with proper
children nodes
• Lists are backed up on the drive
Decision trees: Sprint algorithm example
• Sprint uses:
• Gini index for scoring division tests
• Inequality tests (a<c) for real value attributes
• Set element tests (a∈V) for symbolic attributes
• For real attributes two histograms are used:
• Cbelow : for data below the threshold
• Cabove : for data above the threshold
• For symbolic attributes there is a histogram called count matrix
• These histograms show distribution of decision classes in the set

Division
point
List of CarType values
Count Matrix
• Symbolic attribute:
• Determine division matrix of objects in each node
• Using approximate algorithm determine value 𝑉 ⊆ 𝐷𝑎 , so that test (a∈V) will
be optimal
• The overall idea of the division

• Each list is divided into two
• General algorithm:
• Explore next record on the list
• For each record determine (based on the hash table) the subtree where to move it
• If an attribute has a test assigned: divide according to test, add rid to a hash
table
• Otherwise divide by a hash of rid (number of object)
1. Divide the set of tested attribute value into small parts, so that hash
table fits in memory
2. For each part:
1. Divide records of tested attribute to proper subtree
2. Build hashtable
3. Explore next record of not-tested attribute and move it to proper subtree if
the record is in hash table
3. If all records were assigned to subtrees, stop (go deeper), otherwise
go to step 2
Decision trees: Sprint algorithm (parallel)
• Lists of attribute values are divided equally

• Real attributes: order and divide into equal parts
• Numerical attributes: divde by rid
• Each server/processor/thread operates on one part of the list
• Idea of searching for best division for real attribute:
• Each processor has its own range of attribute values
• Each procesor initializes Cbelow and Cabove with knowledge of class distribution
in other processors
• Each processor has its own list and determines a local threshold
• Processors communicate to find globally best cut
• Idea of searching for best division for symbolic attributes:
• Each processor builds a local count matrix and sends the results to central one
• Central processor calculates global count matrix
• Processors determine best cut based on global count matrix
Processor 0
Processor 1
• Division of attributes that contain a test: each processor determines

subtrees where the records from local list will be put
• Processors share information of the <{rid},subtree>
• Division of other attributes: after gathering all information, each
processor builds the hash table and divides other attributes
• Disadvantages of Spring algorithm:

• Additional data structures
• Inefficient if attributes have many values
• Is not able to use database tools (base algorithm for DT can use powerful=fast
SQL commands)
Clustering (vs classification)
• Classification is supervised learning (a teacher points out good
answers), there are objects in the training set that already have
decision class and we must determine it for new objects
• Clustering is unsupervised learning (there is no teacher), there are
different objects and they may be organized in groups, we need to
determine those groups
• Clustering is the problem of grouping together objects with similar properties
• A cluster is a group or class of similar objects (created by clustering)
Clustering
• Applications in medicine
• Grouping diseases
• Grouping symptoms for patients
• Applications in archeology
• Taxonomy of tools
• Applications in Text Mining
• Grouping similar documents for next queries
Clustering
• Data we have:
• Desired number of clusters k
• Set of objects P
• Distance function d
• Objective function F
• Task: divide set P into k clusters, such that it maximizes F
Clustering
• In unsupervised learning we do not know the decision classes and

must propose clusters. Methods may approach this with:
• Overlapping or disjoint clusters
• Hierachical or flat clusters
• Deterministic or probabilistic clusters (methods)
• Some basic methods:
• K-means: clusters are flat and deterministic
• Hierarchical clusterer: clusters are organized in a tree and deterministic
• Grouping based on probability: clusters are flat and probabilistic
Clustering – examples of distance functions
Hierarchical clusterer
• We build a tree of clusters for a set of n objects. The quality of the
cluster is the sum of distances between objects in a cluster.
1. Each object is a 1-object cluster. These are leafs of the tree.

2. Do
1. Find pair of clusters with a smallest distance between
2. Join those clusters into one. It is now a parent of them in the tree.
3. Until k clusters remain (usually k=1 – root)
• Distance between clusters:

• Single linkage (nearest neighbor)
• Complete linkage (furthest neighbor)
• Unweighted pair-group average
• Weighted pair-group average
• Unweighted pair-group centroid
• Weighted pair-group centroid(median).
K clusters found
K-means
• Original method (1967) with N points x1, … xn on a plane Rn and k<N,
looking for k points c1,…,ck (called representatives or centroids) that
are optimal for function
• We are looking for „centers” of clusters.

• Clusters are determined after that: a point belongs
to the cluster of ci when the distance to ci is smaller
than to any other centroid.
K-means example
• A small 100x100 picture with
size 29.3kB due to unique colors
coded on each pixel (N=10 000).
• We can represent this with only k=32 colors, each pixel coded with 5 bits.
• Picture is not much worse (in this case!), but the size was reduced to
~6.9kB.
(this compression is also sometimes done with unsupervised NN)

K-means algorithm
• Find k centers, such that the sum of distances of points to nearest
centroid is minimal.
1. Select at random k centroids.
2. Assign each object to nearest centroid.
3. Determine new centroids („center of mass of points in cluster”):
4. Repeat until change in centroids is smaller than threshold.

K-means
• Quality of clusters is based on starting centroids.
• Algorithm may find a local minimum.
• To avoid we may run algorithm several times with different random
starting centroids.
K-means example
• Randomly select 3 starting centroids
K-means example
• Assign each point to nearest centroid
K-means example
• Update centroids
K-means example
• Assign each point to a nearest centroid (again)
Changes
K-means example
• Update centroids
K-means example
• (updated centroids, next step: reassign points)
Grouping based on probability
• Objects are in some cluster with some probability.
• Each cluster is described by one probability distribution.
• We assume all distributions are normal, described by expected value 𝜇 and
standard deviation 𝜎.
Grouping based on probability
• Object x is in cluster A with probability
• An algorithm that can be used here is EM: Expectation Maximization.

• Aim: for each real attribute find k distributions of parameters
(descriptions of k clusters).
EM algorithm
• For simplicity: 2 clusters and 1 attribute.
• Idea based on adaptation of k-means algorithm:

1. Select randomly 2 distributions parameters.
2. Calculate expected degree to which each object belongs to a cluster
3. Determine new distributions parameters to maximize the quality
function „likelyhood”
4. If increase of quality over threshold, repeat from step 2.
EM algorithm
• Likelihood funciton for clusters A and B:
• For the value wi being the level of i-th object being in the cluster A:
PART 3: Graphs
Social Networks in Data Science
• What is a Social Network?
• Social Network as a Graph
• Person as node, Friend (Follower) as Edge
• Alternatively: Person, Activity (Interest, Item) as nodes of different types
• Network types:
• Online social network
• Facebook vs Twitter vs other examples
• Telephone network
• Real world communication network
• Email network
• Enron
• Collaboration network
• Research Gate, Groupon, etc
• Distance measures
• Edges used to count distance
• Non-people nodes used to count distance
• No connection is infinite distance
• Clustering by distance
• Close nodes may be a community
• Betweenness measure
• Find the shortest path between each pair of nodes
• On how many shortest paths does this node lie
• Low betweenness inside communities, high outside
• Cliques
• Everybody is connected with everybody
• Another sign of a community
• Counting triangles
• Efficient algorithms, because triangle is the smallest clique
• Diameter of a graph
• The largest distance between nodes (longest of the shortest paths)
• Small world theory states that diameter (for entire planet!) is 6.
• There are other approaches to Social Networks, example finding

influential nodes.
• Multiple methods, some based on epidemic models:
• SI model: Susceptible, Infected
• SIR model: Susceptible, Infected, Recovered
• SIRS model: Susceptible, Infected, Recovered, Susceptible
• And many more variants
• Other approaches are for example Influence Maximization.
Cambridge Analytica case
• There are currently very high concerns regarding privacy of data on
Social Networks.
• The reason is the Cambridge Analytica scandal and the specifics of
how they used their data.
• Initially, there was a Facebook app, which
gave users details on their profiles – but also
gathered this data for internal use.
• And due to a hole in Facebook policy allowed to
legally gather data about their friends.
• With 270.000 of installations, they had acces to
50 million user accounts!
• The data was transfered to the CA company ca. 2014-2015. Illegally.

• It was used in some US election campaign in 2015.
• Previously similar data on smaller scale was used in 2012, but this was all
transparent and friend data was not used.
• Facebook closed the loophole about friend data afterwards.
• They also requested that the obtained data was deleted.
• It wasn’t.
• Instead it was used in US election campaign in 2016.
• The OCEAN classification of personality characteristics:
• Openness – do you like new experiences?
• Conscientiousness – how much tidy do you keep your life?
• Extraversion – how sociable are you?
• Agreeableness – do you let others needs over yours?
• Neuroticism – how angsty are you?
• Any other complex model of human behaviour could have been used
• The used approach:
• Classify the persuadable people as distinct from others
• Determine the OCEAN characteristics of persuadable people
• Prepare campaign for different groups of persuadable people
• Start the campaign in multiple channels.
• This specificly was done by SCL company, which got their data from
Cambridge Analytica…
• Each Like on Facebook was one attribute
• They had data on 50 milion people (voters?)
• Targeted content in Trump’s
campaign
• Targeted content in Trump’s campaign
• Targeted content in Trump’s campaign
Geospatial Data
• GIS – Geographic Information Systems is a catch-all term for multiple
tools using maps, map data and other spatial information.
• The most common approach is putting a layer of data as colors or icons on the
map of some location.
• Software must be able to combine map data (location + value) with the
displayed map.
• Not dissimilar to weather broadcasts on TV and may be used as a basis.
• Many Business Intelligence applications have this option built in as default.
• Another important layer is time.
• Aerial photos may substitute for map (example: ortophotomap)
Geospatial Data
• Different types of presentation & data
Geospatial Data
• Toronto example (book)
Geospatial Data
• Chicago Example (book)

Combine PDF

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Combine PDF

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Combine PDF

Uploaded by

Copyright:

Available Formats

Introduction to Data Science

• For purposes of Data Science any of those may be the input

• 40% Final Exam (live in IU)

• Minimum of 80% attendance required.

• Data Scientist needs to present the result in an attractive form. This

• Important rule: NOT TOO MUCH DATA ON ONE GRAPHS

• Adding labels with values:

• Merging column and linear graph

• The more periods compared,

• Pivoted tables and graphs:

• Contains data oriented towards topics (e.g. sales, inventory)

• So what is the chance of rolling 2 on a d6?

• I have rolled 1,6,4,3,1,5,6,1,3,4,5,3,6,1,3,5 on a d6.

• Expected value („the average”)

• Variance (how spread out are the values) – standard deviation

• How would it look for

• The process of standarization takes parameters of the population, not

• There are estimators for these values as well…

• Let’s calculate probability that:

• As P(a) + P(b) = 1 and P(c) + P(d) = 1, we will have only half of

p4 = P(x<158 || x≥196) = 1 – p3 = 0.370928

• 90% confidence interval for average

• Non-parametric tests are used to check the type of distribution

• The H0 is that the distribution of values in the population is normal.

• The verification graph is for example Q-Q plot:

• Let us try to draw a distributor function for the following data:

controlB={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49,

controlA={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -

treatmentA={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63,

• Lilliefors test is in a fact a correction to standard K-S test.

• The higher the value of the statistic, the closer it is to normal

• In R the function is shapiro.test()

normal N related Y Friedman

• T-distribution for a given number of degrees of freedom is used

• Intergroup differentation – it is important to note that there is also

• For one group (compare with constant):

• For samples larger than 20 and semi-normal distribution:

• Sign test – compares difference between elements in each sample, number of

• McNemara test and more

• There are different types of ANOVA tests, it is mostly used as a catch-

• The assumption is that distributions are similar.

• Can be done on ordinal or interval scale.

• Some correlations exist but don’t mean

• With this we can add more details to definition – correlation

• Example for Pearson coefficient

• Formally, if ρ != 0 then E(Y|X=x) = f(x), so f is regression function.

• What is regression? The idea is to predict values for some variable

Regression 5 y = 0,6218x + 1,6174

After III year

variables X1, … , Xn After I year

• ε – bias, representing unobserved total influence of other factors

• For linear regression y = b1x + b0

Sometimes also called: KDD (knowledge discovery in databases),

• Non-trivial means that they are not simple patterns

• Searching for all patterns:

Client X – will NOT grant loan

• Data reduction and transformation (Curse of dimensionality!)

• Example: decision trees are good for multidimensional problems, data