Combine PDF
Combine PDF
Combine PDF
Dr Marcin Maleszka
Wroclaw University of Science and Technology, Poland
for
International University, Vietnam
Introduction assignment (for class no. 3)
• Using any methods try to find info on me:
• What is my home address / what do I drive (plates number)?
• Alternatively: how would you do this for a person in Vietnam?
• Could the methods used be automated for a large number of people?
• A common method for a single approach (security specialist) would be to ask me or
someone else. In Data Science we need to look for data about thousands/milions of
people. Finding me is an example, we need a method to find a TYPE of person.
• When doing any assignments remember to:
• Give your name / student ID
• Provide final answer and steps to solution
• Be brief but precise
• This task will outline one of first problems a Data Scientists
encounters – where to get the data!
What is „Data”?
• Organization by complexity of concepts:
• Data – raw numbers
• Information – interpretation added
• Knowledge – rules added / pattern extracted
• (Wisdom? Trust? Intelligence?)
• Data Scientist is a new catch-all term for an old concept, it may fit to:
• Statisticians
• (and mathematical positions overall)
• Risk Analysts
• (and Analyst positions overall)
• Business Intelligence specialists
• Data Warehouse specialists
• But there are (small) differences and similar positions outside DS name
https://www.datanix.ai/post/iipgh-data-science-webinar-5
http://nirvacana
.com/thoughts/
2013/07/08/b
ecoming-a-data-
scientist/
https://medium.co
m/hackernoon/navi
gating-the-data-
science-career-
landscape-
db746a61ac62
Data Science presentation
• Note: it is not a type of database. Instead both are types of data collection.
Data warehouse and data separation
Data warehouse and data separation
Data profiling
• Candidate keys
• Amount of missing data
• Distribution of data
• Unique values
ETL
• In most basic terms: download data from the source and load it into
the Data Warehouse
• Copying (duplicating) data between data collections (data bases)
• Data is Extracted from OLTP database, Transformed to fit the DW
schema and Loaded into the DW
• A copy of source data may (or may not) be stored on the DW
hardware
• The theoretical aspects of design are more important than eventual
implementation.
ETL and ELT
ETL vs ELT
ETL ETL
• Extract – duplicating the data • Extract – preparing the data
into the temporary staging area from the source in their original
• Needs another server form (schema-on-read)
• Transform – preparing the model • Load – duplicating the raw data
and transforming data to the to the DW server (into Data
desired form (schema-on-write) Lake)
• Load • Transform – using methods
working with non-relational data
or data in different formats and
structures
PART 1: Statistics
Probability
• What is the chance of rolling a 2 on a dice?
• Before even trying to answer you should narrow down
the boundaries of the problem!
• Teachers often assume 6-sided dice (d6)
• Tabletop gamers will commonly use d4,d6,d8,d10,d20.
N(177,19)
160
p1 = P(x≤160) = 0.185464
Example (P(x>160))
N(177,19)
160
p2 = P(x>160) = 1 – p1 = 0.814536
Example (P(158<x<196))
N(177,19)
158 196
p3 = P(158≤x<196) = 0.629072
Example (P(x<158 || x≥196))
N(177,19)
158 196
Long tail
(fat tail)
Boxplots
Average Median
95% CI Max value
Std. Dev.
Q3 Upper Quantile
Median
Standard
Deviation Q1 Lower Quantile
Average
95%
confidence Min value
interval
Boxplot
Example
• Using data {1,1,1, 2,2,2,2, 3,3, 100} calculate
• Average
• Median
• Q1 quantile
• Q3 quantile
• Let’s take a random sample of 100 runs of the bus. The average time
is 𝑥=31,5
ҧ minutes and standard deviation is s=5 minutes. Now we
construct a 95% confidence interval for the average. For this large size
of group we use normal distribution.
Statistical Tests
• A statistical hypothesis is any assumption about the distribution of
the population (its general function or its parameters).
• We check if the hypothesis is true based on a random sample.
• A statistical test is a method that allows to reject or „accept” a
statistical hypothesis. It is a decision rule.
Chi2 test,
normal N Wilcoxon
one
distribution? ranked sign
proportion
KSL test, test
SW test tests
t-Student
test for one
group
Comparison tests: 2 groups
Interval scale Ordinal scale Nominal scale
Wilcoxon
normal N related Y pair Bowker-
related Y
distribution? variables? order McNemara
test variables?
KSL test, test, Z test for
SW test 2 proportions
Mann-
Y Whitney N
N test, Chi2
square Chi2 test (R X C or
t-Student test for 2 X 2), Fisher test
related Y test for (R X C), Fister test,
trends
variables? related mid-p (2 X 2), Z
groups test for 2
proportions
N t-Student
N test with
Cochrane-
equal Cox
variances? correction
Fisher-Snedecor t-Student
test
Y test for
unrelated
groups
Comparison tests: more than 2 groups
Interval scale Ordinal scale Nominal scale
N
N Kruskal-
Wallis
equal ANOVA
variances?
Brown-Forysth ANOVA
test, Levene test
Y for
unrelated
groups
TTest
• TTest is a common shortname for the t-Student test. In general it is
used to compare two groups – if one is higher values than the other.
• We call it a „statistically significant” difference. It may be very small and non-
obvious for a human observer.
• This is mostly done when the same experiment (providing samples) is
conducted on two subgroups (e.g. man-woman, treatment-control).
• As previous figures has shown this can be done for:
• One group (compare with a constant)
• Two unrelated groups
• Two related groups (in fact: one group, but in two different contexts).
TTest for unrelated groups
• Assumptions:
• The distributions in both groups are close to normal.
• The number of elements in both groups are similar (some assume it may go
up to double in one group). This may use additional chi-square test.
• In examples we will always use groups with same number of elements.
• For Data Mining approaches we could use over/undersampling methods, possibly with
cross-validation, but this does not work with statistical approaches.
• Variations in both groups are similar. We may check with, e.g. Levene’s test.
• Hypothesis:
• H0: μ1 = μ2
• (two-sided) H1: μ1 != μ2
• (one-sided) H1: μ1 > μ2 (right) or μ1 < μ2 (left)
TTest for unrelated groups
• Statistic T increases with more different averages, when variances are
similar:
controlA <- c(0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -
0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50,
-0.09)
treatmentA <- c(-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63,
5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76,
3.50)
var.test (controlA, treatmentA)
t.test (controlA, treatmentA, paired = FALSE)
t.test (controlA, treatmentA, var.equal = TRUE)
TTest for related groups
• This type of test is mostly done on samples from the same elements
of a group, but in different situations (context). This is often a before-
after situation.
• Hypothesis
• H0: P ( X > Y ) = P ( Y > X )
• (two-sided) H1: P ( X > Y ) != P ( Y > X )
• (one-sided) H1: P ( X > Y ) > P ( Y > X ) or P ( X > Y ) < P ( Y > X )
Mann-Whitney U test
• First calculate number of ranks R1 ( x > y ) and R2 ( y > x)
• Test statistic U (need to bo done for both R1 and R2, take smaller U)
• In R: wilcox.test(controlA,treatmentA)
More non-parametric test
• There are some additional tests for checking if elements in two
samples are the same (from the same group).
• Wilcoxon pair order test – the same differences are ordered and given a rank,
a rank table gives the threshold for identical/different.
• Hypothesis:
• H0: Samples from all groups are from populations with the same average
• H1: Average from at least one sample is significantly different than the others
ANOVA tests: one-way
• Results, example in Matlab:
• Source (of variance):
inter-, intragroup, total, etc.
• SS (sum of squared deviations
of each average from total average)
• df (number of degrees of freedom)
• MS = SS / df
• F statistic = MS (Columns) / MS (Error)
• p-value based on F distributor function.
• In R: data <- PlantGrowth
res.aov <- aov(weight ~ group, data = data)
summary(res.aov)
ANOVA tests: Kruskal-Wallis, Friedman
• If the distribution is not normal then we use non-parametric tests,
here Kruskal-Wallis test or Friedman test. They verify the hypothesis
about the differences methen medians being insignificant.
• Test two groups. This are how people describe themselves as ‚funny’
in Likert scale (1-not funny, 10-super funny):
• Before30: 6,7,10,9
• After30: 5,6,2,3
• Assume some narration. Try to describe it as a general report with broad
implications. Add actual tests (which one? what parameters?) as an appendix.
TTest Assignment part 2 (class no. 8)
• Test one group. This is hight of students in a group:
• Height: 175.26,177.8,167.64,160.02,172.72, 177.8,175.26, 170.18,157.48,
160.02, 193.04, 149.86,157.48,157.48,190.5,157.48,182.88,160.02
• Assume some narration. Try to describe it as a general report with broad
implications. Add actual tests (which one? what parameters?) as an appendix.
• Test two groups. This is speed of writing on keyboard before and after
a fast-typing course
• dataAfter: 74.86,77.4,67.24,59.62,72.32,77.4,74.86,69.78,57.08,59.62,
92.64,49.46, 57.08,57.08,90.1,57.08
• dataBefore:
72.32,57.08,69.78,72.32,74.86,69.78,54.54,49.46,57.08,54.54,74.86,
67.24,57.08,57.08,54.54,77,4
• Assume some narration. Try to describe it as a general report with broad
implications. Add actual tests (which one? what parameters?) as an appendix.
Correlation
• Correlation is the research of dependencies between variables
• As one variable changes, the average of the other changes as well
• The change may be linear, or more complex
• It may have different strength or depend
on more than one variable
4,5
After year III
3,5
𝜚 = 0.830619
3
2,5
2,5 3 3,5 4 4,5 5
After year I
Correlation – example Spearman
• Data: students ranked by grades in categories
STEM 1 2 3 4 5
Language, art., etc. 3 1 2 5 4
6 σ𝑛𝑖=1 𝑑𝑖2
𝑟𝑠 = 1 −
𝑛 𝑛2 − 1
where 𝑑𝑖 = 𝑎𝑖 − 𝑏𝑖
Data 𝑎𝑖 𝑏𝑖 𝑑𝑖 𝑑𝑖2
S1 1 3 -2 4 6 ∙9
S2 2 1 -1 1 𝑟𝑠 = 1 − = 0.55
5 25 − 1
S3 3 2 1 1
S4 4 5 1 1
S5 5 4 1 1
SUM 9
STEM 1 2 3 4 5
Language, art., etc. 3 1 2 5 4
Correlation - Kendall
• We observe 2 numeric attributes A and B.
• Connect observations into pairs. S1 S2 S3 S4 S5
A
1 2 3 4 5
S1 x 1 1 1 1
• How many pairs are ordered (P) how many are not (Q) S2 -1 x 1 1 1
S3 -1 -1 x 1 1
S4 -1 -1 -1 x 1
𝑃−𝑄 14 − 6 S5 -1 -1 -1 -1 x
𝜏= = = 0.4
𝑛(𝑛 − 1) 5∗4
S1 S2 S3 S4 S5
B
3 1 2 5 4
S1 x -1 -1 1 1
S2 1 x 1 1 1
S3 1 -1 x 1 1
S4 -1 -1 -1 x -1
S5 -1 -1 -1 1 x
Correlation – Pearson (again)
• We can get back to a fully statistical approach for Pearson
• Hypothesis:
• H0: ρ = 0
• H1: ρ != 0
• Formally a test statistic
𝑐𝑜𝑣(𝑋, 𝑌)
𝑅=
𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌
• Following if |R| > r(α,n) then we reject H0, but often skipped, as R = ρ
Correlation -> Regression
• Following the previous example with grades 𝑅 = 0.830619 >
0.6319 = 𝑟(𝛼; 𝑛) for 𝛼 = 0.05 i 𝑛 = 10
4
• Mathematical model
3,5
Y = f(x1, x2, …, xn) + e 3
• Y – dependent variable 2,5
• x1, … , xn – values of independent 2,5 3 3,5 4 4,5 5 5,5 6
• For the same values of X1, … , Xn , it is possible to get different values of Y (so
the data itself is not represented by a function!)
• But regression can be.
(Ordinary) Least Squares
Method
• A (basic) method to calculate linear
regression function.
• Still needs „all things being equal”
to make sense.
• Variables need to be correlated, but
correlation does not impy causation!
• Regression shows dependency and can
be used for prediction of future trends.
• Suburban living example (ad. book).
(Ordinary) Least Squares
Method
• Ordinary Least Squares Method works to minimize the sum of
distances (error) between the regression line and observations
(points).
• In general this leads to a following set of
equations:
• Measure candidates:
• Objective (based on statistics and structure): support, confidence, etc.
• Subjective (based on user beliefs): unexpectedness, freshness, etc.
Methods of mining for patterns
Client income
Linear classification
No
loan
loan
Linear regression
Clustering
Income
Single (cutting) rule
Non-linear classifier
Nearest Neighbour algorithm
Some other issues to consider
• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
Some other issues to consider
• Issues relating to the diversity of data types
• Handling relational and complex types of data
• Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
• Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
• Protection of data security, integrity, and privacy
Phases of full KDD process
• Understanding the problem domain – knowledge and main aims
• Creating the data collection, using data selection process
• Data cleaning and preprocessing (may take over 50% of time!) ETL, DW
• Knowledge representation
• Logical language used to describe the mined patterns, usually: logical rules,
decision trees, neural networks
• Criteria for rating the mined knowledge
• Search strategy to maximize optimization criteria
• Parameter search
• Model search (for model families)
Designing a good Data Mining process
• Each approach only fits to some practical problems
• The issue is to find a good question to ask (properly formulate the
problem)
• There is no criteria for this, experience is needed
• Data
• Set of all items I = { i1, i2, i3,… , im }
• Set of all transactions D = [ (tid1,T1), (tid2,T2), … }
• Minimum level of suport sup_min and minimum level of confidence conf_min
(constants)
• Task:
• Find all association rules with support > sup_min and confidence > conf_min
Association rules as computational task
5 CW
4 A,D,T,AC,AW,CD, CT,ACW
3 AT,DW,TW,ACT,
ATW,CDW, CTW, ACTW
Mining frequent itemsets
a note to reduce computation time
Mining frequent itemsets
• Following on the previous observation:
• If {A,B} is a frequent itemset, then {A} and {B} should also be frequent.
• In general: If X is a frequent k-itemset, then all of its subsets should be
frequent (k-1)-itemsets.
• Thus a basis for an algorithm:
• Find all frequent 1-itemsets
• Generate frequent 2-itemsets from frequent 1-itemsets
• …
• Generate frequent k-itemsets from frequent (k-1)-itemsets
Mining frequent itemsets – Apriori algorithm
• Strategies:
• Move elements to the right
one by one
• Use AprioriGen to generate Y
Improved Apriori algorithm
STEP 1
STEP 2
STEP 3
Another approach: AprioriHybrid
• AprioriTid searches the CBk table instead of the whole transactional
base – it is effective, if CBk is much smaller than the whole DB.
• AprioriTid is better than Apriori if:
• CBk fits in memory
• Frequent itemsets have long tail distribution
• AprioriHybrid:
• Uses Apriori for first iterations
• Starts using AprioriTid once we estimate that CBk will fit in memory
• In practical terms AprioriHybrid may be ~30% faster
than Apriori and ~60% faster than AprioriTid
• Yet another approach is to use Hash tree
Association rules mined using FP-tree
• The main idea is that the data is saved in less computationally
demanding structure of the FP-tree.
• Frequent itemsets are calculated based on the FP-tree.
• Entire database is only searched two times:
• To determine how often each item occurs
• To construct FP-tree
• As it uses „divide and conquer” method it is much faster than Apriori
Association rules mined using FP-tree
example TID Items
1 f,a,c,d,g,i,m,p
• Min_sup = 3 2 a,b,c,f,l,m,o
3 b,f,h,j,o
4 b,c,k,s,p
f c a b m p l o d e g h i j k n s
4 4 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1
• The search checks for single items in the „header table” top to
bottom
• For each item i :
• Construct a collection of conditional patterns as „prefix paths” (a path from
room to i)
• Construct a conditional FP-tree
• We use conditional patterns for i as a small database of transactions D(i)
• We build FP-tree for D(i)
• If there is only one path, then STOP and return frequent itemsets
• Otherwise repeat
Association rules mined using FP-tree
example
• Lazy methods:
• No (fast) learning. Does not generate model (classifier)
• Classification is done directly from training data
• Eager methods:
• Long learning based on building a full model (classifier) from training data
• Classification is done with the generated model
Classification in two steps
Classifiers: k-Nearest Neighbours
• Often used for numerical attributes
• Requirements:
• Training data set
• Distance function (between objects)
• Parameter: k – the number of neighbours to consider
but this has high computational cost and requires knowledge of many
probability distributions.
Classifiers: Bayesian
• Classification of an object described by attribute values x:
Positive
Dyspnea
XRay
Classifiers: Bayesian Belief Network
• Example:
• Average height 177cm, standard deviation 19cm. Distribution N(177,19).
• Calculate probability that the height of a random person is:
• 160cm or less
• More than 160cm
• In the interval [158,196)
• Outside the interval [158,196)
Predicting performance: confidence interval
Predicting performance: confidence interval
Predicting performance
• We just need to do the same for Bernouli distribution (for large N):
• Mean: p, Variance: p(1-p)
• Expected success rate: f=S/N
• Mean for f: p, Variance for f: p(1-p)/N
𝑓−𝑝
• Normalization of f:
𝑝(1−𝑝)
ൗ𝑁
𝑓−𝑝
• So calculating the confidence interval is: 𝑃 −𝑧 ≤ ≤𝑧 =𝑐
𝑝(1−𝑝)ൗ
𝑁
• Select second group for testing, the rest (including first) for training
•…
Cross-validation
• Standard method for evaluation: stratified ten-fold cross-validation
• Extensive experiments have shown that ten is the best choice to get an
accurate estimate (drawbacks include diminishing returns)
• Stratification reduces the estimate’s variance
• Improved approach: repeated stratified cross-validation
• E.g. ten-fold cross-validation is repeated ten times and results are averaged
(reduces the variance)
Crossvalidation
• Leave-One-Out – a special case where number of groups is equal to
the number of cases, that is for n objects we build the classifier n
times!
A Not A
A TP FN
Value
Real
Not A FP TN
• Machine Learning methods are built to
minimize FP+FN.
• Specific real-world cases may require other approaches
• Direct Marketing maximizes TP:
• Develop a model of a target from a list of candidates (one class in dataset)
• Score all candidates and rank them
• Select top items for further action
• It is a method used by retailers, etc.
Model-Sorted List
• We use some Model of the target class to rate each candidate
• There should be more targets (TP) near the top of the list
Gain Chart
• „Gain” is a popular name for CPH = „Cumulative Percentage of Hits”
• CPH(P,M) is the percentage of all targets in the first P (P ∈ (0,1) ) of
the list scored by model M
Lift
• Using „Gain” we can calculate „Lift”
• Lift (P,M) = CPH(P,M) / P
• In previous case Lift(5%,model) =
CPH(5%,model) / 5% = 21% / 5% = 4.2
• The model is 4.2 better than random!
• Multivariate tree
Classifiers: decision trees
• Example function for continuos attribute:
• Inequality test:
• „Value copy”
Classifiers: decision trees
• How do we determine which tree is best?
• Quality determined by tree size – smaller is better:
• Less nodes
• Less height
• Less leaves (but minimum number of leaves = number of classes)
• Quality determines by classification error on training set
• Quality determines by classification error on testing set
• Information gain
• Gini index
• The higher the value, the better the test.
Comparing tests
• If t’ divides the space into smaller parts than t then
Gain(t’,X) > Gain(t,X) and Disc(t’,X) > Disc(t,X) (it is monotonous).
• Score values for test t are small if distributions of clases in X are similar.
• Gain ratio is better scoring for test than just Gain:
• We prune if
usually
Tree pruning
Decision trees: case of missing values
Learning:
• We can reduce the criterion of test scoring by a (no. of objects with
unknown values / no. of all objects)
• We can fill unknown values of attributes with the value most common
in objects connected with the current node
• We can fill unknown values of attributes with weighted average of
known values.
Decision trees: case of missing values
Classification:
• We stop classification in the current node and return the dominant
label for it
• We can fill unknown values using one of approaches for training
• We can calculate probability distribution of classes based on the
subtree
Decision trees
• Each node is connected with a subset of data, which requires memory
• Looking for best division requires multiple data reordering, which is
computationally expensive (especially with multiple different values).
• There are other algorithms for constructing decision trees, e.g. Sprint:
• Works with some data on hard drive storage
• Uses preordering to speed up calculation on real-value attributes
• Data is ordered only once before calculation
• Can be easily made parallel
Decision trees: Sprint algorithm
• Each attribute has its list of values
• Each element of the list has three parts:
• Attribute value
• Class number i
• Number of object in the dataset rid
• Real attributes are ordered (once on creation)
• At the start the lists are in the tree root
• When nodes are created, lists are divided and connected with proper
children nodes
• Lists are backed up on the drive
Decision trees: Sprint algorithm example
Decision trees: Sprint algorithm
• Sprint uses:
• Gini index for scoring division tests
• Inequality tests (a<c) for real value attributes
• Set element tests (a∈V) for symbolic attributes
• For real attributes two histograms are used:
• Cbelow : for data below the threshold
• Cabove : for data above the threshold
• For symbolic attributes there is a histogram called count matrix
Division
point
Decision trees: Sprint algorithm example
Decision trees: Sprint algorithm example
List of CarType values
Count Matrix
• Symbolic attribute:
• Determine division matrix of objects in each node
• Using approximate algorithm determine value 𝑉 ⊆ 𝐷𝑎 , so that test (a∈V) will
be optimal
Decision trees: Sprint algorithm
Processor 1
Decision trees: Sprint algorithm (parallel)
• Data we have:
• Desired number of clusters k
• Set of objects P
• Distance function d
• Objective function F
• Task: divide set P into k clusters, such that it maximizes F
Clustering
K clusters found
K-means
• Original method (1967) with N points x1, … xn on a plane Rn and k<N,
looking for k points c1,…,ck (called representatives or centroids) that
are optimal for function
Changes
K-means example
• Update centroids
K-means example
• (updated centroids, next step: reassign points)
Grouping based on probability
• Objects are in some cluster with some probability.
• Each cluster is described by one probability distribution.
• We assume all distributions are normal, described by expected value 𝜇 and
standard deviation 𝜎.
Grouping based on probability
• Object x is in cluster A with probability
• For the value wi being the level of i-th object being in the cluster A:
PART 3: Graphs
Social Networks in Data Science
• What is a Social Network?
• Social Network as a Graph
• Person as node, Friend (Follower) as Edge
• Alternatively: Person, Activity (Interest, Item) as nodes of different types
Social Networks in Data Science
• Network types:
• Online social network
• Facebook vs Twitter vs other examples
• Telephone network
• Real world communication network
• Email network
• Enron
• Collaboration network
• Research Gate, Groupon, etc
Social Networks in Data Science
• Distance measures
• Edges used to count distance
• Non-people nodes used to count distance
• No connection is infinite distance
• Clustering by distance
• Close nodes may be a community
• Betweenness measure
• Find the shortest path between each pair of nodes
• On how many shortest paths does this node lie
• Low betweenness inside communities, high outside
Social Networks in Data Science
• Cliques
• Everybody is connected with everybody
• Another sign of a community
• Counting triangles
• Efficient algorithms, because triangle is the smallest clique
• Diameter of a graph
• The largest distance between nodes (longest of the shortest paths)
• Small world theory states that diameter (for entire planet!) is 6.
Social Networks in Data Science
• Any other complex model of human behaviour could have been used
Social Networks in Data Science
Cambridge Analytica case
• The used approach:
• Classify the persuadable people as distinct from others
• Determine the OCEAN characteristics of persuadable people
• Prepare campaign for different groups of persuadable people
• Start the campaign in multiple channels.
• This specificly was done by SCL company, which got their data from
Cambridge Analytica…
• Each Like on Facebook was one attribute
• They had data on 50 milion people (voters?)
Social Networks in Data Science
Cambridge Analytica case
• Targeted content in Trump’s
campaign
Social Networks in Data Science
Cambridge Analytica case
• Targeted content in Trump’s campaign
Social Networks in Data Science
Cambridge Analytica case
• Targeted content in Trump’s campaign
Geospatial Data
• GIS – Geographic Information Systems is a catch-all term for multiple
tools using maps, map data and other spatial information.
• The most common approach is putting a layer of data as colors or icons on the
map of some location.
• Software must be able to combine map data (location + value) with the
displayed map.
• Not dissimilar to weather broadcasts on TV and may be used as a basis.
• Many Business Intelligence applications have this option built in as default.
• Another important layer is time.
• Aerial photos may substitute for map (example: ortophotomap)
Geospatial Data
• Different types of presentation & data
Geospatial Data
• Toronto example (book)
Geospatial Data
• Chicago Example (book)