chp5 Bags

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 28

Chapter 5:

5.1 Relational Operations On Bags


5.1.1 Why Bags?
5.1.2 Union, Intersection And Difference Of Bags
5.1.3 Projection Of Bags
5.1.4 Selection Of Bags
5.1.5 Product Of Bags
5.1.6 Joins Of Bags
5.1.7 Exercises For Section 5.1

5.2 Extended Operators Of Relational Algebra


5.2.1 Duplicate Elimination
5.2.2 Aggregation Operators
5.2.3 Grouping
5.2.4 The Grouping Operator
5.2.5 Extending The Projection Operator
5.2.6 The Sorting Operator
5.2.7 Outer Joins
5.2.8 Exercises For Section 5.2

5.3 A Logic For Relations


5.3.1 Predicates And Atoms
5.3.2 Arithmetic Atoms
5.3.3 Datalog Rules And Queries
5.3.4 Meaning Of Datalog Rules
5.3.5 Extensional And Intensional Predicates
5.3.6 Datalog Rules Applied To Bags
5.3.7 Exercises For Section 5.3

5.4 Relational Algebra And Datalog


5.4.1 Boolean Operations
5.4.2 Projections
5.4.3 Selection
5.4.4 Products
5.4.5 Joins
5.4.6 Simulating Multiple Operations With Datalog
5.4.7 Comparison Between Datalog And Relatonal Algebra
5.4.8 Exercises For Section 5.4

5.5 Summary Of Chapter 5


5.6 Exercise For Chapter 5

Summary:

(1) Let R and S be bags of tuples and let tuple t appear m times in R, n times in S
(m and n can be zero).

then
the bag R union S contains m + n occurrences of t
the bag R intersection S contains min (m,n) occurrence of t
the bag R diff S contains max(0, m-n) occurrences of t. intuitively (when R and
S contain tuples of t) each occurence of t in S 'cancels' one occurence of t in R

(2)
we extend the set based relational algebra of 2.4 to bags, which better reflects
how relational algebra is implemented in practice.
we also extend the algebra so it can handle more operations than previously. (e.g
aggregation)

REM Set based Relalg


1. set based ops - union, intersection, difference
2. project, select
3. natural join, theta join
4. rename
5. specify keys (review required)

/REM

then we look at datalog, a query language based on logic.


datalog allows us to express queries by describing the desired results (??)

5.1 Relational Operations On Bags

Key Difference: Relations that are *bags* of tuples rather than *sets* of tuples as
before.

Consider the relation

A B
1 2
3 4
1 2
1 2

Here the tuple (1,2) appears 3 times and the tuple (3,4) appears once.
If this were a set based relation we would have to remove 2 instances of the (1,2)
relation

In bag based relations, we *do* allow repetition of tuples, but as with sets *order
does not matter*.

5.1.1 Why Bags?

1. Commercial DBMS implement relations that are bags (rather than sets) of tuples.
2. An important motivation for bags is that some relational operations are
considerably more efficient than using sets.

Examples
1. To take the union of two tuples, we just copy one relaton and add to it all
the tuples of the other relation. There is no need to eliminate duplicates as with
sets.
2. When we project a set based relation (retain only some columns), we need to
compare each resulting tuple with all the tuples projected so for, to ensure no
duplicates, whereas with bag based relations, we simply project each tuple, and add
it to the result, with no elimination of duplicates necessary.
Consider the relation

A B C
1 2 5
3 4 6
1 2 7
1 2 8

If we do a project_(A,B) onto this bag based relation we get the relation

A B
1 2
3 4
1 2
1 2

If we were using set based relations we would need to eliminate the duplicated
(1,2) tuples from the result.

The bag result is larger but can be computed more quickly.

There are some situations where only bag based relations give the correct result.
Example: Consider taking the average of the values of the A column of the above
relation.

5.1.2 Union, Intersecton, Difference of Bags

Let R and S be bags of tuples and let tuple t appear m times in R, n times in S (m
and n can be zero).

then
the bag R union S contains m + n occurrences of t
the bag R intersection S contains min (m,n) occurrence of t
the bag R diff S contains max(0, m-n) occurrences of t. intuitively (when R and
S contain tuples of t) each occurence of t in S 'cancels' one occurence of t in R

Let R be
A B
1 2
3 4
1 2
1 2

Let S be
1 2
3 4
3 4
5 6

Then R Union S
1 2
1 2
1 2
1 2
3 4
3 4
3 4
5 6

R intersect S
A B
1 2
3 4

R difference S
A B
1 2
1 2

S difference R
A B
3 4
5 6

5.1.3 Projection of Bags

Each tuple is processed independently.


If in the resulting relation, tuples are duplicated, that is fine.

Let R =

A B C
1 2 5
3 4 6
1 2 7
1 2 8

we do a project_(A,B) R we get (each tuple of R processed independently, duplicates


don't matter)

A B
1 2
3 4
1 2
1 2

Bag Operations On Sets

Key Idea: == Each set can be considered to be a bag with no duplicates, and then
*some of* the bag operations 'fall back into' the set operations.

E.g R intersect S . If R and S don't have any duplicates, but we use the bag
intersect anyway (min (m,n)) we get exactly the result of the Set intersect.
Similarly difference.

But union works differently. If t occurs (once) in both R and S, then the bag union
will have two instances of t, and the set union has 1 instance.

5.1.4 Selection on Bags


We apply the selection condition to each tuple independently. We do not eliminate
duplicates in the result.

Let the tuple R be


A B C
1 2 5
3 4 6
1 2 7
1 2 7

then let the tuple that is the result of select_(C >= 6) R is

A B C
3 4 6
1 2 7
1 2 7

(note: duplication of 1 2 7 tuple. so in essence, apply filtering condition per


tuple , allow duplicates)

Algebraic Laws For Bags

an algebraic law is in equivalence between two expressions of relational algebra


with variables standing for relations.
e.g: A union B = B union A

(KEY) there are laws that hold when relations are sets that do *not* hold when
relations are bags.

e.g: dist diff/union (from the right!)

(R union S) diff T = (R diff T) union (R diff S)

this holds for sets, but not bags.

(bags) counterexample: suppose R, S, T have one copy of a tuple t


Left expression resolves to a relation with one tuple. union gives two, diff takes
off one
right expression resolves to a relation with no tuples. diffs give two relations
with zero t tuples. Union preserves zero tuples.

1 != 0 so 'law' fails.
with sets, both sides resolve to relations with 0 t tuples, so law holds.

5.1.5 Products of Bags

Cartesian product for bags - each tuple from one relation is paired with each tuple
of the other, *regardless of whether the result is a duplicate or not*. Therefore
if a tuple t occurs m times in relation R, n times in relation S, the tuple rs
will appear in R X S (mn) times.

Example: Let R be

A B
1 2
1 2
Let S be

A B
2 3
4 5
4 5

then R X S is

A R.B S.B C
1 2 2 3
1 2 4 5
1 2 4 5
1 2 2 3
1 2 4 5
1 2 4 5

5.1.6 Joins of Bags

No surprises. We take each relation of one tuple with each relation of the other,
decide (based on conditions of the join) if this pair of tuples joins successfully,
and if so, add it to the result, *ignoring* duplicates.

Example: Let R be

A B
1 2
1 2

Let S be

B C
2 3
4 5
4 5

then R natural join S is (note that B is the shared attribute)

A B C
1 2 3
1 2 3

R theta_join( R.B < S.B) S is

A R.B S.B C
1 2 4 5
1 2 4 5
1 2 4 5
1 2 4 5

5.1.7 Exercises For Section 5.1

Exercise 5.1.4 Certain algebraic laws for relations as sets also hold a relations
for bags. Explain why each of the laws below holds for bags as well as for sets.
1. R U (S U T) = (R U S) U T

ideally we want a proof here for bags (the above is true for sets).

for now a brief outline. Let R have m tuples of t, S have n, T have t, assume for
convenience all != 0

then for sets.


LHS: S U T has 1 copy. R U (S U T) has one copy.
RHS: R U S has 1 copy. (R U S) U T has one copy.

for bags
LHS (S U T) has n + t copies. R U (S U T) has m + n + t copies.
RHS (R U S) has m + n copies. (R U S) U T has m + n + t copies.
so law holds for sets and bags.

2. (R intersect S) intersect T = R intersection (S intersect T))

for sets: LHS: R int S gives 1. T intersect that gives 1


RHS: S int T gives 1 (tuple). R intersect that give 1

for bags:
LHS: R intersect S gives (min m n). T intersect that gives (min t (min m
n))
RHS: S intersect T gives (min n t) tuples. R intersec that give (min m
(min n t))
since (min c (min a b)) = (min a (min b c)) = (min a b c), LHS = RHS

3. associative law for natural join


R nj (S nj T) = ( R nj S) nj T

For a tuple t in R, S have n tuples it joins to, and for a tuple u in S, T have p
tuples it joins to

hmm don't know how to *prove*.

4. commutative law for Union


R U S = S U R

Let R have m reps of tuple t, S have n

then
for sets
LHS has 1 tuple for m != 0 and n! = 0 and if either has 0 tuples, the union has 0
which is true for RHS also.
for bags
both sides end with m + n tuples

5. commutative law for intersection


R intersect S = S intersect R
both sides end up with (min m n) = (min n m) . QED
Exercise 5.1.5

The following algebraic laws hold for sets but not for bags.
Explain why they hold for sets and give counterexamples to show that they don't
hold for bags.

(1) ( R intersect S ) - T = R intersect (S - T)

Let R have 1 instances , S have 3 instances, T have 1 instance each of tuple t


LHS:
R intersect S gives 1 tuples of t ;; (min 1 3)
That diff T gives a 0 instance of t (max 0 (1 - 2))

RHS
S - T gives 2 instances of t (max 0 (3 -1))
R intersect that gives 1 instance of t (min 1 2)

LHS != RHS

(2) Distribution of intersection over union

R intersect (S Union T) = (R intersect S) union (R intersect T)

Let R have 1 instances , S have 3 instances, T have 1 instance each of tuple t


LHS:
S Union T has 4 elements
R intersect this = 1 tuple

RHS:
R intersect T gives 1 tuple . R intersect T gives 1 tuple.
Union gives 2 tuples
LHS != RHS

5.2 Extended Operators Of Relational Algebra

Languages like SQL have operations that have proved useful (but go beyond the bag
based and set based relational algebra we dealt with so far)
1. duplicate elimination operator delta changes bags to sets by eliminating
duplicates.
2. aggregation operators - sums, averages, are not operations of relational
algebra, but *are used* by the grouping operator (see below)
3. (once we have some tuples) grouping them in accordance with the values of
one or more attributes. This effectively *partitions* tuples into groups (_ but
(KEY) only *one* entry for all tuples with the same value in the grouping
attributes). Aggregations can then be applied to the columns of tuples within
groups, giving us the ability to express a number of queries that are difficult to
express in classical relational algebra.

The grouping operator gamma, combines the effect of grouping and aggregation.

4. Extended projection: gives extra power to the projection operator,


specifically, produce new columns from existing columns.
5. The sorting operator tau turns a relation into a *list* (not set) of tuples,
sorted according to one or more attributes. this should be used judiciously since
some relational algebra operators don't make sense for lists. However we *can*
apply select and project onto lists and expect order not to change.
6. Outer Join is a variant of join that avoids dangling tuples (tuples with no
values for certain attributes). With outer join, dangling tuples are "padded" with
the null value.

5.2.1 Duplicate Elimination

to convert a bag to a set, we use delta (R) that returns a set that has exactly one
copy of every tuple that appears in relation R

Let R be
A B
1 2
3 4
1 2
1 2

delta (R) gives


A B
1 2
3 4

The tuple (1, 2) that appears thrice in R appears only once in the result

5.2.2 Aggregation Operators

apply to bags or sets of numbers or strings, and generate a summary or 'aggregate


value' of the contents of these sets/bags.
The standard aggregation operators are
a. SUM - produces the sum of a column of numerical values
b. AVG - produces the average of a column of numerical values
c. MIN, MAX - produces, when applied to a column of numerical values, the
numerically greatest or least value. can also be applied to columns of strings when
it produces the lexicographically least or most value.
d. COUNT - produces the number of *not necessarily distinct*, values in a
column (== number of elements in the 'column set' or 'column bag'.)

Example 5.9
Consider the relation R =
A B
1 2
3 4
1 2
1 2

then
SUM(A) = 1 + 3 + 1 + 1 = 6
AVG(A) = (1 + 3 + 1 + 1)/4 = 1.5
MIN(A) = 1
MAX(A) = 3
COUNT (A) = #{_b 1, 3, 1, 1} = 4

5.2.3 Grouping
Often we do not want to aggregate the values of an entire column. Instead we want
to arrange tuples into groups using the value of one or more colummns, then
aggregate 'by groups'.

e.g: starting with the relation

Movie(title,year, length, genre, studioName, producerC#)

we want the total number of minutes of movies produced by each studio, the
resulting relation should be something like

studioName sumOfLengths
Disney 12345
MGM 34567

... ....

we need to
1. group the tuples of this relation by studioName
2. sum the length column within each group (all of whose entries have the same
studioName)

5.2.4 The Grouping Operator

grouping operator gamma, with a subscript L , a list of elements, each of which is


either
a) an attribute of the relation R to which grouping is applied. Such an
attribute is called a grouping attrubute (__ how does this work with a *list* of
attributes)
b) an aggregation operator which is applied to an attribute of a relaton. To
provide a name for the result of the aggregation, an arrow and a new name is
appended to the aggregation. The underlying attribute is said to be an *aggregated*
attribute.

(_ the key point is that L has two kinds of elements - grouping attributes and then
an aggregation operator (which in turn has a new name for the result)

the relation returned by the expression gamma_L (R) is constructed as follows

(1) Partition the tuple R into groups.


Each group consists of all tuples having one particular assignment of values
*to the grouping attributes*.
If there are no grouping attributes, the entire relationship is one group.

(2) For each group, produce (KEY) *****one*** tuple consisting of


(i) the grouping attributes' value for that group
(ii) the aggregations, over all tuples of that group, for the aggregated
attributes on list L.
(_iii) note that only the grouping attributes and the result of the aggregation
operators appear in the resulting relation.

Example: 5.10
consider the relation

StarsIn(title, year, starName)


and, we wish to find for each star who has appeared in 3 movies, the earliest year
in which they applied.

Step 1: First group using starName as a grouping attribute.


Step 2: we must compute for each group, the MIN (year) attribute.

These two steps give one row for each star, with starName, minyear
but, there is a condition that the stars for whom we do this
computation should have finished at least 3 movies, so we must also compute the
COUNT(title) aggregate for each group. (this _muddles the example imo)

We begin with the grouping expression


gamma_(starName, MIN(year)->minYear, COUNT(title)->ctTitle(StarsIn)

This results in the replication of the tuples in StarsIn, but with all tuples
having the same starname having their titles counted, and value put in a new
column, ctTitle, and minimum of the years their movies appeard in, put in a new
column min year.

Then onto *this* tuple if you do a selection with ctTitle >= 3 and on that result
do a selection of only starName + minYear, (+ elimination of duplicates? no, this
is built into grouping) we get the desired result.

Simple SQL Examples

SELECT COUNT(CustomerID), Country


FROM Customers
GROUP BY Country;

delta- the deduplication operator is a special case of grouping operator.


if all attributes of the relation are given to the subscript of the grouping
operator, (with no generated aggregation columns) we get, essentially, a group
operation on all attributes with no aggregation. Since the grouping operator
produces exactly one tuple from each group, the effect of this 'grouping' is to
eliminate duplicates.

but because delta is such a common operator, we consider it distinct.

*gamma* can be seen as an extension of the projection operator, again with all
attributes of the relation as the subscript list.
If R is a set, the output of this type of gamma and a projection with all
attributes is identical.
But R is a bag, the gamma with all attributes eliminates duplicates, but projection
with all attributes do not.

5.2.5 Extending The Projection Operator

In the 'classical' projection operator encountered earlier, pi_L (R), L was a list
of attributes of R (that we want in the result).
We extend this by allowing *computations* with components of tuples. so, now L is a
list of *either*
a) a single attribute of R (which is present in the result)
b) an expression x -> y where x and y are names for attributes, by which we ask
that 'take attribute x of R, *rename* to y in the result'
c) an expression E -> z where E is an expression consisting of attributes of R,
constants, arithmetic operators, string operators, and zi is a *new* name for the
result of the computation E encodes.
example: a + b -> x, represents the sum of attributes a and b renamed x. c ||
d -> e means the concatenate the presumably string valued attributes c and d, and
call the result e.

The result of the projection is computed by considering each tuple of R *in turn*,
the result is a relation whose schema is the names of the attributes on list L,
with whatever renaming the list specifies.

note: Duplicate tuples in R create duplicate tuples in the result. (but even if R
does not have duplicates, the result can).

e.g: let R be
A B C
0 1 2
0 1 2
3 4 5

pi_(A, B+C -> X) gives

A X
0 3
0 3
3 9

pi_(B-A -> X, C-B -> Y) give

X Y
1 1
1 1
1 1

note that two distinct tuples of R have been converted to the same tuple 1 1

5.2.6 The Sorting Operator

The expression Tau_L (R) , with R a relation, L a list of some of the attributes
of R, is the relation R, but with the tuples sorted in the order of attributes
specified in L . If L is the list A1, A2.... then tuples of R are sorted *first* by
the attribute A1. Ties are broken by the order of A2. Ties that remain after
ordering by An are ordered arbitrarily.

Notes:
joins on the result of sorts make the sort meaningless. Elements on the *list*
should be treated as a bag, not a list (??)
however bag projections can be made to preserve the order
also a selection on the list (returned by the sort operator) drops the tuples
that do not satisfy the selection condition, but the remaining (selected) tuples
can be made to appear in sorted order.

5.2.7 Outer Joins

Consider these relations

A B C
1 2 3
4 5 6
7 8 9

B C D
2 3 10
2 3 11
6 7 12

the shared attributes are B and C

R naturaljoin C gives

A B C D
1 2 3 10
1 2 3 11

One of the features of the natural join is that some tuples in each relation are
excluded from the join (the 2nd,3d tuple in R, the 3d tuple in S). These are
'dangling tuples'. Where we, for some reason also want the dangling tuples in the
join, we use a variation called the 'outer join'.

First the 'natural' outer join, with the join on equated values of common
attributes

in this, we start with the natural join, then *add* dangling tuples from both
relations, with a null value (the bottom symbol in the text, here represented with
N) for the attributes the dangling tuples don't possess.

so R naturalOUTERjoin S gives
A B C D
1 2 3 10
1 2 3 11
4 5 6 N ;from R
7 8 9 N ;from R
N 6 7 12 ;from S

(KEY Insight == those tuples from either side of the natural join that do not *make
it into the resulting relation* are the dangling tuples. Pull them in, fill up
missing columns (from other relation) with Null.)

symbol of outer join = bowtie with a hollow dot on top.

There are many variations of the outer join.

In the Left Outer Join, only the dangling tuples from the relation on the left are
added to the resultant relation (and padded with NULLs)

A B C
1 2 3
4 5 6
7 8 9

S
B C D
2 3 10
2 3 11
6 7 12

note: the shared attributes are B and C

R LeftOuterJoin C gives

step 1. First natural join

A B C D
1 2 3 10
1 2 3 11

step 2. add 'left out' tuples from the left relation (here R), pad column D with
nulls

A B C D
1 2 3 10
1 2 3 11
4 5 6 N
7 8 9 N

For the Right Outer Join, only dangling tuples from the right hand side relation
(here S) are added to the join so

A B C
1 2 3
4 5 6
7 8 9

B C D
2 3 10
2 3 11
6 7 12

note: the shared attributes are B and C

R RightOuterJoin C gives

step 1. First natural join

A B C D
1 2 3 10
1 2 3 11

step 2. add 'left out' tuples from the right relation (here S), pad column A with
nulls
A B C D
1 2 3 10
1 2 3 11
N 6 7 12

all three natural outer joins have their theta join equivalents.
first, a theta join is taken, and then tuples that didn't participate are
added, and padded with nulls.

A B C
1 2 3
4 5 6
7 8 9

B C D
2 3 10
2 3 11
6 7 12

note: the shared attributes are B and C


R theta-join(A > S.C) S

gives
1. first theta join

A R.B R.C S.B S.C D


4 5 6 2 3 10
4 5 6 2 3 11
7 8 9 2 3 10
7 8 9 2 3 11

the tuples that didn't participate is the first tuple in R and third tuple in S.
pad with nulls

A R.B R.C S.B S.C D


4 5 6 2 3 10
4 5 6 2 3 11
7 8 9 2 3 10
1 2 3 N N N
N N N 6 7 12

5.2.8 Exercises For Section 5.2

5.3 A Logic For Relations

As an alternative to abstract query languages based on algebra, we can use a form


of logic to express queries.
The logical query language consists of if then rules.
Each of these rules expresses the idea that from the certain combinations of tuples
in certain relations, we may infer that some other tuple must be in some other
relation, or in (?) answer to a query(??)

5.3.1 Predicates And Atoms

Relations are related in datalog by predicates.


Each predicate takes a fixed number of arguments, and a predicate followed by its
arguments is called an **atom**.

The syntax of predicates is just like function calls in programming languages.

P(x1,x2, ..x-n) is an atom consisting of predicate P and arguments x1 through xn. A


predicate returns a boolean value, and can be considered a function that taken n
arguments and returns a boolean value.

<step>
If R is a relation with attributes in a fixed order, (KEY) we use R as a (datalog)
predicate name,
KEY: The atom R(a1,a2, ... a-n) has the value TRUE if (a1, a2, .. . a-n) is a tuple
present in relation R. else this atom has the value false.
For now, a relation defined by (_as?) a predicate is assumed to be a set. Later we
see how datalog can be extended to bags.

Let Relation R be

A B
1 2
3 4

then R(1,2) is TRUE, R(3,4) is true, and R(x,y) for any other values of (x,y) is
FALSE

(step) a predicate can take *variables as well as constants* as arguments.

E.g R(x,2) is TRUE *when* x is one, false *otherwise*.

5.3.2 Arithmetic Atoms

Another kind of atom that is important in datalog. This kind of atom is a


'comparison between two arithmetic expressions' ( _ think this means that
arithmetic ops are treated as *relations* instead of tuples. So the function (+ 1
2) giving 3 becomes the relation "+ (1 2 3)" which returns TRUE, a *boolean*, so we
can have queries like (+ x y 3) and so on - at least that is what I remember from
the Art of Prolog)

The former kinds are called relational atoms, and can change, when they correspond
to tuples in database relations, which are finite. These are arithmetic relations
which are infinite and immutable.

5.3.3 Datalog Rules And Queries

Operations similar to those of relational algebra are described in datalog with


"rules".

Rules consists of
a. A relational atom called a head
b. The symbol <-- which we read as 'if'
c. A body, - consisting of one or more atoms, called subgoals, *which could be
relational *or* arithmetic*.
subgoals are connected by AND
subgoals can be negated by NOT

example:
LongMovie(t,y) <-- Movies(t,y,l,g,s,p) AND l >= 100 (_ interesting how the
second subgoal can use attributes of the first subgoal, so l has to be bound before
the second subgoal can be evaluated to test for truth?)

defines the *set* of long movies those at least 100 minutes long. One of the
subgoals refers to our standard relation Movies with the schema

Movie(title, year, length, genre, studioName, producerC#)

The head of the rule is LongMovie(t,y).


The body of the rule, "Movies(t,y,l,g,s,p) AND l >= 100" has two subgoals
1. The first subgoal has predicate Movies and 6 arguments corresponding to the
6 attributes of our movies relation.
Each of the arguments of the (relational atom) Movies has a variable
corresponding to attributes of the relation Movies.
(datalog relational atom's argument variable) t for (attribute of relation
Movies) title
y for year
l for length
etc

(KEY) We can see this relational atom (Movies, the subgoal of the LongMovies rule)
as saying "Let (t,y,l,g,s,p) be a tuple in the current instance of the relation
Movies". More precisesly Movies(t,y,l,g,s,p) is *true* exactly when the relation
Movies has tuple with the same values (as held in the variables of the relational
atoms Movies.

Similarly the second subgoal l >= 100 is true whenever the length component of a
tuple in Movies is at least 100.

The rule can be viewed as saying:

LongMovies(t,y) is true whenever we can find a tuple in (the relation) Movies with
a) (the values of variables) t and y as the first two components - title and
year -of that tuple
b) (the tuple has) a third component length with >100 as value
c) *any* values in components 4 thru 6

Notice that this rule is equivalent to this relational algebra "assignment form"

LongMovies = project_(title, year) (Select_(length >= 100) Movies)

Queries in Datalog

A query in datalog is (KEY) a collection of one or more rules.


If there is only one relation that appears in the rule *heads*, then the value of
this relation (here LongMovie) is considered the answer to the query.

Thus in the above example, (the resulting relation) LongMovie is the answer to the
query.

(KEY) If there is more than one relation in the rule head, one of these relations
is the answer to the query, while the others assist in the definition of the
answer. While there are several predicates defined by a collection of rules, we
usually assume that the query result is termed Answer ( I think that the point is
that several Relations - from the head - result as 'answer' and we use a convention
to pick on to be "the" answer)

Anonymous variables

Frequently Datalog rules have variables that appear only once. ( I think this means
that variables don't appear in head *and* body of rule). The names used for these
variables are irrelevant. Only when a variable appears more than once do we care
about the name, so we can see it is the same variable in its second and further
appearances.

Thus we adopt a ***convention*** that an underscore _ as an argument for an atom


stands for a variable that appears only there. Multiple occurrences of _ stand for
multiple variables, never the same variable.

The LongMovies rule can be written as

LongMovies(t,y) <-- Movies(t,y,l, _, _, _,) AND l > 100

the variables g,s,p that appear only once in the original have been replaced by
underscores. we cannot replace any of the other variables - t,y,l - because all of
them appear more than once in this rule.

5.3.4 Meaning Of Datalog Rules

The rule

LongMovies(t,y) <-- Movies(t,y,l,g,s,p) AND l >= 100

gives us a hint about the meaning of datalog rules (it does?)


- imagine the variables of the rule ranging over all possible values.
- whenever these variables have values ** that make all subgoals true **(KEY
principle), we see what the value of the head is for those values, and add that
tuple (_ resulting from putting the variable values in head expression) to the
relation whose predicate is in the head.

For the Longmovies rule, we imagine the variables (t,y,l,g,s,p) ranging over all
possible values (here, tuples to check fit are drawn from the Movies relation, so
this just means we consider all tuples of the Movies relation and choose those that
make both subgoals true). The only values of variables that make all subgoals true
are those tuples of Movies where l >= 100. We put the values of t and y (from each
tuple of the relation Movies where l > 100) into a tuple (t,y) and add to (the set
of tuples that is) the relation LongMovie

Rule Safety

we place restrictions on the way variables are used in rules, so that


[1] (KEY) the result of the rule is a finite relation. (_ not infinite relation
as result)
[2] rules with arithmetic subgoals, or subgoals with NOT in front of them
(negated subgoals) make intuitive sense.

This condition, which we call the safety condition, is


(KEY) every variable that appears anywhere in the rule must appear in a non
negated, *relational* subgoal of the body.
more specifically *all* variables that appear in
a) the head
b) a negated subgoal
c) an arithmetic subgoal,
*must* appear in
(a) a non negated
(b) and *relational* subgoal
(c) in the body

P(x,y) <-- Q(x,z) AND NOT R(w,x,z) AND x < y

the only non-negated relational subgoal is Q(x,z)


therefore y in the head, w,z in NOT R(w,x,z), which don't appear in Q, R violate
the rule.

this is thus, not a safe rule and so cannot be used in Datalog.

Another way to define the meaning of rules, assuming safe rules as above

*Instead of considering all possible assignments of values to variables*, we


1. choose all non-negated *relational* subgoals (only)
2. consider sets of tuples in relations corresponding to each such subgoal from
(1)
3. is any assignment of values to variables *in the subgoal* consistent (again,
note that here we use only non negated relational goals) ? i.e, it assigns the same
value to each occurrence of a variable,
if yes then consider the resulting assignment of values to all variables of
the rule (including those in arithmetic and negated relational goals)
4. For each consistent assignment, we see if the assignment of these values to
the variables of negated relational subgoals, and arithmetic subgoals. If the
subgoals are all true, we see what tuple the head becomes under this assignment of
variables and this rule is added to the relation whose predicate is the head.

Example 5.2.1
Consider the datalog rule
P(x,y) <-- Q(z,x) AND R(z,y) AND NOT Q(x,y)

analysis of body ,
two non negated relational subgoals Q, R one negated relational subgoal NOT Q

Step> we consider only the non negated relational goals and look for a variable
assignment that gives consistency.

Let relation Q =

1,2
1,3
Let relation R =

2,3
3,1

There are two non negated subgoals, Q, and R, so we have to consider all
combination of assignments of tuples from relations Q and R respectively to these
subgoals

Tuple for Tuple for Consistent NOT Q(x,y)


Resulting Head
Q(x,z) R(z,y) Assignment? is TRUE?

1. (1,2) (2,3) YES. z = 2 No, Q(1,3) = T


---
2. (1,2) (3,1) NO. z =2,3 Irrelevant
---
3. (1,3) (2,3) NO. z =3,2 Irrelevant
---
4. (1,3) (3,1) YES. z = 3 Yes. Q(1,1) = F
P(1,1)

We first check non negated relational goals for consistency.

The first option assigns a consistent value to z {z =2} and {x = 1, y = 3}. Now
that we have a consistent assignment of variables, we check the other (negated
relational, or arithmetic) subgoals (here only NOT Q(1,3) ) and check if it (==
the *negated* subgoal) evaluates to TRUE. It doesn't (it evaluates to false)

The second and third options are not consistent. Each assigns two different values
to the variable z, and so are inconsistent. We don't evaluate negated relational,
or arithmetic subgoals.

The third option assigns a consistent value t z {z = 3},. So now we check NOT
Q(x,y) . Q(1,1) = F. Negating this gives true. So we take the assignment {x = 1,
y=1, z=2} produce the head clauses which give P(x,y) = P(1,1).

Since we have exhausted all possible tuple assignments for non negated subgoals, we
are done, and P(1,1) is the only tuple in P.

5.3.5 Extensional And Intensional Predicates

A useful distinction
Extensional Predicates = Predicates whose relations are stored in the database.
Intensional Predicates = Relations which are *computed* by applying one or more
Datalog rules.

"extension" == "current instance of a relational DB"


so in the context of relational algebra, 'intensional' and 'extensional' have the
same idea of 'in the db' and 'computed' relations.
In the context of datalog, relations that correspond to any given predicate as
"Intentional" or "Extensional" in the same sense - 'stored in DB' is extensional,
computed is intentional'

"IDB" = Intensional database == predicates and attached relations that are computed
"EDB" = Extensional database == predicates and attached relations are stored in
database

e.g: The relation (and predicate) "Movies" is extensional (_ you can see the table
in the db)
Longmovies (relation and predicate) is intentional/computed.

Now:
1. An EDB predicate cannot appear in the head of a rule. It can appear in the
body.
2. Intensional predicates can appear in the head *or* body.
3. It is common to build up a single relation with multiple rules with the same
IDB in the head (of all the rules)
4. By using intensional predicates, we build up progressively more complicated
functions of the (_ "starting base =") EDB relations, just as we do with relational
algebra.

5.3.6 Datalog Rules Applied To Bags

Datalog is inherently a logic of sets. However (KEY) as long as there are no


negated relational subgoals, the ideas for evaluating datalog rules apply to bags
as well. When relations are bags, it is conceptually simpler to use the second
approach (first look for consistency in variable assignments in non negated
relational subgoals), substituting this variable assignment in all subgoal
predicates, checking that negated relational subgoals and arithmetic goals evaluate
to true, then generating tuples for the head relations).

Since we are now dealing with bags, we do not eliminate tuples in the head.
Moreover as we consider all combinations of tuples for the subgoals, a tuple
appearing n times in the relation for a subgoal, gets considered n times as the
tuple for the subgoal, each time in conjunction with all combinations of tuples for
the other subgoals.

Example 5.2.2

Consider the rule

H(x,z) <- R(x,y) AND S(y,z)

R(A,B) has the following tuples

A B
1 2
1 2

S(B,C) has the following tuples


B C
2 3
4 5
4 5
Data log check for bags: no negated relational subgoals,so proceed. (this is
important, there is no clearly defined meaning of arbitrary Datalog rules with
negated relational subgoals under the bag model)

table of combinations of tuples for each subgoal. each instance of tuple from R is
combined with each instance of tuples from S

R.A R.B x y S.B S.C y z consistent assignment to y?


(x,z) tuple for H

1 2 1 2 2 3 2 3 YES y = 2
(1,3)
1 2 1 2 4 5 4 5 No y = 2,4
----
1 2 1 2 4 5 4 5 No y = 2,4
----
1 2 1 2 2 3 2 3 YES y = 2
(1,3)
1 2 1 2 4 5 4 5 No y = 2,4
----
1 2 1 2 4 5 4 5 No y = 2,4
----

so *two* identical tuples are generated for the Head, since we are using bags.

Example 5.2.3

Consider a relation defined by

H(x,y) <- S(x,y) AND x > 1


H(x,y) <- S(x,y) AND y < 5

with

S(B,C) has the following tuples


B C
2 3
4 5
4 5

First rule puts all of the three tuples into H, since they all have x > 1.
Second rule puts only the tuple (2,3) into H since S,s 2nd and 3d tuples are
eliminated.

The resulting relation has two copies of (2,3) and two tuples (one from clause 1,
and one from clause 2) of (4,5)

5.3.7 Exercises For Section 5.3

5.4 Relational Algebra And Datalog


Relational Algebra operators can be mimicked by one or more datalog rules.

We consider here this mapping of rel alg operators to datalog

Any single safe datalog rule can be mimicked by relational algebra.


but(KEY) datalog is more powerful than relational algebra *when* several rules are
allowed to interact. Then they can express recursions not allowed in relational
algebra.

5.4.1 Boolean Operations

union, intersection, and difference, in relational algebra, can be expressed simply


in datalog.

We assume that R and S are relations with the same number of attributes n (and
matching attribute types?) a1,...2,a_n

Union
Two Rules.

One rule has R(a1,...,a_n) as the lone subgoal and the other has S(a1,..a_n) as the
lone subgoal, and both have the head Answer(a1,...a_n)

(_

Answer(a1, ... , a_n) <- R(a1,...,a_n)


Answer(a1, ... , a_n) <- S(a1,...,a_n)

As a result, each tuple from R and S are put into the answer relations.

Intersection

Answer <- R(a1,...,a_n) AND S(a1,...,a_n)

so a tuple is put into Answer only if it is in both R and S

Difference

Answer <- R(a1,...a_n) AND NOT S(a1,...a_n)

Variables Are Local

The names we choose the names for variables are arbitrary, and have no connection
to variables used in any other rule.

E.g:
U(x,y,z) <- R(x,y,z)
U(x,y,z) <- S(x,y,z)

is the same as
U(x,y,z) <- R(x,y,z)
U(a,b,c) <- S(a,b,c)

These two rules still compute the union of R and S.


(_ would this work only when S has exactly 3 attributes in that sequence,
especially if it is an extensional relation
)

5.4.2 Projection

To compute a projection of relation R, we use one rule with a single subrule with
predicate R. The arguments for this subgoal are distinct variables, one for each
attribute of R. The head has an atom with arguments that are variable corresponding
to attributes in the projection list, in the desired order

e.g:
A relation Movies with schema == Movies(title,year,length,genre, studioName,
producerC#)
To project onto its first three attributes, title, year and length, we use

P(t,y,l) <- Movies(t,y,l,g,s,p)

defines P to be the result of the projection

5.4.3 Selection

Selections can be difficult to express in Datalog.

The simple case is when the selection condition is an AND of one or more arithmetic
expressions. In that case, we create a rule with
one relational subgoal for the relation upon which we are performing the
selection. This atom has distinct variables, one for each attribute of the
relation.

for each comparison in the selection condition, an arithmetic subgoal that is


identical to the this comparison. However, while in the selection condition, an
attribute name was used, in the arithmetic subgoal we use the corresponding
variable, following the correspondence established by the relational
subgoal.occult/

Given the relation


Movies(title,year,length,genre, studioName, producerC#)
the selection
select _(length>100 AND studioName = "Fox") (Movies)

can be written as a Datalog rule

S(t,l,y,g,s,p) <- Movies(t,y,l,g,s,p) AND (l > 100) AND s = 'Fox')

The result is the relation S. Note that l and s are the (datalog) variables
corresponding to the attributes length, and studioName in the relation Movies, *in
the standard order of attributes* used in the relation Movies.

Now we look at selections which involve ORs of conditions.


We cannot necessarily replace such selections via single datalog rules.
However, (KEY) selection using OR of two conditions is equivalent to taking the
union of the results of the two conditions.
so we represent the OR of n conditions by writing n rules, all with the same
head, and each rule representing *one* disjunct

Example 5.27
sigma_[(l > 100) OR studioName = 'Fox'] (Movies)
corresponding datalog rules
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND l > 100
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND studioName = 'Fox'

More complex selection conditions can be formed by combinations of AND, OR, NOT
But we can transform this (technique not presented) to convert such queries into
disjunctive normal form, where the resultant query is a disjunction of conjuncts.

Then a dnf query can be represented by several datalog rules, one for each
conjunct, with the "union semantics" presented above.

Example 5.28
consider the relational expression
sigma_[NOT length >= 100 OR studioName = 'Fox' ](Movies)
converted into disjunctive normal form this becomes

sigma[length < 100 AND studioName != 'Fox] (Movies)

which becomes the datalog rule,


S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND l < 100 AND s != 'Fox'

Example 5.29
consider
sigma[NOT (length >= 100 AND studioName = 'Fox')] (Movies)
dnf form is
sigma [length < 100 OR studioName != 'Fox'] (Movies)

converting to datalog, we get, (_ using the 'union semantics' for OR from above)
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND l < 100
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND studioName != 'Fox'

5.4.4 Products

The product of two relations R,S can be expressed by a single datalog rule.
(KEY) this rule has two subgoals - one for R, one for S.
(KEY) each of these subgoals has distinct variables one for each attribute of R and
S.
the predicate in the head has as arguments all the variables that occur in either
subgoal predicate, with the variables of R appearing before the v ariables of S.

P(a,b,c,x,y,z) <- R(a,b,c) AND S(x,y,z)

defines P to be = R X S

5.4.5 Joins

We can take the natural join of two relations by a datalog rule that looks much
like the rule for a product.
(KEY) the difference is that what we want R join S, then we must use the same
variables for attributes of R and S that have the same name as and use different
variable names otherwise.

Example 5:31 Consider relations with schemas R(A,B), S(B,C,D) . Their natural join
maybe defined by the rule

J(a,b,c,d) <- R(a,b) AND S(b,c,d)

note the common b in both subgoals, which is where the natural join happens.

Converting theta joins to Datalog

(from 2.4.12 one can see a theta join as equivalent to a product followed by a
selection)
(_ product is expressed by a single datalog role R X S = P(a,b,c,x,y,z) <-
R(a,b,c) AND S(x,y,z),
selection is a bit complicated see above, a conversion to dnf may be necessary)

Example 5.32
Consider the relations U(a,b,c), V(b,c,d) and the theta join U theta_join V [A
< D AND U.b < V.b]

We construct the datalog rule

J(a,ub,uc,vb, vc,d) <- U(a, ub,uc) AND V(vb,vc,d) ;; this is the product
AND a < d AND ub != vb

to perform the same operation. Note the naming to distinguish common attributes ub,
vb etc. The first two subgoals reveal the 'source' of the variables in the result
clause, and also 'cover' each relation, the AND gives the cross product, the
remaining subgoals give the theta join condition.

If the condition on the theta join is not a conjunction, we convert to disjunctive


normal form, and we create one rule for each (disjunct) clause of the dnf form.
Then we add subgoals for each literal in each conjunct. The heads of all rules are
identical and have one argument for each attribute of the two relations being
joined.

Example 5.33
Consider U(a,b,c), V(b,c,d) and theta join is U theta_join V [A < D OR U.b <
V.b] ;; same as in prev example, OR replaces AND

the equivalent datalog is

J(a,ub,uc,vb,vc,d) <- U(a,ub,uc) AND V(vb,vc,d) AND a < d


J(a,ub,uc,vb,vc,d) <- U(a,ub,uc) AND V(vb,vc,d) AND ub < vb

There is a distinct rule for each side of the OR condition, with a shared head, and
in each there is a product of the two relations via ANDing 'covering' clauses.

5.4.6 Simulating Multiple Operations With Datalog

There are multiple concepts explained here. (confusing, IDB, EDB, seems to be
Intensional and Extensional Datalog (queries), but not sure how relevant here)

1. How to convert a relational expression to Relational Algebra


a) Create an expression tree for the relational expression
b) Create an IDB rule for each node of the true (why IDB? Proto answer: EDBs are
leaves?)
(i) The rule or rules *for each node* is whatever we need to apply the operator
at the node . some unclear lannguage here.
c) The IDB rule for the whole of the relational expression is the (datalog) rule
corresponding to the root of the expression tree.

Example 5.34
Consider the (relational) algebraic expression
project [title, year] ( select[length > 100] (Movies) INTERSECT
select[studioName = 'Fox'] (Movies) )

The tree is

(1)
Project(title,year)
|
|
(2)
INTERSECT
|
|
-------------------
| |
| |
| |
(3) (4)
Select[length >=100] Select[studioName = 'Fox']
| |
| |
| |
(5) (6)
Movies Movies

(_ note: leaves are relations in db, so EDB, all other nodes are computations, so
IDB)

We move bottom up
The root nodes are relations in DB, so Movies(t,y,l,g,s,p) == Nodes 5 and 6

for node (3)


W(t,y,l,g,s,p) <-- Movies(t,y,l,g,s,p) AND l >= 100

For node (4)


X(t,y,l,g,s,p) <-- Movies(t,y,l,g,s,p) AND studioName = 'Fox'

For node (2)


Y(t,y,l,g,s,p) <-- W AND X

For node (1)


Answer(t,y) <- Y(t,y,l,g,s,p)
IMPORTANT: The identifiers can be replaced by their values so starting with

Answer(t,y) <- Y(t,y,l,g,s,p)

we get

Answer(t,y) <- W AND X

to

Answer(t,y) <- Movies(t,y,l,g,s,p) AND l < 100 AND Movies(t,y,l,g,s,p) AND


studioName = 'Fox'

eliminating the redundant Movies clause

Answel(t,y) <- Movies(t,y,l,g,s,p) AND l < 100 AND studioName = 'Fox'

5.4.7 Comparison Between Datalog And Relational Algebra

Every expression in *basic* relational algebra can be expressed in datalog.

There are operators in the extended relational algebra such as grouping and
aggregation, that have no corresponding features in the datalog version presented
here.

Likewise datalog does not do (some?) bag operations like duplicate elimination.

Any *single* datalog rule can be expressed in equivalent relational algebra.

When we look at *collections* of datalog rules. Datalog can express recursion,


which relational algebra cannot.

Example 5.35

Suppose we have a relation Edge(X,Y) which says there is a directed edge from
node X to node Y. The transitive closure of this relation, i.e, the relation
Path(X,Y) which means there is a sequence of one or more edges starting from X and
ending in Y, can be expressed as

Path (X,Y) <- Edge(X,Y) ; every edge is a path


Path (X,Y) <- Edge (X,Z) AND Path(Z,Y) ; exist edge from X to Z and exists a
path from Z to Y => exists path from X to Y

(_ the recursion comes from the same expression on either side of the <-- )

5.4.8 Exercises For Section 5.4

5.5 Summary Of Chapter 5


5.6 Exercise For Chapter 5

You might also like