chp5 Bags
chp5 Bags
chp5 Bags
Summary:
(1) Let R and S be bags of tuples and let tuple t appear m times in R, n times in S
(m and n can be zero).
then
the bag R union S contains m + n occurrences of t
the bag R intersection S contains min (m,n) occurrence of t
the bag R diff S contains max(0, m-n) occurrences of t. intuitively (when R and
S contain tuples of t) each occurence of t in S 'cancels' one occurence of t in R
(2)
we extend the set based relational algebra of 2.4 to bags, which better reflects
how relational algebra is implemented in practice.
we also extend the algebra so it can handle more operations than previously. (e.g
aggregation)
/REM
Key Difference: Relations that are *bags* of tuples rather than *sets* of tuples as
before.
A B
1 2
3 4
1 2
1 2
Here the tuple (1,2) appears 3 times and the tuple (3,4) appears once.
If this were a set based relation we would have to remove 2 instances of the (1,2)
relation
In bag based relations, we *do* allow repetition of tuples, but as with sets *order
does not matter*.
1. Commercial DBMS implement relations that are bags (rather than sets) of tuples.
2. An important motivation for bags is that some relational operations are
considerably more efficient than using sets.
Examples
1. To take the union of two tuples, we just copy one relaton and add to it all
the tuples of the other relation. There is no need to eliminate duplicates as with
sets.
2. When we project a set based relation (retain only some columns), we need to
compare each resulting tuple with all the tuples projected so for, to ensure no
duplicates, whereas with bag based relations, we simply project each tuple, and add
it to the result, with no elimination of duplicates necessary.
Consider the relation
A B C
1 2 5
3 4 6
1 2 7
1 2 8
A B
1 2
3 4
1 2
1 2
If we were using set based relations we would need to eliminate the duplicated
(1,2) tuples from the result.
There are some situations where only bag based relations give the correct result.
Example: Consider taking the average of the values of the A column of the above
relation.
Let R and S be bags of tuples and let tuple t appear m times in R, n times in S (m
and n can be zero).
then
the bag R union S contains m + n occurrences of t
the bag R intersection S contains min (m,n) occurrence of t
the bag R diff S contains max(0, m-n) occurrences of t. intuitively (when R and
S contain tuples of t) each occurence of t in S 'cancels' one occurence of t in R
Let R be
A B
1 2
3 4
1 2
1 2
Let S be
1 2
3 4
3 4
5 6
Then R Union S
1 2
1 2
1 2
1 2
3 4
3 4
3 4
5 6
R intersect S
A B
1 2
3 4
R difference S
A B
1 2
1 2
S difference R
A B
3 4
5 6
Let R =
A B C
1 2 5
3 4 6
1 2 7
1 2 8
A B
1 2
3 4
1 2
1 2
Key Idea: == Each set can be considered to be a bag with no duplicates, and then
*some of* the bag operations 'fall back into' the set operations.
E.g R intersect S . If R and S don't have any duplicates, but we use the bag
intersect anyway (min (m,n)) we get exactly the result of the Set intersect.
Similarly difference.
But union works differently. If t occurs (once) in both R and S, then the bag union
will have two instances of t, and the set union has 1 instance.
A B C
3 4 6
1 2 7
1 2 7
(KEY) there are laws that hold when relations are sets that do *not* hold when
relations are bags.
1 != 0 so 'law' fails.
with sets, both sides resolve to relations with 0 t tuples, so law holds.
Cartesian product for bags - each tuple from one relation is paired with each tuple
of the other, *regardless of whether the result is a duplicate or not*. Therefore
if a tuple t occurs m times in relation R, n times in relation S, the tuple rs
will appear in R X S (mn) times.
Example: Let R be
A B
1 2
1 2
Let S be
A B
2 3
4 5
4 5
then R X S is
A R.B S.B C
1 2 2 3
1 2 4 5
1 2 4 5
1 2 2 3
1 2 4 5
1 2 4 5
No surprises. We take each relation of one tuple with each relation of the other,
decide (based on conditions of the join) if this pair of tuples joins successfully,
and if so, add it to the result, *ignoring* duplicates.
Example: Let R be
A B
1 2
1 2
Let S be
B C
2 3
4 5
4 5
A B C
1 2 3
1 2 3
A R.B S.B C
1 2 4 5
1 2 4 5
1 2 4 5
1 2 4 5
Exercise 5.1.4 Certain algebraic laws for relations as sets also hold a relations
for bags. Explain why each of the laws below holds for bags as well as for sets.
1. R U (S U T) = (R U S) U T
ideally we want a proof here for bags (the above is true for sets).
for now a brief outline. Let R have m tuples of t, S have n, T have t, assume for
convenience all != 0
for bags
LHS (S U T) has n + t copies. R U (S U T) has m + n + t copies.
RHS (R U S) has m + n copies. (R U S) U T has m + n + t copies.
so law holds for sets and bags.
for bags:
LHS: R intersect S gives (min m n). T intersect that gives (min t (min m
n))
RHS: S intersect T gives (min n t) tuples. R intersec that give (min m
(min n t))
since (min c (min a b)) = (min a (min b c)) = (min a b c), LHS = RHS
For a tuple t in R, S have n tuples it joins to, and for a tuple u in S, T have p
tuples it joins to
then
for sets
LHS has 1 tuple for m != 0 and n! = 0 and if either has 0 tuples, the union has 0
which is true for RHS also.
for bags
both sides end with m + n tuples
The following algebraic laws hold for sets but not for bags.
Explain why they hold for sets and give counterexamples to show that they don't
hold for bags.
RHS
S - T gives 2 instances of t (max 0 (3 -1))
R intersect that gives 1 instance of t (min 1 2)
LHS != RHS
RHS:
R intersect T gives 1 tuple . R intersect T gives 1 tuple.
Union gives 2 tuples
LHS != RHS
Languages like SQL have operations that have proved useful (but go beyond the bag
based and set based relational algebra we dealt with so far)
1. duplicate elimination operator delta changes bags to sets by eliminating
duplicates.
2. aggregation operators - sums, averages, are not operations of relational
algebra, but *are used* by the grouping operator (see below)
3. (once we have some tuples) grouping them in accordance with the values of
one or more attributes. This effectively *partitions* tuples into groups (_ but
(KEY) only *one* entry for all tuples with the same value in the grouping
attributes). Aggregations can then be applied to the columns of tuples within
groups, giving us the ability to express a number of queries that are difficult to
express in classical relational algebra.
The grouping operator gamma, combines the effect of grouping and aggregation.
to convert a bag to a set, we use delta (R) that returns a set that has exactly one
copy of every tuple that appears in relation R
Let R be
A B
1 2
3 4
1 2
1 2
The tuple (1, 2) that appears thrice in R appears only once in the result
Example 5.9
Consider the relation R =
A B
1 2
3 4
1 2
1 2
then
SUM(A) = 1 + 3 + 1 + 1 = 6
AVG(A) = (1 + 3 + 1 + 1)/4 = 1.5
MIN(A) = 1
MAX(A) = 3
COUNT (A) = #{_b 1, 3, 1, 1} = 4
5.2.3 Grouping
Often we do not want to aggregate the values of an entire column. Instead we want
to arrange tuples into groups using the value of one or more colummns, then
aggregate 'by groups'.
we want the total number of minutes of movies produced by each studio, the
resulting relation should be something like
studioName sumOfLengths
Disney 12345
MGM 34567
... ....
we need to
1. group the tuples of this relation by studioName
2. sum the length column within each group (all of whose entries have the same
studioName)
(_ the key point is that L has two kinds of elements - grouping attributes and then
an aggregation operator (which in turn has a new name for the result)
Example: 5.10
consider the relation
These two steps give one row for each star, with starName, minyear
but, there is a condition that the stars for whom we do this
computation should have finished at least 3 movies, so we must also compute the
COUNT(title) aggregate for each group. (this _muddles the example imo)
This results in the replication of the tuples in StarsIn, but with all tuples
having the same starname having their titles counted, and value put in a new
column, ctTitle, and minimum of the years their movies appeard in, put in a new
column min year.
Then onto *this* tuple if you do a selection with ctTitle >= 3 and on that result
do a selection of only starName + minYear, (+ elimination of duplicates? no, this
is built into grouping) we get the desired result.
*gamma* can be seen as an extension of the projection operator, again with all
attributes of the relation as the subscript list.
If R is a set, the output of this type of gamma and a projection with all
attributes is identical.
But R is a bag, the gamma with all attributes eliminates duplicates, but projection
with all attributes do not.
In the 'classical' projection operator encountered earlier, pi_L (R), L was a list
of attributes of R (that we want in the result).
We extend this by allowing *computations* with components of tuples. so, now L is a
list of *either*
a) a single attribute of R (which is present in the result)
b) an expression x -> y where x and y are names for attributes, by which we ask
that 'take attribute x of R, *rename* to y in the result'
c) an expression E -> z where E is an expression consisting of attributes of R,
constants, arithmetic operators, string operators, and zi is a *new* name for the
result of the computation E encodes.
example: a + b -> x, represents the sum of attributes a and b renamed x. c ||
d -> e means the concatenate the presumably string valued attributes c and d, and
call the result e.
The result of the projection is computed by considering each tuple of R *in turn*,
the result is a relation whose schema is the names of the attributes on list L,
with whatever renaming the list specifies.
note: Duplicate tuples in R create duplicate tuples in the result. (but even if R
does not have duplicates, the result can).
e.g: let R be
A B C
0 1 2
0 1 2
3 4 5
A X
0 3
0 3
3 9
X Y
1 1
1 1
1 1
note that two distinct tuples of R have been converted to the same tuple 1 1
The expression Tau_L (R) , with R a relation, L a list of some of the attributes
of R, is the relation R, but with the tuples sorted in the order of attributes
specified in L . If L is the list A1, A2.... then tuples of R are sorted *first* by
the attribute A1. Ties are broken by the order of A2. Ties that remain after
ordering by An are ordered arbitrarily.
Notes:
joins on the result of sorts make the sort meaningless. Elements on the *list*
should be treated as a bag, not a list (??)
however bag projections can be made to preserve the order
also a selection on the list (returned by the sort operator) drops the tuples
that do not satisfy the selection condition, but the remaining (selected) tuples
can be made to appear in sorted order.
A B C
1 2 3
4 5 6
7 8 9
B C D
2 3 10
2 3 11
6 7 12
R naturaljoin C gives
A B C D
1 2 3 10
1 2 3 11
One of the features of the natural join is that some tuples in each relation are
excluded from the join (the 2nd,3d tuple in R, the 3d tuple in S). These are
'dangling tuples'. Where we, for some reason also want the dangling tuples in the
join, we use a variation called the 'outer join'.
First the 'natural' outer join, with the join on equated values of common
attributes
in this, we start with the natural join, then *add* dangling tuples from both
relations, with a null value (the bottom symbol in the text, here represented with
N) for the attributes the dangling tuples don't possess.
so R naturalOUTERjoin S gives
A B C D
1 2 3 10
1 2 3 11
4 5 6 N ;from R
7 8 9 N ;from R
N 6 7 12 ;from S
(KEY Insight == those tuples from either side of the natural join that do not *make
it into the resulting relation* are the dangling tuples. Pull them in, fill up
missing columns (from other relation) with Null.)
In the Left Outer Join, only the dangling tuples from the relation on the left are
added to the resultant relation (and padded with NULLs)
A B C
1 2 3
4 5 6
7 8 9
S
B C D
2 3 10
2 3 11
6 7 12
R LeftOuterJoin C gives
A B C D
1 2 3 10
1 2 3 11
step 2. add 'left out' tuples from the left relation (here R), pad column D with
nulls
A B C D
1 2 3 10
1 2 3 11
4 5 6 N
7 8 9 N
For the Right Outer Join, only dangling tuples from the right hand side relation
(here S) are added to the join so
A B C
1 2 3
4 5 6
7 8 9
B C D
2 3 10
2 3 11
6 7 12
R RightOuterJoin C gives
A B C D
1 2 3 10
1 2 3 11
step 2. add 'left out' tuples from the right relation (here S), pad column A with
nulls
A B C D
1 2 3 10
1 2 3 11
N 6 7 12
all three natural outer joins have their theta join equivalents.
first, a theta join is taken, and then tuples that didn't participate are
added, and padded with nulls.
A B C
1 2 3
4 5 6
7 8 9
B C D
2 3 10
2 3 11
6 7 12
gives
1. first theta join
the tuples that didn't participate is the first tuple in R and third tuple in S.
pad with nulls
<step>
If R is a relation with attributes in a fixed order, (KEY) we use R as a (datalog)
predicate name,
KEY: The atom R(a1,a2, ... a-n) has the value TRUE if (a1, a2, .. . a-n) is a tuple
present in relation R. else this atom has the value false.
For now, a relation defined by (_as?) a predicate is assumed to be a set. Later we
see how datalog can be extended to bags.
Let Relation R be
A B
1 2
3 4
then R(1,2) is TRUE, R(3,4) is true, and R(x,y) for any other values of (x,y) is
FALSE
The former kinds are called relational atoms, and can change, when they correspond
to tuples in database relations, which are finite. These are arithmetic relations
which are infinite and immutable.
Rules consists of
a. A relational atom called a head
b. The symbol <-- which we read as 'if'
c. A body, - consisting of one or more atoms, called subgoals, *which could be
relational *or* arithmetic*.
subgoals are connected by AND
subgoals can be negated by NOT
example:
LongMovie(t,y) <-- Movies(t,y,l,g,s,p) AND l >= 100 (_ interesting how the
second subgoal can use attributes of the first subgoal, so l has to be bound before
the second subgoal can be evaluated to test for truth?)
defines the *set* of long movies those at least 100 minutes long. One of the
subgoals refers to our standard relation Movies with the schema
(KEY) We can see this relational atom (Movies, the subgoal of the LongMovies rule)
as saying "Let (t,y,l,g,s,p) be a tuple in the current instance of the relation
Movies". More precisesly Movies(t,y,l,g,s,p) is *true* exactly when the relation
Movies has tuple with the same values (as held in the variables of the relational
atoms Movies.
Similarly the second subgoal l >= 100 is true whenever the length component of a
tuple in Movies is at least 100.
LongMovies(t,y) is true whenever we can find a tuple in (the relation) Movies with
a) (the values of variables) t and y as the first two components - title and
year -of that tuple
b) (the tuple has) a third component length with >100 as value
c) *any* values in components 4 thru 6
Notice that this rule is equivalent to this relational algebra "assignment form"
Queries in Datalog
Thus in the above example, (the resulting relation) LongMovie is the answer to the
query.
(KEY) If there is more than one relation in the rule head, one of these relations
is the answer to the query, while the others assist in the definition of the
answer. While there are several predicates defined by a collection of rules, we
usually assume that the query result is termed Answer ( I think that the point is
that several Relations - from the head - result as 'answer' and we use a convention
to pick on to be "the" answer)
Anonymous variables
Frequently Datalog rules have variables that appear only once. ( I think this means
that variables don't appear in head *and* body of rule). The names used for these
variables are irrelevant. Only when a variable appears more than once do we care
about the name, so we can see it is the same variable in its second and further
appearances.
the variables g,s,p that appear only once in the original have been replaced by
underscores. we cannot replace any of the other variables - t,y,l - because all of
them appear more than once in this rule.
The rule
For the Longmovies rule, we imagine the variables (t,y,l,g,s,p) ranging over all
possible values (here, tuples to check fit are drawn from the Movies relation, so
this just means we consider all tuples of the Movies relation and choose those that
make both subgoals true). The only values of variables that make all subgoals true
are those tuples of Movies where l >= 100. We put the values of t and y (from each
tuple of the relation Movies where l > 100) into a tuple (t,y) and add to (the set
of tuples that is) the relation LongMovie
Rule Safety
Another way to define the meaning of rules, assuming safe rules as above
Example 5.2.1
Consider the datalog rule
P(x,y) <-- Q(z,x) AND R(z,y) AND NOT Q(x,y)
analysis of body ,
two non negated relational subgoals Q, R one negated relational subgoal NOT Q
Step> we consider only the non negated relational goals and look for a variable
assignment that gives consistency.
Let relation Q =
1,2
1,3
Let relation R =
2,3
3,1
There are two non negated subgoals, Q, and R, so we have to consider all
combination of assignments of tuples from relations Q and R respectively to these
subgoals
The first option assigns a consistent value to z {z =2} and {x = 1, y = 3}. Now
that we have a consistent assignment of variables, we check the other (negated
relational, or arithmetic) subgoals (here only NOT Q(1,3) ) and check if it (==
the *negated* subgoal) evaluates to TRUE. It doesn't (it evaluates to false)
The second and third options are not consistent. Each assigns two different values
to the variable z, and so are inconsistent. We don't evaluate negated relational,
or arithmetic subgoals.
The third option assigns a consistent value t z {z = 3},. So now we check NOT
Q(x,y) . Q(1,1) = F. Negating this gives true. So we take the assignment {x = 1,
y=1, z=2} produce the head clauses which give P(x,y) = P(1,1).
Since we have exhausted all possible tuple assignments for non negated subgoals, we
are done, and P(1,1) is the only tuple in P.
A useful distinction
Extensional Predicates = Predicates whose relations are stored in the database.
Intensional Predicates = Relations which are *computed* by applying one or more
Datalog rules.
"IDB" = Intensional database == predicates and attached relations that are computed
"EDB" = Extensional database == predicates and attached relations are stored in
database
e.g: The relation (and predicate) "Movies" is extensional (_ you can see the table
in the db)
Longmovies (relation and predicate) is intentional/computed.
Now:
1. An EDB predicate cannot appear in the head of a rule. It can appear in the
body.
2. Intensional predicates can appear in the head *or* body.
3. It is common to build up a single relation with multiple rules with the same
IDB in the head (of all the rules)
4. By using intensional predicates, we build up progressively more complicated
functions of the (_ "starting base =") EDB relations, just as we do with relational
algebra.
Since we are now dealing with bags, we do not eliminate tuples in the head.
Moreover as we consider all combinations of tuples for the subgoals, a tuple
appearing n times in the relation for a subgoal, gets considered n times as the
tuple for the subgoal, each time in conjunction with all combinations of tuples for
the other subgoals.
Example 5.2.2
A B
1 2
1 2
table of combinations of tuples for each subgoal. each instance of tuple from R is
combined with each instance of tuples from S
1 2 1 2 2 3 2 3 YES y = 2
(1,3)
1 2 1 2 4 5 4 5 No y = 2,4
----
1 2 1 2 4 5 4 5 No y = 2,4
----
1 2 1 2 2 3 2 3 YES y = 2
(1,3)
1 2 1 2 4 5 4 5 No y = 2,4
----
1 2 1 2 4 5 4 5 No y = 2,4
----
so *two* identical tuples are generated for the Head, since we are using bags.
Example 5.2.3
with
First rule puts all of the three tuples into H, since they all have x > 1.
Second rule puts only the tuple (2,3) into H since S,s 2nd and 3d tuples are
eliminated.
The resulting relation has two copies of (2,3) and two tuples (one from clause 1,
and one from clause 2) of (4,5)
We assume that R and S are relations with the same number of attributes n (and
matching attribute types?) a1,...2,a_n
Union
Two Rules.
One rule has R(a1,...,a_n) as the lone subgoal and the other has S(a1,..a_n) as the
lone subgoal, and both have the head Answer(a1,...a_n)
(_
As a result, each tuple from R and S are put into the answer relations.
Intersection
Difference
The names we choose the names for variables are arbitrary, and have no connection
to variables used in any other rule.
E.g:
U(x,y,z) <- R(x,y,z)
U(x,y,z) <- S(x,y,z)
is the same as
U(x,y,z) <- R(x,y,z)
U(a,b,c) <- S(a,b,c)
5.4.2 Projection
To compute a projection of relation R, we use one rule with a single subrule with
predicate R. The arguments for this subgoal are distinct variables, one for each
attribute of R. The head has an atom with arguments that are variable corresponding
to attributes in the projection list, in the desired order
e.g:
A relation Movies with schema == Movies(title,year,length,genre, studioName,
producerC#)
To project onto its first three attributes, title, year and length, we use
5.4.3 Selection
The simple case is when the selection condition is an AND of one or more arithmetic
expressions. In that case, we create a rule with
one relational subgoal for the relation upon which we are performing the
selection. This atom has distinct variables, one for each attribute of the
relation.
The result is the relation S. Note that l and s are the (datalog) variables
corresponding to the attributes length, and studioName in the relation Movies, *in
the standard order of attributes* used in the relation Movies.
Example 5.27
sigma_[(l > 100) OR studioName = 'Fox'] (Movies)
corresponding datalog rules
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND l > 100
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND studioName = 'Fox'
More complex selection conditions can be formed by combinations of AND, OR, NOT
But we can transform this (technique not presented) to convert such queries into
disjunctive normal form, where the resultant query is a disjunction of conjuncts.
Then a dnf query can be represented by several datalog rules, one for each
conjunct, with the "union semantics" presented above.
Example 5.28
consider the relational expression
sigma_[NOT length >= 100 OR studioName = 'Fox' ](Movies)
converted into disjunctive normal form this becomes
Example 5.29
consider
sigma[NOT (length >= 100 AND studioName = 'Fox')] (Movies)
dnf form is
sigma [length < 100 OR studioName != 'Fox'] (Movies)
converting to datalog, we get, (_ using the 'union semantics' for OR from above)
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND l < 100
S(t,y,l,g,s,p) <- Movies(t,y,l,g,s,p) AND studioName != 'Fox'
5.4.4 Products
The product of two relations R,S can be expressed by a single datalog rule.
(KEY) this rule has two subgoals - one for R, one for S.
(KEY) each of these subgoals has distinct variables one for each attribute of R and
S.
the predicate in the head has as arguments all the variables that occur in either
subgoal predicate, with the variables of R appearing before the v ariables of S.
defines P to be = R X S
5.4.5 Joins
We can take the natural join of two relations by a datalog rule that looks much
like the rule for a product.
(KEY) the difference is that what we want R join S, then we must use the same
variables for attributes of R and S that have the same name as and use different
variable names otherwise.
Example 5:31 Consider relations with schemas R(A,B), S(B,C,D) . Their natural join
maybe defined by the rule
note the common b in both subgoals, which is where the natural join happens.
(from 2.4.12 one can see a theta join as equivalent to a product followed by a
selection)
(_ product is expressed by a single datalog role R X S = P(a,b,c,x,y,z) <-
R(a,b,c) AND S(x,y,z),
selection is a bit complicated see above, a conversion to dnf may be necessary)
Example 5.32
Consider the relations U(a,b,c), V(b,c,d) and the theta join U theta_join V [A
< D AND U.b < V.b]
J(a,ub,uc,vb, vc,d) <- U(a, ub,uc) AND V(vb,vc,d) ;; this is the product
AND a < d AND ub != vb
to perform the same operation. Note the naming to distinguish common attributes ub,
vb etc. The first two subgoals reveal the 'source' of the variables in the result
clause, and also 'cover' each relation, the AND gives the cross product, the
remaining subgoals give the theta join condition.
Example 5.33
Consider U(a,b,c), V(b,c,d) and theta join is U theta_join V [A < D OR U.b <
V.b] ;; same as in prev example, OR replaces AND
There is a distinct rule for each side of the OR condition, with a shared head, and
in each there is a product of the two relations via ANDing 'covering' clauses.
There are multiple concepts explained here. (confusing, IDB, EDB, seems to be
Intensional and Extensional Datalog (queries), but not sure how relevant here)
Example 5.34
Consider the (relational) algebraic expression
project [title, year] ( select[length > 100] (Movies) INTERSECT
select[studioName = 'Fox'] (Movies) )
The tree is
(1)
Project(title,year)
|
|
(2)
INTERSECT
|
|
-------------------
| |
| |
| |
(3) (4)
Select[length >=100] Select[studioName = 'Fox']
| |
| |
| |
(5) (6)
Movies Movies
(_ note: leaves are relations in db, so EDB, all other nodes are computations, so
IDB)
We move bottom up
The root nodes are relations in DB, so Movies(t,y,l,g,s,p) == Nodes 5 and 6
we get
to
There are operators in the extended relational algebra such as grouping and
aggregation, that have no corresponding features in the datalog version presented
here.
Likewise datalog does not do (some?) bag operations like duplicate elimination.
Example 5.35
Suppose we have a relation Edge(X,Y) which says there is a directed edge from
node X to node Y. The transitive closure of this relation, i.e, the relation
Path(X,Y) which means there is a sequence of one or more edges starting from X and
ending in Y, can be expressed as
(_ the recursion comes from the same expression on either side of the <-- )