(Chapter 8) Data Structures and CAATTs For Data Extraction
(Chapter 8) Data Structures and CAATTs For Data Extraction
Province of Rizal
Binangonan Campus
COLLEGE OF ACCOUNTANCY
CHAPTER 8:
Data Structures and CAATTs for
Data Extraction
Submitted by:
Anore, Louis Antoinette G.
Biteranta, Janelle B.
Grantusa, Claire Euniz E.
Haral, Ma. Veronica B.
Quisquino, Angel Rose A.
BSA 3-2
Submitted to:
Prof. Liberty T. Ocampo, CPA, MBA
1
LEARNING OBJECTIVES:
This chapter delves into the intricate realm of data structures and the utilization
of Computer-Assisted Audit Techniques (CAATTs) for data extraction and analysis. It
commences with a comprehensive examination of data structures, emphasizing their
significance in both physical and logical data organization within files and databases.
Various structures such as flat-file, navigational database, and notably, relational
database structures are scrutinized, with a particular focus on relational databases due
to their prevalence in contemporary business environments.
CAATTs for data extraction software fall into two general categories: embedded audit
modules and general audit software. The chapter describes the features,
advantages, and disadvantages of the embedded audit module (EAM) approach. It then
outlines typical functions and uses of generalized audit software (GAS). The chapter
closes with a review of the key features of ACL (audit command language), the leading
product in the GAS market.
DATA STRUCTURES
Data structures have two fundamental components: organization and access method.
Organization refers to the way records are physically arranged on the secondary
storage device. This may be either sequential or random. The records in sequential files
are stored in contiguous locations that occupy a specified area of disk space. Records in
ran- dom files are stored without regard for their physical relationship to other records
of the same file. Random files may have records distributed throughout a disk. The
access method is the technique used to locate records and to navigate through the
database or file. While several specific techniques are used, in general, they can be
classified as either direct access or sequential access methods.
Flat-File Structures
The flat-file approach is a single view model that characterizes legacy systems. Data files
are structured, format- ted, and arranged to suit the specific needs of the owner or
primary user. Such structur- ing, however, may omit or corrupt data attributes that are
essential to other users, thus preventing successful integration of systems across the
organization.
SEQUENTIAL STRUCTURE
This refers to organizing data in a file in a specific sequence based on the
values of their primary keys, such as ascending or descending order. Records are
stored one after another in contiguous storage spaces.
Sequential files are efficient for tasks like reading the entire file or
processing a large portion of it at once, such as updating numerous records in
bulk. However, they become inefficient for small operations or when direct access
to specific records is required. Retrieving a single record necessitates scanning
through all preceding records, making it impractical for such tasks. Direct access
is limited in sequential files, requiring alternative data structures for tasks
needing it.
INDEXED STRUCTURE
An indexed structure comprises both the data file and a separate index file
containing record addresses. Each entry in the index file corresponds to a record
in the data file and contains the numeric location of that record's physical storage
on the disk.
In an indexed random file, records are stored scattered across the disk
without regard for their physical proximity. Searching and retrieval are facilitated
by an index that maps key values to storage locations. Adding records involves
selecting an available disk location and updating the index accordingly. Indexes
can be organized either sequentially or randomly, with sequential indexes
offering rapid search capabilities despite being more complex to maintain.
Algorithms can be utilized to expedite searches through the index, which is
particularly advantageous for large data files.
HASHING STRUCTURE
A hashing structure utilizes an algorithm to directly convert a record's
primary key into a storage address, eliminating the need for a separate index.
This approach streamlines record retrieval by calculating addresses rather than
referencing an index.
Hashing algorithms divide primary keys by a prime number to generate
storage locations, enabling faster record retrieval compared to indexed
structures. This method bypasses index searches by algorithmically determining
storage addresses. For instance, in an inventory file with 100,000 items, the
inventory number is divided by a prime number to determine the storage
location. Hashing often results in random file organization due to the dispersed
nature of record addresses across the storage device.
● Advantage of Access Speed: The primary advantage of hashing is its fast access
speed. Calculating a record's address is quicker than searching for it through an
● Disadvantages of Hashing:
1. Inefficient Storage Usage: Hashing does not utilize storage space efficiently.
The chosen storage location for a record is determined solely by its primary key
value, leading to potential wastage of disk space, with as much as one-third of the
disk pack being unused.
2. Collision Problem: Different record keys may produce the same or similar
residuals, resulting in collisions where two records attempt to occupy the same
storage location. This slows down access to records and complicates record
deletion operations.
POINTER STRUCTURES
Figure 8.6 illustrates the pointer structure used to create linked-list files.
This method involves storing the address (pointer) of a related record in a field
within each record. Records are spread across the disk without regard for their
physical proximity, with pointers establishing connections between them. The
last record in the list typically contains an end-of-file marker.
Pointers within a file can link records together, as seen in the example
where Record 124 points to Record 125, Record 125 points to 126, and so forth.
This creates a sequence of linked records facilitating traversal through the file.
Pointers may also link records between different files, as depicted in Figure 8.7.
For instance, an accounts receivable (AR) record may have pointers linking it to
the next AR record within the AR file and to associated sales invoices and
remittance advice records. This facilitates accessing related records across
multiple files. By accessing an accounts receivable record, all associated sales
invoices and remittances can be located, aiding in account management and
transaction tracking. New transaction records are appended to the appropriate
linked-list, with their addresses stored in preceding records for future access.
● Physical Address Pointer: This type of pointer contains the actual disk
storage location (cylinder, surface, and record number) required by the disk
controller to access the record directly. It offers speed advantages as no further
manipulation is needed to determine the record's location. However, it poses two
disadvantages: (1) If the related record is moved, the pointer must be updated,
which can be problematic during disk reorganizations or copies; (2) Physical
pointers lack a logical relationship to the records they identify, meaning if a
pointer is lost, the referenced record is also lost.
● Relative Address Pointer: A relative address pointer specifies the position of
a record within the file, such as the 135th record. To access the record, further
manipulation is required to convert the relative address to the actual physical
address. Conversion software calculates this using the physical address of the
file's beginning, the record length, and the relative address being sought.
● Logical Key Pointer: This pointer contains the primary key of the related
record. A hashing algorithm converts this key value into the record's physical
address. This method allows for quick access to the record by utilizing the logical
relationship between keys and their corresponding records.
❖ Restrict: Extracts specified rows from a specified table. This operation creates a virtual table
that is a subset of the original table.
❖ Project: Extracts specified attributes (columns) from a table to create a virtual table.
❖ Join: Builds a new physical table from two tables consisting of all concatenated pairs of
rows, from each table.
User Views
● A user view is the set of data that a particular user sees. Examples of user views
are computer screens for entering or viewing data, management reports, or
source documents such as an invoice.
Database Anomalies
● Improperly normalized tables can cause DBMS processing problems that restrict,
or even deny, users access to the information they need. Such tables exhibit
negative operational symptoms called anomalies.
1. Update Anomaly. The update anomaly results from data redundancy in
an unnormalized table.
2. Insertion Anomaly. Assume a new vendor enters a marketplace. The
organization does not yet purchase from the vendor, but may wish to add
the vendor to the database. This is not possible, however, because the
primary key for the Inventory table is PART NUM—the supplier data
cannot be added to the table.
3. Deletion Anomaly. The deletion anomaly involves the unintentional
deletion of data from a table.
❖ One or more of these anomalies will exist in tables that are not normalized or
are normalized at a low level, such as first normal form (1NF) or second
normal form (2NF). To be free of anomalies, tables must be normalized to the
third normal form (3NF) level.
Normalizing Tables
● The database anomalies described above are symptoms of structural problems
within tables called dependencies. Specifically, these are known as repeating
groups, partial dependencies, and transitive dependencies. The
normalization process involves identifying and removing structural dependencies
from the table(s) under review.
The resulting tables will then meet the two conditions below:
1. All nonkey (data) attributes in the table are dependent on (defined by) the
primary key.
2. All nonkey attributes are independent of the other nonkey attributes.
3NF - Third Normal Form Table is a database schema design approach for
relational, databases which uses normalizing principles to reduce the duplication of
data, avoid data anomalies, ensure referential integrity, and simplify
resulting tables (that is, 1:1, 1:M, or M:M) determines how the linking occurs. The
key-assignment rules for linking tables are discussed next.
● Business Rule 1. Each vendor supplies the firm with three (or fewer) different
items of inventory, but each item is supplied by only one vendor.
● Business Rule 2. Each vendor supplies the firm with any number of inventory
items, but each item is supplied by only one vendor.
● Business Rule 3. Each vendor supplies the firm with any number of inventory
items, and each item may be supplied by any number of vendors.
Identify Entities
View modeling begins by identifying the primary entities of the business function in
question. Recall that entities are things about which the organization wishes to capture
data.
We need to test these conditions for each candidate to eliminate any false entities.
● Purchasing Agent. We need to determine what data about the agent that is
unique to his or her role of order placing needs to be captured.
● Receiving Clerk. We will assume that no clerk-specific data needs to be
captured that requires a dedicated table.
● Inventory. The Inventory entity meets both conditions. The description
suggests that the organization holds many items of inventory; thus this entity
would contain multiple occurrences.
● Supplier. The description states that multiple vendors supply inventory; hence
the Supplier entity meets the first condition. We can also assume that it meets the
second condition since no other entity would logically provide supplier data. The
Supplier entity, therefore, will be included in the data model.
● Inventory Status Report. The Inventory Status Report is a user view derived
from the Inventory and Supplier entities (see Figure 8.17). While it contains
multiple occurrences, it is not an entity because it does not satisfy Condition 2.
The view is derived entirely from existing entities and provides no additional data
that requires a separate entity. The view will be carefully analyzed, however, to
ensure that all the attributes needed for it are included in the existing entities.
● Purchase Order. The Purchase Order entity meets both conditions. Many
purchase orders will be processed in a period; thus the entity will have many
occurrences.
● Receiving Report. The Receiving Report meets both conditions. Many receipt
events will take place in the period; thus a Receiving Report entity will have
multiple occurrences. A Receiving Report entity will contain attributes such as
date received and quantity received that are unique to this entity and thus not
provided by other entities in the model.
5. The association between the Receiving Report and Inventory entities is 0,M:M.
This signifies that within the period, each item of inventory may have been
received many times or never. Also, each receiving report is associated with at
least one and possibly many inventory items.
Add Primary Keys. The next step in the process is to assign primary keys to the
entities in the model. The analyst should select a primary key that logically defines the
nonkey attributes and uniquely identifies each occurrence in the entity.
resolved the M:M association between the Purchase Order and Inventory entities by
providing a link.
The next step is to create the physical tables and populate them with data. This is an
involved step that must be carefully planned and executed and may take many months
in a large installation. Programs will need to be written to transfer organization data
currently stored in flat files or legacy databases to the new relational tables. Data
currently stored on paper documents may need to be entered into the database tables
manually. Once this is done, the physical user views can be produced.
The normalized tables should be rich enough to support the views of all users of the
system being modeled. To illustrate the relationship, the fields in the user view are
cross-referenced via circled numbers to the attributes in the supporting tables. Keep in
mind that these tables may also provide data for many other views not shown here, such
as the receiving report, purchase requisition listing, inventory status report, and vendor
purchases activity report.
The view modeling process described previously pertained to only one business
function— the purchases system—and the resulting tables and views constitute only
a subschema of the overall database schema. A modern company, however, would need
hundreds or thousands of views and associated tables. Combining the data needs of all
users into a single schema or enterprise-wide view is called view integration. This is a
daunting undertaking when creating the entire database from scratch.
Disadvantages of EAMs
Operational Efficiency
● From the user’s point of view, EAMs decrease operational performance. The
presence of an audit module within the host application may create significant
overhead, especially when the amount of testing is extensive.
(1) GAS languages are easy to use and require little computer background on the
part of the auditor;
(2) many GASproducts can be used on both mainframe and PC systems;
(3) auditors can perform their tests independent of the client’s computer service
staff; and
(4) GAS can be used to audit the data stored in most file structures and formats.
To illustrate the file flattening process, consider the complex database structure
presented in Figure 8.29.
ACL SOFTWARE
Data Definition
We have already established that a client’s system may store data using a number
of flat file or database structures, including sequential files, VSAM files, linked lists, and
relational tables. One of ACL’s strengths is the ability to read data stored in most
formats. ACL uses the data definition feature for this purpose. To create a data
definition, the auditor needs to know both where the source file physically resides and
its field structure layout. Small files can be imported via text files or spreadsheets. Very
large files may need to be accessed directly from the mainframe computer.
Customizing a View
A view is simply a way of looking at data in a file; auditors seldom need to use all
the data contained in a file. ACL allows the auditor to customize the original view
created during data definition to one that better meets his or her audit needs. The
auditor can create and reformat new views without changing or deleting the data in the
underlying file. Only the presentation of the data is affected.
Filtering Data
ACL provides powerful options for filtering data that support various audit tests.
Filters are expressions that search for records that meet the filter criteria. ACL’s
expression builder allows the auditor to use logical operators such as AND, OR, , , NOT
and others to define and test conditions of any complexity and to process only those
records that match specific conditions.
Stratifying Data
ACL’s stratification feature allows the auditor to view the distribution of records
that fall into specified strata. Data can be stratified on any numeric field such as sales
price, unit cost, quantity sold, and so on. The data are summarized and classified by
strata, which can be equal in size (called intervals) or vary in size (called free).
Statistical Analysis
ACL offers many sampling methods for statistical analysis. Two of the most
frequently used are record sampling and monetary unit sampling (MUS). Each method
allows random and interval sampling. The choice of methods will depend on the
auditor’s strategy and the composition of the file being audited. On one hand, when
records in a file are fairly evenly distributed across strata, the auditor may want an
unbiased sample and will thus choose the record sample approach. Using inventory to
illustrate, each record, regardless of the dollar amount of the inventory value field, has
an equal chance of being included in the sample. On the other hand, if the file is heavily
skewed with large value items, the auditor may select MUS, which will produce a sample
that includes all the larger dollar amounts.