Lab Manual Format

Department of Computer Science and Business Systems
CCS341 - Data Warehousing
LAB MANUAL
SubCode:CCS341 Subject Name :Data Warehousing

List of Experiments
Date of
S.No. Name of the Experiments Page No. Staff sign
Experiment
Data exploration and integration with

1
WEKA
2 Apply weka tool for data validation
Plan the architecture for real time

3
application
4 Write the query for schema definition
Design data ware house for

5 real time applications
6 Analyse the dimensional Modeling
7 Case study using OLAP
8 Case study using OTLP
9 Implementation of warehouse testing.

EX.NO.:1 DATA EXPLORATION AND INTEGRATION WITH WEKA
AIM
To exploring the data and performing integration with weka
PROCEDURE
WEKA - an open-source software provides tools for data preprocessing, implementation of

several Machine Learning algorithms, and visualization tools so that you can develop machine
learning techniques and apply them to real-world data mining problems. What WEKA offers is
summarized in the following diagram:

WEKA Installation
To install WEKA on your machine, visit WEKA’s official website and download the installation file.
WEKA supports installation on Windows, Mac OS X and Linux. You just need to follow the
instructions on this page to install WEKA for your OS.
The WEKA GUI Chooser application will start and you would see the following screen
The GUI Chooser application allows you to run five different types of applications as listed
here:
 Explorer
 Experimenter
 KnowledgeFlow
 Workbench
 Simple CLI
WEKA – Launching Explorer

Let us look into various functionalities that the explorer provides for working with big data. When you
click on the Explorer button in the Applications selector, it opens the following screen:
On the top, you will see several tabs as listed here:

 Preprocess
 Classify
 Cluster
 Associate
 Select Attributes
 Visualize
Under these tabs, there are several pre-implemented machine learning algorithms. Let us
look into each of them in detail now.

Preprocess Tab
Initially as you open the explorer, only the Preprocess tab is enabled. The first step in machine
learning is to preprocess the data. Thus, in the Preprocess option, you will select the data file,
process it and make it fit for applying the various machine learning algorithms.
Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your
data. To list a few, you may apply algorithms such as Linear Regression, Logistic Regression,
Support Vector Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on.
The list is very exhaustive and provides both supervised and unsupervised machine learning
algorithms.
Cluster Tab
Under the Cluster tab, there are several clustering algorithms provided - such as
SimpleKMeans, FilteredClusterer, HierarchicalClusterer, and so on.
Associate Tab
Under the Associate tab, you would find Apriori, FilteredAssociator and FPGrowth.
Select Attributes Tab

Select Attributes allows you feature selections based on several algorithms such as
ClassifierSubsetEval, PrinicipalComponents, etc.

Visualize Tab
Lastly, the Visualize option allows you to visualize your processed data for analysis. As you
noticed, WEKA provides several ready-to-use algorithms for testing and building your machine
learning applications. To use WEKA effectively, you must have a sound knowledge of these
algorithms, how they work, which one to choose under what circumstances, what to look for in their
processed output, and so on. In short, you must have a solid foundation in machine learning to use
WEKA effectively in building your apps.
Loading Data
The data can be loaded from the following sources:

 Local file system
 Web
 Database
Loading Data from Local File System

There are three buttons
 Open file
 Open URL
 Open DB
Click on the Open file ... button. A directory navigator window opens as shown in the following
screen
Loading Data from Web

Once you click on the Open URL button, you can see a window as follows:

We will open the file from a public URL Type the following URL in the popup box:
https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.nominal.arff
You may specify any other URL where your data is stored. The Explorer will load the data from the
remote site into its environment.
Loading Data from DB

Once you click on the Open DB button, you can see a window as follows:
Set the connection string to your database, set up the query for data selection, process the query and
load the selected records in WEKA.

WEKA File Formats

WEKA supports a large number of file formats for the data. Here is the complete list:
 arff
 arff.gz
 bsi
 csv
 dat
 data
 json
 json.gz
 libsvm
 m
 names
 xrff
 xrff.gz
The types of files that it supports are listed in the drop-down list box at the bottom of the screen. This
is shown in the screenshot given below.
As you would notice it supports several formats including CSV and JSON. The default file type is
Arff.
Arff Format
An Arff file contains two sections - header and data.
 The header describes the attribute types.
 The data section contains a comma separated list of data.

As an example for Arff format, the Weather data file loaded from the WEKA sample databases is
shown below:
From the screenshot, you can infer the following points:

 The @relation tag defines the name of the database.
 The @attribute tag defines the attributes.
 The @data tag starts the list of data rows each containing the comma separated
 fields.
 The attributes can take nominal values as in the case of outlook shown here:
@attribute outlook (sunny, overcast, rainy)
 The attributes can take real values as in this case:
@attribute temperature real
 You can also set a Target or a Class variable called play as shown here:
@attribute play (yes, no)
 The Target assumes two nominal values yes or no.
Understanding Data
Let us first look at the highlighted Current relation sub window. It shows the name of the
database that is currently loaded. You can infer two points from this sub window:
 There are 14 instances - the number of rows in the table.
 The table contains 5 attributes - the fields, which are discussed in the upcoming
 sections.
On the left side, notice the Attributes sub window that displays the various fields in the database.

The weather database contains five fields - outlook, temperature, humidity, windy and play. when
you select an attribute from this list by clicking on it, further details on the attribute itself are
displayed on the right hand side.
Let us select the temperature attribute first. When you click on it, you would see the following
screen:

In the Selected Attribute subwindow, you can observe the following:

 The name and the type of the attribute are displayed.
 The type for the temperature attribute is Nominal.
 The number of Missing values is zero.
 There are three distinct values with no unique value.
 The table underneath this information shows the nominal values for this field as hot,
mild and cold.
 It also shows the count and weight in terms of a percentage for each nominal value.
At the bottom of the window, you see the visual representation of the class values
If you click on the Visualize All button, you will be able to see all features in one single window as
shown here:
Removing Attributes
Many a time, the data that you want to use for model building comes with many irrelevant fields. For
example, the customer database may contain his mobile number which is relevant in analysing his
credit rating
To remove Attribute/s select them and click on the Remove button at the bottom.

The selected attributes would be removed from the database. After you fully preprocess the data, you
can save it for model building.
Next, you will learn to preprocess the data by applying filters on this data.
Data Integration
Suppose you have 2 datasets as below and need to merge them together
Open a Command Line Interface

 Run the following command replacing values as needed
 java -cp weka.jar weka.core.Instances merge <path to file1> <path to file 2>
> <path to result file>
Example
java weka.core.Instances merge C:\Users\Ram\Downloads\file1.csv C:\Users\Ram\Downloads\

file2.csv > C:\Users\Ram\Downloads\results.csv
Finished redirecting output to 'C:\Users\Ram\Downloads\results.csv'.

Now you can see results.csv or results.csv file in your given location as below.
RESULT:

Thus exploring the data and performing integration with weka was done successfully.

EX.NO.:2 APPLY WEKA TOOL FOR DATA VALIDATION
AIM:
To validate the data stored in data warehouse using Weka tool.
PROCEDURE:
Data validation is the process of verifying and validating data that is collected before it is
used. Any type of data handling task, whether it is gathering data, analyzing it, or structuring it for
presentation, must include data validation to ensure accurate results.
1. Data Sampling
After loading your dataset
 Click on choose ( certain datasets in sample datasets does not allow this operation. I used
Brest-cancer dataset for this experiment )
 Filters -> supervised -> Instance -> Re-sample
 Click on the name of the algorithm to change parameters
 Change bias To Uniform Class to have a biased sample. If you set it to 1 resulting dataset
will have equal number of instances for each class. Ex:- Brest-cancer positive 20 negative 20.
 Change no Replacement accordingly.
 Change sample Size Percent accordingly. ( self explanatory )
2. Removing duplicates
 Choose filters -> unsupervised -> instance -> Remove Duplicates

 Compare the results as below
Dealing with missing values
 Open labor.arff file using weka. ( which has missing values )

 Click on edit button on the top bar to get a view of dataset as blow. Here you can clearly see
the missing values in gray areas.
 Filters -> unsupervised -> attribute -> replace Missing Values With User Constants
 In attributes set the column number you want to replace values.
 Set nominalStringReplacementValue to a suitable string if selected column is nominal.
 Set numericReplacementValue to 0, -1 depending on your requirement if selected column is
numeric.
 fill all replacement values, add “first-last” to attribute column to apply for all columns at
once.
 ReplaceMissingValues filter which replace numeric values with mean and nominal values
with mode.
3. DataReduction
PCA
 Load iris dataset
 Filters -> unsupervised -> attribute -> PrincipleComponents
 Original iris dataset have 5 columns. ( 4 data + 1 class ). Lets reduce that to 3 columns ( 2
data + 1 class ).
 maximumAttributes – No of attributes we need after reduction.

 maximumAttributeNames – PCA algorithm calculates 4 principle components for this
dataset. Upon them we are selecting the 2 components which have the most variance ( PC1,
PC2 ). Then we need to re-represent data again using these selected components ( reducing
4D plot to 2D plot ). In this process we can select how many principle components we are
using when re-generating values. See the final result below where you can see new columns
are created using 3 principle components multiplied by respective bias values.
4. Data transformation
Normalization
 Load iris dataset
 Filters -> unsupervised -> attribute -> normalize
 Normalization is important when you don’t know the distribution of data beforehand.
 Scale is the length of number line and translation is the lower bound.
 Ex :- scale 2 and translation -1 => -1 to 1, scale 4 and translation -2 => -2 to 2
 This filter get applied to all numeric columns. You can’t selectively normalize.
Standardization
 Load iris dataset.
 Used when dataset known to be in Gaussian (bell curve) distribution.
 Filters -> unsupervised -> attribute -> standardize
 This filter get applied to all numeric columns. You can’t selectively standardize.
Discretization
 Load diabetes dataset.
 Discretization comes in handy when using decision trees.
 Suppose you need to change weight column to two values like low and high.
 Set column number 6 to AttributeIndices.
 Set bins to 2 ( Low/ High)
 When you set equal frequency to true there will be equal number of high and low entries in
the final column.
RESULT:
Thus validate the data stored in data warehouse using Weka tool was done successfully.
EX.NO.:3 PLAN THE ARCHITECTURE FOR REAL TIME APPLICATION
AIM:
To plan the architecture for real time application.
PROCEDURE:
DESIGN STEPS:
1. Gather Requirements: Aligning the business goals and needs of different departments with
the overall data warehouse project.
2. Set Up Environments: This step is about creating three environments for data warehouse
development, testing, and production, each running on separate servers
3. Data Modeling: Design the data warehouse schema, including the fact tables and dimension
tables, to support the business requirements.
4. Develop Your ETL Process: ETL stands for Extract, Transform, and Load. This process is
how data gets moved from its source into your warehouse.
5. OLAP Cube Design: Design OLAP cubes to support analysis and reporting requirements.
6. Reporting & Analysis: Developing and deploying the reporting and analytics tools that will
be used to extract insights and knowledge from the data warehouse.
7. Optimize Queries: Optimizing queries ensures that the system can handle large amounts of
data and respond quickly to queries.
8. Establish a Rollout Plan: Determine how the data warehouse will be introduced to the
organization, which groups or individuals will have access to it, and how the data will be
presented to these users.
EX.NO.:4 QUERY FOR SCHEMA DEFINITION
AIM:
To Write a query for Star, Snowflake and Galaxy schema definitions.
PROCEDURE:
STAR SCHEMA
 Each dimension in a star schema is represented with only one-dimension table.

 This dimension table contains the set of attributes.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes
SNOWFLAKE SCHEMA
 Some dimension tables in the Snowflake schema are normalized.

 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized.
FACT CONSTELLATION SCHEMA
 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The sales fact table is same as that in the star schema.
 The sales fact table is same as that in the star schema.
 The shipping fact table also contains two measures, namely dollars sold and units sold.
SYNTAX:
Cube Definition:
define cube < cube_name > [ < dimension-list > }: < measure_list >
Dimension Definition:
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
SAMPLE PROGRAM:
STAR SCHEMA DEFINITION
define cube sales star [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
SNOWFLAKE SCHEMA DEFINITION

define cube sales snowflake [time, item, branch, location]:
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))
define dimension location as (location key, street, city (city key, city, province or state, country))
FACT CONSTELLATION SCHEMA DEFINITION
define cube sales [time, item, branch, location]:
define dimension item as (item key, item name, brand, type, supplier type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
dollars cost = sum(cost in dollars), units shipped = count(*)
define dimension time as time in cube sales

define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper
type)
define dimension from location as location in cube sales define
dimension to location as location in cube sales OUTPUT :
STAR SCHEMA
SNOWFLAKE SCHEMA
FACT CONSTELLATION SCHEMA
RESULT:
Thus the query for star, Snowflake and Galaxy schema was written Successfully.
EX.NO.:5 DESIGN DATA WARE HOUSE FOR REAL TIME APPLICATIONS
AIM:
To design a data warehouse for real time applications
PROCEDURE:
DATA CLEANSING AND TRANSFORMING
Dropping Tables
Since decision-making is concerned with the trends related to students’ history, behavior,
and academic
performance, tables “assets” and “item” are not needed; and therefore, they are discarded
and excluded from the
data warehouse.
DROP TABLE assets ;
DROP TABLE item ;
Merging Tables
Based on the design assumptions, the three tables “department”, “section”, and “course” do
not constitute
separately important parameters for extracting relevant patterns and discovering
knowledge. Therefore, they are
merged altogether with the “transcript_fact_table” table.
SELECT co_name FROM course, section, transcript
WHERE tr_id = n AND str_semester/year = se_semester/year AND tr_se_num = se_num AND
se_code =
co_code ;
ALTER TABLE transcript fact table ADD co_course TEXT ; DROP
TABLE department ;
DROP TABLE section ;
DROP TABLE course ;
Furthermore, table “Activities” is merged with table “RegistrationActivities” and a new table
is produced
called “RegisteredActivities”.
SELECT act_name FROM activities, registrationActivities
WHERE reg_act_id = act_id ;
New Columns
During transformation new columns can be added. In fact, tr_courseDifficulty is added to
table “transcript_fact_table” in order to increase the degree of knowledge and information.
ALTER TABLE transcript_fact_table ADD tr_courseDifficulty TEXT ; Moreover a
Boolean column is added to table “receipt” called re_paidOnDueDate ALTER
TABLE receipt (re_paidOnDueDate) ;
Removing Columns
Unnecessary columns can be removed too during the transformation process. Below is a list of useless
columns
that were discarded during the transformation process from tables “Account”, “Student”, “Receipt” and
“Activities” respectively:
ALTER TABLE Receipt REMOVE re_dueDate
REMOVE re_dateOfPayment ;
ALTER TABLE Activities REMOVE ac_supervisor ;
ALTER TABLE Student REMOVE st_phone REMOVE
st_email ;
Conceptual Schema – The Snowflake Schema
The proposed data warehouse is a Snowflake type design with one center fact table and seven
dimensions
Output:
RESULT:
Thus the data warehouse was successfully designed.
EX.NO.:6 ANALYSE THE DIMENSIONAL MODELING
AIM:
To design and analyse the dimensional Modeling of a Data warehouse
PROCEDURE:
SAMPLE DIMENSIONS
For instance, suppose a business wants to analyze sales data. In that case, the dimensions could
include customers, products, regions, and time, while the facts could be the number of products sold,
the total revenue generated, and the profit earned.
There are two primary techniques used in dimensional modeling:
Star Schema
The star schema is the simplest and most common dimensional modeling technique. In a star schema,
the fact table is at the center and connected via foreign key(s) to the dimension tables. The fact table
contains the numerical values or metrics being analyzed, while the dimension tables have the
attributes that describe the data.
Snowflake Schema
The snowflake schema is a more complex dimensional modeling technique used when there are
multiple levels of granularity within a dimension. In a snowflake schema, the dimension tables are
normalized, meaning they are split into multiple tables to reduce data redundancy. This
normalization results in a more complex schema that resembles a snowflake, hence the name .
Steps to Implement Dimensional Modeling
The following are the steps involved in implementing dimensional modeling:
1. Identify the Business Process

The first step in implementing dimensional modeling is to identify the business requirements and the
business process that the data warehouse will support. This involves determining business objectives
that need to be supported by the key performance indicators (KPIs) that will be used to measure the
success of the process.
2. Determine the Data to be Analyzed
Once the business process has been identified, the next step is determining what data needs to be
analyzed. This includes identifying the relevant data sources and the specific data elements that must
be included in the model.
3. Identify the Dimensions

The third step is to identify the dimensions that will be used to describe the data. These dimensions
should be based on the KPIs identified in step oneand relevant to the business process.
4. Identify the Facts

The fourth step is to identify the facts that will be analyzed in the data warehouse. These facts should
be based on the KPIs specified in step one andrelevant to the business process.
5. Identify the Grain
The grain refers to the level of detail at which the data will be stored and analyzed. For example, in a
sales data model, the grain might be at the level of individual sales transactions or at the level of
daily sales totals. Identifying the grain is important because it determines the level of detail at which
the data will be stored and can affect the performance and usability of the data model.
6. Design the Schema

The final step is to design the schema. This involves creating a fact and dimension tables based on
the dimensions and facts identified in steps three and four.
7. Populate the Data Warehouse

The sixth step is to populate the data warehouse with data from the source systems. This involves
extracting, transforming, and loading (ETL) the data into the data warehouse.
8. Test the Data Warehouse

The final step is to test the data warehouse to ensure it meets the requirements identified in step one.
This involves running queries against the data warehouse to ensure the data is accurate, complete,
and consistent. Identifying and addressing any issues during testing is essential to ensure the data
warehouse is functioning correctly.
Ex.No:7 CASE STUDY USING OLAP
AIM:
To study about Online Analytical Processing OLAP
Introduction:
In this case study, we will explore how Online Analytical Processing (OLAP) technology was
implemented in a retail data warehousing environment to improve data analysis capabilities and
support decision-making processes. The case study will focus on a fictional retail company, XYZ
Retail, and the challenges they faced in managing and analyzing their vast amounts of transactional
data.
Background:
XYZ Retail is a large chain of stores with locations across the country. The company has been
experiencing rapid growth in recent years, leading to an increase in the volume of data generated
from sales transactions, inventory management, customer interactions, and other operational
activities. The existing data management system was struggling to keep up with the demand for
timely and accurate data analysis, hindering the company's ability to make informed business
decisions.
Challenges:
1. Lack of real-time data analysis: The existing data warehouse system was unable to provide real-
time insights into sales trends, inventory levels, and customer preferences.
2. Limited scalability: The data warehouse infrastructure was reaching its limits in terms of
storage capacity and processing power, making it difficult to handle the growing volume of data.
3. Complex data relationships: The data stored in the warehouse was highly normalized, making
it challenging to perform complex queries and analyze data across multiple dimensions.
Solution:
To address these challenges, XYZ Retail decided to implement an OLAP solution as part of their
data warehousing strategy. OLAP technology allows for multidimensional analysis of data,
enabling users to easily slice and dice information across various dimensions such as time, product
categories, geographic regions, and customer segments.
Implementation:
1. Data modeling: The data warehouse was redesigned using a star schema model, which
simplifies data relationships and facilitates OLAP cube creation.
2. OLAP cube creation: OLAP cubes were created to store pre-aggregated data for faster query
performance. The cubes were designed to support various dimensions and measures relevant to the retail
business.
3. Reporting and analysis: Business users were trained on how to use OLAP tools to create ad-hoc reports,
perform trend analysis, and drill down into detailed data.
Results:
1. Improved data analysis: With OLAP technology in place, XYZ Retail was able to perform complex
analyses on sales data, identify trends, and make informed decisions based on real- time insights.
2. Faster query performance: OLAP cubes enabled faster query performance compared to
traditional relational databases, allowing users to retrieve data more efficiently.
3. Enhanced decision-making: The ability to analyze data across multiple dimensions helped XYZ Retail
gain a deeper understanding of their business operations and customer behavior, leading to more strategic
decision-making.
Conclusion:
By leveraging OLAP technology in their data warehousing environment, XYZ Retail was able to
overcome the challenges of managing and analyzing vast amounts of data. The implementation of OLAP
not only improved data analysis capabilities but also empowered business users to make informed
decisions based on real-time insights. This case study demonstrates the value of OLAP in enhancing data
analysis and decision-making processes in a retail environment.
RESULT:
Thus the case study using OLAP was done successfully.
Ex.No:8 CASE STUDY USING OTLP
AIM:
To study about using OTLP
Introduction:
This case study explores the implementation of the Operational Data Layer Pattern (OTLP) in a
data warehousing environment to improve data integration, processing, and analytics
capabilities. The case study focuses on a fictional company, Tech Solutions Inc., and how they
leveraged OTLP to enhance their data warehousing operations.
Background:
Tech Solutions Inc. is a technology consulting firm that provides IT solutions to various clients.
The company collects a vast amount of data from different sources, including customer
interactions, sales transactions, and operational activities. The existing data warehouse
infrastructure was struggling to handle the growing volume of data and provide real-time
insights for decision-making.
Challenges:
1. Data silos: Data from different sources were stored in separate silos, making it difficult to
integrate and analyze data effectively.
2. Real-time data processing: The existing data warehouse was not capable of processing real-
time data streams, leading to delays in data analysis and decision-making.
3. Scalability: The data warehouse infrastructure was reaching its limits in terms of storage
capacity and processing power, hindering the company's ability to scale with the growingdata
volume.
Solution:
To address these challenges, Tech Solutions Inc. decided to implement the OTLP pattern in their
data warehousing environment. OTLP combines elements of both Operational Data Store (ODS)
and Traditional Data Warehouse (TDW) architectures to enable real-time data processing, data
integration, and analytical capabilities.
Implementation:
1. Data integration: Tech Solutions Inc. integrated data from various sources into the
operational data layer, where data transformations and cleansing processes were applied.
2. Real-time processing: The OTLP architecture allowed for real-time data processing, enabling
the company to analyze streaming data and generate insights in near real-time.
3. Analytics and reporting: Business users were provided with self-service
analytics tools to create ad-hoc reports, perform trend analysis and gain
actionable insights from the integrated data.
Results:
1. Improved data integration: The OTLP architecture facilitated seamless
integration of data from multiple sources, breaking down data silos and
enabling a unified view of the company's operations.
2. Real-time analytics: With OTLP in place, Tech Solutions Inc. was able to
analyze streaming data in real-time, allowing for faster decision-making and
response to market trends.
3. Scalability: The OTLP architecture provided scalability to handle the
growing volume of data, ensuring that the company's data warehousing
operations could support future growth.
Conclusion:
By implementing the Operational Data Layer Pattern (OTLP) in their data
warehousing environment, Tech Solutions Inc. was able to overcome the
challenges of data silos, real-time data processing, and scalability. The
adoption of OTLP not only improved data integration and analytics
capabilities but also empowered business users to make informed decisions
based on real-time insights. This case study highlights the benefits of
leveraging OTLP in enhancing data warehousing operations for improved
business outcomes.
RESULT:
Thus case study using OTLP done successfully.
Ex.No:9 Implementation of warehouse testing.
AIM:
To implement warehouse testing
Steps with program:

1. Install necessary libraries:
pip install pytest pandas
2. Create a Python script for data transformation and loading: #

data_transformation.py import pandas as pd
deftransform_data(input_data):
# Perform data transformation logic here
transformed_data=input_data.apply(lambdax:x*2)
return transformed_data
defload_data(transformed_data):
# Load transformed data into the operational data layer
transformed_data.to_csv('transformed_data.csv',index=False)
3. Create test cases using pytest:

#test_data_integration.py
import pandas as pd
importdata_transformation
deftest_transform_data():
input_data=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
expected_output = pd.DataFrame({'A': [2, 4, 6], 'B': [8, 10, 12]})
transformed_data=data_transformation.transform_data(input_data)
assert transformed_data.equals(expected_output)
deftest_load_data():
input_data=pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
data_transformation.load_data(input_data)
loaded_data=pd.read_csv('transformed_
data.csv') assert
input_data.equals(loaded_data)
4. Run the tests using py test:
pytesttest_data_integration.py
5. Analyze the test results to ensure that the data transformation and
loading processes are functioning correctly in the operational data layer.
By implementing automated tests for data integration processes in the data

warehousing environment, you can ensure the accuracy and reliability of the
data transformation and loading operations. This approach helps in identifying
any issues or discrepancies early on in the development cycle, leading to a
more robust and efficient data warehousing system.
OUTPUT:
RESULT:
Thus the implementation of warehouse testing done successfully.

Lab Manual Format

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Lab Manual Format

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab Manual Format

Uploaded by

Copyright:

Available Formats

Department of Computer Science and Business Systems

CCS341 - Data Warehousing

SubCode:CCS341 Subject Name :Data Warehousing

Data exploration and integration with

2 Apply weka tool for data validation

Plan the architecture for real time

4 Write the query for schema definition

Design data ware house for

6 Analyse the dimensional Modeling

7 Case study using OLAP

8 Case study using OTLP

9 Implementation of warehouse testing.

SubCode:CCS341 Subject Name :Data Warehousing

EX.NO.:1 DATA EXPLORATION AND INTEGRATION WITH WEKA

WEKA - an open-source software provides tools for data preprocessing, implementation of

SubCode:CCS341 Subject Name :Data Warehousing

WEKA – Launching Explorer

On the top, you will see several tabs as listed here:

SubCode:CCS341 Subject Name :Data Warehousing

Select Attributes Tab

SubCode:CCS341 Subject Name :Data Warehousing

The data can be loaded from the following sources:

Loading Data from Local File System

Loading Data from Web

SubCode:CCS341 Subject Name :Data Warehousing

Loading Data from DB

SubCode:CCS341 Subject Name :Data Warehousing

WEKA File Formats

SubCode:CCS341 Subject Name :Data Warehousing

From the screenshot, you can infer the following points:

SubCode:CCS341 Subject Name :Data Warehousing

SubCode:CCS341 Subject Name :Data Warehousing

In the Selected Attribute subwindow, you can observe the following:

SubCode:CCS341 Subject Name :Data Warehousing

Open a Command Line Interface

java weka.core.Instances merge C:\Users\Ram\Downloads\file1.csv C:\Users\Ram\Downloads\

Finished redirecting output to 'C:\Users\Ram\Downloads\results.csv'.

SubCode:CCS341 Subject Name :Data Warehousing

SubCode:CCS341 Subject Name :Data Warehousing

To validate the data stored in data warehouse using Weka tool.

After loading your dataset

 Choose filters -> unsupervised -> instance -> Remove Duplicates

Dealing with missing values

 Open labor.arff file using weka. ( which has missing values )

 maximumAttributes – No of attributes we need after reduction.

 Each dimension in a star schema is represented with only one-dimension table.

 Some dimension tables in the Snowflake schema are normalized.

FACT CONSTELLATION SCHEMA

STAR SCHEMA DEFINITION

define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

SNOWFLAKE SCHEMA DEFINITION

dollars sold = sum(sales in dollars), units sold = count(*)

FACT CONSTELLATION SCHEMA DEFINITION

define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension time as time in cube sales