0% found this document useful (0 votes)
2 views

1. Databases for Data Science-SQL

The document provides an overview of databases for data science, emphasizing the importance of SQL for data retrieval and manipulation. It discusses various tools and programming languages like Python and R, along with their libraries for data analysis, visualization, and model building. Additionally, it covers data munging techniques and basic statistics using SQL to ensure data quality and facilitate effective analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

1. Databases for Data Science-SQL

The document provides an overview of databases for data science, emphasizing the importance of SQL for data retrieval and manipulation. It discusses various tools and programming languages like Python and R, along with their libraries for data analysis, visualization, and model building. Additionally, it covers data munging techniques and basic statistics using SQL to ensure data quality and facilitate effective analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Unit 2: Databases for Data

Science:
• SQL
• Tool for Data Science
• Basic Statistics with SQL
• Data Munging with SQL
• Filtering, Joins and Aggregation
• Window Functions and Ordered Data
• Preparing Data for Analytics Tool
Databases for Data Science
• Data science involves extracting value and insights from large
volumes of data to drive business decisions.
• It also involves building predictive models using historical
data.
• Databases facilitate effective storage, management, retrieval,
and analysis of such large volumes of data.
• Data scientist, should understand the fundamentals of
databases.
– Because they enable the storage and management of large and complex
datasets, allowing for efficient data exploration, modeling, and
deriving insights.
1. SQL
• As a data scientist, should have a good understanding of
different types of databases, such as relational and NoSQL
databases, and their respective use cases.
• Relational databases are a type of database management
system (DBMS) that organize and store data in a structured
manner using tables with rows and columns.
• SQL (Structured Query Language) for Data Retrieval and
Data Manipulation & Management.
• Querying is a primary feature of SQL databases used for data
mining . making it easier to clean and organize it or
exploratory analysis.
• Through querying, data scientists are able to filter, sort, and
group data as well as return descriptive statistics.
1. SQL
• PostgreSQL, Microsoft SQL Server, MySQL, SQLite, and
IBM Db2 are some of the top SQL databases used in data
science.
• They each offer unique features and are compatible with
various programming languages.
• SQL as a essential data science tool, it includes an
understanding of how to clean, organize, and store data within
a relational database management system.
Data Science
• Data Science is a multidisciplinary field that combines
knowledge from various disciplines to help businesses make
intelligent decisions through data-driven analysis.
2. Tools for Data Science
Python:
•Along with R, Python is one of the most frequently utilized
languages in data research.
•It is flexible and readable and has many libraries to support it,
especially in data science, making it ideal for various tasks, from
web scraping to model building.
•Here are the critical Libraries for each category in Python
1) Web Scraping:
– BeautifulSoup: Easiest web scraping library in Python.
– Scrapy: Advanced web scraping library.
2. Tools for Data Science
Python:
2) Data Exploration and Manipulation:
– Pandas: Python data manipulation and analysis toolkit.
– NumPy: Supports big multidimensional arrays and mats.
3) Data Visualization:
– Matplotlib: The core Python plotting library
– Seaborn: A visualization library based on Matplotlib. It offers a high-
level interface for creating attractive statistical graphics.
– Plotly: Interactive graphing library.
4) Model Modeling:
– Scikit-learn: The most critical ML library in Python
– TensorFlow: Good to apply and scale Deep Learning.
– PyTorch: A machine learning library for image processing and NLP
applications.
2. Tools for Data Science
R:
•R is a potent text analysis tool designed to address statistical and
data analysis concerns.
•Its comprehensive statistical power and vast package ecosystem
make it quite popular in academia and research.
•Here are the critical Libraries for each category in Pythonn
1) Web Scraping:
– rvest: Makes web scraping easy by mimicking the exact structure of
the web page.
– RCurl: R bindings to the curl lib, allowing for anything that can be
done with the curl itself.
2. Tools for Data Science
R:
2) Data Exploration and Manipulation:
– dplyr: It is a grammar of data manipulation offering data manipulation
verbs that help make data manipulation easier.
– tidyr: Makes your data more accessible by manually spreading and
gathering data.
– Data.table: An extension of data.frame with faster data manipulation
capabilities.
3) Data Visualization:
– ggplot2: Application of the grammar of graphics.
– lattice: Better defaults + easy way to create multi-panel-plots.
– plotly: It converts graphs created with ggplot2 to interactive, user-
driven web-based graphs.
2. Tools for Data Science
R:
4) Model Modeling:
– Caret: Tools for creating classification and regression models.
– nnet: Offer functions to build neural networks.
– randomForest: It is a random forest algorithm-based library for
classification and regression.
2. Tools for Data Science
Excel:
•Excel is easy to use for analyzing and visualizing data. It is easy
to learn and compress, and its ability to handle large data sets
makes it helpful for fast data manipulation and analysis.
•Here, instead of libraries, the key functions of Excel ar divided
into subsections to categorize them.
1) Data Exploration and Manipulation:
– FILTER: Filters a spectrum of data depending on user defined
criteria.
– SORT: Sort the elements of a range or array.
– VLOOKUP/HLOOKUP: Finds things in tables or ranges by row or
column.
– TEXT TO COLUMNS: This will split the content of a cell into
multiple cells.
2. Tools for Data Science
Excel:
2) Data Visualization:
– Charts (Bar, Line, Pie, etc.): Regular standard chart types to depict
data.
– PivotTables: It condenses large data sets and creates interactive
summaries.
– Conditional Formatting: It displays which cells fall under a specific
rule.
3) Model Building:
– AVERAGE, MEDIAN, MODE: Calculates central tendencies.
– STDEV.P/STDEV.S: Works with the dataset to calculate dataset
segregation.
– LINEST: Based on the linear regression analysis, statistics for a
straight line that most matches a data set are returned.
– Regression Analysis (Data Analysis Toolpak): This toolkit uses
regression analysis to find correlations between variables.
2. Tools for Data Science
SQL:
•SQL is the language used to interact with relational databases
and is needed to store and process data.
•A data scientist primarily uses SQL as the standard way to
interact with databases, helping them query, update, and
manage data in all the databases. SQL is also required to access
the data for retrieval and analysis.
•The most popular SQL systems are:
– PostgreSQL: An open-source object-relational database system.
– MySQL: A high-level, popular open-source database known for its
speed and reliability.
– MsSQL (Microsoft SQL Server): A Microsoft-developed RDBMS
fully integrated Microsoft product with enterprise features.
– Oracle: It is a multi-model DBMS widely used in enterprise
environments. It combines the best relational model with tree-based
storage representation.
2. Tools for Data Science
Advance Visualization Tools:
•With the right advanced visualization tools, complex data can be
transformed into vivid, usable insights.
•These tools allow data scientists and business analysts to create
interactive and shareable dashboards that improve, understand,
and make the data accessible at the right time.
Important tools to build dashboards are:
•Power BI: A business analytics service by Microsoft that
provides interactive visualizations and business intelligence
capabilities with an interface simple enough for end users to
create their reports and dashboards.
2. Tools for Data Science
Advance Visualization Tools:
Important tools to build dashboards are:
•Tableau: A robust data visualization tool that allows users to
create interactive and shareable dashboards that give insightful
views of the data.
• It can handle large volumes of data and work well with
disparate data sources.
•Google Data Studio: It is a free parts web-based application
that allows user to create dynamic and aesthetic dashboards and
reports using data from virtually any source, and other parts free,
fully customizable, and easy-to-share reports that automatically
update using data from other Google services.
2. Tools for Data Science
Could Systems:
•Cloud systems are essential to data science because they can
scale, increase flexibility, and manage big datasets.
•They offer computational services, tools, and resources to store,
process, and analyze data at scale with cost optimization and
performance effectiveness.
Few Tools are:
•AWS (Amazon Web Services): Provides a highly sophisticated
and ever-evolving cloud computing platform that includes a
range of services such as storage, computation, machine learning,
big data analytics, etc.
2. Tools for Data Science
Could Systems:
Few Tools are:
•Google Cloud: Offers various cloud computing services that run
on the same infrastructure Google uses internally for products
such as Google Search and YouTube, including cloud data
analytics, data management, and machine learning.
•Microsoft Azure: Microsoft offers cloud computing services,
including virtual machines, databases, AI and machine learning
tools, and DevOps solutions.
•PythonAnywhere: A cloud-based development and hosting
environment allowing to run, develop, and host Python
applications through a web browser without IT staff setting up a
server. Ideal for data science and web app developers who want
to deploy their code quickly.
2. Tools for Data Science
Bonus: LLM’s
•Large Language Models (LLMs) are one of the cutting-edge
solutions in AI.
•They can learn and generate text like humans, and they are quite
advantageous in a wide range of applications, such as Natural
Language Processing, Customer Service Automation, Content
Generation, and so on.
Some of the most famous tools are:
•ChatGPT: It is a flexible conversational agent created by
OpenAI to generate human-like and in-context text, which is
beneficial.
•Gemini: The LLM created by Google will allow you to use it
directly inside Google apps like Gmail.
2. Tools for Data Science
Bonus: LLM’s
Some of the most famous tools are:
•Claude-3: A modern LLM specially built for better
understanding and text generation. It is used to assist in every
high-level NLP task and conversational AI.
•Microsoft Co-pilot: An AI-powered service integrated into
Microsoft applications, Co-pilot helps users by giving context-
sensitive recommendations and automating repetitive workflows,
enabling productivity and efficiencies across the processes.
3. Basic Statistics with SQL
• Statistics is widely used in various fields such as business,
economics, social science, medicine, and engineering.
• A Statistical function is a mathematical function that helps us
to process and analyze data to provide meaningful information
about the dataset.
• SQL statistical functions are essential tools for extracting
meaningful insights from databases. These functions, enable
users to perform statistical calculations on numeric data.
• Basic statistics can be performed in SQL using aggregate
functions, window functions, and grouping to analyze data.
3. Basic Statistics with SQL
1. COUNT: This function counts the number of rows in a
dataset or group of rows.
SELECT COUNT(studentID) AS total_students
FROM studentDetails;
Output:

2. SUM: The SUM() function adds up the values in a column.


SELECT SUM(marks) AS total_marks
FROM studentDetails;
Output:
3. Basic Statistics with SQL
3. AVG: The AVG() function calculates the average of values in
a column.
SELECT AVG(marks) AS average_marks
FROM studentDetails;
Output:

4.MIN and MAX: These functions return the smallest and


largest values in a column, respectively.
SELECT MAX(marks) AS highest_marks
MIN(marks) AS lowest_marks
FROM studentDetails;
Output:
3. Basic Statistics with SQL
5. Standard Deviation and Variance:
The standard deviation is the average amount of variability in
your dataset. It tells you, on average, how far each value lies
from the mean
Variance is a statistical measurement of the spread between
numbers in a data set. It measures how far each number in the set
is from the mean (average).
•To calculate the standard deviation or variance, can use
STDDEV() or VARIANCE() in SQL.
SELECT STDDEV(marks) AS stddev_marks
VARIANCE(marks) AS variance_marks
FROM studentDetails;
3. Basic Statistics with SQL
6. GROUP BY: (Group Data and Perform Aggregate Functions)
•To group rows with the same values in specified columns and
perform aggregate functions (e.g., SUM, COUNT, AVG)
Example:
SELECT category , SUM(amount) AS total_sales
FROM sales
GROUP BY category;
•Output: category total_sales
Electronics 5000
Furniture 3000
Clothing 1500
Home Appliances 2000
3. Basic Statistics with SQL
7. PERCENTILE_CONT() Function
•To calculate a specified percentile value for a numeric column.
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP

(ORDER BY salary) AS median_salary


FROM employees;
•Output:
3. Basic Statistics with SQL
8. CORR() Function
•Correlation is a process for establishing the linear relationships
between two variables.
• Corr() used to calculate the correlation coefficient between two
numeric columns.
SELECT CORR(sales, profit) AS correlation_coefficient
FROM sales_data;
Output:
4. Data Munging with SQL
• Data munging, sometimes called data wrangling or data
remediation or data cleaning.
• It is used to converting and mapping unprocessed data into a
different format to improve its suitability and value for various
downstream uses, including analytics.
• This is a prerequisite for a successful data analytics or data
integration.
• Munging is generally a permanent data transformation
process.
4. Data Munging with SQL
• Data munging plays a pivotal role in ensuring the quality and
reliability of data for analysis.
• Clean and well-structured data leads to more accurate insights
and facilitates the modeling process, thereby enhancing the
overall efficacy of data-driven decision-making.
4. Data Munging with SQL
Tools and Technologies used for Data Munging:
•Python: Libraries like Pandas, NumPy, and scikit-learn are
widely used for data manipulation and analysis.
•R Programming: Packages such as dplyr and tidyr are popular
for data wrangling tasks.
•SQL: For querying and transforming data in databases.
•ETL (Extract, Transform, Load) Tools: Such as Apache
Spark, Talend, or Informatica, which automate data munging
processes.
4. Data Munging with SQL
Techniques and Steps in Data Munging:
•Data munging involves a series of techniques and steps to
transform raw data into a usable form.
Some common techniques used:
1) Data Discovery:
•Defining the purpose and goals of data analysis.
•Identifying potential uses and requirements of data.
•Focusing on business requirements rather than technical
specifications.
4. Data Munging with SQL
Techniques and Steps in Data Munging:
Some common techniques used:
2) Data Structuring:
•Structuring raw data to make it machine-readable.
•Organizing data into a well-defined schema with consistent
layout (rows and columns).
•Extracting data from various sources and organizing it into a
formatted repository.
4. Data Munging with SQL
Techniques and Steps in Data Munging:
•Data munging involves a series of techniques and steps to
transform raw data into a usable form.
Some common techniques used:
3) Data Cleaning:
•This involves identifying and correcting errors or
inconsistencies in the data.
•Common tasks include handling missing values, correcting data
formats (e.g., dates, numeric values), and removing duplicates.
4. Data Munging with SQL
Techniques and Steps in Data Munging:
Some common techniques used:
4) Data Transformation:
•Data often needs to be transformed to fit the analytical
requirements.
•This may include converting categorical data into numerical
format (encoding), normalizing or scaling numeric data, and
aggregating or summarizing data.
4. Data Munging with SQL
Techniques and Steps in Data Munging:
Some common techniques used:
5) Handling Missing Data:
Techniques such as imputation (replacing missing values with
estimated ones) or deletion (removing rows or columns with
missing data) are used to handle missing data appropriately.
6) Data Integration:
•Combining data from multiple sources involves aligning
schemas, resolving inconsistencies, and merging datasets to
create a unified view.
4. Data Munging with SQL
Techniques and Steps in Data Munging:
Some common techniques used:
7) Feature Engineering:
•Creating new features or variables from existing data that can
enhance the predictive power of machine learning models.
8) Data Validation:
– Checking data integrity to ensure it meets expected standards and
business rules.
– Validating the accuracy, completeness, and reliability of data.
– Final check to ensure output information is accurate and reliable.
– Rejecting data that don't comply with pre-defined rules or constraints.
– Types of validation checks include consistency check, data-type
validation, range and constraint validation.
4. Data Munging with SQL
Techniques and Steps in Data Munging:
Some common techniques used:
•These stages/techniques represent a systematic approach to
preparing data for analysis, ensuring that the data is well-
structured, clean, enriched, and validated before further analysis
or processing.
•This ultimately leads to more accurate and reliable insights
from data analysis.
5. Filtering, Joins and Aggregation

Refer Word Document shared in Moodle


6. Window Functions and Ordered Data
• Window functions is one an advanced and versatile feature of
SQL.
• With window functions, data engineers can perform complex
analytical operations and gain deeper insights into their
datasets.
• A window function makes a calculation across multiple rows
that are related to the current row.
• For example, a window function allows you to calculate:
– Running totals (i.e. sum values from all the rows before the current
row)
– 7-day moving averages (i.e. average values from 7 rows before the
current row)
– Rankings
6. Window Functions and Ordered Data
• Similar to an aggregate function (GROUP BY), a window
function performs the operation across multiple rows.
• Unlike an aggregate function, a window function does not
group rows into one single row. — the rows retain their
separate identities.
• Windows can be defined in the SELECT section of the query.
• Syntax:
SELECT
window_function() OVER(
PARTITION BY partition_expression
ORDER BY order_expression ROWS/RANGE BETWEEN
start AND end
window_frame_extent
) AS window_column_alias
FROM table_name
6. Window Functions and Ordered Data
• To reuse the same window with several window functions,
define a named window using the WINDOW keyword. This
appears in the query after the HAVING section and before the
ORDER BY section.
SELECT
window_function() OVER (window_name)
FROM table_name
[HAVING ...]
WINDOW window_name AS (
PARTITION BY partition_expression
ORDER BY order_expression ROWS/RANGE BETWEEN start AND end
window_frame_extent
)
[ORDER BY ...]
6. Window Functions and Ordered Data
• Where,
– window_function: The analytical function you want to apply (e.g.,
ROW_NUMBER, RANK, SUM, AVG, etc.).
– PARTITION BY: Divides the result set into partitions to which the
function is applied independently.
– ORDER BY: Defines the order of rows within the partition.
– ROWS/RANGE BETWEEN: Specifies the range or number of rows
that the function operates on.
• SQL supports various window functions such as SUM(),
AVG(), COUNT(), RANK(), ROW_NUMBER(), LEAD() and
LAG(), FIRST_VALUE() and LAST_VALUE(), NTILE() and
more.
• These functions can compute values over a specified range of
rows.
6. Window Functions and Ordered Data
Example 1: Rolling Averages
•Imagine you have a sales table and want to calculate the 7-day
rolling average for each day’s sales. Window functions can make
this task much more efficient and readable:
SELECT
sale_date,
SUM(sales_amount) OVER (ORDER BY sale_date ROWS
BETWEEN 6 PRECEDING
AND CURRENT ROW)
AS rolling_avg
FROM sales;
6. Window Functions and Ordered Data
• We have a table called “sales” that presumably contains data about different sales
transactions, including the sale date and the amount of sales.
• In the above example
– SUM(sales_amount) OVER: This indicates that we want to calculate a sum
across a specific range of rows.
– ORDER BY sale_date: This specifies the order in which the rows are
considered for the calculation. In this case, the rows are ordered based on the
“sale_date” in ascending order.
– ROWS BETWEEN 6 PRECEDING AND CURRENT ROW: This defines
the window, which is a subset of rows used for the calculation.
– Here, it says that the window should include the current row and the 6 rows
preceding it based on the ordering by “sale_date”.
• So, for each row, the window includes the current row and the previous six rows (a
total of seven rows). The window “slides” along the ordered rows for each
calculation.
6. Window Functions and Ordered Data
Example 2: Ranking Based on Sales.
•Suppose you want to rank salespeople based on their total sales,
with ties getting the same rank:
•SELECT
salesperson_id,
SUM(sales_amount) AS total_sales,
RANK() OVER (ORDER BY SUM(sales_amount) DESC) AS sales_rank
FROM
sales
GROUP BY
salesperson_id;
•In the given SQL code, we have a table called “sales” that likely
contains data about sales transactions, including the salesperson’s
ID and the amount of each sale.
6. Window Functions and Ordered Data
Example 2: Ranking Based on Sales.
•SUM(sales_amount) AS total_sales: This part calculates the
total sales amount for each salesperson by summing up the
“sales_amount” values.
•RANK() OVER (ORDER BY SUM(sales_amount) DESC) AS
sales_rank: This is where the windowing function comes into
play. It calculates a ranking for each salesperson based on their
total sales amount. Here’s how it works:
6. Window Functions and Ordered Data
Example 2: Ranking Based on Sales.
•RANK() is the ranking function. It assigns a unique rank to each
row based on the specified ordering.
•OVER (ORDER BY SUM(sales_amount) DESC) defines the
window. It orders the rows by the total sales amount in
descending order. This means that the salesperson with the
highest total sales will have the rank of 1.
•So, for each salesperson, the sales_rank is determined by their
total sales amount compared to other salespersons.
•Finally, the result of the query displays the salesperson_id, the
total_sales amount for each salesperson, and their corresponding
sales_rank.
7. Preparing Data for Analytics Tool
• Data preparation process involves cleaning, transforming, and
structuring the data so that it can be effectively used with the
new analytics tools and technologies.
• Proper data preparation solution can help maintain data quality
and accuracy, preventing erroneous conclusions and decisions.
• Ensuring that data is clean and reliable for accurate analytics,
is a critical process that involves various steps and practices.
They are:
7. Preparing Data for Analytics Tool
Data Collection and Entry:
•Standardize Data Entry: Implement standardized data entry
procedures to minimize human errors, typographical error and
inconsistencies in data capture.
•Validation Rules: Use validation rules and data entry forms to
enforce data integrity at the point of entry.
Data Cleaning:
•Data Profiling: Analyze data to identify inconsistencies,
missing values, and outliers.
•Data Cleaning Tools: Utilize data cleaning tools and software
to automate the process of correcting, deduplicating, and
transforming data.
7. Preparing Data for Analytics Tool
Data Cleaning:
•Data Transformation: Standardize and format data to ensure
uniformity.
•Address Missing Data: Develop strategies for handling missing
data, such as imputation or removing incomplete records.
Data Validation:
•Data Quality Metrics: Establish data quality metrics and KPIs
(Key Performance Indicators) to measure and monitor data
quality over time.
•Validation Rules: Define validation rules and checks to verify
the accuracy and consistency of data.
•Data Auditing: Conduct periodic data audits to identify and
rectify discrepancies.
7. Preparing Data for Analytics Tool
Data Governance:
•Data Ownership: Assign data owners and stewards who are
responsible for maintaining data quality.
•Data Policies and Procedures: Develop and enforce data
governance policies and procedures to guide data handling and
maintenance.
Data Integration:
•Data Integration Tools: Use ETL (Extract, Transform, Load)
tools to integrate data from various sources, ensuring consistency
and reliability.
•Master Data Management (MDM): Implement MDM systems
to manage and maintain consistent master data.
7. Preparing Data for Analytics Tool
Data Security:
•Access Controls: Implement robust access controls to ensure
data security and prevent unauthorized changes or tampering.
•Data Encryption: Encrypt sensitive data to protect it from
unauthorized access.
Documentation:
•Metadata Management: Maintain detailed metadata records to
document the source, transformation processes, and history of
data.
•Data Lineage: Track the lineage of data to understand how it
flows through your organization.
7. Preparing Data for Analytics Tool
Data Quality Assurance:
•Data Quality Team: Establish a data quality team or assign data
stewards to monitor and maintain data quality.
•Data Quality Metrics: Continuously monitor data quality
metrics to identify and address issues as they arise.
Continuous Improvement:
•Feedback Loops: Establish feedback loops to capture and
address data quality issues reported by users.
•Root Cause Analysis: Investigate the root causes of data
quality issues to prevent their recurrence.
7. Preparing Data for Analytics Tool
Automated Monitoring:
•Utilize automated monitoring and alerting systems to detect
anomalies, data quality breaches, or deviations from
predefined standards.
Data Quality Reporting:
•Create regular data quality reports and dashboards to
communicate data quality metrics and issues to stakeholders.
Compliance and Regulation:
•Ensure that data management practices comply with relevant
industry regulations, such as GDPR, or financial reporting
standards.
•GDPR - General Data Protection Regulation,
. Filtering, Joins and Aggregation

https://www.xerago.com/importance-of-data-preparation -
Preparing Data for Accurarte analysis
Tools for data preparation
4. Data Munging with SQL

https://www.geeksforgeeks.org/what-is-data-munging/
https://www.integrate.io/glossary/what-is-data-munging/

https://medium.com/@jennifer.ezeanyim/my-data-wrangling-proj
ect-with-sql-mysql-50b6e6e53ba2

https://www.datacamp.com/blog/what-is-data-wrangling

You might also like