1. Databases for Data Science-SQL
1. Databases for Data Science-SQL
Science:
• SQL
• Tool for Data Science
• Basic Statistics with SQL
• Data Munging with SQL
• Filtering, Joins and Aggregation
• Window Functions and Ordered Data
• Preparing Data for Analytics Tool
Databases for Data Science
• Data science involves extracting value and insights from large
volumes of data to drive business decisions.
• It also involves building predictive models using historical
data.
• Databases facilitate effective storage, management, retrieval,
and analysis of such large volumes of data.
• Data scientist, should understand the fundamentals of
databases.
– Because they enable the storage and management of large and complex
datasets, allowing for efficient data exploration, modeling, and
deriving insights.
1. SQL
• As a data scientist, should have a good understanding of
different types of databases, such as relational and NoSQL
databases, and their respective use cases.
• Relational databases are a type of database management
system (DBMS) that organize and store data in a structured
manner using tables with rows and columns.
• SQL (Structured Query Language) for Data Retrieval and
Data Manipulation & Management.
• Querying is a primary feature of SQL databases used for data
mining . making it easier to clean and organize it or
exploratory analysis.
• Through querying, data scientists are able to filter, sort, and
group data as well as return descriptive statistics.
1. SQL
• PostgreSQL, Microsoft SQL Server, MySQL, SQLite, and
IBM Db2 are some of the top SQL databases used in data
science.
• They each offer unique features and are compatible with
various programming languages.
• SQL as a essential data science tool, it includes an
understanding of how to clean, organize, and store data within
a relational database management system.
Data Science
• Data Science is a multidisciplinary field that combines
knowledge from various disciplines to help businesses make
intelligent decisions through data-driven analysis.
2. Tools for Data Science
Python:
•Along with R, Python is one of the most frequently utilized
languages in data research.
•It is flexible and readable and has many libraries to support it,
especially in data science, making it ideal for various tasks, from
web scraping to model building.
•Here are the critical Libraries for each category in Python
1) Web Scraping:
– BeautifulSoup: Easiest web scraping library in Python.
– Scrapy: Advanced web scraping library.
2. Tools for Data Science
Python:
2) Data Exploration and Manipulation:
– Pandas: Python data manipulation and analysis toolkit.
– NumPy: Supports big multidimensional arrays and mats.
3) Data Visualization:
– Matplotlib: The core Python plotting library
– Seaborn: A visualization library based on Matplotlib. It offers a high-
level interface for creating attractive statistical graphics.
– Plotly: Interactive graphing library.
4) Model Modeling:
– Scikit-learn: The most critical ML library in Python
– TensorFlow: Good to apply and scale Deep Learning.
– PyTorch: A machine learning library for image processing and NLP
applications.
2. Tools for Data Science
R:
•R is a potent text analysis tool designed to address statistical and
data analysis concerns.
•Its comprehensive statistical power and vast package ecosystem
make it quite popular in academia and research.
•Here are the critical Libraries for each category in Pythonn
1) Web Scraping:
– rvest: Makes web scraping easy by mimicking the exact structure of
the web page.
– RCurl: R bindings to the curl lib, allowing for anything that can be
done with the curl itself.
2. Tools for Data Science
R:
2) Data Exploration and Manipulation:
– dplyr: It is a grammar of data manipulation offering data manipulation
verbs that help make data manipulation easier.
– tidyr: Makes your data more accessible by manually spreading and
gathering data.
– Data.table: An extension of data.frame with faster data manipulation
capabilities.
3) Data Visualization:
– ggplot2: Application of the grammar of graphics.
– lattice: Better defaults + easy way to create multi-panel-plots.
– plotly: It converts graphs created with ggplot2 to interactive, user-
driven web-based graphs.
2. Tools for Data Science
R:
4) Model Modeling:
– Caret: Tools for creating classification and regression models.
– nnet: Offer functions to build neural networks.
– randomForest: It is a random forest algorithm-based library for
classification and regression.
2. Tools for Data Science
Excel:
•Excel is easy to use for analyzing and visualizing data. It is easy
to learn and compress, and its ability to handle large data sets
makes it helpful for fast data manipulation and analysis.
•Here, instead of libraries, the key functions of Excel ar divided
into subsections to categorize them.
1) Data Exploration and Manipulation:
– FILTER: Filters a spectrum of data depending on user defined
criteria.
– SORT: Sort the elements of a range or array.
– VLOOKUP/HLOOKUP: Finds things in tables or ranges by row or
column.
– TEXT TO COLUMNS: This will split the content of a cell into
multiple cells.
2. Tools for Data Science
Excel:
2) Data Visualization:
– Charts (Bar, Line, Pie, etc.): Regular standard chart types to depict
data.
– PivotTables: It condenses large data sets and creates interactive
summaries.
– Conditional Formatting: It displays which cells fall under a specific
rule.
3) Model Building:
– AVERAGE, MEDIAN, MODE: Calculates central tendencies.
– STDEV.P/STDEV.S: Works with the dataset to calculate dataset
segregation.
– LINEST: Based on the linear regression analysis, statistics for a
straight line that most matches a data set are returned.
– Regression Analysis (Data Analysis Toolpak): This toolkit uses
regression analysis to find correlations between variables.
2. Tools for Data Science
SQL:
•SQL is the language used to interact with relational databases
and is needed to store and process data.
•A data scientist primarily uses SQL as the standard way to
interact with databases, helping them query, update, and
manage data in all the databases. SQL is also required to access
the data for retrieval and analysis.
•The most popular SQL systems are:
– PostgreSQL: An open-source object-relational database system.
– MySQL: A high-level, popular open-source database known for its
speed and reliability.
– MsSQL (Microsoft SQL Server): A Microsoft-developed RDBMS
fully integrated Microsoft product with enterprise features.
– Oracle: It is a multi-model DBMS widely used in enterprise
environments. It combines the best relational model with tree-based
storage representation.
2. Tools for Data Science
Advance Visualization Tools:
•With the right advanced visualization tools, complex data can be
transformed into vivid, usable insights.
•These tools allow data scientists and business analysts to create
interactive and shareable dashboards that improve, understand,
and make the data accessible at the right time.
Important tools to build dashboards are:
•Power BI: A business analytics service by Microsoft that
provides interactive visualizations and business intelligence
capabilities with an interface simple enough for end users to
create their reports and dashboards.
2. Tools for Data Science
Advance Visualization Tools:
Important tools to build dashboards are:
•Tableau: A robust data visualization tool that allows users to
create interactive and shareable dashboards that give insightful
views of the data.
• It can handle large volumes of data and work well with
disparate data sources.
•Google Data Studio: It is a free parts web-based application
that allows user to create dynamic and aesthetic dashboards and
reports using data from virtually any source, and other parts free,
fully customizable, and easy-to-share reports that automatically
update using data from other Google services.
2. Tools for Data Science
Could Systems:
•Cloud systems are essential to data science because they can
scale, increase flexibility, and manage big datasets.
•They offer computational services, tools, and resources to store,
process, and analyze data at scale with cost optimization and
performance effectiveness.
Few Tools are:
•AWS (Amazon Web Services): Provides a highly sophisticated
and ever-evolving cloud computing platform that includes a
range of services such as storage, computation, machine learning,
big data analytics, etc.
2. Tools for Data Science
Could Systems:
Few Tools are:
•Google Cloud: Offers various cloud computing services that run
on the same infrastructure Google uses internally for products
such as Google Search and YouTube, including cloud data
analytics, data management, and machine learning.
•Microsoft Azure: Microsoft offers cloud computing services,
including virtual machines, databases, AI and machine learning
tools, and DevOps solutions.
•PythonAnywhere: A cloud-based development and hosting
environment allowing to run, develop, and host Python
applications through a web browser without IT staff setting up a
server. Ideal for data science and web app developers who want
to deploy their code quickly.
2. Tools for Data Science
Bonus: LLM’s
•Large Language Models (LLMs) are one of the cutting-edge
solutions in AI.
•They can learn and generate text like humans, and they are quite
advantageous in a wide range of applications, such as Natural
Language Processing, Customer Service Automation, Content
Generation, and so on.
Some of the most famous tools are:
•ChatGPT: It is a flexible conversational agent created by
OpenAI to generate human-like and in-context text, which is
beneficial.
•Gemini: The LLM created by Google will allow you to use it
directly inside Google apps like Gmail.
2. Tools for Data Science
Bonus: LLM’s
Some of the most famous tools are:
•Claude-3: A modern LLM specially built for better
understanding and text generation. It is used to assist in every
high-level NLP task and conversational AI.
•Microsoft Co-pilot: An AI-powered service integrated into
Microsoft applications, Co-pilot helps users by giving context-
sensitive recommendations and automating repetitive workflows,
enabling productivity and efficiencies across the processes.
3. Basic Statistics with SQL
• Statistics is widely used in various fields such as business,
economics, social science, medicine, and engineering.
• A Statistical function is a mathematical function that helps us
to process and analyze data to provide meaningful information
about the dataset.
• SQL statistical functions are essential tools for extracting
meaningful insights from databases. These functions, enable
users to perform statistical calculations on numeric data.
• Basic statistics can be performed in SQL using aggregate
functions, window functions, and grouping to analyze data.
3. Basic Statistics with SQL
1. COUNT: This function counts the number of rows in a
dataset or group of rows.
SELECT COUNT(studentID) AS total_students
FROM studentDetails;
Output:
https://www.xerago.com/importance-of-data-preparation -
Preparing Data for Accurarte analysis
Tools for data preparation
4. Data Munging with SQL
https://www.geeksforgeeks.org/what-is-data-munging/
https://www.integrate.io/glossary/what-is-data-munging/
https://medium.com/@jennifer.ezeanyim/my-data-wrangling-proj
ect-with-sql-mysql-50b6e6e53ba2
https://www.datacamp.com/blog/what-is-data-wrangling