Roles of Data Scientists in Business and Society
Roles of Data Scientists in Business and Society
Roles of Data Scientists in Business and Society
• The Data Scientist is responsible for advising the business on the potential of data,
to provide new insights into the business’s mission, and through the use of
advanced statistical analysis, data mining, and data visualization techniques, to
create solutions that enable enhanced business performance.
• The Data Scientist combines data, computational science, and technology with
consumer-oriented business knowledge in the business setting, to drive high-value
insights into the business and drive high-impact through the business levers at the
business’s disposal.
• The Data Scientist plays a strategic role in the development of new approaches to
understand the business’s consumer trends and behaviors as well as approaches to
solve complex business issues.
Roles of Data Scientists in Business and Society
• The Data Scientist also takes initiative to experiment with various technologies
and tools with vision of creating innovative data driven insights for the business
and society.
• The Data Science helps to address real life social issues with the help of data
science.
• Nonprofits and nonprofit government organizations leverage benefits from the
data science. These organizations gathering information to cater society's needs.
• For nonprofits to run at its utmost level, resources such as time, money, and products should
be used wisely. Data science enables these organizations to determine effective and
economical methods to offer those in need without running out of such valuable resources.
Module 2
Data Structure
• In computer science, a data structure is a particular way of organizing
and storing data in a computer such that it can be accessed and
modified efficiently.
• More precisely, a data structure is a collection of data values, the
relationships among them, and the functions or operations that can be
applied to the data.
Structured Data
Structured data is data that adheres to a pre-defined data model and is therefore
straightforward to analyze. Structured data conforms to a tabular format with
relationship between the different rows and columns.
Common examples of structured data are Excel files or SQL databases.
Quantity, barcodes, and weblog statistics.
Unstructured data
It is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-
heavy, but may contain data such as dates, numbers, and facts as well.
This results in irregularities and ambiguities that make it difficult to understand
using traditional programs as compared to data stored in structured databases.
Common examples of unstructured data include audio, video files or No-SQL databases.
Semi-structured Data
Semi-structured data is a form of structured data that does not conform
with the formal structure of data models associated with relational
databases or other forms of data tables.
But, nonetheless it contain tags or other markers to separate semantic
elements and enforce hierarchies of records and fields within the data.
Therefore, it is also known as self-describing structure.
An example of semi-structured data is delimited files. It contains elements that
can break down the data into separate hierarchies.
Similarly, in digital photographs, the image does not have a pre-defined
structure itself. Still, if it is taken from a smartphone, it would have structured
attributes like geotag, device ID, and Date Time stamp.
What is Database
“A database is a collection of related data which represents some
elements of the real world. It is designed to be built and populated with
data for a specific task. It is also a building block of the data solution.”
Insurance sector Data warehouses are widely used to analyze data patterns,
customer trends, and to track market movements quickly.
Retail chain It helps you to track items, identify the buying pattern of the
customer, promotions and also used for determining pricing
policy.
Telecommunication In this sector, data warehouse used for product promotions, sales
decisions and to make distribution decisions.
Relational Vs Non Relational Database
• A relational database is structured, meaning the data is organized in
tables. Many times, the data within these tables have relationships
with one another, or dependencies.
• A non relational database is document-oriented, meaning, all
information gets stored in more of a laundry list order. Within a single
construct, or document, you will have all of your data listed out.
Relational Vs Non Relational Database
SQL Databases (Relational)
SQL is short for Structured Query Language, basically meaning a very
firm way of sorting through data in the form of tables, columns, and rows.
• For example, if you are looking to sort data regarding what the weather is at a
certain time of the day during a certain day, it would be structured as the
following:
Table: Weather
Columns: Days of the Week
Rows: Time of Day
Data Points: Degrees Fahrenheit
In this structure, all queries would be related to this table and the structure
of the table would allow for easy sorting, filtering, computations, etc.
Relational Database
A relational database works by linking information from multiple tables through the use of
“keys.” A key is a unique identifier which can be assigned to a row of data contained within a
table.
This unique identifier, called a “primary key,” can then be included in a record located in
another table when that record has a relationship to the primary record in the main table.
When this unique primary key is added to a record in another table, it is called a “foreign key”
in the associated table.
The connection between the primary and foreign key then creates the “relationship” between
records contained across multiple tables.
Relational Database
The Employees table contains a single row representing an employee with each employee
assigned a unique id (primary key). In this case, the primary key is named Employee Id.
The second table, Sales, contains individual sales records that are then associated with the
employee that made the sale.
Because an employee can make multiple sales, their unique Employee Id (primary key),
can appear multiple times in the Sales table as a foreign key.
Relational Database
• Some popular SQL database systems include:
Oracle
Microsoft SQL Server
PostgreSQL
MySQL
MariaDB
NoSQL Databases (Non-Relational Databases)
• In contrast to a relational database, a NoSQL database is one that is less
structured/confined in format, and thus, allows for more flexibility and
adaptability.
• If you are going to be dealing with a dataset that isn’t clearly defined, meaning
not organized or structured, you likely won’t have the luxury of establishing
defined tables and relationships amongst the dataset.
Non Relational Database
• For example, Facebook Messenger uses a NoSQL database, because the
information that is being gathered isn’t structured enough to be segmented
into tables and define relationships between each other.
• With tons of unstructured information, it needs to be held in a non-relational
database. Think of the information as being stored on one large word
document. Everything is there. As more information gets entered, the
document gets longer. If you want to find and pull data, you have to in
essence ‘control/command + F’ and search for the data itself.
• Some popular NoSQL databases include:
MongoDB
Cassandra
Redis
Apache HBase
Amazon DynamoDB
Relational Vs Non Relational Database
• Final Showdown: Pros and Cons of Relational and Non-Relational Databases
• Now we answer the question you’re really looking for. Which type of database
should you use?
• Well, there are some questions you should ask yourself that are outlined below. If
you answer yes to the relational questions, then use a SQL database. If you answer
yes to the non-relational questions, then use a NoSQL database.
Pros of a Relational Database
• Data is easily structured into categories.
• Your data is consistent in input, meaning, and easy to navigate.
• Relationships can be easily defined between data points.
Pros of a Non-Relational Database
• Data is not confined to a structured group.
• You can perform functions that allow for greater flexibility.
• Your data and analysis can be more dynamic and allow for more variant inputs.
RDBMS
RDBMS stands for Relational Database Management System. RDBMS is the basis for
SQL, and for all modern database systems like MS SQL Server, IBM DB2, Oracle,
MySQL, and Microsoft Access. A Relational database management system (RDBMS) is a
database management system (DBMS) that is based on the relational model as introduced
by E. F. Codd.
Table
•The data in an RDBMS is stored in database objects which are called as tables. This table
is basically a collection of related data entries and it consists of numerous columns and
rows.
RDBMS
Field
•Every table is broken up into smaller entities called fields. The fields in the
CUSTOMERS table consist of ID, NAME, AGE, ADDRESS and SALARY.
Record or a Row
•A record is also called as a row of data is each individual entry that exists in a table.
For example, there are 7 records in the above CUSTOMERS table. Following is a
single row of data or record in the CUSTOMERS table −
• +----+----------+-----+-----------+----------+
• +----+----------+-----+-----------+----------+
RDBMS
Column
•A column is a vertical entity in a table that contains all information associated with
a specific field in a table.
•For example, a column in the CUSTOMERS table is ADDRESS, which represents
location description and would be as shown below −
RDBMS
Database Normalization
Database normalization is the process of efficiently organizing data in a database. There are
two reasons of this normalization process −
∙ Eliminating redundant data, for example, storing the same data in more than one table.
∙ Ensuring data dependencies make sense.
•Both these reasons are worthy goals as they reduce the amount of space a database
consumes and ensures that data is logically stored. Normalization consists of a series of
guidelines that help in creating a good database structure.
When comparing relational and non-relational databases, it’s important to first note that
these two very different types of databases are equally useful in their own right—but for
contrasting reasons and use-cases. One type of database is not better than the other type,
and both relational and non-relational databases have their place.
Columnar Database
• A columnar database is a database management system (DBMS) that stores data
in columns instead of rows.
• The goal of a columnar database is to efficiently write and read data to and from
hard disk storage in order to speed up the time it takes to return a query.
• In a columnar database, all the column 1 values are physically together, followed
by all the column 2 values, etc.
• The data is stored in record order, so the 100th entry for column 1 and the 100th
entry for column 2 belong to the same input record.
• This allows individual data elements, such as customer name for instance, to be
accessed in columns as a group, rather than individually row-by-row.
Columnar Database
• Here is an example of a simple database table with 4 columns and 3 rows.
Lift:
This measurement technique measures the accuracy of the confidence over how often item B is purchased.
Support:
This measurement technique measures how often multiple items are purchased and compared it to the
overall dataset.
Confidence:
This measurement technique measures how often item B is purchased when item A is purchased as well.
• Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label.
• The topmost node in the tree is the root node.
• The following decision tree is for the concept buy computer that indicates whether
a customer at a company is likely to buy a computer or not.
• Each internal node represents a test on an attribute. Each leaf node represents a
class.
• The benefits of having a decision tree are as follows −
• It is easy to comprehend.
Decision Tree
Decision Tree
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
Decision Tree
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
Analytical Methodology
• In terms of methodology, analytics differs significantly from the traditional
statistical approach of experimental design. Analytics starts with data. Normally
we model the data in a way to explain a response.
• The objectives of this approach is to predict the response behavior or understand
how the input variables relate to a response. Normally in statistical experimental
designs, an experiment is developed and data is retrieved as a result.
• This allows to generate data in a way that can be used by a statistical model,
where certain assumptions hold such as independence, normality, and
randomization.
• Normally once the business problem is defined, a research stage is needed to
design the methodology to be used. However general guidelines are relevant to be
mentioned and apply to almost all problems.
Analytical Methodology
• One of the most important tasks in big data analytics is statistical modeling,
meaning supervised and unsupervised classification or regression problems.
• Once the data is cleaned and preprocessed, available for modeling, care should be
taken in evaluating different models with reasonable loss metrics and then once
the model is implemented, further evaluation and results should be reported.
• A common pitfall in predictive modeling is to just implement the model and never
measure its performance.
Analytical Methodology
• Preparing objectives & identifying data requirements,
• Data Collection,
• Understanding data