MLM FDS
MLM FDS
MLM FDS
2021 Regulations
UNIT I–INTRODUCTION
PART-A
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze
large amounts of data.
Academic institutions use data science to monitor student performance and improve
their marketing to prospective students.
Machine generated data can be found across all sectors of computing and a business.
This type of data makes use of computers in any of their daily operations, and this type
of data can be generated by users unknowingly. Examples of machine generated data
include: APIs.
Graph or network data is, in short, data that focuses on the relationship or adjacency
of objects.
PART-B
This data type is non-numerical. This type of data is collected through methods of
observations, one-to-one interviews, conducting focus groups, and similar methods.
Qualitative data in statistics is also known as categorical data – data that can be
arranged categorically based on the attributes and properties of a thing or a
phenomenon.
It provides how observers can quantify the world around them. Qualitative data is
about the emotions or perceptions of people and what they feel. Qualitative analysis
is key to getting useful insights from textual data, figuring out its rich context, and
finding subtle patterns and themes.
In qualitative data, these perceptions and emotions are documented. It helps market
researchers understand their consumers’ language and solve the research problem
effectively and efficiently.
Rich data
Collected data can also be used to conduct future research. Since the questions
asked to collect qualitative data are open-ended questions, respondents are free to
express their opinions, leading to more information.
3. Ranked data Quantitative data
A ranked variable is one that has an ordinal value (i.e. 1st, 2nd, 3rd, etc.). While the
exact value of the variable may not be known, its place relative to the other
variables is. Ranked data is data that has been compared to the other pieces of data
and given a "place" relative to these other pieces of data. For example, to rank the
numbers 7.6, 2.4, 1.5, and 5.9 from least to greatest, 1.5 is first, 2.4 is second, 5.9 is
third, and 7.6 is fourth. The numbers within this data set (7.6, 2.4, 1.5, 5.9) are
ranked data, and the ordinal numbers used to rank them (1st, 2nd, 3rd, 4th) are
ranked variables.
Ranked data has many uses, including in:
Sports: Most sports clubs (such as the NFL, FIFA, and bicycle racing) rank their teams
or athletes to determine who goes to the final match.
Politics: There are many world-wide rankings, including education, environment, and
technology.
Search Engines: Search engines give results based on what they think is important and
relevant to a search.
Ranked data is important when to know how each piece of data compares to the others
in a set. It is also important for certain statistical data, such as Spearman's Rank
Correlation Coefficient.
4. Experiment Dependent variable
A dependent variable is what changes as a result of the independent variable
manipulation in experiments. It's what you're interested in measuring, and it
“depends” on your independent variable. In statistics, dependent variables are also
called: Response variables (they respond to a change in another variable)
5. Confounding variable Observational study
In an observational study, confounding occurs when a risk factor for the outcome
also affects the exposure of interest, either directly or indirectly. The resultant bias
can strengthen, weaken, or completely reverse the true exposure-outcome
association.
2. Define Scatterplots?
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables.
The correlation coefficient measures the relationship between two variables. The
correlation coefficient can never be less than -1 or higher than 1. 1 = there is a
perfect linear relationship between the variables
4. Define Regression.
A regression is a statistical technique that relates a dependent variable to one or
more independent (explanatory) variables.
5. Write the types of regression analysis.
Linear regression and logistic regression are two types of regression analysis
techniques that are used to solve the regression problem using machine learning.
8. What is interpretation of r2 ?
R-Squared and Adjusted R-Squared describes how well the linear regression model
fits the data points: The value of R-Squared is always between 0 to 1 (0% to 100%).
A high R-Squared value means that many data points are close to the linear
regression function line.
PART-B
1.What is python? Explain about python libraies?
Python is one of the most popular programming languages used across various tech
disciplines, especially in data science and machine learning. Python offers an easy-to-
code, object-oriented, high-level language with a broad collection of libraries for a
multitude of use cases.
1. TensorFlow
The first in the list of python libraries for data science is TensorFlow. TensorFlow is a
library for high-performance numerical computations with around 35,000 comments
and a vibrant community of around 1,500 contributors. It’s used across various
scientific fields. TensorFlow is basically a framework for defining and running
computations that involve tensors, which are partially defined computational objects
that eventually produce a value.
Features:
Better computational graph visualizations
Reduces error by 50 to 60 percent in neural machine learning
Parallel computing to execute complex models
Seamless library management backed by Google
Quicker updates and frequent new releases to provide you with the latest features
TensorFlow is particularly useful for the following applications:
Features:
Collection of algorithms and functions built on the NumPy extension of Python
High-level commands for data manipulation and visualization
Multidimensional image processing with the SciPy ndimage submodule
Includes built-in functions for solving differential equations
Applications:
Multidimensional image operations
Solving differential equations and the Fourier transform
Optimization algorithms
Linear algebra
3. NumPy
NumPy (Numerical Python) is the fundamental package for numerical computation in
Python; it contains a powerful N-dimensional array object. It has around 18,000
comments on GitHub and an active community of 700 contributors. It’s a general-
purpose array-processing package that provides high-performance multidimensional
objects called arrays and tools for working with them. NumPy also addresses the
slowness problem partly by providing these multidimensional arrays as well as
providing functions and operators that operate efficiently on these arrays.
Features:
Provides fast, precompiled functions for numerical routines
Array-oriented computing for better efficiency
Supports an object-oriented approach
Compact and faster computations with vectorization
Applications:
Extensively used in data analysis
Creates powerful N-dimensional array
Forms the base of other libraries, such as SciPy and scikit-learn
Replacement of MATLAB when used with SciPy and matplotlib
4. Pandas
Next in the list of python librabries is Pandads. Pandas (Python data analysis) is a must
in the data science life cycle. It is the most popular and widely used Python library for
data science, along with NumPy in matplotlib. With around 17,00 comments on
GitHub and an active community of 1,200 contributors, it is heavily used for data
analysis and cleaning. Pandas provides fast, flexible data structures, such as data frame
CDs, which are designed to work with structured data very easily and intuitively.
Also Read: What is Data Analysis: Methods, Process and Types Explained
Features:
Eloquent syntax and rich functionalities that gives you the freedom to deal with
missing data
Enables you to create your own function and run it across a series of data
High-level abstraction
Contains high-level data structures and manipulation tools
Applications:
General data wrangling and data cleaning
ETL (extract, transform, load) jobs for data transformation and data storage, as it has
excellent support for loading CSV files into its data frame format
Used in a variety of academic and commercial areas, including statistics, finance and
neuroscience
Time-series-specific functionality, such as date range generation, moving window,
linear regression and date shifting.
2.What is data Analysis? Why python is used for data analysis?
Data analysts are responsible for interpreting data and analyzing the results utilizing
statistical techniques and providing ongoing reports. They develop and implement data
analyses, data collection systems, and other strategies that optimize statistical efficiency and
quality. They are also responsible for acquiring data from primary or secondary data sources
and maintaining databases.
Besides, they identify, analyze, and interpret trends or patterns in complex data sets. Data
analysts review computer reports, printouts, and performance indicators to locate and
correct code problems. By doing this, they can filter and clean data.
Data analysts conduct full lifecycle analyses to include requirements, activities, and design,
as well as developing analysis and reporting capabilities. They also monitor performance
and quality control plans to identify improvements.
Finally, they use the results of the above responsibilities and duties to better work with
management to prioritize business and information needs.
One needs only to briefly glance over this list of data-heavy tasks to see that having a tool that can
handle mass quantities of data easily and quickly is an absolute must. Considering the proliferation
of Big Data (and it’s still on the increase), it is important to be able to handle massive amounts of
information, clean it up, and process it for use. Python fits the bill since its simplicity and ease of
performing repetitive tasks means less time needs to be devoted to trying to figure out how the tool
works.
Data Analysis Vs. Data Science
Before wading in too deep on why Python is so essential to data analysis, it’s important first
to establish the relationship between data analysis and data science, since the latter also
tends to benefit greatly from the programming language. In other words, many of the
reasons Python is useful for data science also end up being reasons why it’s suitable for data
analysis.
The two fields have significant overlap, and yet are also quite distinctive, each on their
right. The main difference between a data analyst and a data scientist is that the former
curate's meaningful insights from known data, while the latter deals more with the
hypotheticals, the what-ifs. Data analysts handle the day-to-day, using data to answer
questions presented to them, while data scientists try to predict the future and frame those
predictions in new questions. Or to put it another way, data analysts focus on the here and
now, while data scientists extrapolate what might be.
There are often situations where the lines get blurred between the two specialties, and that’s
why the advantages that Python bestows on data science can potentially be the same ones
enjoyed by data analysis. For instance, both professions require knowledge of software
engineering, competent communication skills, basic math knowledge, and an understanding
of algorithms. Furthermore, both professions require knowledge of programming languages
such as R, SQL, and, of course, Python.
3.Discuss in detail about Pandas in python with suitable example.
Pandas is a powerful and versatile library that simplifies tasks of data manipulation in
Python . Pandas is built on top of the NumPy library and is particularly well-suited for
working with tabular data, such as spreadsheets or SQL tables. Its versatility and ease
of use make it an essential tool for data analysts, scientists, and engineers working with
structured data in Python.
When you want to combine data objects based on one or more keys, similar to what you’d
do in a relational database, merge() is the tool you need. More specifically, merge() is most
useful when you want to combine rows that share data.
You can achieve both many-to-one and many-to-many joins with merge(). In a many-to-
one join, one of your datasets will have many rows in the merge column that repeat the
same values. For example, the values could be 1, 1, 3, 5, and 5. At the same time, the
merge column in the other dataset won’t have repeated values. Take 1, 3, and 5 as an
example.
As you might have guessed, in a many-to-many join, both of your merge columns will
have repeated values. These merges are more complex and result in the Cartesian product
of the joined rows.
This means that, after the merge, you’ll have every combination of rows that share the
same value in the key column. You’ll see this in action in the examples below.
What makes merge() so flexible is the sheer number of options for defining the behavior of
your merge. While the list can seem daunting, with practice you’ll be able to expertly
merge datasets of all kinds.
Arrays are very frequently used in data science, where speed and resources are very
important.
7. How the operations can be performed on null values in pandas data science?
the notnull() method to return a dataframe of boolean values that are False for NaN values
when checking for null values in a Pandas Dataframe.
2D array creation :
import numpy as np
two_dimensional_list=[[1,2,3],[4,5,6]]
two_dimensional_arr = np.array(two_dimensional_list)
print("2D array is : ",two_dimensional_arr)
3D array creation :
import numpy as np
three_dimensional_list=[[[1,2,3],[4,5,6],[7,8,9]]]
three_dimensional_arr = np.array(three_dimensional_list)
print("3D array is : ",three_dimensional_arr)
PART-B
11. Explain about Big Data-Characteristics and applications?
It refers to a massive amount of data that keeps on growing exponentially with time.
It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
It includes data mining, data storage, data analysis, data sharing, and data visualization.
The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyze the data.
Types of Big Data
Now that we are on track with what is big data, let’s have a look at the
types of big data:
Structured
Unstructured data refers to the data that lacks any specific form or
structure whatsoever. This makes it very difficult and time-consuming to
process and analyze unstructured data. Email is an example of
unstructured data. Structured and unstructured are two important types of
big data.
PART-A
1. What is purpose of matplotlib?
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python
2. Write the dual interface of matplotlib?
An explicit "Axes" interface that uses methods on a Figure or Axes
object to create other Artists, and build a visualization step by step. This
has also been called an "object-oriented" interface.
3. How to draw a simple line plot using matplotlib?
pip install matplotlib.
import matplotlib. ...
# Load the dataset into a Pandas DataFrame df = pd. ...
# Extract the date and close price columns dates = df['Date'] closing_price = df['Close'] #
Create a line plot plt. ...
# Plot in Red colour plt. ...
# Increasing the linewidth plt.
4. Define contour plots?
A contour plot is a graphical technique for representing a 3-dimensional
surface by plotting constant z slices, called contours, on a 2-dimensional
format.
5. What functions can be used to draw contour plots?
The contour plot is used to depict the change in Z values as compared to
X and Y values. If the data (or function) do not form a regular grid, you
typically need to perform a 2-D interpolation to form a regular grid.
6. What is the purpose of histogram?
The histogram is a popular graphing tool. It is used to summarize
discrete or continuous data that are measured on an interval scale.
PART-B
When you open Tableau, you are taken to the home page where you can easily select from
previous workbooks, sample workbooks, and saved data sources. You can also connect to
new data sources by selecting Connect to Data. Figure 2.1 displays the screen.
The option In a File is for connecting to locally stored data or file based data. TABLEAU
Personal edition can only access Excel, Access, and text files (txt, csv). You can also import
from data sources stored in other workbooks.
Connecting to Desktop Sources
If you click on one of the desktop source options under the In a File list, you will get a
directory window to select the desired file. Once you have chosen your file, you will be
taken to the Connection Options window. There are small differences in the connection
dialog depending on the data source you are connecting to but the menus are self-
explanatory. Figure 2.2 shows the connection window with the Superstore sample
spreadsheet being the file that is being accessed.
2.Explain Joining tables with tableau?
A way to extract data from multiple tables in the database is by Tableau Joins. They enable
us to get data from different tables provided that the tables have certain fields common. The
common field shall be the primary key in one table that acts as a foreign key in another.
Various types of Joins include Inner Join, Left Join, Right Join, and Full Outer Join.
Tableau allows us to perform joins in a very easy manner. It offers a guided approach to
join the two tables providing a couple of important options. Using the functionality we can
get data from different tables for analysis.
3.Explain sets groups & hierarchies?
A Tableau Group is a set of multiple members combined in a single dimension to create a
higher category of the dimension. Tableau allows the grouping of single-dimensional
members and automatically creates a new dimension adding the group at the end of the
name. Tableau does not do anything with the original dimension of the members.
Sort data
Data present in the visualization and worksheet can be sort based on the requirement. It can
sort the data based on data source order, ascending, descending or depend on any measured
value.
Step 1) Go to a Worksheet and drag a dimension and measure as given in the image.
Step 2)
Right click on Category.
Select ‘Sort’ option.
Sort Order:
Use the Formula Builder to create calculated fields and measures and configure their
summary functions to generate the data that you want.
Use the following fields in the Formula Builder tab to create the formula for
your calculated field and measures:
Formula field box
a. Edit the formula for calculating your fields and measures by typing directly into
the Formula field box.
b. Consider the following when you write your formula:
Formulas must use the following syntax:
Delineate labels for fields and measures must in double quotation
marks ("). For example, "Customer ID", "Date ordered".
Surround text with single quotation marks ('). For example, '--'.
Use single quotation marks for levels ('). For
example, 'ColumnGroup' and 'Total'. For more information about
levels, see Aggregate functions.
The following words are reserved and cannot be used as field names unless
they are contained as part of a phrase such as Not Available:
And
In
Not
Or
Add fields, measures, and functions to your formula by double-clicking
them.
Click the buttons below the Formula field to add operators.
Number Functions.
String Functions.
Date Functions.
Type Conversion.
Logical Functions.
Aggregate Functions.
Pass-Through Functions (RAWSQL)
User Functions.
7. How to using maps to improve insights?
8. How to proving self evidenceadhoc analysis with parameters?
9. How to editing views in server?