Big Data SYBBA(CA)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

BIG DATA

4 marks questions

1. Explain Decision tree with example.


A decision tree is a type of supervised learning algorithm that is commonly
used in machine learning to model and predict outcomes based on input data. It
is a tree-like structure where each internal node tests on attribute or represents a
decision based on an attribute, each branch corresponds to attribute value and
each leaf node represents the final decision or prediction.
Decision Tree offers interpretability, versatility, and simple visualization,
making them valuable for both categorization and regression tasks.
The attributes in decision tree are selected on the basis of Attribute Selection
Measure.
Attribute selection measure refers to criteria used to evaluate and select the best
attributes (or features) when constructing a decision tree. The choice of attribute
has a significant impact on the performance and accuracy of the model.
Some commonly used attribute selection measures are:
1. Information Gain
2. Gini Index
Example of Decision Tree:
Suppose we want to predict whether a person will buy a laptop based on their
age and income. A simple decision tree could look like this:

Age
/ \
<25 >=25
/ \ / \
<40k > =40k <60k >=60k
/ \ / \
Do Not Buy Do Not Buy
Buy Buy
2. Defne Correlation and explain its types with diagram?
Correlation
Correlation can be defined as the statistical method of evaluation used to
measure the strength and direction of linear relationship between two
numerically measured continuous variables. It quantifies how changes
in one variable are associated with changes in another variable. The
correlation coefficient, typically denoted as (r), ranges from 1 to -1.
Types of Correlation:
1. Positive Correlation
- **Definition**: In a positive correlation, as one variable increases, the
other variable also tends to increase.
- **Example**: Height and weight are often positively correlated;
generally, taller individuals weigh more.

2. Negative Correlation
- **Definition**: In a negative correlation, as one variable increases, the
other variable tends to decrease.
- **Example**: The amount of time spent studying and the number of
errors on a test usually show a negative correlation; as study time
increases, the number of errors decreases.

3. No Correlation
- **Definition**: No correlation means that there is no predictable
relationship between the two variables; changes in one variable do not
affect the other.
- **Example**: The colour of a car and its fuel efficiency likely have no
correlation; the car's colour does not influence how efficiently it uses fuel.

4. Perfectly Positive Correlation


- **Definition**: A perfectly positive correlation occurs when the
correlation coefficient is exactly +1. This means that the two variables
increase together in perfect proportion.
- **Example**: If you have a dataset where every time one variable
increases by 1 unit, the other variable also increases by exactly 1 unit
(e.g., temperature in Celsius and temperature in Fahrenheit).

5. Perfectly Negative Correlation


- **Definition**: A perfectly negative correlation occurs when the
correlation coefficient is exactly -1. This means that as one variable
increases, the other decreases in perfect proportion.
- **Example**: An example could be the relationship between the
number of hours a car runs and the amount of fuel left in the tank if the
fuel is consumed at a constant rate.

Summary of Correlation Types


- **Positive Correlation**: Both variables increase together.
- **Negative Correlation**: One variable increases while the other
decreases.
- **No Correlation**: No relationship between the variables.
- **Perfectly Positive Correlation**: A perfect linear relationship where
both variables move in the same direction.
- **Perfectly Negative Correlation**: A perfect linear relationship where
one variable moves in the opposite direction to the other.
3. Write a R program to sort a vector in ascending and descending order?
sortvector<-c (1,6,5,3,7,4)
a<-sort(sortvector)
print ("Ascending Order: ")
print (a)
b<-sort (sortvector, decreasing=TRUE)
print ("Descending Order: ")
print (b)
4. Difference Between Descriptive, Predictive & Prescriptive Analysis?
Features Descriptive Predictive Prescriptive
Analysis Analysis Analysis
Definition Descriptive analysis Predictive Prescriptive
is the process of analysis uses analysis provides
analysing historical historical data recommendation
data to understand and statistical for action based
what has happened algorithm to on predictive
in the past. forecast future analysis to
events. achieve desired
results.
Focus It focuses on It focuses on It focuses on
summarizing and predicting what providing
interpreting is likely going to optimal actions
historical data to happen or what to achieve the
provide insights on could happen best possible
past performance based on past results.
and trends. trends and
patterns.
Utilisation It is used when the It is used when It is used when
user wants to the user wants to users have to
summarize all parts make an make complex or
or a part of business. educated guess time-sensitive
about likely decisions.
outcomes.
Data Used It uses past data that It uses both It uses historical
is already collected, historical data and predictive
often aggregated to and predictive data to evaluate
provide insights and modelling action and their
summaries. techniques to possible
forecast outcomes for
potential future decision-making
scenarios.
Techniques The techniques used The techniques The techniques
Used to analyse and used are used are
summarize data are- machine heuristics, and
: learning, optimization.
Data Aggregation, regression
Data Mining, Data analysis, and
Visualisation and statistical
Statistical Analysis.
models to
predict future
events and
trends.
Examples Examples are-: Examples are: Examples are:
Sales report, Sales Inventory
Customer forecasting, risk management,
demographics, and assessment, and Marketing
website traffic customer strategies, and
analysis showcasing behaviour resource
historical prediction allocation
performance. identifying recommendations
future patterns based on data-
and trends driven insights.

5. What is Statistical Inference?


Statistical Inference
Statistical inference is the process of using data from a sample to make
generalizations or predictions about a larger population. It involves drawing
conclusions based on the analysis of data, allowing researchers to make
informed decisions and estimates without examining every member of the
population.

Key Concepts

1. **Population vs. Sample**:


- **Population**: The entire group of individuals or items to be studied.
- **Sample**: A subset of the population selected for analysis.

2. **Estimation**:
- Inference often involves estimating population parameters (like means or
proportions) based on sample statistics.
- **Point Estimation**: Providing a single value estimate of a population
parameter.
- **Interval Estimation**: Providing a range (confidence interval) within
which the parameter is expected to lie.

3. **Hypothesis Testing**:
- A method to test assumptions (hypotheses) about population parameters.
- Involves formulating a null hypothesis (no effect or difference) and an
alternative hypothesis (some effect or difference).
- Statistical tests (like t-tests, chi-squared tests) determine if there is enough
evidence to reject the null hypothesis.

4. **Confidence Intervals**:
- A range of values derived from a sample that likely contains the true
population parameter.
- For example, a 95% confidence interval suggests that if the same sampling
method is repeated many times, approximately 95% of intervals will contain the
true parameter.

5. **Significance Level (α)**:


- The probability of rejecting the null hypothesis when it is true, commonly set
at 0.05.
- A smaller α indicates stricter criteria for rejecting the null hypothesis.

#### Applications
- **Market Research**: Estimating customer preferences based on survey data.
- **Healthcare**: Determining the effectiveness of a new treatment from
clinical trial data.
- **Quality Control**: Inferring the quality of a batch of products based on a
sample.
### Summary
Statistical inference allows us to make predictions and decisions based on
sample data, providing a framework for understanding uncertainty and
variability in data. It is a fundamental aspect of statistical analysis and is widely
used in various fields, including science, business, and social research.
6. Explain 5 Vs of Big Data?
The 5 Vs of Big Data are key characteristic that define and help manage large
and complex datasets.
The 5 Vs of Big data are:
1. Volume: This refers to the sheer amount of data generated every second.
For example, social media platform like Facebook and twitter generates
vast amount of data from user interaction, messages and posts.
2. Velocity: This is the speed at which data is generated, processed, and
analysed. For instance, financial markets generate data at high velocity,
requiring real-time processing to make timely decisions.
3. Variety: This encompasses the different types of data, including
structured data, semi structured data, and unstructured data. Examples
include text, images, videos, and sensor data.
4. Veracity: This refers to the trustworthiness and quality of the data. High
veracity means that the data is accurate and reliable, which is crucial for
making informed decisions.
5. Value: This is about turning data into valuable insight. Data itself is not
useful until it can be analysed to extract meaningful information that can
drive business decisions.
Understanding these characteristics helps in effectively managing and
utilizing big data for various application, from business analytics to
scientific research.

Advantages of Data Science


1. Improved Decision Making:
o Data science enables organizations to make data-driven decisions,
enhancing accuracy and efficiency.
2. Personalization:
o It helps in creating personalized experiences for customers by
analyzing their behavior and preferences.
3. Automation of Tasks:
oData science can automate repetitive tasks, saving time and
reducing human error.
4. Predictive Analytics:
o It allows businesses to predict future trends and behaviors, helping
in strategic planning.
5. Innovation:
o Data science drives innovation by uncovering new insights and
opportunities from data.

Disadvantages of Data Science


1. Data Privacy Issues:
o Handling large volumes of data raises significant privacy and
security concerns1.
2. High Costs:
o Implementing data science solutions can be expensive due to the
need for advanced tools and skilled professionals2.
3. Complexity:
o Managing and analyzing big data can be complex, requiring
specialized knowledge and expertise3.
4. Data Quality:
o Ensuring the accuracy and reliability of data is crucial, as poor data
quality can lead to incorrect insights4.
5. Over-reliance on Data:
o There is a risk of over-relying on data without considering other
factors, which can lead to biased or incomplete conclusions
Advantages of data science
1. Improved decision making
2. Personalization
3. Automation of tasks
4. Predictive analytics
5. Innovation
Disadvantages of data science
1. Data privacy issues
2. High cost
3. Complexity
4. Data quality
5. Over-reliance on data
7 Explain types of Digital Data?
Digital data is information that is stored and processed in a digital format,
which means it is represented using binary code (0s and 1s). This type of data
can be easily read, processed, and transmitted by computers and other digital
devices.
Types of Digital Data
1. Structured Data:
o Definition: Data that is organized in a fixed format, such as tables
or databases.
o Examples: SQL databases, spreadsheets, and CSV files.
o Characteristics: Easily searchable and analysable due to its
organized nature.
2. Unstructured Data:
o Definition: Data that does not have a predefined format or
structure.
o Examples: Text documents, emails, social media posts, images,
videos.
o Characteristics: More challenging to process and analyse due to
its varied nature.
3. Semi-structured Data:
o Definition: Data that does not conform to a rigid structure but
contains tags or markers to separate elements.
o Examples: XML files, JSON files, HTML documents.
o Characteristics: Easier to manage than unstructured data but not
as straightforward as structured data.
4. Metadata:
o Definition: Data that provides information about other data.
o Examples: File properties (e.g., size, creation date), database
schema, tags.
o Characteristics: Helps in organizing, finding, and understanding
data.
5. Big Data:
o Definition: Extremely large datasets that require advanced
methods and technologies to store, process, and analyse.
o Examples: Data from social media platforms, sensor data from IoT
devices, transaction records.
o Characteristics: High volume, velocity, variety, veracity, and
value (the 5 V’ s of Big Data).
8. Explain types of regression models?
Regression models are statistical techniques used to analyze and predict
relationships between variables. Here are some common types of regression
models:
1. Linear Regression
 Description: Models the relationship between a dependent variable
and one or more independent variables using a straight line.
 Equation: Y=a + b X + ϵ
 Use Case: Predicting sales based on advertising spend.
 Assumption: Assumes a linear relationship between variables.
2. Multiple Linear Regression
 Description: Extends linear regression by using multiple
independent variables to predict a dependent variable.
 Equation: Y=a+b1X1+b2X2+...+bnXn+ ϵ
 Use Case: Predicting house prices based on size, location, and
number of bedrooms.
 Assumption: Assumes linear relationships among all independent
variables.
3. Polynomial Regression
 Description: Models the relationship between variables as an nth
degree polynomial, allowing for curvature in the relationship.
 Equation: Y=a+b1X+b2X2+...+bnXn+ϵ
 Use Case: Modeling the growth rate of plants over time.
 Assumption: Assumes a non-linear relationship that can be
captured with polynomial terms.
Logistic Regression
Description: Used for binary classification problems, predicting the
probability that an event occurs (1) or does not occur (0).
Equation:
P(Y=1)=1/1+e−(a+bX)

Use Case: Predicting whether a customer will buy a product or not.


Assumption: Assumes a logistic function to model the probability.
9. Explain Naïve Bayes with the help of example
Naïve Bayes is a probabilistic classifier based on Bayes ’s theorem,
which assumes that the features are independent given the class.
Despite the naïve assumption, it is effective in practical applications,
especially in text classification and spam filtering.
Bayes ’s theorem is expressed as:
P(A|B) = P(B|A) * P(A) / P(B)

Where:
 (P(A|B)) is the probability of event A occurring given that B is
true.
 (P(B|A)) is the probability of event B occurring given that A is
true.
 (P(A)) and (P(B)) are the probabilities of observing A and B
independently of each other.
Example:
P(Spam|Email) = P(Email|Spam) × P(Spam) / P(Email)
1. P(Spam|Email): Posterior Probability (or Conditional Probability)
- The probability that an email is Spam, given its features.
- This is what we want to calculate.
2. P(Email|Spam): Likelihood
- The probability of observing the email's features, given that it is
Spam.
- This represents how well the email's features match the typical
characteristics of Spam emails.
3. P(Spam): Prior Probability
- The probability that an email is Spam, regardless of its features.
- This represents our initial belief about the likelihood of Spam
emails.
4. P(Email): Evidence (or Normalizing Constant)
- The probability of observing the email's features, regardless of
whether it's Spam or Not Spam.
- This ensures the probabilities add up to 1.

You might also like