Big Data SYBBA(CA)
Big Data SYBBA(CA)
Big Data SYBBA(CA)
4 marks questions
Age
/ \
<25 >=25
/ \ / \
<40k > =40k <60k >=60k
/ \ / \
Do Not Buy Do Not Buy
Buy Buy
2. Defne Correlation and explain its types with diagram?
Correlation
Correlation can be defined as the statistical method of evaluation used to
measure the strength and direction of linear relationship between two
numerically measured continuous variables. It quantifies how changes
in one variable are associated with changes in another variable. The
correlation coefficient, typically denoted as (r), ranges from 1 to -1.
Types of Correlation:
1. Positive Correlation
- **Definition**: In a positive correlation, as one variable increases, the
other variable also tends to increase.
- **Example**: Height and weight are often positively correlated;
generally, taller individuals weigh more.
2. Negative Correlation
- **Definition**: In a negative correlation, as one variable increases, the
other variable tends to decrease.
- **Example**: The amount of time spent studying and the number of
errors on a test usually show a negative correlation; as study time
increases, the number of errors decreases.
3. No Correlation
- **Definition**: No correlation means that there is no predictable
relationship between the two variables; changes in one variable do not
affect the other.
- **Example**: The colour of a car and its fuel efficiency likely have no
correlation; the car's colour does not influence how efficiently it uses fuel.
Key Concepts
2. **Estimation**:
- Inference often involves estimating population parameters (like means or
proportions) based on sample statistics.
- **Point Estimation**: Providing a single value estimate of a population
parameter.
- **Interval Estimation**: Providing a range (confidence interval) within
which the parameter is expected to lie.
3. **Hypothesis Testing**:
- A method to test assumptions (hypotheses) about population parameters.
- Involves formulating a null hypothesis (no effect or difference) and an
alternative hypothesis (some effect or difference).
- Statistical tests (like t-tests, chi-squared tests) determine if there is enough
evidence to reject the null hypothesis.
4. **Confidence Intervals**:
- A range of values derived from a sample that likely contains the true
population parameter.
- For example, a 95% confidence interval suggests that if the same sampling
method is repeated many times, approximately 95% of intervals will contain the
true parameter.
#### Applications
- **Market Research**: Estimating customer preferences based on survey data.
- **Healthcare**: Determining the effectiveness of a new treatment from
clinical trial data.
- **Quality Control**: Inferring the quality of a batch of products based on a
sample.
### Summary
Statistical inference allows us to make predictions and decisions based on
sample data, providing a framework for understanding uncertainty and
variability in data. It is a fundamental aspect of statistical analysis and is widely
used in various fields, including science, business, and social research.
6. Explain 5 Vs of Big Data?
The 5 Vs of Big Data are key characteristic that define and help manage large
and complex datasets.
The 5 Vs of Big data are:
1. Volume: This refers to the sheer amount of data generated every second.
For example, social media platform like Facebook and twitter generates
vast amount of data from user interaction, messages and posts.
2. Velocity: This is the speed at which data is generated, processed, and
analysed. For instance, financial markets generate data at high velocity,
requiring real-time processing to make timely decisions.
3. Variety: This encompasses the different types of data, including
structured data, semi structured data, and unstructured data. Examples
include text, images, videos, and sensor data.
4. Veracity: This refers to the trustworthiness and quality of the data. High
veracity means that the data is accurate and reliable, which is crucial for
making informed decisions.
5. Value: This is about turning data into valuable insight. Data itself is not
useful until it can be analysed to extract meaningful information that can
drive business decisions.
Understanding these characteristics helps in effectively managing and
utilizing big data for various application, from business analytics to
scientific research.
Where:
(P(A|B)) is the probability of event A occurring given that B is
true.
(P(B|A)) is the probability of event B occurring given that A is
true.
(P(A)) and (P(B)) are the probabilities of observing A and B
independently of each other.
Example:
P(Spam|Email) = P(Email|Spam) × P(Spam) / P(Email)
1. P(Spam|Email): Posterior Probability (or Conditional Probability)
- The probability that an email is Spam, given its features.
- This is what we want to calculate.
2. P(Email|Spam): Likelihood
- The probability of observing the email's features, given that it is
Spam.
- This represents how well the email's features match the typical
characteristics of Spam emails.
3. P(Spam): Prior Probability
- The probability that an email is Spam, regardless of its features.
- This represents our initial belief about the likelihood of Spam
emails.
4. P(Email): Evidence (or Normalizing Constant)
- The probability of observing the email's features, regardless of
whether it's Spam or Not Spam.
- This ensures the probabilities add up to 1.