0% found this document useful (0 votes)
9 views3 pages

Assignment Unit I and II

The document outlines key concepts in data science, including its definition, workflow, and applications across various industries. It discusses the traits of big data, web scraping techniques, and the differences between data analysis and reporting. Additionally, it covers tools and libraries such as Matplotlib, NumPy, Scikit-learn, and NLTK, along with methods for data cleaning, manipulation, and dimensionality reduction.

Uploaded by

utkarsh dave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views3 pages

Assignment Unit I and II

The document outlines key concepts in data science, including its definition, workflow, and applications across various industries. It discusses the traits of big data, web scraping techniques, and the differences between data analysis and reporting. Additionally, it covers tools and libraries such as Matplotlib, NumPy, Scikit-learn, and NLTK, along with methods for data cleaning, manipulation, and dimensionality reduction.

Uploaded by

utkarsh dave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Unit I

Concept of Data Science

1. What is data science? Explain its key components and how it differs from traditional
data analysis.
2. Describe the data science workflow. What are the major steps involved in solving a
data science problem?
3. How is data science applied in various industries? Provide examples of its
applications in fields like healthcare, finance, and marketing.
4. Differentiate between data science, machine learning, and artificial intelligence. How
do they interrelate in practice?

Traits of Big Data

1. What are the key traits of big data? Explain the "5 Vs" (Volume, Velocity, Variety,
Veracity, and Value) of big data.
2. How do the characteristics of big data impact the methods used for storing,
processing, and analyzing data? Provide examples.
3. Discuss the challenges associated with big data. How do these challenges influence
the choice of tools and technologies for big data analysis?
4. Explain how scalability and distributed computing are important when dealing with
big data. What are some common tools used to handle big data?

Web Scraping

1. What is web scraping? Describe its importance in data science and list some common
tools used for web scraping in Python.
2. Explain the ethical considerations and legal implications of web scraping. What are
some guidelines to follow when scraping data from websites?
3. Describe the process of web scraping using BeautifulSoup and requests in Python.
Provide an example of scraping data from a website.
4. What are some common challenges in web scraping, and how can they be mitigated?
Discuss issues such as CAPTCHA, rate limiting, and dynamic content.

Analysis vs Reporting

1. Differentiate between data analysis and data reporting. How does each contribute to
the decision-making process?
2. Explain the key differences between exploratory data analysis (EDA) and generating
business reports. When should you use each approach?
3. How does the focus of data analysis differ from data reporting in terms of the
audience and the purpose? Provide examples of each.
4. Discuss the tools and techniques commonly used for data analysis versus those used
for data reporting. How do their outputs differ?
Unit II

Matplotlib and Data Visualization

1. Explain how to create a bar chart using Matplotlib in Python. What are some common
use cases for bar charts in data science?
2. How can you customize line charts in Matplotlib to show multiple data series on the
same plot? Give an example.
3. Describe the process of creating a scatterplot in Matplotlib. How can you modify the
size and color of points based on additional data?
4. What are the different types of visualizations available in Matplotlib for comparing
categorical and numerical data? Provide examples.

NumPy

1. What is NumPy and how is it used in data science? Explain the concept of arrays and
how NumPy arrays differ from Python lists.
2. Demonstrate how to perform basic mathematical operations (addition, subtraction,
multiplication, etc.) on NumPy arrays.
3. Explain the concept of broadcasting in NumPy. Provide an example where
broadcasting is used for efficient computation.
4. How can you use NumPy to generate random numbers and create datasets for data
analysis? Provide examples.

Scikit-learn

1. Explain the purpose of Scikit-learn in Python. How is it used for machine learning?
2. What is the difference between supervised and unsupervised learning in Scikit-learn?
Provide examples of algorithms for each type.
3. Describe the process of training a linear regression model in Scikit-learn. What
functions are used to evaluate the model’s performance?
4. How does Scikit-learn handle feature scaling? Explain the importance of scaling in
machine learning models.

NLTK (Natural Language Toolkit)

1. What is the purpose of the NLTK library in Python? How is it used for text
processing?
2. Explain how tokenization is performed using NLTK. Why is it an important step in
natural language processing (NLP)?
3. Describe the process of sentiment analysis using NLTK. How can this be applied in
analyzing social media data?
4. How can NLTK be used for named entity recognition (NER)? Provide an example of
extracting entities from text.

Working with Data: Reading Files, Scraping, APIs

1. Describe the process of reading and writing CSV files in Python using Pandas.
Provide an example.
2. Explain how web scraping works in Python using BeautifulSoup. What precautions
should be taken when scraping websites?
3. How can the Twitter API be used to collect data for sentiment analysis? Provide an
example of connecting to the API and retrieving tweets.
4. What are some common methods for handling missing data in Python? Explain the
pros and cons of different approaches.

Data Cleaning and Manipulation

1. What is data munging, and why is it important in the data analysis process?
2. How can Pandas be used to clean and manipulate data? Provide an example of
filtering and modifying data in a DataFrame.
3. Explain how to handle outliers in a dataset. What impact can outliers have on the
results of a data analysis?
4. Describe the process of rescaling data using MinMaxScaler and StandardScaler in
Scikit-learn. When should you use each?

Dimensionality Reduction

1. What is dimensionality reduction, and why is it important in data science?


2. Explain how Principal Component Analysis (PCA) works for dimensionality
reduction. Provide an example of its application.
3. How does Scikit-learn’s TruncatedSVD differ from PCA? When would you use each
method?
4. Describe the role of feature selection in dimensionality reduction. How does it help in
improving model performance?

You might also like