jenisha INTERNSHIP REPORT-2.docx (1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

INTRODUCTION

Python is a versatile programming language that is widely used for data science,
machine learning, software development, web development and more. Here are some things
to know about Python for data science.

It is a versatile programming language used in various fields. It is widely used for data
analysis and visualization, with libraries such as pandas, NumPy, matplotlib, and seaborn.
Python is also a popular choice for machine learning, software development, web
development, and task automation or scripting.

It is favoured in data science due to its readability, simplicity, and versatility. Its
extensive libraries and frameworks streamline complex tasks, allowing data scientists to
focus on problem-solving rather than coding intricacies.

It is a popular choice for data science because of its built-in mathematical libraries
and functions, which make it easier to perform data analysis and calculate mathematical
problems.

Data science combines statistical analysis, programming skills, and domain expertise
to extract insights and knowledge from data. It has become essential to various industries,
from healthcare to finance, enabling organizations to make data-driven decisions. This report
provides a detailed comprehensive introduction to data science with Python, covering key
concepts and practical examples.

1
CHAPTER 1

WORKING WITH MICROSOFT EXCEL

It is a spreadsheet program developed by Microsoft. Excel organizes data in columns


and rows and allows you to do mathematical functions. It runs on Windows, macOS, Android
and iOS. The first version was released in 1985. It is a popular tool for data analysis.

1.1 Key Features

❖ Spread sheet Layout


❖ Formulas and Functions
❖ Data Visualisation
❖ Pivot Tables and Pivot Chart
❖ Data Filtering and Sorting
❖ Conditional Formatting
❖ Collaboration

1.2 Quick Analysis

Quick Analysis takes a range of data and helps you pick the perfect chart with just a
few commands.

1.3 Conditional Formatting

This feature helps highlight important data points, trends, or outliers, making it easier
to identify key insights in your data.

1.4 Data validation

Data validation in Excel refers to setting specific criteria for accepting data in a cell or
range of cells.

1.5 Types of Operators in Excel

1. Arithmetic Operators: Used for basic mathematical calculations (e.g., +, -, *,/).


2. Comparison Operators: Used to compare values (e.g., =, <, >, <=, >=, <>).
3. Logical Operators: Used in logical tests (e.g., AND, OR, NOT).
4. Reference Operators: Used to define ranges (e.g., : for ranges, , for union).

2
1.6 Formulas and Functions

One of Excel's standout features is its ability to perform calculations and operations
on data using formulas and functions. Users can create complex calculations by combining
mathematical operators, cell references, and built-in functions.

⮚ IF function: A versatile tool that allows users to build logic and decisions into their
spreadsheets.
⮚ Sum of data =SUMIFS is an important formula that can arise in an entry-level data
analysis interview.
Formula: =SUMIF(RANGE,CRITERIA,[sum_range])
⮚ Average of data =AVERAGEIF has similarities to =SUMIF, and the two usually work
in conjunction. It enables you to determine averages of multiple variables.
Formula: =AVERAGEIF(SELECT CELL,CRITERIA,[AVERAGE_RANGE])
⮚ Connecting data sets=VLOOKUP function enables users to marry data from two
different sources within the spreadsheet to get a numerical result.
Formula:=VLOOKUP(LOOKUP_VALUE,TABLE_ARRAY,COL_INDEX_NU
M, [RANGE_LOOKUP])
⮚ DAYS function: Use this function to return the number of days between two dates.
NETWORKDAYS function: Returns the number of whole workdays between two
dates.
TODAY function: Returns the serial number of today's date.
NOW function: Returns the serial number of the current date and time.

1.7 Data cleaning

Clean text data by removing duplicates, trimming spaces, and standardizing formats.

1.8 Pivot Tables and Pivot Charts

Pivot Tables are powerful tools for summarizing and analyzing large datasets. Pivot
Charts works hand-in-hand with Pivot Tables to provide visual representations of the
summarized data.

1.9 Dashboard

A dashboard in Excel is a visual representation of key metrics that helps you analyze
data and make quick decisions.

3
CHAPTER 2

PYTHON ESSENTIALS

2.1 Python Programming

Python is a popular programming language. Developers use Python because it is


efficient and easy to learn and can run on many different platforms. It is used for:

● Web development (server-side)


● Software development
● Mathematics
● System scripting.

Variable: Variable is a name that is used to store data in the memory location.

Keyword: Python keywords are reserved words that define the structure and syntax of the
Python language. They are case sensitive and cannot be used as variable names, function
names, other identifiers. Example: and, as, break, except, false, or, not, if, true, etc.

2.2 Python Data Types

2.3 Operators

Operators are special symbols in Python that carry out arithmetical or logical
computation. The value that the operator operates on is called the operand.

Example: Z= x + y (x & y are operands; + is operator)

4
2.3.1 Types of Operators

2.4 Conditional Statements

❖ if statement: Executes a block of code if a specified condition is true.


❖ if-else statement: Executes one block of code if the condition is true, and another
block if the condition is false.
❖ elif (else if) statement: Checks multiple conditions in a sequence and executes a
block of code as soon as one of the conditions evaluates to true.
❖ Nested if statements: You can nest if statements within another if or else block to
create complex decision trees.

2.5 Functions

It is block of code that runs or works when it is called. The types are:

o Built-in functions: These functions are pre-defined in Python and can be used
without further declaration. Examples: enumerate(), eval(), exec(), and filter().
o User-defined functions: These are created by the user to perform specific tasks.
o Lambda functions: These are small, unnamed functions defined using the
lambda keyword. They are typically used for short and simple operations.

2.6 Data Structure

Data Structures are a way of organizing data so that it can be accessed more
efficiently depending upon the situation.

1. String
2. List
3. Tuple
4. Set
5. Dictionary

5
CHAPTER 3

WORKING WITH NUMPY AND PANDAS

3.1 NumPy

NumPy stands for Numerical Python. It is a Python library used for working with
arrays. NumPy was created in 2005 by Travis Oliphant. It is an open source project and you
can use it freely. It provides a high-performance multidimensional array object and tools for
working with these arrays.

3.1.1 Arrays in NumPy

NumPy’s main object is the homogeneous multidimensional array. It is a table of


elements (usually numbers), all of the same type, indexed by a tuple of positive integers.

Some restrictions of NumPy array

● All elements of the array must be of the same type of data.


● Once created, the total size of the array can’t change.
● The shape must be “rectangular”, not “jagged”; e.g., each row of a
two-dimensional array must have the same number of columns.

3.1.2 Analysing the NumPy array (Attributes)

1. ndim: The number of dimensions of an array is contained in the ndim attribute.


2. size: The fixed, total number of elements in array is contained in the size attribute.
3. shape: The shape of an array is a tuple of non-negative integers that specify the
number of elements along each dimension.
4. dtype: The data type is recorded in the dtype attribute.

3.1.3 Initial Placeholders in NumPy array

● np.zeros: The np.zeros function fills the whole array with zeros.
● np.ones: The np.ones function fills the whole array with ones.
● np.full: The np.full function structure is a bit different from the others until now. Along
with the shape and data type, it also takes another argument called ‘fill_value’.
● np.eye: The np.eye function produces a diagonal matrix. It returns a 2-D array with 1’s on
the diagonals, and 0’s everywhere else.

6
● np.random.random: The purpose of this function is to return random values from a
continuous uniform distribution.

3.2 Pandas

Pandas is a Python library used for working with data sets. It has functions for
analyzing, cleaning, exploring, and manipulating data. It is a powerful and open-source
Python library. The Pandas library is used for data manipulation and analysis. Pandas consist
of data structures and functions to perform efficient operations on data.

3.2.1 List of things that we can do using Pandas

✔ Data set cleaning, merging, and joining.


✔ Easy handling of missing data (represented as NaN) in floating point as well as
non-floating point data.
✔ Columns can be inserted and deleted from Data Frame and higher-dimensional objects.
✔ Data Visualization.

3.2.2 Data Pre-processing

It is the first machine learning step in which we transform raw data obtained from
various sources into a usable format to implement accurate machine learning models.

3.2.3 Data Frame

Pandas Data Frame is created by loading the datasets from existing storage (which
can be a SQL database, a CSV file, or an Excel file). It can also be created from lists,
dictionaries, a list of dictionaries, etc.

3.2.4 Data Structures in Pandas Library

Pandas generally provide two data structures for manipulating data. They are:

1. Series: A Pandas Series is a one-dimensional labeled array capable of holding


data of any type (integer, string, float, Python objects, etc.). The axis labels are
collectively called indexes.
2. Data Frame: Pandas Data Frame is a two-dimensional data structure with labeled
axes (rows and columns).

7
CHAPTER 4

DATA SCIENCE IN REAL TIME APPLICATIONS

Python is a popular programming language for data science because of its flexibility,
ease of use and extensive libraries. It's used in a variety of real-time applications, including:

4.1 Data Analysis

Python is used to clean, prepare, and analyze data, and to identify patterns,
relationships, and trends.

4.2 Machine Learning

Python can be used to build and fine-tune models, and make data-driven decisions.

4.3 Artificial Intelligence

Python is used to develop AI and ML applications, and to work with complex


mathematical functions and image processing.

4.4 Web development

Python is used for web development, including social media monitoring tools and chat
bots.

4.5 In Transport

Data Science is also entered in real-time such as the Transport field like Driverless
Cars. With the help of Driverless Cars, it is easy to reduce the number of Accidents.

4.6 Speech recognition

It is one of the most commonly known applications of data science. It is a technology


that enables a computer to recognize and transcribe spoken language into text.

4.7 In E-Commerce

E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better
user experience with personalized recommendations.

8
4.8 In Finance

Financial Industries uses Data Science Analytics tools in order to predict the future. It
allows the companies to predict customer lifetime value and their stock market moves.

4.9 Healthcare applications

Healthcare companies are using data science to build sophisticated medical


instruments to detect and cure diseases.

4.10 Gaming

Video and computer games are now being created with the help of data science and
that has taken the gaming experience to the next level.

4.11 Image Recognition

Identifying patterns is one of the most commonly known applications of data science.

4.12 Recommendation Systems

Netflix and Amazon give movie and product recommendations based on what you like
to watch, purchase, or browse on their platforms.

4.13 Logistics

Data Science is used by logistics companies to optimize routes to ensure faster


delivery of products and increase operational efficiency.

4.14 Fraud Detection

Banking and financial institutions use data science and related algorithms to detect
fraudulent transactions.

4.15 Airline Route Planning

Next up in the data science and its applications list comes route planning. As a result
of data science, it is easier to predict flight delays for the airline industry, which is helping it
grow. It also helps to determine whether to land immediately at the destination or to make a
stop in between, such as a flight from Delhi to the United States of America or to stop in
between and then arrive at the destination.

9
CHAPTER 5

DATA VISUALIZATION

Data visualization provides a good, organized pictorial representation of the data which
makes it easier to understand, observe, analyze. Python offers several plotting libraries,
namely Matplotlib, Seaborn and many other such data visualization packages with different
features for creating informative, customized, and appealing plots to present data in the most
simple and effective way. Here’s an overview of some key libraries and techniques for data
visualization in Python:

5.1 Key Libraries

5.1.1 Matplotlib

● A foundational library for creating static, animated, and interactive visualizations in


Python.
● Basic plots include line graphs, scatter plots, bar charts, histograms, and more.

5.1.2 Seaborn

▪ Built on top of Matplotlib, Seaborn provides a high-level interface for drawing


attractive statistical graphics.
▪ It’s particularly good for heat maps, violin plots, and pair plots.

5.1.3 Pandas Visualization

Pandas offer built-in plotting capabilities that work seamlessly with Data Frames.

5.1.4 Plotly

A library for creating interactive plots that can be easily shared or embedded in web.

5.1.5 Bokeh

● Another library for creating interactive visualizations, particularly suited for web
applications.
● It allows for complex visualizations with interactivity.

10
5.2 Visualization Techniques

5.2.1 Exploratory Data Analysis (EDA)

Use scatter plots, box plots, and histograms to explore distributions and relationships in
the data.

5.2.2 Correlation Heat maps

Visualize correlation matrices to understand relationships between variables.

5.2.3 Time Series Analysis

▪ Line plots to visualize trends over time.


▪ Geospatial Data Visualization.
▪ Use libraries like Folium or Geo Pandas to visualize data on maps.

5.3 Best Practices

● Choose the Right Chart Type: Use the most suitable chart type for your data and
audience.
● Keep it Simple: Avoid clutter and focus on the key insights you want to
communicate.
● Use Color Wisely: Be mindful of color choices for clarity and accessibility.
● Label Clearly: Ensure all axes and legends are clearly labeled for better
understanding.

Mastering data visualization in Python enhances your ability to analyze and present
data effectively. Experiment with different libraries and techniques to find the best fit for your
projects.

11
CHAPTER 6

DATA SCIENCE COMPONENTS

Data science is an interdisciplinary field that involves extracting insights and


knowledge from structured and unstructured data. Python, with its rich ecosystem of libraries
and frameworks, has become the go-to language for data scientists.

6.1 The key components of data science in Python include

1. Data Collection
Gathering data from various sources such as databases, APIs, and web scraping.
2. Data Cleaning
Preparing the data by handling missing values, duplicates, and inconsistencies to
ensure high-quality input for analysis.
3. Data Exploration
Analyzing data through descriptive statistics and visualizations to uncover patterns
and insights.
4. Feature Engineering

Creating new features or modifying existing ones to improve model performance.

5. Modeling
Applying statistical and machine learning techniques to build predictive models.
6. Evaluation
Assessing the performance of models using various metrics to ensure their
reliability.
7. Deployment
Implementing models into production environments for real-time usage.
8. Monitoring and Maintenance
Continuously tracking model performance and updating them as needed.
9. Communication
Effectively presenting findings through visualizations and reports to stakeholders.

12
10. Machine learning
A core component of data science, machine learning uses algorithms to help
machines learn patterns and trends from data to make predictions.
11. NumPy
A Python library that allows for scientific calculations and multi-dimensional array
objects.
12. Data analysis
An essential part of data science, data analysis helps provide insights about data.
Python libraries like Pandas, Matplotlib, and Seaborn can be used for data analysis.
13. Deep learning
A field of data science that involves understanding deep learning concepts and
neural network architecture.

This structured approach allows data scientists to systematically tackle problems


and derive actionable insights, making Python an essential tool in the data science toolkit.

13
ASSIGNMENTS

Excel Assignment 1

14
Output:

15
Output:

16
Output:

17
CONCLUSION

Python has established itself as a leading language in the field of data science, owing to its

versatility, ease of use, and extensive libraries. The various components of data science

ranging from data collection and cleaning to modeling and communication can be efficiently

implemented using Python's rich ecosystem, including libraries like Pandas, NumPy, and

Matplotlib.

By leveraging these tools, data scientists can transform raw data into meaningful insights,

enabling organizations to make informed decisions. The collaborative nature of the Python

community further enhances its capabilities, fostering innovation and continuous

improvement.

As the demand for data-driven decision-making continues to grow, Python will

undoubtedly remain at the forefront of data science, empowering professionals to tackle

increasingly complex challenges and drive impactful solutions across diverse industries.

18
19

You might also like