data scince report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

October University for Modern Sciences and Arts (MSA)

Faculty of Engineering
Computer Systems Engineering department

Database Theory [CSE553].

Data Science Report

Prepared By
Name: Moustafa Atef Farouk ID: 195339
Supervised By
Dr Manal moustafa
2022-2023
1.Introduction
Data science is an exciting and rapidly growing field that is transforming
the way businesses and organizations make decisions. It is a powerful tool for
understanding the world around us, and it is used to uncover hidden relationships
between variables, to identify trends and patterns in data, and to develop
predictive models that can be used to make decisions and optimize processes.
Data science is a combination of several disciplines, and it is used in a variety of
industries to uncover patterns, trends, and correlations in large datasets, and to
develop predictive models that can be used to make decisions and optimize
processes.

2.What is data science?


It is about extraction, preparation, analysis, visualization, and maintenance of
information. It is a cross-disciplinary field which uses scientific methods and
processes to draw insights from data. Data Science is the science which uses
computer science, statistics and machine learning, visualization and human-
computer interactions to collect, clean, integrate, analyze, visualize, interact with
data to create data products.

3.The V’s of data science


There is several "V"s of big data: three of these are volume, velocity,
variety.8 Big data exceeds the storage capacity of conventional databases. This is
its volume aspect. The scale of data generation is mind-boggling. Google’s Eric
Schmidt pointed out that until 2003, all of human kind had generated just 5
exabytes of data (an exabyte is 10006 bytes or a billion bytes). Today we generate
5 exabytes of data every two days. The main reason for this is the explosion of
“interaction” data, a new phenomenon in contrast to mere “transaction” data.
Interaction data comes from recording activities in our day-to-day ever more
digital lives, such as browser activity, geo-location data, RFID data, sensors,
personal digital recorders such as the fitbit and phones, satellites, etc. We now
live in the “internet of things” (or iOT), and it’s producing a wild quantity of data,
all of which we seem to have an endless need to analyze. In some quarters it is
better to speak of 4 Vs of big data, as shown in Figure 1.1.

Figure 1: The Four Vs of Big Data.

4.Datafication
Data science is more than the mere analysis of large data sets. It is also about the
creation of data. The field of “text-mining” expands available data enormously,
since there is so much more text being generated than numbers. The creation of
data from varied sources, and its quantification into information is known as
“datafication.”. The datafication can be done in different levels. It could be in
personal levels or business levels.

5.Why to use data science?


data science seen as a novel trend within business reviews, in technology
blogs, and at academic conference. There are many reasons why the world has
moved towards data science:
➢ The high computational power available nowadays.
➢ The massive amount of data available.
➢ The analytical gap between Large companies (Google, Yahoo, IBM, or SAS)
and the rest of the world (companies and people) is shrinking.
➢ Access to cloud computing allows any individual to analyze huge amounts
of data in short periods of time
➢ Analytical knowledge is free and most of the crucial algorithms that are
needed to create a solution can be found, because open-source development
is the norm in this field.

6.How Data Science Works?


With the help of a data scientist and data science technologies, The data
scientist start to implement each phase of data science life cycle .

Figure 2: data science life cycle.

6.1. Business Understanding


The data scientists in the room are the people who keep asking the why’s.
They’re the people who want to ensure that every decision made in the company
is supported by concrete data, and that it is guaranteed (with a high probability)
to achieve results. Before you can even start on a data science project, it is critical
that you understand the problem you are trying to solve.

6.2. Data Mining


Now that you’ve defined the objectives of your project, it’s time to start
gathering the data. Data mining is the process of gathering your data from
different sources. Some people tend to group data retrieval and cleaning together,
but each of these processes is such a substantial step that I’ve decided to break
them apart. At this stage, some of the questions worth considering are - what data
do I need for my project? Where does it live? How can I obtain it? What is the
most efficient way to store and access all of it?

If all the data necessary for the project is packaged and handed to you,
you’ve won the lottery. More often than not, finding the right data takes both time
and effort. If the data lives in databases, your job is relatively simple - you can
query the relevant data using SQL queries, or manipulate it using a dataframe tool
like Pandas. However, if your data doesn’t actually exist in a dataset, you’ll need
to scrape it. Beautiful Soup is a popular library used to scrape web pages for data.
If you’re working with a mobile app and want to track user engagement and
interactions, there are countless tools that can be integrated within the app so that
you can start getting valuable data from customers. Google Analytics, for
example, allows you to define custom events within the app which can help you
understand how your users behave and collect the corresponding data.

6.3. Data Cleaning


Now that you’ve got all of your data, we move on to the most time-
consuming step of all - cleaning and preparing the data. This is especially true in
big data projects, which often involve terabytes of data to work with. According
to interviews with data scientists, this process (also referred to as ‘data janitor
work’) can often take 50 to 80 percent of their time. So what exactly does it entail,
and why does it take so long?

The reason why this is such a time-consuming process is simply because


there are so many possible scenarios that could necessitate cleaning. For instance,
the data could also have inconsistencies within the same column, meaning that
some rows could be labelled 0 or 1, and others could be labelled no or yes. The
data types could also be inconsistent - some of the 0s might integers, whereas
some of them could be strings. If we’re dealing with a categorical data type with
multiple categories, some of the categories could be misspelled or have different
cases, such as having categories for both male and Male. This is just a subset of
examples where you can see inconsistencies, and it’s important to catch and fix
them in this stage.
One of the steps that is often forgotten in this stage, causing a lot of
problems later on, is the presence of missing data. Missing data can throw a lot
of errors in the model creation and training. One option is to either ignore the
instances which have any missing values. Depending on your dataset, this could
be unrealistic if you have a lot of missing data. Another common approach is to
use something called average imputation, which replaces missing values with the
average of all the other instances. This is not always recommended because it can
reduce the variability of your data, but in some cases it makes sense.

6.4. Data Exploration


Now that you’ve got a sparkling clean set of data, you’re ready to finally
get started in your analysis. The data exploration stage is like the brainstorming
of data analysis. This is where you understand the patterns and bias in your data.
It could involve pulling up and analysing a random subset of the data using
Pandas, plotting a histogram or distribution curve to see the general trend, or even
creating an interactive visualization that lets you dive down into each data point
and explore the story behind the outliers.

Using all of this information, you start to form hypotheses about your data
and the problem you are tackling. If you were predicting student scores for
example, you could try visualizing the relationship between scores and sleep. If
you were predicting real estate prices, you could perhaps plot the prices as a heat
map on a spatial plot to see if you can catch any trends.

There is a great summary of tools and approaches on the Wikipedia page


for exploratory data analysis.

6.5. Feature Engineering


In machine learning, a feature is a measurable property or attribute of a
phenomenon being observed. If we were predicting the scores of a student, a
possible feature is the amount of sleep they get. In more complex prediction tasks
such as character recognition, features could be histograms counting the number
of black pixels.
6.6. Predictive Modelling
Predictive modelling is where the machine learning finally comes into your
data science project. A trained model uses the generated weights from the training
to predict new values .

6.7. Data Visualization


Data visualization is a tricky field, mostly because it seems simple but it
could possibly be one of the hardest things to do well. That’s because data viz
combines the fields of communication, psychology, statistics, and art, with an
ultimate goal of communicating the data in a simple yet effective and visually
pleasing way. Once you’ve derived the intended insights from your model, you
have to represent them in way that the different key stakeholders in the project
can understand.

7. Business Intelligence vs. Data Science: What’s the Difference?


Generally speaking, business intelligence and data science both play a key
role in producing any organization’s actionable insights. So where exactly is the
line between the two? When does business intelligence end and data science
begin? BI and data science vary in a number of ways, from the type of data they’re
working with to project deliverables and approaches. See the figure below for a

visual distinction between the most common attributes of the two.


Figure 3: Business Intelligence vs. Data Science

8.Application of data science

Figure 4: Applications of data science

Data science has found its applications in almost every industry.


1. Healthcare
Healthcare companies are using data science to build sophisticated medical
instruments to detect and cure diseases.
2. Gaming
Video and computer games are now being created with the help of data
science and that has taken the gaming experience to the next level.
3. Image Recognition
Identifying patterns in images and detecting objects in an image is one of
the most popular data science applications.
4. Recommendation Systems
Netflix and Amazon give movie and product recommendations based on
what you like to watch, purchase, or browse on their platforms.

5. Logistics
Data Science is used by logistics companies to optimize routes to ensure
faster delivery of products and increase operational efficiency.
6. Fraud Detection
Banking and financial institutions use data science and related algorithms
to detect fraudulent transactions.
7. Internet Search
When we think of search, we immediately think of Google. Right?
However, there are other search engines, such as Yahoo, Duckduckgo, Bing,
AOL, Ask, and others, that employ data science algorithms to offer the best
results for our searched query in a matter of seconds. Given that Google handles
more than 20 petabytes of data per day. Google would not be the 'Google' we
know today if data science did not exist.
8. Speech recognition
Speech recognition is dominated by data science techniques. We may see
the excellent work of these algorithms in our daily lives. Have you ever needed
the help of a virtual speech assistant like Google Assistant, Alexa, or Siri? Well,
its voice recognition technology is operating behind the scenes, attempting to
interpret and evaluate your words and delivering useful results from your use.
Image recognition may also be seen on social media platforms such as Facebook,
Instagram, and Twitter. When you submit a picture of yourself with someone on
your list, these applications will recognise them and tag them.
9. Targeted Advertising
If you thought Search was the most essential data science use, consider
this: the whole digital marketing spectrum. From display banners on various
websites to digital billboards at airports, data science algorithms are utilised to
identify almost anything. This is why digital advertisements have a far higher
CTR (Call-Through Rate) than traditional marketing. They can be customised
based on a user's prior behaviour. That is why you may see adverts for Data
Science Training Programs while another person sees an advertisement for
clothes in the same region at the same time.
10. Airline Route Planning
As a result of data science, it is easier to predict flight delays for the airline
industry, which is helping it grow. It also helps to determine whether to land
immediately at the destination or to make a stop in between, such as a flight from
Delhi to the United States of America or to stop in between and then arrive at the
destination

9. What is a Data Scientist?

Data scientists are among the most recent analytical data professionals
who have the technical ability to handle complicated issues as well as the desire
to investigate what questions need to be answered. They're a mix of
mathematicians, computer scientists, and trend forecasters. They're also in high
demand and well-paid because they work in both the business and IT sectors.

On a daily basis, a data scientist may do the following tasks:

1. Discover patterns and trends in datasets to get insights.

2. Create forecasting algorithms and data models.

3. Improve the quality of data or product offerings by utilising machine learning


techniques.

4. Distribute suggestions to other teams and top management.

5. In data analysis, use data tools such as R, SAS, Python, or SQL.

6. Top the field of data science innovations.

10. conclusion
Data will be the lifeblood of the business world for the foreseeable future.
Knowledge is power, and data is actionable knowledge that can mean the
difference between corporate success and failure. By incorporating data science
techniques into their business, companies can now forecast future growth, predict
potential problems, and devise informed strategies for success.
References:

1. https://www.simplilearn.com/tutorials/data-science-tutorial/what-is-data-
science
2. https://www.ibm.com/cloud/learn/data-science-introduction
3. https://www.datasciencecentral.com/the-concept-of-datafication-
definition-amp-examples/
4. https://www.jigsawacademy.com/blogs/data-science/importance-of-data-
science/#:~:text=Data%20Science%20enables%20companies%20to,%2C
%20policy%20work%2C%20and%20more.
5. https://www.sudeep.co/data-science/2018/02/09/Understanding-the-Data-
Science-Lifecycle.html

You might also like