Data Analyst work
Data Analyst work
Following are some data-collection considerations to keep in mind for your analysis:
Data sources
If you don’t collect the data using your own resources, you might get data from second-party
or third-party data providers. Second-party data is collected directly by another group
and then sold. Third-party data is sold by a provider that didn’t collect the data
themselves. Third-party data might come from a number of different sources.
Time frame
If you are collecting your own data, decide how long you will need to collect it, especially if
you are tracking trends over a long period of time. If you need an immediate answer, you
might not have time to collect new data. In this case, you would need to use historical data
that already exists.
Use the flowchart below if data collection relies heavily on how much time you have:
Data formats in practice
When you think about the word "format," a lot of things might come to mind. Think of an
advertisement for your favorite store. You might find it in the form of a print ad, a billboard, or
even a commercial. The information is presented in the format that works best for you to take it
in. The format of a dataset is a lot like that, and choosing the right format will help you manage
and use your data in the best way possible.
Secondary data Gathered by other people Data you bought from a local data
or from other research analytics firm’s customer profiles
Demographic data collected by a
university
Census data gathered by the federal
government
External data Data that is stored outside of a National average wages for the various positions
company or organization throughout your organization
Credit reports for customers of an auto dealership
The following table highlights the differences between internal and external data and presents
examples of each.
Discrete data Data that is counted and has a Number of people who visit a hospital
limited number of values on a daily basis (10, 20, 200)
Maximum capacity allowed in a room
Tickets sold in the current month
Unstructured data Data that cannot be stored as columns and Social media
rows in a relational database. posts
Emails
Videos
The effects of different structures
Data is everywhere and it can be stored in lots of ways. Two general categories of data are:
For example, when you rate your favorite restaurant online, you're creating structured data.
But when you use Google Earth to check out a satellite image of a restaurant location, you're
using unstructured data.
Structured data: - Defined data types - Most often quantitative data - Easy to organize - Easy
to search - Easy to analyze - Stored in relational databases - Contained in rows and columns -
Examples: Excel, Google Sheets, SQL, customer data, phone records, transaction history
Unstructured data: - Varied data types - Most often qualitative data - Difficult to search -
Provides more freedom for analysis - Stored in data lakes and NoSQL databases - Can't be put
in rows and columns - Examples: Text messages, social media comments, phone call
transcriptions, various log files, images, audio, video
Structured data
As we described earlier, structured data is organized in a certain format. This makes it
easier to store and query for business needs. If the data is exported, the structure goes along
with the data.
Unstructured data
Unstructured data can’t be organized in any easily identifiable manner. And there is
much more unstructured than structured data in the world. Video and audio files, text files,
social media content, satellite imagery, presentations, PDF files, open-ended survey
responses, and websites all qualify as types of unstructured data.
Important note: As a junior data analyst, you won't be asked to design a data model. But
you might come across existing data models your organization already has in place.
1. Conceptual data modeling gives a high-level view of the data structure, such
as how data interacts across an organization. For example, a conceptual data model
may be used to define the business requirements for a new database. A conceptual
data model doesn't contain technical details.
2. Logical data modeling focuses on the technical details of a database such as
relationships, attributes, and entities. For example, a logical data model defines how
individual records are uniquely identified in a database. But it doesn't spell out actual
names of database tables. That's the job of a physical data model.
3. Physical data modeling depicts how a database operates. A physical data
model defines all entities and attributes used; for example, it includes table names,
column names, and data types for the database.
Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but two common
methods are the Entity Relationship Diagram (ERD) and the Unified
Modeling Language (UML) diagram. ERDs are a visual way to understand the
relationship between entities in the data model. UML diagrams are very detailed diagrams
that describe the structure of a system by showing the system's entities, attributes,
operations, and their relationships. As a junior data analyst, you will need to understand that
there are different data modeling techniques, but in practice, you will probably be using your
organization’s existing technique.
You can read more about ERD, UML, and data dictionaries in this data modeling techniques
article.
Okay, let's break down data types in a way that's easy to understand!
Imagine you have a big toy box full of different kinds of toys. You've got stuffed animals,
building blocks, and board games all mixed together. To keep things organized, you decide to
sort them into separate containers.
Data types are like those containers for your data! They help you organize information in a
way that makes sense. Just like you wouldn't put a teddy bear in the same box as a puzzle
piece, you wouldn't want to mix up different types of data.
Just like organizing your toys makes it easier to find what you're looking for, using the right
data types helps your computer understand and work with information correctly.
Are there any other parts of this topic that you'd like me to clarify? I'm here to help!
And if you want to continue exploring this topic, try one of these follow-up questions:
Think of data types in a spreadsheet like ingredients in a recipe. Just like you need different
ingredients to make a cake (flour, sugar, eggs), you need different data types to store and
work with different kinds of information in your spreadsheet.
Number: This is for any numerical data. Think of it as the numbers you'd use for
calculations, like the price of an item or the quantity in stock.
Text or String: This type handles text-based information. Imagine this as the words
and letters you'd use for labels, descriptions, or names.
Boolean: This one is like a light switch; it can only be one of two
values: TRUE or FALSE. You might use this to mark whether a task is completed or not.
Using the correct data type helps your spreadsheet understand how to handle the
information. For example, you can't perform calculations on text, and you can't format a
number as a date. All three data types (Number, Text/String, and Boolean) are
straightforward once you start working with them.
However, learners sometimes find Boolean to be the least intuitive at first, simply because
it's different from how we usually think about information. We're used to numbers and words,
but the idea of something being only TRUE or FALSE takes a bit of getting used to.
Don't worry though, the more you work with Booleans, the easier they become to
understand!
You will buy the shoes only if they are any combination of pink and grey
You will buy the shoes if they are entirely pink, entirely grey, or if they are pink and
grey
You will buy the shoes if they are grey, but not if they have any pink
These Venn diagrams illustrate your shoe preferences. AND is the center of the Venn diagram,
where two conditions overlap. OR includes either condition. NOT includes only the part of the
Venn diagram that doesn't contain the exception.
The intersection of these circles is highlighted to indicate the AND condition requires shoes to
be both grey and pink. The Venn diagram that represents OR includes a circle labeled grey
shoes overlapping with a circle labeled pink shoes. The entirety of both circles is highlighted
to indicate the OR condition means any shoe with grey, pink, or some combination satisfies
the requirement. The Venn diagram that represents NOT includes a circle labeled grey shoes
overlapping with a circle labeled pink shoes. The portion of the grey shoes circle that does
not intersect with the pink shoes circle is highlighted to indicate the NOT condition requires
shoes to not include pink.
Your condition is “If the color of the shoe has any combination of grey and pink, you will buy
them.” The Boolean statement would break down the logic of that statement to filter your
results by both colors. It would say IF (Color="Grey") AND (Color="Pink") then
buy them
Below is a simple truth table that outlines the Boolean logic at work in this statement. In the
Color is Grey column, there are two pairs of shoes that meet the color condition. And in
the Color is Pink column, there are two pairs that meet that condition. But in the If
Grey AND Pink column, only one pair of shoes meets both conditions. So, according to
the Boolean logic of the statement, there is only one pair marked true. In other words, there is
one pair of shoes that you would buy.
Color is Grey Color is Pink If Grey AND Pink, then Buy Boolean Logic
Grey/True Pink/True True/Buy True AND True = True
Grey/True Black/False False/Don't buy True AND False = False
Red/False Pink/True False/Don't buy False AND True = False
Red/False Green/False False/Don't buy False AND False = False
The OR operator
The OR operator lets you move forward if either one of your two conditions is met. Your
condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would
be IF (Color="Grey") OR (Color="Pink") then buy them.
Notice that any shoe that meets either the Color is Grey or the Color is Pink
condition is marked as true by the Boolean logic. According to the truth table below, there are
three pairs of shoes that you can buy.
Color is Color is If Grey OR Pink, Boolean Logic
Grey Pink then Buy
Red/False Black/False False/Don't buy False OR False = False
Black/False Pink/True True/Buy False OR True = True
Grey/True Green/False True/Buy True OR False = True
Grey/True Pink/True True/Buy True OR True = True
Finally, the NOT operator lets you filter by subtracting specific conditions from the results.
Your condition is "You will buy any grey shoe except for those with any traces of pink in
them." Your Boolean statement would be IF (Color="Grey") AND (Color=NOT
"Pink") then buy them
Now, all of the grey shoes that aren't pink are marked true by the Boolean logic for the NOT
Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink
condition. Only one pair of shoes is excluded in the truth table below.
Color is Color is Boolean Logic If Grey AND (NOT Pink), Boolean Logic
Grey Pink for NOT Pink then Buy
Grey/True Red/False Not False = True True/Buy True AND True =
True
Grey/True Black/False Not False = True True/Buy True AND True =
True
Grey/True Green/False Not False = True True/Buy True AND True =
True
Grey/True Pink/True Not True = False False/Don't buy True AND False =
False
Notice that you can use parentheses to group your conditions together.
Key takeaways
Operators are symbols that name the operation or calculation to be performed. The
operators AND, OR, and NOT can be used to write Boolean statements in programming
languages. Whether you are doing a search for new shoes or applying this logic to queries,
Boolean logic lets you create multiple conditions to filter your results. Now that you know a
little more about Boolean logic, you can start using it!
Think of a data table like a spreadsheet. It's organized into rows and columns.
So, if our data table was about a music playlist, a row would be one song, and the columns
would be things like "Song Title," "Artist," "Genre," and so on.
Keep this guide open as you watch the video. It can serve as a helpful reference if you need
additional context or clarification while following the video steps. This is not a graded
activity, but you can complete these steps to practice the skills demonstrated in the video.
If you would like to access the spreadsheets the instructor uses in this video, select the link to
a dataset to create a copy. If you don’t have a Google account, download the data directly
from the attachments below.
OR
Download data:
Population, Latin, and Caribbean Countries, 2010–2019, wide format
XLSX File
1. Open the Population, Latin, and Caribbean Countries, 2010–2019, wide format
spreadsheet.
2. Each row contains all population data for one country.
3. The population data for each year is contained in a column.
4. Find the annual population of Argentina in row 3.
5. In this wide format, you can quickly compare the annual population of Argentina to
the annual populations of Antigua and Barbuda, Aruba, the Bahamas, or any other
country.
1. Select column H.
2. Right-click column header H and choose Sort A to Z.
3. Notice that the British Virgin Islands are now at the top because they had the lowest
population of all countries in 2013.
1. Open the Population, Latin, and Caribbean Countries, 2010–2019, long format
spreadsheet.
2. Notice the data is no longer organized into columns by year. All of the years are now in
one column.
3. Find Argentina’s population data in rows 12-21. Each row contains one year of
Argentina’s population data.
Transforming data
A woman presenting data, a hand holding a medal, two people chatting, a ship's wheel being
steered, two people high-fiving each other
In this reading, you will explore how data is transformed and the differences between wide
and long data. Data transformation is the process of changing the data’s format, structure,
or values. As a data analyst, there is a good chance you will need to transform data at some
point to make it easier for you to analyze it.
Mario is a plumber who owns a plumbing company. After years in business, he buys another
plumbing company. Mario wants to merge the customer information from his newly acquired
company with his own, but the other company uses a different database. So, Mario needs to
make the data compatible. To do this, he must transform the format of the acquired
company’s data. Then, he must remove duplicate rows for customers they had in common.
When the data is compatible and together, Mario’s plumbing company will have a complete
and merged customer database.
To make it easier to create charts, you may also need to transform long data to wide data.
Consider the following example of transforming stock prices (collected as long data) to wide
data.
Long data is data where each row contains a single data point for a particular item. In the
long data example below, individual stock prices (data points) have been collected for Apple
(AAPL), Amazon (AMZN), and Google (GOOGL) (items) on the given dates.
With data transformed into wide data, you can create a chart comparing how each company's
stock changed over the same period.
You might notice that all the data included in the long format is also in the wide format. But
wide data is easier to read and understand. That is why data analysts typically transform long
data to wide data more often than when they transform wide data to long data. The following
table summarizes when each format is preferred:
Wide data is preferred when Long data is preferred when
Creating tables and charts with a few Storing a lot of variables about each subject. For
variables about each subject example, 60 years’ worth of interest rates for each
bank
Comparing straightforward line graphs Performing advanced statistical analysis or graphing
Data anonymization
You have been learning about the importance of privacy in data analytics. Now, it is time to
talk about data anonymization and what types of data should be anonymized. Personally
identifiable information, or PII, is information that can be used by itself or with other data to
track down a person's identity.
Organizations have a responsibility to protect their data and the personal information that
data might contain. As a data analyst, you might be expected to understand what data needs
to be anonymized, but you generally wouldn't be responsible for the data anonymization
itself. A rare exception might be if you work with a copy of the data for testing or development
purposes. In this case, you could be required to anonymize the data before you work with it.
Healthcare and financial data are two of the most sensitive types of data. These industries
rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why
data in these two industries usually goes through de-identification, which is a process used
to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for data
analysts to understand the basics. Here is a list of data that is often anonymized:
Telephone numbers
Names
License plates and license numbers
Social security numbers
IP addresses
Medical records
Email addresses
Photographs
Account numbers
For some people, it just makes sense that this type of data should be anonymized. For others,
we have to be very specific about what needs to be anonymized. Imagine a world where we
all had access to each other’s addresses, account numbers, and other identifiable
information. That would invade a lot of people’s privacy and make the world less safe. Data
anonymization is one of the ways we can keep data private and secure!
In data analytics, open data is part of data ethics, which has to do with using data ethically.
Openness refers to free access, usage, and sharing of data. But for data to be considered
open, it has to:
Data can only be considered open when it meets all three of these standards.
One of the biggest benefits of open data is that credible databases can be used more widely.
Basically, this means that all of that good data can be leveraged, shared, and combined with
other data. This could have a huge impact on scientific collaboration, research advances,
analytical capacity, and decision-making. But it is important to think about the individuals
being represented by the public, open data, too.
Third-party data is collected by an entity that doesn’t have a direct relationship with the
data. You might remember learning about this type of data earlier. For example, third parties
might collect information about visitors to a certain website. Doing this lets these third
parties create audience profiles, which helps them better understand user behavior and
target them with more effective advertising.
Personal identifiable information (PII) is data that is reasonably likely to identify a person
and make information known about them. It is important to keep this data safe. PII can
include a person’s address, credit card information, social security number, medical records,
and more.
Everyone wants to keep personal information about themselves private. Because third-party
data is readily available, it is important to balance the openness of data with the privacy of
individuals.
Luckily for data analysts, there are lots of trustworthy sites and resources available for open
data. It is important to remember that even reputable data needs to be constantly evaluated,
but these websites are a useful starting point:
1. U.S. government data site: Data.gov is one of the most comprehensive data sources
in the US. This resource gives users the data and tools that they need to do research,
and even helps them develop web and mobile applications and design data
visualizations.
2. U.S. Census Bureau: This open data source offers demographic information from
federal, state, and local governments, and commercial entities in the U.S. too.
3. Open Data Network: This data source has a really powerful search engine and
advanced filters. Here, you can find data on topics like finance, public safety,
infrastructure, and housing and development.
4. Google Cloud Public Datasets: There are a selection of public datasets available
through the Google Cloud Public Dataset Program that you can find already loaded
into BigQuery.
5. Dataset Search: The Dataset Search is a search engine designed specifically for data
sets; you can use this to search for specific data sets.