0% found this document useful (0 votes)
10 views22 pages

Data Analyst work

The document outlines key considerations for data collection, including the distinction between first-party, second-party, and third-party data, as well as the importance of selecting appropriate data types and formats for analysis. It explains various classifications of data, such as primary vs. secondary, internal vs. external, and structured vs. unstructured data, along with their examples. Additionally, it introduces data modeling concepts and techniques, emphasizing the role of Boolean logic in data analysis.

Uploaded by

talktojabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

Data Analyst work

The document outlines key considerations for data collection, including the distinction between first-party, second-party, and third-party data, as well as the importance of selecting appropriate data types and formats for analysis. It explains various classifications of data, such as primary vs. secondary, internal vs. external, and structured vs. unstructured data, along with their examples. Additionally, it introduces data modeling concepts and techniques, emphasizing the role of Boolean logic in data analysis.

Uploaded by

talktojabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Select the right data

Following are some data-collection considerations to keep in mind for your analysis:

How the data will be collected


Decide if you will collect the data using your own resources or receive (and possibly purchase
it) from another party. Data that you collect yourself is called first-party data.

Data sources
If you don’t collect the data using your own resources, you might get data from second-party
or third-party data providers. Second-party data is collected directly by another group
and then sold. Third-party data is sold by a provider that didn’t collect the data
themselves. Third-party data might come from a number of different sources.

Solving your business problem


Datasets can show a lot of interesting information. But be sure to choose data that can
actually help solve your problem question. For example, if you are analyzing trends over time,
make sure you use time series data — in other words, data that includes dates.

How much data to collect


If you are collecting your own data, make reasonable decisions about sample size. A random
sample from existing data might be fine for some projects. Other projects might need more
strategic data collection to focus on certain criteria. Each project has its own needs.

Time frame
If you are collecting your own data, decide how long you will need to collect it, especially if
you are tracking trends over a long period of time. If you need an immediate answer, you
might not have time to collect new data. In this case, you would need to use historical data
that already exists.

Use the flowchart below if data collection relies heavily on how much time you have:
Data formats in practice
When you think about the word "format," a lot of things might come to mind. Think of an
advertisement for your favorite store. You might find it in the form of a print ad, a billboard, or
even a commercial. The information is presented in the format that works best for you to take it
in. The format of a dataset is a lot like that, and choosing the right format will help you manage
and use your data in the best way possible.

Data format examples


As with most things, it is easier for definitions to click when you can pair them with examples
you might encounter on a daily basis. Review each data format’s definition first and then use the
examples to lock in your understanding.

Primary versus secondary data


The following table highlights the differences between primary and secondary data and presents
examples of each.

Data format Definition Examples


classification
Primary data Collected by a researcher  Data from an interview you conducted -
from first-hand sources Data from a survey returned from 20
participants
 Data from questionnaires you got back
from a group of workers

Secondary data Gathered by other people  Data you bought from a local data
or from other research analytics firm’s customer profiles
 Demographic data collected by a
university
 Census data gathered by the federal
government

Internal versus external data


Data format Definition Examples
classification
Internal data Data that is stored inside a  Wages of employees across different business
company’s own systems units tracked by HR
 Sales data by store location
 Product inventory levels across distribution
centers

External data Data that is stored outside of a  National average wages for the various positions
company or organization throughout your organization
 Credit reports for customers of an auto dealership
The following table highlights the differences between internal and external data and presents
examples of each.

Continuous versus discrete data


The following table highlights the differences between continuous and discrete data and presents
examples of each.

Data format Definition Examples


classification
Continuous data Data that is measured and can have  Height of kids in third grade classes
almost any numeric value (52.5 inches, 65.7 inches)
 Runtime markers in a video
 Temperature

Discrete data Data that is counted and has a  Number of people who visit a hospital
limited number of values on a daily basis (10, 20, 200)
 Maximum capacity allowed in a room
 Tickets sold in the current month

Qualitative versus quantitative data


The following table highlights the differences between qualitative and quantitative data and
presents examples of each.

Data format Definition Examples


classification
Qualitative A subjective and explanatory measure of a  Favorite exercise activity
quality or characteristic  Brand with best customer service
 Fashion preferences of young
adults

Quantitative A specific and objective measure, such as a  Percentage of board certified


number, quantity, or range doctors who are women
 Population size of elephants in
Africa
 Distance from Earth to Mars at a
particular time

Nominal versus ordinal data


The following table highlights the differences between nominal and ordinal data and presents
examples of each.

Data format Definition Examples


classification
Nominal A type of qualitative data that is  First time customer, returning
categorized without a set order customer, regular customer
 New job applicant, existing
applicant, internal applicant
 New listing, reduced price listing,
foreclosure

Ordinal A type of qualitative data with a  Movie ratings (number of stars: 1


set order or scale star, 2 stars, 3 stars)
 Ranked-choice voting selections
(1st, 2nd, 3rd)
 Satisfaction level measured in a
survey (satisfied, neutral,
dissatisfied)

Structured versus unstructured data


The following table highlights the differences between structured and unstructured data and
presents examples of each.
Data format Definition Examples
classification
Structured data Data organized in a certain format, like  Expense
rows and columns reports
 Tax returns
 Store
inventory

Unstructured data Data that cannot be stored as columns and  Social media
rows in a relational database. posts
 Emails
 Videos
The effects of different structures
Data is everywhere and it can be stored in lots of ways. Two general categories of data are:

 Structured data: Organized in a certain format, such as rows and columns.


 Unstructured data: Not organized in any easy-to-identify way.

For example, when you rate your favorite restaurant online, you're creating structured data.
But when you use Google Earth to check out a satellite image of a restaurant location, you're
using unstructured data.

Here's a refresher on the characteristics of structured and unstructured data:

Structured data: - Defined data types - Most often quantitative data - Easy to organize - Easy
to search - Easy to analyze - Stored in relational databases - Contained in rows and columns -
Examples: Excel, Google Sheets, SQL, customer data, phone records, transaction history
Unstructured data: - Varied data types - Most often qualitative data - Difficult to search -
Provides more freedom for analysis - Stored in data lakes and NoSQL databases - Can't be put
in rows and columns - Examples: Text messages, social media comments, phone call
transcriptions, various log files, images, audio, video

Structured data
As we described earlier, structured data is organized in a certain format. This makes it
easier to store and query for business needs. If the data is exported, the structure goes along
with the data.

Unstructured data
Unstructured data can’t be organized in any easily identifiable manner. And there is
much more unstructured than structured data in the world. Video and audio files, text files,
social media content, satellite imagery, presentations, PDF files, open-ended survey
responses, and websites all qualify as types of unstructured data.

The fairness issue


The lack of structure makes unstructured data difficult to search, manage, and analyze. But
recent advancements in artificial intelligence and machine learning algorithms are beginning
to change that. Now, the new challenge facing data scientists is making sure these tools are
inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily
weighted and/or represented than others. And as you're learning, an unfair dataset does not
accurately represent the population, causing skewed outcomes, low accuracy levels, and
unreliable analysis.

Data modeling levels and techniques


This reading introduces you to data modeling and different types of data models. Data
models help keep data consistent and enable people to map out how data is organized. A
basic understanding makes it easier for analysts and other stakeholders to make sense of
their data and use it in the right ways.

Important note: As a junior data analyst, you won't be asked to design a data model. But
you might come across existing data models your organization already has in place.

What is data modeling?


Data modeling is the process of creating diagrams that visually represent how data is
organized and structured. These visual representations are called data models. You can
think of data modeling as a blueprint of a house. At any point, there might be electricians,
carpenters, and plumbers using that blueprint. Each one of these builders has a different
relationship to the blueprint, but they all need it to understand the overall structure of the
house. Data models are similar; different users might have different data needs, but the data
model gives them an understanding of the structure as a whole.

Levels of data modeling


Each level of data modeling has a different level of detail.

1. Conceptual data modeling gives a high-level view of the data structure, such
as how data interacts across an organization. For example, a conceptual data model
may be used to define the business requirements for a new database. A conceptual
data model doesn't contain technical details.
2. Logical data modeling focuses on the technical details of a database such as
relationships, attributes, and entities. For example, a logical data model defines how
individual records are uniquely identified in a database. But it doesn't spell out actual
names of database tables. That's the job of a physical data model.
3. Physical data modeling depicts how a database operates. A physical data
model defines all entities and attributes used; for example, it includes table names,
column names, and data types for the database.

More information can be found in this comparison of data models.

Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but two common
methods are the Entity Relationship Diagram (ERD) and the Unified
Modeling Language (UML) diagram. ERDs are a visual way to understand the
relationship between entities in the data model. UML diagrams are very detailed diagrams
that describe the structure of a system by showing the system's entities, attributes,
operations, and their relationships. As a junior data analyst, you will need to understand that
there are different data modeling techniques, but in practice, you will probably be using your
organization’s existing technique.

You can read more about ERD, UML, and data dictionaries in this data modeling techniques
article.

Data analysis and data modeling


Data modeling can help you explore the high-level details of your data and how it is related
across the organization’s information systems. Data modeling sometimes requires data
analysis to understand how the data is put together; that way, you know how to map the
data. And finally, data models make it easier for everyone in your organization to understand
and collaborate with you on your data. This is important for you and everyone on your team!

Okay, let's break down data types in a way that's easy to understand!

Imagine you have a big toy box full of different kinds of toys. You've got stuffed animals,
building blocks, and board games all mixed together. To keep things organized, you decide to
sort them into separate containers.

Data types are like those containers for your data! They help you organize information in a
way that makes sense. Just like you wouldn't put a teddy bear in the same box as a puzzle
piece, you wouldn't want to mix up different types of data.

Here's how it works:


 Number: This is for any kind of numerical data, like counting how many toys you have
or measuring the height of your toy shelf.
 Text/String: This is for words and letters, like the names of your toys or the storybook
that came with your stuffed animal.
 Boolean: This is for data that can only be one of two things, like a light switch that's
either "on" or "off." In your toy box, you might use this to mark whether a toy needs
new batteries (True) or not (False).

Just like organizing your toys makes it easier to find what you're looking for, using the right
data types helps your computer understand and work with information correctly.

Are there any other parts of this topic that you'd like me to clarify? I'm here to help!

And if you want to continue exploring this topic, try one of these follow-up questions:

Think of data types in a spreadsheet like ingredients in a recipe. Just like you need different
ingredients to make a cake (flour, sugar, eggs), you need different data types to store and
work with different kinds of information in your spreadsheet.

The course mentions three main data types you'll encounter:

 Number: This is for any numerical data. Think of it as the numbers you'd use for
calculations, like the price of an item or the quantity in stock.
 Text or String: This type handles text-based information. Imagine this as the words
and letters you'd use for labels, descriptions, or names.
 Boolean: This one is like a light switch; it can only be one of two
values: TRUE or FALSE. You might use this to mark whether a task is completed or not.

Using the correct data type helps your spreadsheet understand how to handle the
information. For example, you can't perform calculations on text, and you can't format a
number as a date. All three data types (Number, Text/String, and Boolean) are
straightforward once you start working with them.

However, learners sometimes find Boolean to be the least intuitive at first, simply because
it's different from how we usually think about information. We're used to numbers and words,
but the idea of something being only TRUE or FALSE takes a bit of getting used to.

Don't worry though, the more you work with Booleans, the easier they become to
understand!

Use Boolean logic


In this reading, you will explore the basics of Boolean logic and learn how to use single and
multiple conditions in a Boolean statement. These conditions are created with Boolean
operators, including AND, OR, and NOT. These operators are similar to mathematical
operators and can be used to create logical statements that filter your results. Data analysts
use Boolean statements to do a wide range of data analysis tasks, such as writing queries for
searches and checking for conditions when writing programming code.

Boolean logic example


Imagine you are shopping for shoes, and are considering certain preferences:

 You will buy the shoes only if they are any combination of pink and grey
 You will buy the shoes if they are entirely pink, entirely grey, or if they are pink and
grey
 You will buy the shoes if they are grey, but not if they have any pink

These Venn diagrams illustrate your shoe preferences. AND is the center of the Venn diagram,
where two conditions overlap. OR includes either condition. NOT includes only the part of the
Venn diagram that doesn't contain the exception.

The intersection of these circles is highlighted to indicate the AND condition requires shoes to
be both grey and pink. The Venn diagram that represents OR includes a circle labeled grey
shoes overlapping with a circle labeled pink shoes. The entirety of both circles is highlighted
to indicate the OR condition means any shoe with grey, pink, or some combination satisfies
the requirement. The Venn diagram that represents NOT includes a circle labeled grey shoes
overlapping with a circle labeled pink shoes. The portion of the grey shoes circle that does
not intersect with the pink shoes circle is highlighted to indicate the NOT condition requires
shoes to not include pink.

Use Boolean logic in statements


In queries, Boolean logic is represented in a statement written with Boolean operators. An
operator is a symbol that names the operation or calculation to be performed. Read on to
discover how you can convert your shoe preferences into Boolean statements.

The AND operator

Your condition is “If the color of the shoe has any combination of grey and pink, you will buy
them.” The Boolean statement would break down the logic of that statement to filter your
results by both colors. It would say IF (Color="Grey") AND (Color="Pink") then
buy them

The AND operator lets you stack both of your conditions.

Below is a simple truth table that outlines the Boolean logic at work in this statement. In the
Color is Grey column, there are two pairs of shoes that meet the color condition. And in
the Color is Pink column, there are two pairs that meet that condition. But in the If
Grey AND Pink column, only one pair of shoes meets both conditions. So, according to
the Boolean logic of the statement, there is only one pair marked true. In other words, there is
one pair of shoes that you would buy.

Color is Grey Color is Pink If Grey AND Pink, then Buy Boolean Logic
Grey/True Pink/True True/Buy True AND True = True
Grey/True Black/False False/Don't buy True AND False = False
Red/False Pink/True False/Don't buy False AND True = False
Red/False Green/False False/Don't buy False AND False = False

The OR operator

The OR operator lets you move forward if either one of your two conditions is met. Your
condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would
be IF (Color="Grey") OR (Color="Pink") then buy them.

Notice that any shoe that meets either the Color is Grey or the Color is Pink
condition is marked as true by the Boolean logic. According to the truth table below, there are
three pairs of shoes that you can buy.
Color is Color is If Grey OR Pink, Boolean Logic
Grey Pink then Buy
Red/False Black/False False/Don't buy False OR False = False
Black/False Pink/True True/Buy False OR True = True
Grey/True Green/False True/Buy True OR False = True
Grey/True Pink/True True/Buy True OR True = True

The NOT operator

Finally, the NOT operator lets you filter by subtracting specific conditions from the results.
Your condition is "You will buy any grey shoe except for those with any traces of pink in
them." Your Boolean statement would be IF (Color="Grey") AND (Color=NOT
"Pink") then buy them

Now, all of the grey shoes that aren't pink are marked true by the Boolean logic for the NOT
Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink
condition. Only one pair of shoes is excluded in the truth table below.

Color is Color is Boolean Logic If Grey AND (NOT Pink), Boolean Logic
Grey Pink for NOT Pink then Buy
Grey/True Red/False Not False = True True/Buy True AND True =
True
Grey/True Black/False Not False = True True/Buy True AND True =
True
Grey/True Green/False Not False = True True/Buy True AND True =
True
Grey/True Pink/True Not True = False False/Don't buy True AND False =
False

The power of multiple conditions


For data analysts, the real power of Boolean logic comes from being able to combine multiple
conditions in a single statement. For example, if you wanted to filter for shoes that were grey
or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color =
"Grey") OR (Color = "Pink")) AND (Waterproof="True")

Notice that you can use parentheses to group your conditions together.

Key takeaways
Operators are symbols that name the operation or calculation to be performed. The
operators AND, OR, and NOT can be used to write Boolean statements in programming
languages. Whether you are doing a search for new shoes or applying this logic to queries,
Boolean logic lets you create multiple conditions to filter your results. Now that you know a
little more about Boolean logic, you can start using it!

Resources for more information


 Learn about who pioneered Boolean logic in this historical article: Origins of Boolean
Algebra in the Logic of Classes.
 Find more information about using AND, OR, and NOT from these tips for searching
with Boolean operators.

Think of a data table like a spreadsheet. It's organized into rows and columns.

 Each row is a record, representing a single item like a song in a playlist.


 Each column is a field, which is a characteristic of the item, like the song title or artist.

So, if our data table was about a music playlist, a row would be one song, and the columns
would be things like "Song Title," "Artist," "Genre," and so on.

Step-by-Step: Meet wide and long data


This reading outlines the steps the instructor performs in the following video, Meet wide and
long data. In this video, the instructor presents wide and long data formats and discusses the
types of questions each format can help you answer.

Keep this guide open as you watch the video. It can serve as a helpful reference if you need
additional context or clarification while following the video steps. This is not a graded
activity, but you can complete these steps to practice the skills demonstrated in the video.

What you’ll need

If you would like to access the spreadsheets the instructor uses in this video, select the link to
a dataset to create a copy. If you don’t have a Google account, download the data directly
from the attachments below.

Link to population datasets:

 Population, Latin, and Caribbean Countries, 2010–2019, wide format


 Population, Latin, and Caribbean Countries, 2010–2019, long format

OR

Download data:
Population, Latin, and Caribbean Countries, 2010–2019, wide format
XLSX File

Population, Latin, and Caribbean Countries, 2010–2019, long format


XLSX File

Example 1: Examine wide data


Wide data is a dataset in which every data subject has a single row with multiple columns to
hold the values of various attributes of the subject. It is helpful for comparing specific
attributes across different subjects.

1. Open the Population, Latin, and Caribbean Countries, 2010–2019, wide format
spreadsheet.
2. Each row contains all population data for one country.
3. The population data for each year is contained in a column.
4. Find the annual population of Argentina in row 3.
5. In this wide format, you can quickly compare the annual population of Argentina to
the annual populations of Antigua and Barbuda, Aruba, the Bahamas, or any other
country.

Find the country with the highest population in 2010

1. Select column E, which contains each country’s 2010 population data.


2. Right-click column header E and choose Sort Z to A.
3. Notice that Brazil is now at the top of the list because it had the highest population in
the year 2010.

Find the country with the lowest population in 2013

1. Select column H.
2. Right-click column header H and choose Sort A to Z.
3. Notice that the British Virgin Islands are now at the top because they had the lowest
population of all countries in 2013.

Example 2: Examine long data


Long data is data in which each row represents one observation per subject, so each subject
will be represented by multiple rows. This data format is useful for comparing changes over
time or making other comparisons across subjects.

1. Open the Population, Latin, and Caribbean Countries, 2010–2019, long format
spreadsheet.
2. Notice the data is no longer organized into columns by year. All of the years are now in
one column.
3. Find Argentina’s population data in rows 12-21. Each row contains one year of
Argentina’s population data.

Transforming data

What is data transformation?

A woman presenting data, a hand holding a medal, two people chatting, a ship's wheel being
steered, two people high-fiving each other

In this reading, you will explore how data is transformed and the differences between wide
and long data. Data transformation is the process of changing the data’s format, structure,
or values. As a data analyst, there is a good chance you will need to transform data at some
point to make it easier for you to analyze it.

Data transformation usually involves:

 Adding, copying, or replicating data


 Deleting fields or records
 Standardizing the names of variables
 Renaming, moving, or combining columns in a database
 Joining one set of data with another
 Saving a file in a different format. For example, saving a spreadsheet as a comma
separated values (.csv) file.

Why transform data?

Goals for data transformation might be:

 Data organization: better organized data is easier to use


 Data compatibility: different applications or systems can then use the same data
 Data migration: data with matching formats can be moved from one system to
another
 Data merging: data with the same organization can be merged together
 Data enhancement: data can be displayed with more detailed fields
 Data comparison: apples-to-apples comparisons of the data can then be made

Data transformation example: data merging

Mario is a plumber who owns a plumbing company. After years in business, he buys another
plumbing company. Mario wants to merge the customer information from his newly acquired
company with his own, but the other company uses a different database. So, Mario needs to
make the data compatible. To do this, he must transform the format of the acquired
company’s data. Then, he must remove duplicate rows for customers they had in common.
When the data is compatible and together, Mario’s plumbing company will have a complete
and merged customer database.

Data transformation example: data organization (long to wide)

To make it easier to create charts, you may also need to transform long data to wide data.
Consider the following example of transforming stock prices (collected as long data) to wide
data.

Long data is data where each row contains a single data point for a particular item. In the
long data example below, individual stock prices (data points) have been collected for Apple
(AAPL), Amazon (AMZN), and Google (GOOGL) (items) on the given dates.

Long data example: Stock prices


Wide data is data where each row contains multiple data points for the items identified in
the columns.

Wide data example: Stock prices

With data transformed into wide data, you can create a chart comparing how each company's
stock changed over the same period.

You might notice that all the data included in the long format is also in the wide format. But
wide data is easier to read and understand. That is why data analysts typically transform long
data to wide data more often than when they transform wide data to long data. The following
table summarizes when each format is preferred:
Wide data is preferred when Long data is preferred when
Creating tables and charts with a few Storing a lot of variables about each subject. For
variables about each subject example, 60 years’ worth of interest rates for each
bank
Comparing straightforward line graphs Performing advanced statistical analysis or graphing

Data anonymization

What is data anonymization?

You have been learning about the importance of privacy in data analytics. Now, it is time to
talk about data anonymization and what types of data should be anonymized. Personally
identifiable information, or PII, is information that can be used by itself or with other data to
track down a person's identity.

Data anonymization is the process of protecting people's private or sensitive data by


eliminating that kind of information. Typically, data anonymization involves blanking,
hashing, or masking personal information, often by using fixed-length codes to represent
data columns, or hiding data with altered values.

Your role in data anonymization

Organizations have a responsibility to protect their data and the personal information that
data might contain. As a data analyst, you might be expected to understand what data needs
to be anonymized, but you generally wouldn't be responsible for the data anonymization
itself. A rare exception might be if you work with a copy of the data for testing or development
purposes. In this case, you could be required to anonymize the data before you work with it.

What types of data should be anonymized?

Healthcare and financial data are two of the most sensitive types of data. These industries
rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why
data in these two industries usually goes through de-identification, which is a process used
to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for data
analysts to understand the basics. Here is a list of data that is often anonymized:

 Telephone numbers
 Names
 License plates and license numbers
 Social security numbers
 IP addresses
 Medical records
 Email addresses
 Photographs
 Account numbers

For some people, it just makes sense that this type of data should be anonymized. For others,
we have to be very specific about what needs to be anonymized. Imagine a world where we
all had access to each other’s addresses, account numbers, and other identifiable
information. That would invade a lot of people’s privacy and make the world less safe. Data
anonymization is one of the ways we can keep data private and secure!

The open-data debate


Just like data privacy, open data is a widely debated topic in today’s world. Data analysts
think a lot about open data, and as a future data analyst, you need to understand the basics
to be successful in your new role.

What is open data?

In data analytics, open data is part of data ethics, which has to do with using data ethically.
Openness refers to free access, usage, and sharing of data. But for data to be considered
open, it has to:

 Be available and accessible to the public as a complete dataset


 Be provided under terms that allow it to be reused and redistributed
 Allow universal participation so that anyone can use, reuse, and redistribute the data

Data can only be considered open when it meets all three of these standards.

The open data debate: What data should be publicly available?

One of the biggest benefits of open data is that credible databases can be used more widely.
Basically, this means that all of that good data can be leveraged, shared, and combined with
other data. This could have a huge impact on scientific collaboration, research advances,
analytical capacity, and decision-making. But it is important to think about the individuals
being represented by the public, open data, too.

Third-party data is collected by an entity that doesn’t have a direct relationship with the
data. You might remember learning about this type of data earlier. For example, third parties
might collect information about visitors to a certain website. Doing this lets these third
parties create audience profiles, which helps them better understand user behavior and
target them with more effective advertising.
Personal identifiable information (PII) is data that is reasonably likely to identify a person
and make information known about them. It is important to keep this data safe. PII can
include a person’s address, credit card information, social security number, medical records,
and more.

Everyone wants to keep personal information about themselves private. Because third-party
data is readily available, it is important to balance the openness of data with the privacy of
individuals.

Sites and resources for open data

Luckily for data analysts, there are lots of trustworthy sites and resources available for open
data. It is important to remember that even reputable data needs to be constantly evaluated,
but these websites are a useful starting point:

1. U.S. government data site: Data.gov is one of the most comprehensive data sources
in the US. This resource gives users the data and tools that they need to do research,
and even helps them develop web and mobile applications and design data
visualizations.
2. U.S. Census Bureau: This open data source offers demographic information from
federal, state, and local governments, and commercial entities in the U.S. too.
3. Open Data Network: This data source has a really powerful search engine and
advanced filters. Here, you can find data on topics like finance, public safety,
infrastructure, and housing and development.
4. Google Cloud Public Datasets: There are a selection of public datasets available
through the Google Cloud Public Dataset Program that you can find already loaded
into BigQuery.
5. Dataset Search: The Dataset Search is a search engine designed specifically for data
sets; you can use this to search for specific data sets.

You might also like