Data Analytics Course 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Spreadsheets and the data life cycle

To better understand the benefits of using spreadsheets in data analytics, let’s explore how they relate to each
phase of the data life cycle: plan, capture, manage, analyze, archive, and destroy.

 Plan for the users who will work within a spreadsheet by developing organizational standards. This can mean
formatting your cells, the headings you choose to highlight, the color
scheme, and the way you order your data points. When you take the time
to set these standards, you will improve communication, ensure
consistency, and help people be more efficient with their time.
 Capture data by the source by connecting spreadsheets to
other data sources, such as an online survey application or
a database. This data will automatically be updated in the
spreadsheet. That way, the information is always as
current and accurate as possible.
 Manage different kinds of data with a spreadsheet. This can
involve storing, organizing, filtering, and updating information.
Spreadsheets also let you decide who can access the data, how the
information is shared, and how to keep your data safe and secure.
 Analyze data in a spreadsheet to help make better decisions. Some of the most common spreadsheet analysis
tools include formulas to aggregate data or create reports, and pivot tables
for clear, easy-to-understand visuals.
 Archive any spreadsheet that you don’t use often, but might need to reference later with built-in tools.
This is especially useful if you want to store historical data before it gets updated.
 Destroy your spreadsheet when you are certain that you will never need it again, if you have better
backup copies, or for legal or security reasons. Keep in mind, lots of businesses are required to follow
certain rules or have measures in place to make sure data is destroyed properly.
Quick reference: Formulas in spreadsheets
You have been learning a lot about spreadsheets and all kinds of time-
saving calculations and organizational features they offer. One of the
most valuable spreadsheet features is a formula. As a quick reminder,
a formula is a set of instructions that does a specific calculation using
the data in a spreadsheet. Formulas make it easy for data analysts to
do powerful calculations automatically, which helps them analyze data
more effectively. Below is a quick-reference guide to help you get the
most out of formulas.

Formulas
The basics
 When you write a formula in math, it generally ends with an equal sign (2 + 3 = ?). But with formulas,
they always start with one instead (=A2+A3). The equal sign tells the spreadsheet that what follows is
part of a formula, not just a word or number in a cell.
 After you type the equal sign, most spreadsheet applications will display an autocomplete menu that lists
valid formulas, names, and text strings. This is a great way to create and edit formulas while avoiding
typing and syntax errors.
 A fun way to learn new formulas is just by typing an equal sign and a single letter of the alphabet.
Choose one of the options that pops up and you will learn what that formula does.
Mathematical operators
 The mathematical operators used in spreadsheet formulas include:
 Subtraction – minus sign ( - )
 Addition – plus sign ( + )
 Division – forward-slash ( / )
 Multiplication – asterisk ( * )
Auto-filling
The lower-right corner of each cell has a fill handle. It is a small green square in Microsoft Excel and a small
blue circle in Google Sheets.

 Click the fill handle square or circle for a cell and drag it down a column to auto-fill other cells in the
column with the same value or formula in that cell.
 Click the fill handle square or circle for a cell and drag it across a row to auto-fill other cells in the row
with the same value or formula in that cell.
 If you want to create a numbered sequence in a column or row, do the following: 1) Fill in the first two
numbers of the sequence in two adjacent cells, 2) Select to highlight the cells, and 3) Drag the fill handle
square or circle to the last cell to complete the sequence of numbers. For example, to insert 1 through
100 in each row of column A, enter 1 in cell A1 and 2 in cell A2. Then, select to highlight both cells, click
the fill handle square or circle in cell A2, and drag it down to cell A100. This auto-fills the numbers
sequentially so you don't have to type them in each cell.
Absolute referencing
 Absolute referencing is marked by a dollar sign ($). For example, =$A$10 has absolute referencing for
both the column and the row value
 Relative references (which is what you normally do e.g. “=A10”) will change anytime the formula is
copied and pasted. They are in relation to where the referenced cell is located. For example if you
copied “=A10” to the cell to the right it would become “=B10”. With absolute referencing “=$A$10” copied
to the cell to the right would remain “=$A$10”. But if you copied $A10 to the cell below, it would change
to $A11 because the row value isn't an absolute reference.
 Absolute references will not change when you copy and paste the formula in a different cell. The cell
being referenced is always the same.
 To easily switch between absolute and relative referencing in the formula bar, highlight the reference you
want to change and press the F4 key; for example, if you want to change the absolute reference, $A$10,
in your formula to a relative reference, A10, highlight $A$10 in the formula bar and then press the F4 key
to make the change.
Data range
 When you click into your formula, the colored ranges let you see which cells are being used in your
spreadsheet. There are different colors for each unique range in your formula.
 In a lot of spreadsheet applications, you can press the F2 (or Enter) key to highlight the range of data in
the spreadsheet that is referenced in a formula. Click the cell with the formula, and then press the F2 (or
Enter) key to highlight the data in your spreadsheet.
Combining with functions
 COUNTIF() is a formula and a function. This means the function runs based on criteria set by the
formula. In this case, COUNT is the formula; it will be executed IF the conditions you create are true. For
example, you could use =COUNTIF(A1:A16, “7”) to count only the cells that contained the number 7.
Combining formulas and functions allows you to do more work with a single command.

Quick reference: Functions in spreadsheets


As a quick refresher, a function is a preset command that automatically performs a specific process or task using
the data in a spreadsheet. Functions give data analysts the ability to do calculations, which can be anything from
simple arithmetic to complex equations. Use this reading to help you keep track of some of the most useful
options.

Functions
The basics
 Just like formulas, start all of your functions with an equal sign; for example =SUM. The equal sign tells
the spreadsheet that what follows is part of a function, not just a word or number in a cell.
 After you type the equal sign, most spreadsheet applications will display an autocomplete menu that lists
valid functions, names, and text strings. This is a great way to create and edit functions while avoiding
typing and syntax errors.
 A fun way to learn new functions is by simply typing an equal sign and a single letter of the alphabet.
Choose one of the options that pops up and learn what that function does.
Difference between formulas and functions
 A formula is a set of instructions used to perform a calculation using the data in a spreadsheet.
 A function is a preset command that automatically performs a specific process or task using the data in a
spreadsheet.
Popular functions
A lot of people don’t realize that keyboard shortcuts like cut, save, and find are actually functions. These
functions are built into an application and are amazing time-savers. Using shortcuts lets you do more with less
effort. They can make you more efficient and productive because you are not constantly reaching for the mouse
and navigating menus. Use these links to discover the most popular shortcuts, for Chromebook, PC, and Mac.

Auto-filling
The lower-right corner of each cell has a fill handle. It is a small green square in Microsoft Excel and a small blue
circle in Google Sheets.

 Click the fill handle for a cell and drag it down a column to auto-fill other cells in the column with the
same formula or function used in that cell.
 Click the fill handle for a cell and drag it across a row to auto-fill other cells in the row with the same
formula or function used in that cell.
Relative, absolute, and mixed references
 Relative references (cells referenced without a dollar sign, like A2) will change when you copy and paste
the function into a different cell. With relative references, the location of the cell that contains the function
determines the cells used by the function.
 Absolute references (cells fully referenced with a dollar sign, like $A$2) will not change when you copy
and paste the function into a different cell. With absolute references, the cells referenced always remain
the same.
 Mixed references (cells partially referenced with a dollar sign, like $A2 or A$2) will change when you
copy and paste the function into a different cell. With mixed references, the location of the cell that
contains the function determines the cells used by the function, but only the row or column is relative (not
both).
 In spreadsheets, you can press the F4 key to toggle between relative, absolute, and mixed references in
a function. Click the cell containing the function, highlight the referenced cells in the formula bar, and
then press F4 to toggle between and select relative, absolute, or mixed referencing.
Data ranges
 When you click a cell that contains a function, colored data ranges in the formula bar indicate which cells
are being used in the spreadsheet. There are different colors for each unique range in a function.
 Colored data ranges help prevent you from getting lost in complex functions.
 In spreadsheets, you can press the F2 key to highlight the range of data used by a function. Click the cell
containing the function, highlight the range of data used by the function in the formula bar, and then
press F2. The spreadsheet will go to and highlight the cells specified by the range.
Data ranges evaluated for a condition
COUNTIF is an example of a function that returns a value based on a condition that the data range is evaluated
for. The function counts the number of cells that meet the criteria. For example, in an expense spreadsheet, use
COUNTIF to count the number of cells that contain a reimbursement for "airfare."

For more information, refer to:

 Microsoft Support's page for COUNTIF


 Google Help Center's documentation for COUNTIF where you can copy a sheet with COUNTIF
examples (click "Use Template" if you click the COUNTIF link provided on this page)
Conclusion
There are a lot more functions that can help you make the most of your data. This is just the start. You can keep
learning how to use functions to help you solve complex problems efficiently and accurately throughout your
entire
1.
Question 1

Activity overview

In this activity, you will import a dataset, build a custom data table, and use functions to analyze your data. For
this activity, imagine you're a data analyst working for a recruiting agency. This recruiting agency helps all sorts
of companies find skilled people to fill open data analytics jobs. The agency has collected data about job
applications for opportunities posted on its website for the year 2019.

The agency has asked your team to optimize its online application process. Your assignment is to summarize
the agency’s job application data. In particular, you want to answer the following questions:

 What was the total number of applications received per month in 2019?
 Which months had the least and greatest number of total applications received?
 What was the average number of applications received per month?
To do this, you’ll work with a spreadsheet. You’ll use spreadsheet functions to make calculations based on your
data and create a custom data table to summarize your results.

By the time you complete this activity, you will be able to import a spreadsheet file, sort data, create a custom
data table, and use spreadsheet functions to work with your data. Spreadsheets are an essential tool for every
data analyst. Using spreadsheets to organize and analyze data is an important skill that you will continue to
develop throughout your career.

What you will need


The agency’s data contains information about all of the data analytics job applications received in 2019. The
data includes the following column headers: Applicant ID, Date, Job Title, Job Location, Hired, and Easy Apply.
Below is a description of each column header and sample values.

To get started, access the spreadsheet that


contains the data. Click the link and make a
copy of the spreadsheet. (Note: this data set is
a bit larger, so it may take 5-7 seconds for the
rows to appear.)

Or, if you don’t have a Google account, you


may download the dataset directly from the
attachment below:

2019_data_analyst_job

CSV File
Sort your data

Because you want to answer questions based on a specific timeframe (in this case, applications received per
month in 2019), it will be useful to start off by sorting the data by date. Sorting involves arranging data into a
meaningful order to make it easier to understand, analyze, and visualize. Considering the order in which each
application was received can help you discover trends in data analytics job applications.

1. First, rename your spreadsheet. In the upper left corner, click Untitled Spreadsheet and enter a new
name. You can use the name 2019_data_analyst_jobs, or a similar name that clearly describes the data
your spreadsheet contains.
2. Next, rename your current sheet. Later, you will add another sheet to your spreadsheet for your data
table. Click the sheet tab and choose Rename on the menu. Then, type raw data.

3. If you want to get a better view of your data, you can make the columns wider by dragging the right boundary
of the column heading. This may apply to the Job Title (C) and Job Location (D) columns.

4. Select all the data in the spreadsheet by clicking the rectangle in the top left corner of your sheet.

5. Then, from the menu bar, select Data → Sort range. (Note: For some versions of Google Sheets, the selection
Advanced range sorting options may appear on the Data drop-down menu instead of Sort range).

6. In the pop-up window, click the Data has header row box. Now you can choose specific column headers to sort
by.

7. In the Sort by dropdown, choose the header Date. Then, click A → Z to sort in ascending order.
8. Finally, click Sort.

This displays the chronological order of applications received.

Create a data table

Now that you’ve sorted your data, you’re ready to create a custom data table based on the specific questions
you want to answer. Your table will clearly display the data you want to summarize. Plus, if you want to share
your results, your table presents a well-organized format for your data that’s easy to understand.

As a reminder, you want to answer the following questions:

 What was the total number of applications received per month in 2019?
 Which months had the least and greatest number of total applications received?
 What was the average number of applications received per month?
Note that the above questions could also be answered using a pivot table. Pivot tables will be discussed in detail
later on in the course.

Begin with the first question. You can use spreadsheet functions to help you discover the total number of
applications received per month.

1. To start, add another sheet to your spreadsheet. Click the Add sheet icon (the plus sign) at the bottom
left corner of your spreadsheet.
2. Rename the new sheet. Click the sheet tab and choose Rename on the menu. Then, type summary data.
Labeling your sheets helps organize your data. Plus, if you return to this project in the future after some
time has elapsed, your labels will make it easier to recall what you were working on.
3. Next, add column headers to your table. In cell A1 of your summary data sheet, type Month. In cell B1,
type Applicants.
4. Add the name of the first month under Month. In cell A2, type January. Press Enter.
5. Now, use autofill to add the rest of the months of the year. Select cell A2 again. A small blue square, or
fill handle, will appear in the bottom-right corner of the cell. Click on the fill handle and drag it down to
cell A13 to autofill all the months of the year.
At this point, your table should appear like this:
6. Next, you want to convert the number values in the Date column into text. You can use the TEXT function to
do this. Converting this data to text will help you total the applications by month in your data table later on (see
Step 8). First, click the raw data tab to return to your raw data sheet. Now, add a new column header. In cell G1,
type Month.

7. The TEXT function converts a number into text according to a specified format. In this case, you want to list
the months of the year. You can use the format “mmmm” for the full name of the month. In cell G2, type the
following code (do not copy+paste):

=TEXT(B2,"mmmm")

The first entry (B2) refers to the cell you want to convert. The second entry (“mmmm”) refers to the specific
format you want to use. Press Enter.

(Note: it is very important that you type the formula syntax directly into the calculation field within Google Sheets.
Copying + Pasting from these instructions will cause errors in the calculations).

8. Select cell G2. Then, double-click on the fill handle to copy the function down the column. This will populate all
the cells in the column with the corresponding month.

9. Now you're ready to total the applications by month. You could do this manually, by filtering the data and
counting the number of entries for each month, but this would take a long time. A more efficient method is to use
the COUNTIF function. First, click the summary data tab to return to your summary data sheet.

10. The COUNTIF function quickly counts how many items in a range of cells meet a given criterion. In cell B2,
type =COUNTIF('raw data'!G:G,A2). The first entry ('raw data'!G:G) refers to the range where you are counting
the data. The range is located on your raw data sheet ('raw data'!) and includes all column G (G:G). This column
contains the data for months. The second entry (A2) refers to the criterion you want to count. In this case, it’s
“January,” the value in cell A2 of your summary data sheet. The function will tell you how many times January
(the criterion) appears in the Date column (the range).

11. Press Enter. You’ll notice the value 2387 appears in cell B2. This means that 2,387 job applications were
submitted in January.
12. Select cell B2. Double-click the fill handle to copy the function down through cell B13.

Now your table shows the total applications submitted for each month of 2019:

13. You can use the SUM function to calculate the overall total for applications submitted in 2019. Before you
enter the function, make a label for the result. In cell A14, type Total.

14. The SUM function adds up the values in a range of cells. Using the SUM function saves time and effort,
especially if you have to find the sum of a long sequence of numbers. In the parentheses of the function, add the
range you want to sum (B2:B13). In cell B14, type =SUM(B2:B13).

Now that you’ve organized the monthly job application data in your data table, you’re ready to answer the
remaining questions:

 Which months had the least and greatest number of total applications received?
 What was the average number of applications received per month?
You can use the MIN, MAX, and AVERAGE functions to make the necessary calculations, and then add the
results to your data table.

15. First, make labels for your results. In cell A16, type Min. In cell A17, type Max. In cell A18, type Avg.

16. Use the MIN function to calculate the least number of applications received in a month. The MIN function
returns the minimum value in a numeric dataset. In cell B16, type =MIN(B2:B13).

17. Use the MAX function to calculate the greatest number of applications received in a month. The MAX
function returns the maximum value in a numeric dataset. In cell B17, type =MAX(B2:B13).

18. Use the AVERAGE function to calculate the average monthly applications received in 2019. The AVERAGE
function returns the average value in a numeric dataset. In cell B18, type =AVERAGE(B2:B13).

Your table should appear like this:


Your table displays the following results for the year 2019: The least number of applications received in a month
was 2,312 (February); the greatest number of applications received in a month was 3,138 (July); the average
number of applications received per month was 2,716.333333.

Your work will help your team discover important trends and patterns in the agency’s data and generate insights
for optimizing the agency’s online applications process. For example, because your findings reveal that February
was the slowest month, the agency can devote more of its advertising and outreach budget to February and less
to the peak month of July. This is the strategic impact of data analysis.

(Optional): Feel free to explore formatting options for your data table using bold, center align, fill color, borders,
and more. Formatting lets you highlight important information, and helps capture the attention of your audience.
At this point, try not to confuse statement of work with scope of work, which are both abbreviated as
SOW. Although they both are used to define deliverables and a timeline, they aren't the same and shouldn't
be used interchangeably.

A statement of work is a document that clearly identifies the products and services a vendor or contractor
will provide to an organization. It includes objectives, guidelines, deliverables, schedule, and costs.

A scope of work is project-based and sets the expectations and boundaries of a project. A scope of work may
be included in a statement of work to help define project outcomes.

As a junior data analyst, It's more typical to be asked to create a scope of work than a statement of work.
Activity overview

You have been learning about the role of a data analyst and how to manage, analyze, and visualize data.
Now, you will consider a valuable tool to help you practice structured thinking and avoid mistakes: a
scope-of-work (SOW).

In this activity, you’ll get practical experience developing an SOW document with the help of a handy
template. You will then complete an example SOW for an imaginary project of your choosing and learn
how analysts outline the work they are going to perform. By the time you complete this activity, you will
be familiar with an essential, industry-standard tool, and gain comfort asking the right questions to
develop an SOW.

Before you get started, take a minute to think about the main ideas, goals, and target audiences of SOW
documents.

Scope of work: What you need to know

As a data analyst, it’s hard to overstate the importance of an SOW document. A well-defined SOW keeps
you, your team, and everyone involved with a project on the same page. It ensures that all contributors,
sponsors, and stakeholders share the same understanding of the relevant details.

Why do you need an SOW?

The point of data analysis projects is to complete business tasks that are useful to the stakeholders.
Creating an SOW helps to make sure that everyone involved, from analysts and engineers to managers
and stakeholders, shares the understanding of what those business goals are, and the plan for
accomplishing them.

Clarifying requirements and setting expectations are two of the most important parts of a project. Recall
the first phase of the Data Analysis Process—asking questions.

As you ask more and more questions to clarify requirements, goals, data sources, stakeholders, and any
other relevant info, an SOW helps you formalize it all by recording all the answers and details. In this
context, the word “ask” means two things. Preparing to write an SOW is about asking questions to learn
the necessary information about the project, but it’s also about clarifying and defining what you’re being
asked to accomplish, and what the limits or boundaries of the “ask” are. After all, if you can’t make a
distinction between the business questions you are and aren’t responsible for answering, then it’s hard
to know what success means!

What is a good SOW?

There’s no standard format for an SOW. They may differ significantly from one organization to another,
or from project to project. However, they all have a few foundational pieces of content in common.

 Deliverables: What work is being done, and what things are being created as a result of this
project? When the project is complete, what are you expected to deliver to the stakeholders? Be
specific here. Will you collect data for this project? How much, or for how long?
Avoid vague statements. For example, “fixing traffic problems” doesn’t specify the scope. This could
mean anything from filling in a few potholes to building a new overpass. Be specific! Use numbers and
aim for hard, measurable goals and objectives. For example: “Identify top 10 issues with traffic patterns
within the city limits, and identify the top 3 solutions that are most cost-effective for reducing traffic
congestion.”

 Milestones: This is closely related to your timeline. What are the major milestones for progress in
your project? How do you know when a given part of the project is considered complete?
Milestones can be identified by you, by stakeholders, or by other team members such as the Project
Manager. Smaller examples might include incremental steps in a larger project like “Collect and process
50% of required data (100 survey responses)”, but may also be larger examples like ”complete initial
data analysis report” or “deliver completed dashboard visualizations and analysis reports to
stakeholders”.

 Timeline: Your timeline will be closely tied to the milestones you create for your project. The
timeline is a way of mapping expectations for how long each step of the process should take. The
timeline should be specific enough to help all involved decide if a project is on schedule. When
will the deliverables be completed? How long do you expect the project will take to complete? If
all goes as planned, how long do you expect each component of the project will take? When can
we expect to reach each milestone?
 Reports: Good SOWs also set boundaries for how and when you’ll give status updates to
stakeholders. How will you communicate progress with stakeholders and sponsors, and how
often? Will progress be reported weekly? Monthly? When milestones are completed? What
information will status reports contain?
At a minimum, any SOW should answer all the relevant questions in the above areas. Note that these
areas may differ depending on the project. But at their core, the SOW document should always serve the
same purpose by containing information that is specific, relevant, and accurate. If something changes in
the project, your SOW should reflect those changes.

What is in and out of scope?

SOWs should also contain information specific to what is and isn’t considered part of the project. The
scope of your project is everything that you are expected to complete or accomplish, defined to a level of
detail that doesn’t leave any ambiguity or confusion about whether a given task or item is part of the
project or not.

Notice how the previous example about studying traffic congestion defined its scope as the area within
the city limits. This doesn’t leave any room for confusion — stakeholders need only to refer to a map to
tell if a stretch of road or intersection is part of the project or not. Defining requirements can be trickier
than it sounds, so it’s important to be as specific as possible in these documents, and to use quantitative
statements whenever possible.

For example, assume that you’re assigned to a project that involves studying the environmental effects
of climate change on the coastline of a city: How do you define what parts of the coastline you are
responsible for studying, and which parts you are not?

In this case, it would be important to define the area you’re expected to study using GPS locations, or
landmarks. Using specific, quantifiable statements will help ensure that everyone has a clear
understanding of what’s expected.
The importance of context
Context is the condition in which something exists or happens. Context is important in
data analytics because it helps you sift through huge amounts of disorganized data and
turn it into something meaningful. The fact is, data has little value if it is not paired with
context.

Understanding the context behind the data can help us make it more meaningful at every stage of the data
analysis process. For example, you might be able to make a few guesses about what you're looking at in the
following table, but you couldn't be certain without more context.

2010 28000
2005 18000
2000 23000
1995 10000
On the other hand, if the first column was labeled to represent the years when a survey was conducted, and the
second column showed the number of people who responded to that survey, then the table would start to make
a lot more sense. Take this a step further, and you might notice that the survey is conducted every 5 years. This
added context helps you understand why there are five-year gaps in the table.

Years (Collected every 5 years) Respondents


2010 28000
2005 18000
2000 23000
1995 10000
Context can turn raw data into meaningful information. It is very important for data analysts to contextualize their
data. This means giving the data perspective by defining it. To do this, you need to identify:

 Who: The person or organization that created, collected, and/or funded the data collection
 What: The things in the world that data could have an impact on
 Where: The origin of the data
 When: The time when the data was created or collected
 Why: The motivation behind the creation or collection
 How: The method used to create or collect it

Understanding and including the context is important during each step of your
analysis process, so it is a good idea to get comfortable with it early in your career. For
example, when you collect data, you’ll also want to ask questions about the context to
make sure that you understand the business and business process. During organization,
the context is important for your naming conventions, how you choose to show
relationships between variables, and what you choose to keep or leave out. And finally,
when you present, it is important to include contextual information so that your
stakeholders understand your analysis.

You might also like