Computer Science

GROUP BY SQL

GROUP BY SQL is a clause used in SQL queries to group rows that have the same values. It is used in conjunction with aggregate functions such as SUM, COUNT, AVG, etc. to perform calculations on the grouped data. The resulting output displays the grouped data and the calculated values.

Written by Perlego with AI-assistance

Related key terms

11 Key excerpts on "GROUP BY SQL"

eBook - ePub
SQL for Data Analytics
Harness the power of SQL to extract insights from data, 3rd Edition
- Benjamin Johnston, Jun Shan, Matt Goldwasser, Upom Malik(Authors)
- 2022(Publication Date)
- Packt Publishing
  (Publisher)
GROUP BY clause.
Note
To access the source code for this specific section, please refer to https://packt.link/OU9zr .

Aggregate Functions with the GROUP BY Clause

So far, you have used aggregate functions to calculate statistics for an entire column. However, most times you are interested in not only the aggregate values for a whole table but also the values for smaller groups in the table. To illustrate this, refer back to the customers table. You know that the total number of customers is 50,000. However, you might want to know how many customers there are in each state. But how can you calculate this?
You could determine how many states there are with the following query: SELECT DISTINCT state FROM customers;
You will see 50 distinct states, Washington D.C., and NULL returned as a result of the preceding query, totaling 52 rows. Once you have the list of states, you could then run the following query for each state:
SELECT COUNT(*) FROM customers WHERE state='{state}'
Although you can do this, it is incredibly tedious and can take a long time if there are many states. The GROUP BY clause provides a much more efficient solution.

The GROUP BY Clause

GROUP BY is a clause that divides the rows of a dataset into multiple groups based on some sort of key that is specified in the clause. An aggregate function is then applied to all the rows within a single group to produce a single number for that group. The GROUP BY key and the aggregate value for the group are then displayed in the SQL output. The following diagram illustrates this general process:

Figure 4.11: General GROUP BY computational model
In the preceding diagram, you can see that the dataset has multiple groups (Group 1 , Group 2 , …, Group N ). Here, the aggregate function is applied to all the rows in Group 1 and generates the result Aggregate 1 . Then, the aggregate function is applied to all the rows in Group 2 and generates the result Aggregate 2
Sign up to read
Learn more about book

eBook - ePub

Data Wrangling with SQL

A hands-on guide to manipulating, wrangling, and engineering data using SQL

Raghav Kandarpa, Shivangi Saxena(Authors)
2023(Publication Date)
Packt Publishing
(Publisher)

each category.

The result set displays the total revenue, maximum revenue, and minimum revenue for each category. In this case, the X category has a total revenue of 250, with a maximum revenue of 150 and a minimum revenue of 100. The Y category has a total revenue of 750, with a maximum revenue of 300 and a minimum revenue of 200.

In summary, the GROUP BY clause in SQL allows us to group rows in a result set based on one or more columns and then use aggregate functions to perform calculations on the grouped data. This is a powerful tool for analyzing and summarizing large sets of data.

Case scenario

An interesting use case scenario for the GROUP BY clause in SQL could be analyzing survey data for a market research company. The company may have a table of survey responses, with columns for the respondent’s age, gender, income, and overall satisfaction rating.

To understand the demographics of the respondents and their satisfaction levels, the market research company might use GROUP BY to group the responses by age and gender and then use aggregate functions such as COUNT() and AVG() to calculate the total number of respondents for each group and their average satisfaction rating.

respondent_id	age	gender	income	satisfaction_rating
1	25	Male	50000	4
2	30	Female	55000	5
3	35	Male	60000	3
4	40	Female	65000	4
5	45	Male	70000	4
6	50	Female	75000	5

Figure 8.6 – survey_responses table

For example, the following SQL query would group responses by age and gender and calculate the total number of respondents and average satisfaction rating for each group:

SELECT age, gender, COUNT(*) as num_respondents, AVG(satisfaction_rating) as avg_satisfaction FROM survey_responses GROUP BY age, gender ORDER BY avg_satisfaction DESC;

Learn more about book

eBook - ePub
OCA: Oracle Database 11g Administrator Certified Associate Study Guide
Exams1Z0-051 and 1Z0-052
- Biju Thomas(Author)
- 2011(Publication Date)
- Sybex
  (Publisher)
job_id . The SQL shows the number of different jobs within each department:

SELECT department_id, job_id, COUNT(*) FROM employees GROUP BY department_id, job_id ORDER BY 1, 2; DEPARTMENT_ID JOB_ID COUNT(*) ------------- ---------- ---------- 10 AD_ASST 1 20 MK_MAN 1 20 MK_REP 1 30 PU_CLERK 5 30 PU_MAN 1 40 HR_REP 1 50 SH_CLERK 20 50 ST_CLERK 20 50 ST_MAN 5 60 IT_PROG 5 70 PR_REP 1 80 SA_MAN 5 80 SA_REP 29 90 AD_PRES 1 90 AD_VP 2 100 FI_ACCOUNT 5 100 FI_MGR 1 110 AC_ACCOUNT 1 110 AC_MGR 1 SA_REP 1

The GROUP BY clause groups data, but Oracle does not guarantee the order of the result set by the grouping order. To order the data in any specific order, you must use the ORDER BY clause.

Group-Function Overview
Tables 3-1 and 3-2
Sign up to read
Learn more about book
eBook - ePub
Introductory Relational Database Design for Business, with Microsoft Access
- Jonathan Eckstein, Bonnie R. Schultz(Authors)
- 2017(Publication Date)
- Wiley
  (Publisher)
CustomerID attribute is already present in ORDERS. Dispensing with the CUSTOMER table, we obtain exactly the same output through the simpler query:
SELECT CustomerID, Sum(UnitPrice*Quantity) AS Revenue FROM ORDERS, ORDERDETAIL, PRODUCT WHERE ORDERS.OrderID = ORDERDETAIL.OrderID AND ORDERDETAIL.ProductID = PRODUCT.ProductID GROUP BY CustomerID;

But if we wish to display customer name information, we must use the CUSTOMER table. Returning to the version of the query that displays customer names, an alternative to using a seemingly redundant GROUP BY clause is to apply some unnecessary aggregation operation in the SELECT clause. For example, we could write:
SELECT Min(FirstName), Min(LastName), Sum(UnitPrice*Quantity) AS Revenue FROM CUSTOMER, ORDERS, ORDERDETAIL, PRODUCT WHERE CUSTOMER.CustomerID = ORDERS.CustomerID AND ORDERS.OrderID = ORDERDETAIL.OrderID AND ORDERDETAIL.ProductID = PRODUCT.ProductID GROUP BY CUSTOMER.CustomerID;

The Min operation, when applied to text data, selects the alphabetically first value within each group. But since FirstName and LastName take the same value throughout each group, the output for each group contains the only possible applicable first name and last name. This query is probably more confusing to a human than the original one, however. If we were to take this approach, we would also need to use more AS modifiers to make the column names in the output more readable.
We do not give an example here, but it is possible to GROUP BY not only by the values of simple attributes, but the values of general expressions (computed fields).
When you use GROUP BY, one should have at least one aggregation function such as Sum or Avg
Sign up to read
Learn more about book
eBook - ePub
Pocket Primer
- Oswald Campesato(Author)
- 2021(Publication Date)
- Mercury Learning and Information
  (Publisher)
GROUP BY in a SQL statement to display the number of purchase orders that were created on a daily basis:

SELECT purchase_date, COUNT(*) FROM purchase_orders GROUP BY purchase_date;

Select Statements with a HAVING Clause

The following SQL statement illustrates how to specify GROUP BY in a SQL statement to display the number of purchase orders that were created on a daily basis, and only those days where at least 4 purchase orders were created:

SELECT purchase_date, COUNT(*) FROM purchase_orders GROUP BY purchase_date; HAVING COUNT(purchase_date) > 3;

WORKING WITH INDEXES IN SQL

SQL enables you to define one or more indexes for a table, which can greatly reduce the amount of time that is required to select a single row or a subset of rows from a table.

A SQL index on a table consists of one or more attributes in a table. SQL updates in a table that has one or more indexes requires more time than updates without the existence of indexes on that table because both the table and the index (or indexes) must be updated. Therefore, it’s better to create indexes on tables that involve table columns that are frequently searched.

Here are two examples of creating indexes on the customers table:

CREATE INDEX idx_cust_lname ON customers (lname); CREATE INDEX idx_cust_lname_fname ON customers (lname,fname);

WHAT ARE KEYS IN AN RDBMS?

A key
Sign up to read
Learn more about book
eBook - ePub
Hands-On Data Science with SQL Server 2017
Perform end-to-end data analysis to gain efficient data insight
- Marek Chmel, Vladimír Mužný(Authors)
- 2018(Publication Date)
- Packt Publishing
  (Publisher)
we need to the GROUP BY clause separated by colons. Let's execute the following query: SELECT CategoryName , SubcategoryName , COUNT(*) as RecordCountFROM #srcGROUP BY CategoryName, SubcategoryName
The result of preceding query shows groups formed of Category names, and then of Subcategory names. Using the GROUP BY clause is relatively straightforward, but what if we want to add subtotals into our result? For this purpose, we have an additional syntax helper, called grouping sets .

Grouping sets are useful when we want to calculate not only raw groups, but also higher groups. In the previous example, we combined CategoryName and SubcategoryName. If you look carefully at this, you can see that the numbers in the RecordCount column are the same whether we have a CategoryName column added into our query or not. This is because subcategories are fully nested into categories without overlaps. The following sample query will not only calculate record counts for every group built from a combination of categories and subcategories, but it will also compute record count for categories only, and for whole datasets:
SELECT CategoryName , SubcategoryName , COUNT(*) as RecordCOuntFROM #srcGROUP BY GROUPING SETS( (CategoryName, SubcategoryName) , (CategoryName) , ())
The GROUP BY clause in the preceding query has changed. The GROUPING SETS keyword introduces brackets containing sets of grouping criteria also enclosed in brackets. Remember that both brackets are mandatory. In our example we have three grouping sets:
Sign up to read
Learn more about book
eBook - ePub
Mastering PostgreSQL 9.6
- Hans-Jurgen Schonig(Author)
- 2017(Publication Date)
- Packt Publishing
  (Publisher)
might be right by doing guesswork.
Other database engines can accept aggregate functions without an OVER or even a GROUP BY clause. However, from a logical point of view this is wrong and on top of that a violation of SQL. Passage contains an image

Partitioning data

So far the same result can also easily be achieved using a subselect. However, if you want more than just the overall average, subselects will turn your queries into nightmares. Suppose, you just don't want the overall average but the average of the country you are dealing with. A PARTITION BY clause is what you need:

test=# SELECT country, year, production, consumption, avg(production) OVER (PARTITION BY country) FROM t_oil; country | year | production | consumption | avg -----------------+------+------------+-------------+----------- Canada | 1965 | 920 | 1108 | 2123.2173 Canada | 2010 | 3332 | 2316 | 2123.2173 Canada | 2009 | 3202 | 2190 | 2123.2173 ... Iran | 1966 | 2132 | 148 | 3631.6956 Iran | 2010 | 4352 | 1874 | 3631.6956 Iran | 2009 | 4249 | 2012 | 3631.6956 ...

The point here is that each country will be assigned to the average of the country. The OVER clause defines the window we are looking at. In this case the window is the country the row belongs to. In other words the query returns the rows compared to all rows in this country.

Note that the year column is not sorted. The query does not contain an explicit sort order so it might happen that data is returned in random order. Remember, SQL does not promise sorted output unless you explicitly state what you want.
Basically, a PARTITION BY clause takes any expression. Usually most people will use a column to partition the data. Here is an example:
Sign up to read
Learn more about book
eBook - ePub
Mastering PostgreSQL 13
Build, administer, and maintain database applications efficiently with PostgreSQL 13, 4th Edition
- Hans-Jürgen Schönig(Author)
- 2020(Publication Date)
- Packt Publishing
  (Publisher)
Handling Advanced SQL

In Chapter 3 , Making Use of Indexes , you learned about indexing, as well as about PostgreSQL's ability to run custom indexing code to speed up queries. In this chapter, you will learn about advanced SQL. Most of the people who read this book will have some experience of using SQL. However, experience has shown that the advanced features outlined in this book are not widely known, and therefore it makes sense to cover them in this context to help people to achieve their goals faster and more efficiently. There has been a long discussion about whether the database is just a simple data store or whether the business logic should be in the database. Maybe this chapter will shed some light and show how capable a modern relational database really is. SQL is not what it used to be back when SQL-92 was around. Over the years, the language has grown and become more and more powerful.
This chapter is about modern SQL and its features. A variety of different and sophisticated SQL features are covered and presented in detail. We will cover the following topics in this chapter:

Introducing grouping sets

Using ordered sets

Understanding hypothetical aggregates

Utilizing windowing functions and analytics

Writing your own aggregates

By the end of this chapter, you will understand and be able to use advanced SQL.

Introducing grouping sets

Every advanced user of SQL should be familiar with the GROUP BY and HAVING clauses. But are they also aware of CUBE, ROLLUP, and
Sign up to read
Learn more about book
eBook - ePub
OCA: Oracle Database 12c Administrator Certified Associate Study Guide
Exams 1Z0-061 and 1Z0-062
- Biju Thomas(Author)
- 2014(Publication Date)
- Sybex
  (Publisher)
As you can see in the result, the space used by each schema in each tablespace is shown as well as the total space used in each tablespace and the total space used by each schema. The total space used in the database (including all tablespaces) is also shown in the very first line.

Three functions (GROUPING , GROUP_ID , and GROUPING_ID ) can come in very handy when you’re using the ROLLUP and CUBE modifiers of the GROUP BY clause.

In the examples you have seen using the ROLLUP and CUBE modifiers, there was no way of telling which row was a subtotal and which row was a grand total. You can use the GROUPING function to overcome this problem. Review the following SQL example:

SELECT gender, marital_status, count(*) num_rec, GROUPING (gender) g_grp, GROUPING (marital_status) ms_grp FROM oe.customers GROUP BY CUBE(marital_status, gender); G MARITAL_STATUS NUM_REC G_GRP MS_GRP - -------------------- ---------- ---------- ---------- 319 1 1 F 110 0 1 M 209 0 1 single 139 1 0 F single 47 0 0 M single 92 0 0 married 180 1 0 F married 63 0 0 M married 117 0 0

The G_GRP column has a 1 for NULL values generated by the CUBE or ROLLUP modifier for GENDER column. Similarly, the MS_GRP column has a 1 when NULL values are generated in the MARITAL_STATUS column. By using a DECODE function on the result of the GROUPING
Sign up to read
Learn more about book
eBook - ePub
Mastering PostgreSQL 10
- Hans-Jurgen Schonig(Author)
- 2018(Publication Date)
- Packt Publishing
  (Publisher)
This actually makes sense because the average has to be defined precisely. The database engine cannot just guess at any value. Other database engines can accept aggregate functions without an OVER or even a GROUP BY clause. However, from a logical point of view this is wrong, and on top of that, a violation of SQL. Passage contains an image

Partitioning data

So far, the same result can also easily be achieved using a subselect. However, if you want more than just the overall average, subselects will turn your queries into nightmares. Suppose, you don't just want the overall average but the average of the country you are dealing with. A PARTITION BY clause is what you need:

test=# SELECT country, year, production, consumption, avg(production) OVER (PARTITION BY country) FROM t_oil;
country | year | production | consumption | avg ----------+-------+------------+-------------+----------- Canada | 1965 | 920 | 1108 | 2123.2173 Canada | 2010 | 3332 | 2316 | 2123.2173 Canada | 2009 | 3202 | 2190 | 2123.2173 ... Iran | 1966 | 2132 | 148 | 3631.6956 Iran | 2010 | 4352 | 1874 | 3631.6956 Iran | 2009 | 4249 | 2012 | 3631.6956 ...

The point here is that each country will be assigned to the average of the country. The OVER clause defines the window we are looking at. In this case, the window is the country the row belongs to. In other words, the query returns the rows compared to all rows in this country.

The year column is not sorted. The query does not contain an explicit sort order so it might be that data is returned in a random order. Remember, SQL does not promise sorted output unless you explicitly state what you want.
Basically, a PARTITION BY clause takes any expression. Usually, most people will use a column to partition the data. Here is an example:
Sign up to read
Learn more about book
eBook - ePub
Beginning Microsoft SQL Server 2012 Programming
- Paul Atkinson, Robert Vieira(Authors)
- 2012(Publication Date)
- Wrox
  (Publisher)
GROUP BY clause, these aggregators perform exactly as they did before, except that they return a count for each grouping rather than the full table. You can use this to get your number of reports:

SELECT ManagerID, COUNT(*) FROM HumanResources.Employee2 GROUP BY ManagerID;

Code snippet Chap03.sql

Notice that you are grouping only by the ManagerID — the COUNT() function is an aggregator and, therefore, does not have to be included in the GROUP BY clause.
ManagerID ----------- ----------- NULL 1 1 3 4 3 5 4 (4 row(s) affected)
Your results tell us that the manager with ManagerID 1 has three people reporting to him or her, and that three people report to the manager with ManagerID 4 , as well as four people reporting to ManagerID 5 . You are also able to tell that one Employee record had a NULL value in the ManagerID field. This employee apparently doesn’t report to anyone (hmmm, president of the company I suspect?).

It’s probably worth noting that you, technically speaking, could use a GROUP BY clause without any kind of aggregator, but this wouldn’t make sense. Why not? Well, SQL Server is going to wind up doing work on all the rows in order to group them, but functionally speaking you would get the same result with a DISTINCT option (which you’ll look at shortly), and it would operate much faster.
Now that you’ve seen how to operate with groups, let’s move on to one of the concepts that a lot of people have problems with. Of course, after reading the next section, you’ll think it’s a snap.

Placing Conditions on Groups with the HAVING Clause

Up to now, all of the conditions have been against specific rows. If a given column in a row doesn’t have a specific value or isn’t within a range of values, the entire row is left out. All of this happens before the groupings are really even thought about.
Sign up to read
Learn more about book

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

View all