In this article, we’ll explore how to remove duplicates in Power Query. We’ll cover basic and advanced methods for removing duplicates, including how to remove duplicates from single and multiple columns, how to remove duplicates case-insensitively, and how to keep values based on a condition.
We’ll also delve into an inconsistency problem, often related to sort order, that can occur when removing duplicates and how to solve it. By the end of this article, you’ll have a thorough understanding of how to remove duplicates in Power Query.
Table of contents
Importance of Removing Duplicates
Eliminating duplicate rows is a critical step when cleaning your data in Power Query. Duplicates can cause problems in many situations. For example, when:
- Performing a join: duplicates in your join table, result in duplicate rows in the destination table.
- Creating a dimension table: you can’t make a one-to-many relationship with duplicates on the one side.
- Inspecting Values: duplicates can be an indication that an earlier join had duplicate values.
To prevent this, we need an expression to check for duplicates and filter our dataset only to include unique values.
Basics for Removing Duplicates
So, how can you remove duplicates using the Power Query Editor? Under the hood, the operation uses the Table.Distinct function to remove duplicates. The syntax for the function is as follows:
Table.Distinct(
table as table,
optional equationCriteria as any
)
This function has two arguments: the first argument is the table to remove duplicates from, and the second argument is an optional equationCriteria that tells the function which columns to look at when removing duplicates. You can provide a column name, or instruct the function to respect or ignore capitals. If you omit the equationCriteria, the function will remove duplicates from the entire table.
Removing Duplicates from a Single Column
Let’s start with removing duplicates from a single column. This is one of the most basic methods for removing duplicates in Power Query, and it’s really easy to do.
To remove duplicates from a single column, you:
- Right-click on the column that contains duplicates
- Select the “Remove Duplicates” option.
- Now click OK.
That’s it! Power Query will remove all duplicates from the selected column.
The remove duplicates operation makes use of the Table.Distinct function in Power Query. When performing the operation on a single-column table, the above operation will show:
Table.Distinct( Source )
This removes duplicates from the entire table. When your table has multiple columns, the code below removes the duplicates from only the “Fruit” column.
Table.Distinct( Source, { "Fruit" } )
But what if you have duplicate values across multiple columns? No problem.
Removing Duplicates based on Multiple Columns
The easiest way to remove duplicates based on multiple columns is:
- Hold down the CTRL Key.
- Select the columns you want to remove duplicates from.
- Click on Remove Duplicates.
Now, Power Query will delete all rows with duplicated values across the selected columns. This operation focuses on returning unique combinations of values.
Table.Distinct(
Source,
{ "Brand", "Country"}
)
Performing this operation still uses the Table.Distinct function. This time, the second argument contains both column names within a list.
Tip
If you could use a refresher on list functions, have a look at this article that delves into use-cases and syntax.
So you know how to remove duplicates from one or more columns, but what if you want to remove duplicates case-insensitive?
Removing Duplicates Case-Insensitive
Removing duplicates is normally performed in a case-insensitive way, it ignores case. However, there are scenarios where you want to ignore capitalizations. For instance, observe the following table.
The dataset is unique when respecting capitalization, yet has quite a few duplicates when you ignore the capitals. A regular remove duplicates operation (case sensitive) only removes a row for the value ‘Cedar‘. To achieve this, you can use Table.Distinct function and provide the column name “Wood” in the second argument. The result is a case-sensitive removal of duplicates as below.
To remove duplicates in a manner that is case-insensitive, you can make use of the optional equation argument of Table.Distinct. It allows you to add a comparer function. To do that:
- Right-click the column that contains duplicates
- Select Remove Duplicates
- Provide the column name and Comparer.OrdinalIgnoreCase as equation criteria.
- Use format: { “ColumnName”, Comparer.OrdinalIgnoreCase }
The result is a case-insensitive removal of duplicates.
Table.Distinct(
Source,
{ "Wood", Comparer.OrdinalIgnoreCase }
)
But what if you want to apply this method to multiple columns?
Table.Distinct(
Source,
{ { "Wood", Comparer.OrdinalIgnoreCase },
{ "Area", Comparer.OrdinalIgnoreCase } }
)
You can use the Table.Distinct function and provide the method to look for duplicates for every column. This allows you to specify whether to ignore capitalization (Comparer.OrdinalIgnoreCase) or respect capitalization (Comparer.Ordinal) for each column individually.
So there you have it – some basic methods for removing duplicates in Power Query. Whether you’re working with a single column, with multiple columns, or whether you check capital sensitive, Power Query makes it easy to get rid of the duplicates and get your data in tip-top shape.
Caution
Be aware that the methods shown so far don’t always return the same results. Let’s see why that is in the next section.
Inconsistent Results
Did you know that the remove duplicates operation in Power Query often returns inconsistent results? It might sound strange, but it’s actually a side effect of certain optimizations that Power Query uses to be as efficient as possible.
One of these optimizations is Query Folding, which sends certain operations to the data source to be processed instead of in Power Query. While this can help make things faster and more efficient, it can also cause problems when it comes to removing duplicates. Without using additional measures, there’s no guarantee that removing duplicates returns the same records. That means you can’t assume that removing duplicates will return only the first duplicate row in your dataset.
But there’s a simple solution to this randomness problem. To make the remove duplicates operation predictable, you can use Table.Buffer. This function forces Power Query to cache the data in memory, which means that you can control which records are returned when removing duplicates.
So, if you want to ensure that you’re getting consistent results when removing duplicates, be sure to use Table.Buffer. We’ll dive deeper into this topic and explore advanced methods in the next chapter.
Advanced Techniques for Removing Duplicates
Sometimes, removing duplicates isn’t as straightforward as it seems, and you may need more advanced methods to get the job done. That’s where this chapter comes in – we’re going to explore some advanced methods for removing duplicates in Power Query.
In this section, we’ll dive into how you can remove duplicates based on a condition. We’ll look at retrieving the first or last value, the earliest or latest value and the lowest or highest value.
Keep First or Last Value
First, let’s talk about how to delete duplicates and keep the first or last value in your dataset. By default, Table.Distinct returns the first duplicate row it finds. But as we learned in the previous chapter, the sorting order of your table isn’t guaranteed, which can cause issues.
To ensure that we’re keeping the first or last value, we need to sort our data and make sure it stays sorted.
To remove duplicates and retrieve the first value in a dataset, you:
- Sort your data
- Select columns to remove duplicates from
- Wrap the data in Table.Buffer
- Remove duplicates
We can do that with the following dataset:
Prepare your data by sorting the table and wrapping it in Table.Buffer. You can then select the Category column and click Remove duplicates, returning:
You will end up removing duplicates but keeping the first instance within your category. So to return the first value when removing duplicates, you sort your code with:
Table.Buffer(
Table.Sort( Dataset,
{ {"Category", Order.Ascending },
{"Description", Order.Ascending } } ) )
If you want to keep the last value instead, you can simply sort your data in descending order by the column you want to keep the last value of. Then, wrap your data in Table.Buffer, and remove duplicates.
Table.Buffer(
Table.Sort( Dataset,
{ {"Category", Order.Ascending },
{"Description", Order.Descending } } ) )
Keep Earliest or Latest Value
In case you want to return the earliest or the latest value, you can perform something similar. To return unique rows for the category with the latest date, you:
- Sort your data by Category and in a Descending order for your Date Column.
- Remove Duplicates from the relevant column(s).
// Sorts and buffers data, so removing duplicates returns the latest category row
Table.Buffer(
Table.Sort( Dataset,
{ {"Category", Order.Ascending },
{"Date", Order.Descending } } ) )
// Sorts and buffers data, so removing duplicates returns the earliest category row
Table.Buffer(
Table.Sort( Dataset,
{ {"Category", Order.Ascending },
{"Date", Order.Ascending } } ) )
Keep Lowest or Highest Value
The last advanced method returns the lowest or highest value within a group. We could make use of the above method to achieve this. Yet, since I’m certain you will know how to sort your data in a similar way, let’s explore an alternative.
To remove duplicates and return the lowest value, you can:
- Group your data.
- Perform an All Rows operation.
- Adjust the M-code to return the row with the lowest value.
- Expand the record column to get your result.
Let’s see how that can work. The Group By operation returns only unique rows for the columns selected. To group your data, go to the Home tab, and select Group By.
To keep the lowest value for each Category, you can group your data by Category and perform an All Rows operation. This creates a table object with all summarized rows for each Category. This produces below code:
We’re going to change the formula above.
Tip
A useful function is Table.Min. The Table.Min function returns the table row that contains the lowest value of a column. In other words, you can delete duplicate rows based on the column value.
To return the lowest value, you can adjust the code like in the below picture. Then, expand the record column by selecting the arrows in the column header.
After clicking OK, you’ll end up with the desired result:
To instead keep the highest value when removing duplicates, you can simply swap out Table.Min for Table.Max. That means you either use:
// Sorts and buffers data, so removing duplicates returns the latest category row
Table.Group( Dataset,
{"Category"},
{{"Details", each Table.Min( _, "Amount" ) , type record }} )
// Sorts and buffers data, so removing duplicates returns the earliest category row
Table.Group( Dataset,
{"Category"},
{{"Details", each Table.Max( _, "Amount" ) , type record }} )
You can learn more about the Group By Operation or also read into the All Rows operation and how it works.
Conclusion
And there, you have everything you need to know about removing duplicates in Power Query. Whether you’re dealing with a single column or multiple columns, case-sensitive or case-insensitive duplicates, or advanced methods for keeping the first, last, earliest, latest, lowest or highest values, Power Query has you covered.
Remember, removing duplicates is an important step in cleaning your data. By removing duplicates, you can avoid issues with joins, dimension tables, and prepare your data for your data model.
Just be aware of the randomness problem that can occur with the Table.Distinct function, and use Table.Buffer to ensure consistent results.
Frequently Asked Questions
The most common operations to remove duplicates are by using the Remove Duplicates operation using Table.Distinct or by using a Group By operation using Table.Group.
Power Query always keeps the first value it finds. However, certain optimizations make it that the sort order of your table is not guaranteed. To ensure proper sort order, consider buffering your table using Table.Buffer.
Yes. Since Power Query keeps the first record it finds, you can sort your data before removing duplicates. With the right sorting, you can retrieve the first, last, latest, earlier, highest or lowest value.
Recommended Reading…
Power Query Foundations
Beginners Guide to Power Query
List Functions (200+ Examples)
Text Functions (150+ Examples)
Creating Tables (40+ Examples)
Generate the Ultimate Date Table
Advanced Topics
Master List.Generate with Easy Examples
Deep Dive into List.Accumulate
I’m using power bi dataflow, and my query folding is broken, is possible to fix that ?
I have some dataset like some row data begin with KR-1234-5678-9012 and some row data begins with 12345, EDFCH127HGYU. There are some duplicacy only in KR no which I want to remove. Please tell me power query function to do this task.
Sorry, ondanks deze zeer duidelijke uitleg kom ik er echt niet uit. Ik weet dat het echt iets kleins moet zijn. Wat is de syntax van het commando als ik wil dat mijn hele tabel ontdubbeld wordt?
Mijn tabel heet ING. Ik heb geprobeerd Tabel.Distinct(ING), Tabel.Distinct(“ING”), niet goed…
Rick sometimes I get into problems that I don’t know how to solve in Power BI/Power Query and I find articles that are not so clear or well constructed, this article is detailed and the examples are so great! Thank you so much I really appreciate the help this article gave me
You’re so welcome!
If you ever put a reference book together for all your PQ knowledge – I’ll buy it! Thanks, this cleared up multiple issues. Cheers, Emile
Thanks Emile – We released a book on this just this month: ‘The Definitive Guide to Power Query M’. You can buy it here: https://amzn.to/3xurWW0
Hope you like it! Rick
Thank you so much!!
Really appreciated this write-up. It helped me figure something out. Thanks!
Thanks Rick! I used the max option successfully!
However, is there a way to use a condition instead of the max/min? For example keep rows where [column] is not 0 ?
Hi Tanya – you could use a regular filter to filter rows <> 0. Would that work for your case?
I was wondering if you can use Table.StopFolding() to keep the sort when removing duplicates. I guess buffering a huge table can hurt performance. What do you think?
Unfortunately, you can’t. Table.StopFolding is purely meant to stop the engine from transforming your steps into SQL. Table.Buffer actually saves the values into memory so it remains constant for your remove duplicates operation.
Hey Rick, that was an absolutely amazing article on Duplicates. There is small typo for Table.Min which you mentioned to return highest value instead of lowest value.
Thanks so much Ali. That typo slipped when writing the article. I changed it together with the image.
Thanks Rick, another great article. I have done advanced duplicate filtering with the group table function and now I know that I could have obtained the same with the distinct table function with improved performance.
Thanks Stijn. I’ve not tested performance difference between the grouping and Table.Buffer. If you learn more, I’d love to hear more about it. Cheers!