Written By Rick de Groot

Rick is the founder of BI Gorilla. He believes learning is one of life's greatest pleasures and shares his knowledge to help you improve your skills.

In this article, we’ll explore how to remove duplicates in Power Query. We’ll cover basic and advanced methods for removing duplicates, including how to remove duplicates from single and multiple columns, how to remove duplicates case-insensitively, and how to keep values based on a condition.

We’ll also delve into the inconsistency problem that can occur when removing duplicates and how to solve it. By the end of this article, you’ll have a thorough understanding of how to remove duplicates in Power Query. So, let’s dive in!

Table of contents

1. Importance of Removing Duplicates

Removing duplicates is a critical step when cleaning your data in Power Query. Duplicates can cause problems in many situations. For example, when

  • Performing a join: duplicates in your join table, result in duplicate rows in the destination table.
  • Creating a dimension table: you can’t make a one-to-many relationship with duplicates on the one side.
  • Inspecting Values: duplicates can be an indication that an earlier join had duplicate values.

To prevent this, we want to check for duplicates and filter our dataset only to include unique values.

2. Basics for Removing Duplicates

So, how does Power Query remove duplicates using the Power Query Editor? Under the hood, Power Query uses the Table.Distinct function to remove duplicates. The syntax for the function is as follows:

= Table.Distinct(
    table as table, 
    optional equationCriteria as any

This function has two arguments: the first argument is the table to remove duplicates from, and the second argument is an optional equationCriteria that tells the function which columns to look at when removing duplicates. You can provide a column name, or instruct the function to respect or ignore capitals. If you omit the equationCriteria, the function will remove duplicates from the entire table.

2.1. Removing Duplicates from a Single Column

Let’s start with removing duplicates from a single column. This is one of the most basic methods for removing duplicates in Power Query, and it’s really easy to do.

To remove duplicates from a single column, you:

  • Right-click on the column that contains duplicates
  • Select the “Remove Duplicates” option.
  • Now click OK.

That’s it! Power Query will remove all duplicates from the selected column.

The remove duplicates operations makes use of the Table.Distinct function in Power Query. When performing the operation on a single-column table, the above operation will show:

= Table.Distinct( Source )

This removes duplicates from the entire table. When your table has multiple columns, the code below removes the duplicates from only the “Fruit” column.

= Table.Distinct( Source, { "Fruit" } )

2.2. Removing Duplicates from Multiple Columns

But what if you have duplicates across multiple columns? No problem! The easiest way to remove duplicates from multiple columns is:

  • Hold down the CTRL Key.
  • Select the columns you want to remove duplicates from.
  • Click on Remove Duplicates.

Now, Power Query will remove all rows with duplicated values across the selected columns.

= Table.Distinct(
     { "Brand", "Country"} )

Performing this operation still uses the Table.Distinct function. This time, the second argument contains both column names within a list. To learn more about list use cases and syntax, read this article.

2.3. Removing Duplicates Case-Insensitive

So you know how to remove duplicates from one or more columns, but what if you want to remove duplicates case-insensitive?

To remove duplicates in a manner that is case-insensitive, you can make use of the optional equation argument of Table.Distinct. Whether you’re working with one or multiple columns or even the entire table, you can easily remove duplicates case-insensitive.

So, what if you have a dataset that contains duplicates, and you want to perform a case-sensitive removing duplicates operation on the Wood column? Removing duplicates (case sensitive) then only removes a row for the value ‘Cedar‘.

Data with Duplicates in Power Query

To achieve this, you can use Table.Distinct function and provide the column name “Wood” in the second argument. The result is a case-sensitive removal of duplicates as below.

Remove Duplicates Case Sensitive in Power Query

Now, let’s say you want to ignore the capitalization when removing duplicates. The easiest way to remove duplicates and ignore case is by adding a comparer function:

  • Right-click on the column that contains duplicates
  • Select Remove Duplicates
  • Provide the column name and Comparer.OrdinalIgnoreCase as equation criteria.
  • Use format: { “ColumnName”, Comparer.OrdinalIgnoreCase }

The result is a case-insensitive removal of duplicates.

= Table.Distinct(
   { "Wood", Comparer.OrdinalIgnoreCase } )

But what if you want to apply this method to multiple columns?

= Table.Distinct(
     { { "Wood", Comparer.OrdinalIgnoreCase },
       { "Area", Comparer.OrdinalIgnoreCase } } )
Remove Duplicates Cases Insensitive on Multiple Columns in Power Query

You can use the Table.Distinct function and provide the method to look for duplicates for every column. This allows you to specify whether to ignore capitalization (Comparer.OrdinalIgnoreCase) or respect capitalization (Comparer.Ordinal) for each column individually.

So there you have it – some basic methods for removing duplicates in Power Query. Whether you’re working with a single column, multiple columns, or whether you check capital sensitive, Power Query makes it easy to get rid of the duplicates and get your data in tip-top shape.

But be aware that the methods shown so far don’t always return the same results. Let’s see why that is in the next section.

3. Inconsistent Results

Did you know that the remove duplicates operation in Power Query often returns inconsistent results? It might sound strange, but it’s actually a side effect of certain optimizations that Power Query uses to be as efficient as possible.

One of these optimizations is Query Folding, which sends certain operations to the data source to be processed instead of in Power Query. While this can help make things faster and more efficient, it can also cause problems when it comes to removing duplicates. Without using additional measures, there’s no guarantee which records will be returned when removing duplicates. That means you can’t assume that removing duplicates will return only the first duplicate row in your dataset.

But there’s a simple solution to this randomness problem. To make the remove duplicates operation predictable, you can use Table.Buffer. This function forces Power Query to cache the data in memory, which means that you can control which records are returned when removing duplicates.

So, if you want to ensure that you’re getting consistent results when removing duplicates, be sure to use Table.Buffer. We’ll dive deeper into this topic and explore advanced methods in the next chapter.

4. Advanced Techniques for Removing Duplicates

Sometimes, removing duplicates isn’t as straightforward as it seems, and you may need more advanced methods to get the job done. That’s where this chapter comes in – we’re going to explore some advanced methods for removing duplicates in Power Query.

In this chapter, we’ll dive into how you can remove duplicates based on a condition. We’ll look at retrieving the first or last value, the earliest or latest value and the lowest or highest value.

4.1. Keep First or Last Value

First, let’s talk about how to keep the first or last value in your dataset when removing duplicates. By default, Table.Distinct returns the first duplicate row it finds. But as we learned in the previous chapter, the sorting order of your table isn’t guaranteed, which can cause issues.

So, to ensure that we’re keeping the first or last value, we need to sort our data and make sure it stays sorted.

To remove duplicates and retrieve the first value in a dataset, you:

  • Sort your data
  • Select columns to remove duplicates from
  • Wrap the data in Table.Buffer
  • Remove duplicates

You can do that with this dataset.

Prepare your data by sorting the table and wrapping it in Table.Buffer. You can then select the Category column and click Remove duplicates, returning:

You will end up removing duplicates but keeping the first instance within your category. So to return the first value when removing duplicates, you sort your code with:

= Table.Buffer( 
    Table.Sort( Dataset,
                { {"Category",    Order.Ascending },
                  {"Description", Order.Ascending } } ) )

If you want to keep the last value instead, you can simply sort your data in descending order by the column you want to keep the last value of. Then, wrap your data in Table.Buffer, and remove duplicates.

= Table.Buffer( 
    Table.Sort( Dataset,
                { {"Category",    Order.Ascending },
                  {"Description", Order.Descending } } ) )

4.2. Keep Earliest or Latest Value

In case you want to return the earliest or the latest value, you can perform something similar. To return unique rows for the category with the latest date, you:

  1. Sort your data by Category and in a Descending order for your Date Column.
  2. Remove Duplicates from the relevant column(s).
// Sorts and buffers data, so removing duplicates returns the latest category row
= Table.Buffer( 
    Table.Sort( Dataset,
                { {"Category",    Order.Ascending }, 
                  {"Date",        Order.Descending } } ) )

// Sorts and buffers data, so removing duplicates returns the earliest category row
= Table.Buffer( 
    Table.Sort( Dataset,
                { {"Category",    Order.Ascending }, 
                  {"Date",        Order.Ascending } } ) )

4.3. Keep Lowest or Highest Value

The last advanced method returns the lowest or highest value within a group. We could make use of the above method to achieve this. Yet, since I’m certain you will know how to sort your data in a similar way, let’s explore an alternative.

To remove duplicates and return the lowest value, you can:

  1. Group your data.
  2. Perform an All Rows operation.
  3. Adjust the M-code to return the row with the lowest value.
  4. Expand the record column to get your result.

Let’s see how that can work. The Group By operation returns only unique rows for the columns selected.

To group your data, go to the Home Tab, and select Group By.

To keep the lowest value for each Category, you can group your data by Category and perform an All Rows operation. This creates a table object with all summarized rows for each Category. This produces below code:

We’re going to change the formula above. A useful function is Table.Min. The Table.Min function returns the table row that contains the lowest value of a column. In other words, you can delete duplicate rows based on the column value.

To achieve this, adjust the code like in the below picture. Then, expand the record column by selecting the arrows in the column header.

Return Row with Lowest Value using Table.Group in Power Query

After clicking OK, you’ll end up with the desired result:

To instead keep the highest value when removing duplicates, you can simply swap out Table.Min for Table.Max. That means you either use:

// Sorts and buffers data, so removing duplicates returns the latest category row
= Table.Group( Dataset, 
    {{"Details", each Table.Min( _, "Amount" ) , type record }} )

// Sorts and buffers data, so removing duplicates returns the earliest category row
= Table.Group( Dataset, 
    {{"Details", each Table.Max( _, "Amount" ) , type record }} )

You can learn more about the Group By Operation by reading this article. You can also read into the All Rows operation and how it works.

5. Conclusion

And there, you have everything you need to know about removing duplicates in Power Query! Whether you’re dealing with a single column or multiple columns, case-sensitive or case-insensitive duplicates, or advanced methods for keeping the first, last, earliest, latest, lowest or highest values, Power Query has you covered.

Remember, removing duplicates is a crucial step in cleaning your data and getting it ready for analysis. By removing duplicates, you can avoid issues with joins, dimension tables, and incorrect data insights.

Just be aware of the randomness problem that can occur with the Table.Distinct function, and use Table.Buffer to ensure consistent results.

Happy querying!

Share on:

Latest from my blog

  1. Thanks Rick! I used the max option successfully!

    However, is there a way to use a condition instead of the max/min? For example keep rows where [column] is not 0 ?

  2. Hey Rick, that was an absolutely amazing article on Duplicates. There is small typo for Table.Min which you mentioned to return highest value instead of lowest value.

  3. Thanks Rick, another great article. I have done advanced duplicate filtering with the group table function and now I know that I could have obtained the same with the distinct table function with improved performance.

    • Thanks Stijn. I’ve not tested performance difference between the grouping and Table.Buffer. If you learn more, I’d love to hear more about it. Cheers!


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.