找到你要的答案

Q:Dedup using SQL on a huge 1 billion data set

Q:使用SQL在一个巨大的10亿个数据集dedup

I am having out of memory issues while trying to dedup a table consisting of huge amount of data.

Scenario :

Column A      |    Column B ( Date )

  Value1            Date1
  Value1            Date2
  Value2            Date3
  Value2            Date4

I need to dedup on both these columns, I need to pick the latest record using column b.

Lets say date2 and date4 are the latest dates. My output should be:

Column A      |    Column B ( Date )

  Value1            Date2
  Value2            Date4

Currently I am using the below query which works. Is there a better way of doing this using less memory.

CREATE TABLE UNIQUE_TABLENAME AS (
SELECT a.column a, a.column b, a.column c, a.column d
from tablename a,
(select column a,max(column b) from tablename group by column a)b
where a.column a = b.column a
and a.column b= b.column b)

Thanks in advance!

我有了内存不足的问题而试图dedup组成的庞大的数据量表。

脚本:

Column A      |    Column B ( Date )

  Value1            Date1
  Value1            Date2
  Value2            Date3
  Value2            Date4

我需要这两列dedup,我需要选择新的记录使用列B.

可以说,把最新的日期和日期。我的输出应该是:

Column A      |    Column B ( Date )

  Value1            Date2
  Value2            Date4

目前我使用下面的查询工作。有没有更好的方法使用更少的内存。

CREATE TABLE UNIQUE_TABLENAME AS (
SELECT a.column a, a.column b, a.column c, a.column d
from tablename a,
(select column a,max(column b) from tablename group by column a)b
where a.column a = b.column a
and a.column b= b.column b)

先谢谢了。

answer1: 回答1:
select distinct on (col_a) 
    col_a as value, col_b as "date"
from t
order by col_a, col_b desc

Check distinct on

select distinct on (col_a) 
    col_a as value, col_b as "date"
from t
order by col_a, col_b desc

检查明显

postgresql  amazon-redshift