什么是最有效的方法来擦除重复和排序一个向量?

我需要采取一个c++向量与潜在的很多元素,擦除重复,并排序。

我目前有下面的代码,但它不起作用。

vec.erase(
std::unique(vec.begin(), vec.end()),
vec.end());
std::sort(vec.begin(), vec.end());

我怎样才能正确地做到这一点呢?

此外,是先删除副本(类似于上面的编码)还是先执行排序更快?如果我先执行排序,它是否保证在std::unique执行后保持排序?

或者是否有另一种(也许更有效的)方法来完成这一切?

522481 次浏览

std::unique只适用于重复元素的连续运行,所以你最好先排序。但是,它是稳定的,所以你的向量是有序的。

在调用unique之前需要对它进行排序,因为unique只删除彼此相邻的重复项。

编辑:38秒……

unique只删除连续重复的元素(这对于它在线性时间内运行是必要的),所以你应该先执行排序。它将在调用unique后保持排序。

std::unique只删除重复的元素,如果它们是邻居:你必须先对向量排序,然后它才能像你想要的那样工作。

std::unique被定义为稳定的,因此在vector上惟一地运行后仍将被排序。

如前所述,unique需要一个已排序的容器。此外,unique实际上并不从容器中删除元素。相反,它们被复制到末尾,unique返回一个指向第一个重复元素的迭代器,并且您需要调用erase来实际删除这些元素。

我不确定你用它来做什么,所以我不能100%确定地说,但通常当我想到“排序的,唯一的”容器时,我想到std::集。它可能更适合你的用例:

std::set<Foo> foos(vec.begin(), vec.end()); // both sorted & unique already

否则,在调用unique之前进行排序(正如其他答案所指出的那样)才是正确的方法。

效率是一个复杂的概念。有时间和空间的考虑,以及一般的测量(你只能得到模糊的答案,如O(n))和特定的(例如冒泡排序可以比快速排序快得多,这取决于输入特征)。

如果你有相对较少的副本,那么排序,然后唯一和擦除似乎是要走的路。如果您有相对较多的副本,则从向量创建一个集合,并让它完成繁重的工作,可以轻松击败它。

也不要只关注时间效率。Sort+unique+erase操作在O(1)空间,而set构造操作在O(n)空间。而且两者都不直接用于映射减少并行化(对于真正的巨大的数据集)。

我同意r .脑袋托德•加德纳;a std::set在这里可能是一个好主意。即使你在使用向量时遇到了困难,如果你有足够多的副本,你最好创建一个集合来做这些肮脏的工作。

让我们来比较三种方法:

用向量,sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

转换为set(手动)

set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );

转换为set(使用构造函数)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

下面是它们在重复数量变化时的表现:

 compare of vector and set methods

总结:当重复的数量足够大时,它实际上更快地转换为一个集合,然后将数据转储回一个向量. 0。

出于某种原因,手动进行set转换似乎比使用set构造函数更快——至少在我使用的随机数据上是这样。

这里有一个模板来帮你做这件事:

template<typename T>
void removeDuplicates(std::vector<T>& vec)
{
std::sort(vec.begin(), vec.end());
vec.erase(std::unique(vec.begin(), vec.end()), vec.end());
}

这样称呼它:

removeDuplicates<int>(vectorname);

Nate Kohl建议的标准方法,使用vector, sort + unique:

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

对指针向量不起作用。

仔细看这个例子来自cplusplus.com

在他们的例子中,移到最后的“所谓的副本”实际上显示为?(未定义的值),因为这些“所谓的重复”有时是“额外的元素”,有时是原始向量中的“缺失元素”。

当对指向对象的指针向量使用std::unique()时,会出现一个问题(内存泄漏,从HEAP读取数据的错误,重复释放,这会导致分割错误等)。

下面是我的解决方案:用ptgi::unique()替换std::unique()

请看下面的ptgi_unique.hpp文件:

// ptgi::unique()
//
// Fix a problem in std::unique(), such that none of the original elts in the collection are lost or duplicate.
// ptgi::unique() has the same interface as std::unique()
//
// There is the 2 argument version which calls the default operator== to compare elements.
//
// There is the 3 argument version, which you can pass a user defined functor for specialized comparison.
//
// ptgi::unique() is an improved version of std::unique() which doesn't looose any of the original data
// in the collection, nor does it create duplicates.
//
// After ptgi::unique(), every old element in the original collection is still present in the re-ordered collection,
// except that duplicates have been moved to a contiguous range [dupPosition, last) at the end.
//
// Thus on output:
//  [begin, dupPosition) range are unique elements.
//  [dupPosition, last) range are duplicates which can be removed.
// where:
//  [] means inclusive, and
//  () means exclusive.
//
// In the original std::unique() non-duplicates at end are moved downward toward beginning.
// In the improved ptgi:unique(), non-duplicates at end are swapped with duplicates near beginning.
//
// In addition if you have a collection of ptrs to objects, the regular std::unique() will loose memory,
// and can possibly delete the same pointer multiple times (leading to SEGMENTATION VIOLATION on Linux machines)
// but ptgi::unique() won't.  Use valgrind(1) to find such memory leak problems!!!
//
// NOTE: IF you have a vector of pointers, that is, std::vector<Object*>, then upon return from ptgi::unique()
// you would normally do the following to get rid of the duplicate objects in the HEAP:
//
//  // delete objects from HEAP
//  std::vector<Object*> objects;
//  for (iter = dupPosition; iter != objects.end(); ++iter)
//  {
//      delete (*iter);
//  }
//
//  // shrink the vector. But Object * pointers are NOT followed for duplicate deletes, this shrinks the vector.size())
//  objects.erase(dupPosition, objects.end));
//
// NOTE: But if you have a vector of objects, that is: std::vector<Object>, then upon return from ptgi::unique(), it
// suffices to just call vector:erase(, as erase will automatically call delete on each object in the
// [dupPosition, end) range for you:
//
//  std::vector<Object> objects;
//  objects.erase(dupPosition, last);
//
//==========================================================================================================
// Example of differences between std::unique() vs ptgi::unique().
//
//  Given:
//      int data[] = {10, 11, 21};
//
//  Given this functor: ArrayOfIntegersEqualByTen:
//      A functor which compares two integers a[i] and a[j] in an int a[] array, after division by 10:
//
//  // given an int data[] array, remove consecutive duplicates from it.
//  // functor used for std::unique (BUGGY) or ptgi::unique(IMPROVED)
//
//  // Two numbers equal if, when divided by 10 (integer division), the quotients are the same.
//  // Hence 50..59 are equal, 60..69 are equal, etc.
//  struct ArrayOfIntegersEqualByTen: public std::equal_to<int>
//  {
//      bool operator() (const int& arg1, const int& arg2) const
//      {
//          return ((arg1/10) == (arg2/10));
//      }
//  };
//
//  Now, if we call (problematic) std::unique( data, data+3, ArrayOfIntegersEqualByTen() );
//
//  TEST1: BEFORE UNIQ: 10,11,21
//  TEST1: AFTER UNIQ: 10,21,21
//  DUP_INX=2
//
//      PROBLEM: 11 is lost, and extra 21 has been added.
//
//  More complicated example:
//
//  TEST2: BEFORE UNIQ: 10,20,21,22,30,31,23,24,11
//  TEST2: AFTER UNIQ: 10,20,30,23,11,31,23,24,11
//  DUP_INX=5
//
//      Problem: 21 and 22 are deleted.
//      Problem: 11 and 23 are duplicated.
//
//
//  NOW if ptgi::unique is called instead of std::unique, both problems go away:
//
//  DEBUG: TEST1: NEW_WAY=1
//  TEST1: BEFORE UNIQ: 10,11,21
//  TEST1: AFTER UNIQ: 10,21,11
//  DUP_INX=2
//
//  DEBUG: TEST2: NEW_WAY=1
//  TEST2: BEFORE UNIQ: 10,20,21,22,30,31,23,24,11
//  TEST2: AFTER UNIQ: 10,20,30,23,11,31,22,24,21
//  DUP_INX=5
//
//  @SEE: look at the "case study" below to understand which the last "AFTER UNIQ" results with that order:
//  TEST2: AFTER UNIQ: 10,20,30,23,11,31,22,24,21
//
//==========================================================================================================
// Case Study: how ptgi::unique() works:
//  Remember we "remove adjacent duplicates".
//  In this example, the input is NOT fully sorted when ptgi:unique() is called.
//
//  I put | separatators, BEFORE UNIQ to illustrate this
//  10  | 20,21,22 |  30,31 |  23,24 | 11
//
//  In example above, 20, 21, 22 are "same" since dividing by 10 gives 2 quotient.
//  And 30,31 are "same", since /10 quotient is 3.
//  And 23, 24 are same, since /10 quotient is 2.
//  And 11 is "group of one" by itself.
//  So there are 5 groups, but the 4th group (23, 24) happens to be equal to group 2 (20, 21, 22)
//  So there are 5 groups, and the 5th group (11) is equal to group 1 (10)
//
//  R = result
//  F = first
//
//  10, 20, 21, 22, 30, 31, 23, 24, 11
//  R    F
//
//  10 is result, and first points to 20, and R != F (10 != 20) so bump R:
//       R
//       F
//
//  Now we hits the "optimized out swap logic".
//  (avoid swap because R == F)
//
//  // now bump F until R != F (integer division by 10)
//  10, 20, 21, 22, 30, 31, 23, 24, 11
//       R   F              // 20 == 21 in 10x
//       R       F              // 20 == 22 in 10x
//       R           F          // 20 != 30, so we do a swap of ++R and F
//  (Now first hits 21, 22, then finally 30, which is different than R, so we swap bump R to 21 and swap with  30)
//  10, 20, 30, 22, 21, 31, 23, 24, 11  // after R & F swap (21 and 30)
//           R       F
//
//  10, 20, 30, 22, 21, 31, 23, 24, 11
//           R          F           // bump F to 31, but R and F are same (30 vs 31)
//           R               F      // bump F to 23, R != F, so swap ++R with F
//  10, 20, 30, 22, 21, 31, 23, 24, 11
//                  R           F       // bump R to 22
//  10, 20, 30, 23, 21, 31, 22, 24, 11  // after the R & F swap (22 & 23 swap)
//                  R            F      // will swap 22 and 23
//                  R                F      // bump F to 24, but R and F are same in 10x
//                  R                    F  // bump F, R != F, so swap ++R  with F
//                      R                F  // R and F are diff, so swap ++R  with F (21 and 11)
//  10, 20, 30, 23, 11, 31, 22, 24, 21
//                      R                F  // aftter swap of old 21 and 11
//                      R                  F    // F now at last(), so loop terminates
//                          R               F   // bump R by 1 to point to dupPostion (first duplicate in range)
//
//  return R which now points to 31
//==========================================================================================================
// NOTES:
// 1) the #ifdef IMPROVED_STD_UNIQUE_ALGORITHM documents how we have modified the original std::unique().
// 2) I've heavily unit tested this code, including using valgrind(1), and it is *believed* to be 100% defect-free.
//
//==========================================================================================================
// History:
//  130201  dpb dbednar@ptgi.com created
//==========================================================================================================


#ifndef PTGI_UNIQUE_HPP
#define PTGI_UNIQUE_HPP


// Created to solve memory leak problems when calling std::unique() on a vector<Route*>.
// Memory leaks discovered with valgrind and unitTesting.




#include <algorithm>        // std::swap


// instead of std::myUnique, call this instead, where arg3 is a function ptr
//
// like std::unique, it puts the dups at the end, but it uses swapping to preserve original
// vector contents, to avoid memory leaks and duplicate pointers in vector<Object*>.


#ifdef IMPROVED_STD_UNIQUE_ALGORITHM
#error the #ifdef for IMPROVED_STD_UNIQUE_ALGORITHM was defined previously.. Something is wrong.
#endif


#undef IMPROVED_STD_UNIQUE_ALGORITHM
#define IMPROVED_STD_UNIQUE_ALGORITHM


// similar to std::unique, except that this version swaps elements, to avoid
// memory leaks, when vector contains pointers.
//
// Normally the input is sorted.
// Normal std::unique:
// 10 20 20 20 30   30 20 20 10
// a  b  c  d  e    f  g  h  i
//
// 10 20 30 20 10 | 30 20 20 10
// a  b  e  g  i    f  g  h  i
//
// Now GONE: c, d.
// Now DUPS: g, i.
// This causes memory leaks and segmenation faults due to duplicate deletes of same pointer!




namespace ptgi {


// Return the position of the first in range of duplicates moved to end of vector.
//
// uses operator==  of class for comparison
//
// @param [first, last) is a range to find duplicates within.
//
// @return the dupPosition position, such that [dupPosition, end) are contiguous
// duplicate elements.
// IF all items are unique, then it would return last.
//
template <class ForwardIterator>
ForwardIterator unique( ForwardIterator first, ForwardIterator last)
{
// compare iterators, not values
if (first == last)
return last;


// remember the current item that we are looking at for uniqueness
ForwardIterator result = first;


// result is slow ptr where to store next unique item
// first is  fast ptr which is looking at all elts


// the first iterator moves over all elements [begin+1, end).
// while the current item (result) is the same as all elts
// to the right, (first) keeps going, until you find a different
// element pointed to by *first.  At that time, we swap them.


while (++first != last)
{
if (!(*result == *first))
{
#ifdef IMPROVED_STD_UNIQUE_ALGORITHM
// inc result, then swap *result and *first


//          THIS IS WHAT WE WANT TO DO.
//          BUT THIS COULD SWAP AN ELEMENT WITH ITSELF, UNCECESSARILY!!!
//          std::swap( *first, *(++result));


// BUT avoid swapping with itself when both iterators are the same
++result;
if (result != first)
std::swap( *first, *result);
#else
// original code found in std::unique()
// copies unique down
*(++result) = *first;
#endif
}
}


return ++result;
}


template <class ForwardIterator, class BinaryPredicate>
ForwardIterator unique( ForwardIterator first, ForwardIterator last, BinaryPredicate pred)
{
if (first == last)
return last;


// remember the current item that we are looking at for uniqueness
ForwardIterator result = first;


while (++first != last)
{
if (!pred(*result,*first))
{
#ifdef IMPROVED_STD_UNIQUE_ALGORITHM
// inc result, then swap *result and *first


//          THIS COULD SWAP WITH ITSELF UNCECESSARILY
//          std::swap( *first, *(++result));
//
// BUT avoid swapping with itself when both iterators are the same
++result;
if (result != first)
std::swap( *first, *result);


#else
// original code found in std::unique()
// copies unique down
// causes memory leaks, and duplicate ptrs
// and uncessarily moves in place!
*(++result) = *first;
#endif
}
}


return ++result;
}


// from now on, the #define is no longer needed, so get rid of it
#undef IMPROVED_STD_UNIQUE_ALGORITHM


} // end ptgi:: namespace


#endif

下面是我用来测试它的UNIT Test程序:

// QUESTION: in test2, I had trouble getting one line to compile,which was caused  by the declaration of operator()
// in the equal_to Predicate.  I'm not sure how to correctly resolve that issue.
// Look for //OUT lines
//
// Make sure that NOTES in ptgi_unique.hpp are correct, in how we should "cleanup" duplicates
// from both a vector<Integer> (test1()) and vector<Integer*> (test2).
// Run this with valgrind(1).
//
// In test2(), IF we use the call to std::unique(), we get this problem:
//
//  [dbednar@ipeng8 TestSortRoutes]$ ./Main7
//  TEST2: ORIG nums before UNIQUE: 10, 20, 21, 22, 30, 31, 23, 24, 11
//  TEST2: modified nums AFTER UNIQUE: 10, 20, 30, 23, 11, 31, 23, 24, 11
//  INFO: dupInx=5
//  TEST2: uniq = 10
//  TEST2: uniq = 20
//  TEST2: uniq = 30
//  TEST2: uniq = 33427744
//  TEST2: uniq = 33427808
//  Segmentation fault (core dumped)
//
// And if we run valgrind we seen various error about "read errors", "mismatched free", "definitely lost", etc.
//
//  valgrind --leak-check=full ./Main7
//  ==359== Memcheck, a memory error detector
//  ==359== Command: ./Main7
//  ==359== Invalid read of size 4
//  ==359== Invalid free() / delete / delete[]
//  ==359== HEAP SUMMARY:
//  ==359==     in use at exit: 8 bytes in 2 blocks
//  ==359== LEAK SUMMARY:
//  ==359==    definitely lost: 8 bytes in 2 blocks
// But once we replace the call in test2() to use ptgi::unique(), all valgrind() error messages disappear.
//
// 130212   dpb dbednar@ptgi.com created
// =========================================================================================================


#include <iostream> // std::cout, std::cerr
#include <string>
#include <vector>   // std::vector
#include <sstream>  // std::ostringstream
#include <algorithm>    // std::unique()
#include <functional>   // std::equal_to(), std::binary_function()
#include <cassert>  // assert() MACRO


#include "ptgi_unique.hpp"  // ptgi::unique()






// Integer is small "wrapper class" around a primitive int.
// There is no SETTER, so Integer's are IMMUTABLE, just like in JAVA.


class Integer
{
private:
int num;
public:


// default CTOR: "Integer zero;"
// COMPRENSIVE CTOR:  "Integer five(5);"
Integer( int num = 0 ) :
num(num)
{
}


// COPY CTOR
Integer( const Integer& rhs) :
num(rhs.num)
{
}


// assignment, operator=, needs nothing special... since all data members are primitives


// GETTER for 'num' data member
// GETTER' are *always* const
int getNum() const
{
return num;
}


// NO SETTER, because IMMUTABLE (similar to Java's Integer class)


// @return "num"
// NB: toString() should *always* be a const method
//
// NOTE: it is probably more efficient to call getNum() intead
// of toString() when printing a number:
//
// BETTER to do this:
//  Integer five(5);
//  std::cout << five.getNum() << "\n"
// than this:
//  std::cout << five.toString() << "\n"


std::string toString() const
{
std::ostringstream oss;
oss << num;
return oss.str();
}
};


// convenience typedef's for iterating over std::vector<Integer>
typedef std::vector<Integer>::iterator      IntegerVectorIterator;
typedef std::vector<Integer>::const_iterator    ConstIntegerVectorIterator;


// convenience typedef's for iterating over std::vector<Integer*>
typedef std::vector<Integer*>::iterator     IntegerStarVectorIterator;
typedef std::vector<Integer*>::const_iterator   ConstIntegerStarVectorIterator;


// functor used for std::unique or ptgi::unique() on a std::vector<Integer>
// Two numbers equal if, when divided by 10 (integer division), the quotients are the same.
// Hence 50..59 are equal, 60..69 are equal, etc.
struct IntegerEqualByTen: public std::equal_to<Integer>
{
bool operator() (const Integer& arg1, const Integer& arg2) const
{
return ((arg1.getNum()/10) == (arg2.getNum()/10));
}
};


// functor used for std::unique or ptgi::unique on a std::vector<Integer*>
// Two numbers equal if, when divided by 10 (integer division), the quotients are the same.
// Hence 50..59 are equal, 60..69 are equal, etc.
struct IntegerEqualByTenPointer: public std::equal_to<Integer*>
{
// NB: the Integer*& looks funny to me!
// TECHNICAL PROBLEM ELSEWHERE so had to remove the & from *&
//OUT   bool operator() (const Integer*& arg1, const Integer*& arg2) const
//
bool operator() (const Integer* arg1, const Integer* arg2) const
{
return ((arg1->getNum()/10) == (arg2->getNum()/10));
}
};


void test1();
void test2();
void printIntegerStarVector( const std::string& msg, const std::vector<Integer*>& nums );


int main()
{
test1();
test2();
return 0;
}


// test1() uses a vector<Object> (namely vector<Integer>), so there is no problem with memory loss
void test1()
{
int data[] = { 10, 20, 21, 22, 30, 31, 23, 24, 11};


// turn C array into C++ vector
std::vector<Integer> nums(data, data+9);


// arg3 is a functor
IntegerVectorIterator dupPosition = ptgi::unique( nums.begin(), nums.end(), IntegerEqualByTen() );


nums.erase(dupPosition, nums.end());


nums.erase(nums.begin(), dupPosition);
}


//==================================================================================
// test2() uses a vector<Integer*>, so after ptgi:unique(), we have to be careful in
// how we eliminate the duplicate Integer objects stored in the heap.
//==================================================================================
void test2()
{
int data[] = { 10, 20, 21, 22, 30, 31, 23, 24, 11};


// turn C array into C++ vector of Integer* pointers
std::vector<Integer*> nums;


// put data[] integers into equivalent Integer* objects in HEAP
for (int inx = 0; inx < 9; ++inx)
{
nums.push_back( new Integer(data[inx]) );
}


// print the vector<Integer*> to stdout
printIntegerStarVector( "TEST2: ORIG nums before UNIQUE", nums );


// arg3 is a functor
#if 1
// corrected version which fixes SEGMENTATION FAULT and all memory leaks reported by valgrind(1)
// I THINK we want to use new C++11 cbegin() and cend(),since the equal_to predicate is passed "Integer *&"


//  DID NOT COMPILE
//OUT   IntegerStarVectorIterator dupPosition = ptgi::unique( const_cast<ConstIntegerStarVectorIterator>(nums.begin()), const_cast<ConstIntegerStarVectorIterator>(nums.end()), IntegerEqualByTenPointer() );


// DID NOT COMPILE when equal_to predicate declared "Integer*& arg1, Integer*&  arg2"
//OUT   IntegerStarVectorIterator dupPosition = ptgi::unique( const_cast<nums::const_iterator>(nums.begin()), const_cast<nums::const_iterator>(nums.end()), IntegerEqualByTenPointer() );




// okay when equal_to predicate declared "Integer* arg1, Integer*  arg2"
IntegerStarVectorIterator dupPosition = ptgi::unique(nums.begin(), nums.end(), IntegerEqualByTenPointer() );
#else
// BUGGY version that causes SEGMENTATION FAULT and valgrind(1) errors
IntegerStarVectorIterator dupPosition = std::unique( nums.begin(), nums.end(), IntegerEqualByTenPointer() );
#endif


printIntegerStarVector( "TEST2: modified nums AFTER UNIQUE", nums );
int dupInx = dupPosition - nums.begin();
std::cout << "INFO: dupInx=" << dupInx <<"\n";


// delete the dup Integer* objects in the [dupPosition, end] range
for (IntegerStarVectorIterator iter = dupPosition; iter != nums.end(); ++iter)
{
delete (*iter);
}


// shrink the vector
// NB: the Integer* ptrs are NOT followed by vector::erase()
nums.erase(dupPosition, nums.end());




// print the uniques, by following the iter to the Integer* pointer
for (IntegerStarVectorIterator iter = nums.begin(); iter != nums.end();  ++iter)
{
std::cout << "TEST2: uniq = " << (*iter)->getNum() << "\n";
}


// remove the unique objects from heap
for (IntegerStarVectorIterator iter = nums.begin(); iter != nums.end();  ++iter)
{
delete (*iter);
}


// shrink the vector
nums.erase(nums.begin(), nums.end());


// the vector should now be completely empty
assert( nums.size() == 0);
}


//@ print to stdout the string: "info_msg: num1, num2, .... numN\n"
void printIntegerStarVector( const std::string& msg, const std::vector<Integer*>& nums )
{
std::cout << msg << ": ";
int inx = 0;
ConstIntegerStarVectorIterator  iter;


// use const iterator and const range!
// NB: cbegin() and cend() not supported until LATER (c++11)
for (iter = nums.begin(), inx = 0; iter != nums.end(); ++iter, ++inx)
{
// output a comma seperator *AFTER* first
if (inx > 0)
std::cout << ", ";


// call Integer::toString()
std::cout << (*iter)->getNum();     // send int to stdout
//      std::cout << (*iter)->toString();   // also works, but is probably slower


}


// in conclusion, add newline
std::cout << "\n";
}

下面是使用std::unique()出现重复删除问题的示例。在LINUX机器上,程序崩溃。详情请阅读评论。

// Main10.cpp
//
// Illustration of duplicate delete and memory leak in a vector<int*> after calling std::unique.
// On a LINUX machine, it crashes the progam because of the duplicate delete.
//
// INPUT : {1, 2, 2, 3}
// OUTPUT: {1, 2, 3, 3}
//
// The two 3's are actually pointers to the same 3 integer in the HEAP, which is BAD
// because if you delete both int* pointers, you are deleting the same memory
// location twice.
//
//
// Never mind the fact that we ignore the "dupPosition" returned by std::unique(),
// but in any sensible program that "cleans up after istelf" you want to call deletex
// on all int* poitners to avoid memory leaks.
//
//
// NOW IF you replace std::unique() with ptgi::unique(), all of the the problems disappear.
// Why? Because ptgi:unique merely reshuffles the data:
// OUTPUT: {1, 2, 3, 2}
// The ptgi:unique has swapped the last two elements, so all of the original elements in
// the INPUT are STILL in the OUTPUT.
//
// 130215   dbednar@ptgi.com
//============================================================================


#include <iostream>
#include <vector>
#include <algorithm>
#include <functional>


#include "ptgi_unique.hpp"


// functor used by std::unique to remove adjacent elts from vector<int*>
struct EqualToVectorOfIntegerStar: public std::equal_to<int *>
{
bool operator() (const int* arg1, const int* arg2) const
{
return (*arg1 == *arg2);
}
};


void printVector( const std::string& msg, const std::vector<int*>& vnums);


int main()
{
int inums [] = { 1, 2, 2, 3 };
std::vector<int*> vnums;


// convert C array into vector of pointers to integers
for (size_t inx = 0; inx < 4; ++ inx)
vnums.push_back( new int(inums[inx]) );


printVector("BEFORE UNIQ", vnums);


// INPUT : 1, 2A, 2B, 3
std::unique( vnums.begin(), vnums.end(), EqualToVectorOfIntegerStar() );
// OUTPUT: 1, 2A, 3, 3 }
printVector("AFTER  UNIQ", vnums);


// now we delete 3 twice, and we have a memory leak because 2B is not deleted.
for (size_t inx = 0; inx < vnums.size(); ++inx)
{
delete(vnums[inx]);
}
}


// print a line of the form "msg: 1,2,3,..,5,6,7\n", where 1..7 are the numbers in vnums vector
// PS: you may pass "hello world" (const char *) because of implicit (automatic) conversion
// from "const char *" to std::string conversion.


void printVector( const std::string& msg, const std::vector<int*>& vnums)
{
std::cout << msg << ": ";


for (size_t inx = 0; inx < vnums.size(); ++inx)
{
// insert comma separator before current elt, but ONLY after first elt
if (inx > 0)
std::cout << ",";
std::cout << *vnums[inx];


}
std::cout << "\n";
}
std::set<int> s;
std::for_each(v.cbegin(), v.cend(), [&s](int val){s.insert(val);});
v.clear();
std::copy(s.cbegin(), s.cend(), v.cbegin());

我重做了内特·科尔的侧写得到了不同的结果。对于我的测试用例,直接对向量排序总是比使用集合更有效。我添加了一个新的更有效的方法,使用unordered_set

请记住,unordered_set方法只有在你需要唯一和排序的类型有一个良好的哈希函数时才有效。对于int型,这很简单!(标准库提供了一个默认的哈希,它只是标识函数。)另外,不要忘记在最后排序,因为unordered_set是无序的:)

我在setunordered_set实现中做了一些深入研究,发现构造函数实际上为每个元素构造了一个新节点,然后检查其值以确定是否应该实际插入(至少在Visual Studio实现中)。

以下是5种方法:

f1:只使用vectorsort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

f2:转换为set(使用构造函数)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

f3:转换为set(手动)

set<int> s;
for (int i : vec)
s.insert(i);
vec.assign( s.begin(), s.end() );

f4:转换为unordered_set(使用构造函数)

unordered_set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

f5:转换为unordered_set(手动)

unordered_set<int> s;
for (int i : vec)
s.insert(i);
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

我在[1,10],[1,1000]和[1,100000]的范围内随机选择了100,000,000 int的向量进行测试

结果(以秒为单位,越小越好):

range         f1       f2       f3       f4      f5
[1,10]      1.6821   7.6804   2.8232   6.2634  0.7980
[1,1000]    5.0773  13.3658   8.2235   7.6884  1.9861
[1,100000]  8.7955  32.1148  26.5485  13.3278  3.9822

关于alexK7基准测试。我尝试了它们,得到了类似的结果,但是当值的范围为100万时,使用std::sort (f1)和使用std::unordered_set (f5)的情况产生类似的时间。当取值范围为1000万时,f1比f5快。

如果值的范围是有限的,并且值是无符号int,则可以使用std::vector,其大小对应于给定的范围。代码如下:

void DeleteDuplicates_vector_bool(std::vector<unsigned>& v, unsigned range_size)
{
std::vector<bool> v1(range_size);
for (auto& x: v)
{
v1[x] = true;
}
v.clear();


unsigned count = 0;
for (auto& x: v1)
{
if (x)
{
v.push_back(count);
}
++count;
}
}

如果你不想改变元素的顺序,那么你可以尝试这个解决方案:

template <class T>
void RemoveDuplicatesInVector(std::vector<T> & vec)
{
set<T> values;
vec.erase(std::remove_if(vec.begin(), vec.end(), [&](const T & value) { return !values.insert(value).second; }), vec.end());
}

你可以这样做:

std::sort(v.begin(), v.end());
v.erase(std::unique(v.begin(), v.end()), v.end());

如果你正在寻找性能并使用std::vector,我推荐这个文档链接提供的。

std::vector<int> myvector{10,20,20,20,30,30,20,20,10};             // 10 20 20 20 30 30 20 20 10
std::sort(myvector.begin(), myvector.end() );
const auto& it = std::unique (myvector.begin(), myvector.end());   // 10 20 30 ?  ?  ?  ?  ?  ?
//          ^
myvector.resize( std::distance(myvector.begin(),it) ); // 10 20 30
void EraseVectorRepeats(vector <int> & v){
TOP:for(int y=0; y<v.size();++y){
for(int z=0; z<v.size();++z){
if(y==z){ //This if statement makes sure the number that it is on is not erased-just skipped-in order to keep only one copy of a repeated number
continue;}
if(v[y]==v[z]){
v.erase(v.begin()+z); //whenever a number is erased the function goes back to start of the first loop because the size of the vector changes
goto TOP;}}}}

这是我创建的一个函数,你可以用它来删除重复。所需的头文件只有<iostream><vector>

更容易理解的代码来自:https://en.cppreference.com/w/cpp/algorithm/unique

#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <cctype>


int main()
{
// remove duplicate elements
std::vector<int> v{1,2,3,1,2,3,3,4,5,4,5,6,7};
std::sort(v.begin(), v.end()); // 1 1 2 2 3 3 3 4 4 5 5 6 7
auto last = std::unique(v.begin(), v.end());
// v now holds {1 2 3 4 5 6 7 x x x x x x}, where 'x' is indeterminate
v.erase(last, v.end());
for (int i : v)
std::cout << i << " ";
std::cout << "\n";
}

输出:

1 2 3 4 5 6 7

假设一个是一个向量,使用

a.erase(unique(a.begin(),a.end()),a.end());O (n)时间运行。

使用Ranges v3库,您可以简单地使用

action::unique(vec);

注意,它实际上删除了重复的元素,而不仅仅是移动它们。

不幸的是,动作在c++ 20中没有标准化,因为即使在c++ 20中,范围库的其他部分仍然必须使用原始库。

void removeDuplicates(std::vector<int>& arr) {
for (int i = 0; i < arr.size(); i++)
{
for (int j = i + 1; j < arr.size(); j++)
{
if (arr[i] > arr[j])
{
int temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
}
}
std::vector<int> y;
int x = arr[0];
int i = 0;
while (i < arr.size())
{
if (x != arr[i])
{
y.push_back(x);
x = arr[i];
}
i++;
if (i == arr.size())
y.push_back(arr[i - 1]);
}
arr = y;
}

如果你的类很容易转换为int型,并且你有一些内存, Unique可以在没有排序的情况下完成,而且速度更快:

#include <vector>
#include <stdlib.h>
#include <algorithm>
int main (int argc, char* argv []) {
//vector init
std::vector<int> v (1000000, 0);
std::for_each (v.begin (), v.end (), [] (int& s) {s = rand () %1000;});
std::vector<int> v1 (v);
int beg (0), end (0), duration (0);
beg = clock ();
{
std::sort (v.begin (), v.end ());
auto i (v.begin ());
i = std::unique (v.begin (), v.end ());
if (i != v.end ()) v.erase (i, v.end ());
}
end = clock ();
duration = (int) (end - beg);
std::cout << "\tduration sort + unique == " << duration << std::endl;


int n (0);
duration = 0;
beg = clock ();
std::for_each (v1.begin (), v1.end (), [&n] (const int& s) {if (s >= n) n = s+1;});
std::vector<int> tab (n, 0);
{
auto i (v1.begin ());
std::for_each (v1.begin (), v1.end (), [&i, &tab] (const int& s) {
if (!tab [s]) {
*i++ = s;
++tab [s];
}
});
std::sort (v1.begin (), i);
v1.erase (i, v1.end ());
}
end = clock ();
duration = (int) (end - beg);
std::cout << "\tduration unique + sort == " << duration << std::endl;
if (v == v1) {
std::cout << "and results are same" << std::endl;
}
else {
std::cout << "but result differs" << std::endl;
}
}
典型结果: Duration sort + unique == 38985 持续时间唯一+排序== 2500 结果相同

大部分答案似乎是使用O(nlogn),但使用unordered_set,我们可以将其减少到O(n)。我看到一些解决方案使用sets,但我发现这一个,它似乎更优雅的使用setiterators

using Intvec = std::vector<int>;


void remove(Intvec &v) {
// creating iterator starting with beginning of the vector
Intvec::iterator itr = v.begin();
std::unordered_set<int> s;
// loops from the beginning to the end of the list
for (auto curr = v.begin(); curr != v.end(); ++curr) {
if (s.insert(*curr).second) { // if the 0 curr already exist in the set
*itr++ = *curr; // adding a position to the iterator
}
}
// erasing repeating positions in the set
v.erase(itr, v.end());
}

取决于用例。如果你期望有少于100个正整数的唯一值,并且你有一个cpu能够处理avx512f指令集,那么你可以以每个元素15个时钟周期或每秒3 -5亿个插入的速度插入元素,通过与一个小型查找表进行简单的比较。

接下来的实现使用CPU寄存器对~50个惟一值进行值查找,并对~1000个惟一值进行L1缓存。对于L1缓存版本,每次插入大约需要160个时钟周期,这相当于大约每秒25M个插入,并且比使用std::set慢。对于只有4个唯一值,它以每个元素5.8个周期的速率插入,高于500M/s。

//g++  7.4.0
// time measurement taken from another answer
// valid C99 and C++


#include <stdint.h>  // <cstdint> is preferred in C++, but stdint.h works.


#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif


// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
_mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
_mm_lfence();  // optionally block later instructions until rdtsc retires
return tsc;
}


// requires a Nehalem or newer CPU.  Not Core2 or earlier.  IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy);  // waits for earlier insns to retire, but allows later to start
}






#include <iostream>


template<int n>
struct FastUnique
{
public:
FastUnique()
{
it=0;
for(int i=0;i<n;i++)
dict[i]=-1;
}


void insert(const int val)
{
if(!test(dict,val))
dict[it++]=val;
}


const int get(const int index)
{
return dict[index];
}


const int size()
{
return it;
}


private:
int dict[n];
int it;
bool test(const int * dict, const int val)
{
int c=0;
for(int i=0;i<n;i++)
c+=(dict[i]==val);
return c>0;
}
};


int main()
{
std::cout << "Hello, world!\n";
const int n=500000000;


FastUnique<64> fastSet;


auto t= readTSC();


for(int i=0;i<n;i++)
fastSet.insert(i&63);


auto t2=readTSC();


std::cout<<(t2-t)/(double)n<<"cycles per iteration"<<std::endl;
   

for(int i=0;i<fastSet.size();i++)
std::cout<<fastSet.get(i)<<std::endl;
    

return 0;
}