如何拆分列可能包含逗号的 csv

给予

2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,"Corvallis, OR",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34

如何使用 C # 将上述信息分割成如下字符串:

2
1016
7/31/2008 14:22
Geoff Dalgas
6/5/2011 22:21
http://stackoverflow.com
Corvallis, OR
7679
351
81
b437f461b3fd27387c5d8ab47a293d35
34

可以看到其中一列包含 < = (Corvallis,OR)

基于 C # 正则表达式分隔逗号外引号

string[] result = Regex.Split(samplestring, ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
135350 次浏览

使用类似 LumenWorks的库来阅读 CSV。它将处理包含引号的字段,并且由于已经存在了很长一段时间,因此总体上可能比您的自定义解决方案更加健壮。

你可以把所有后面有偶数个引号的逗号分开。

您还希望在 specf for CSV 格式中查看关于处理逗号的内容。

有用连结: C# Regex Split - commas outside quotes

我发现如果你在 Excel 中粘贴 csv 分隔的文本并做一个“ Text to Columns”,它会询问你一个“ Text 修饰符”。它默认为双引号,因此它将双引号中的文本视为文本。我猜想 Excel 实现这一点的方式是一次处理一个字符,如果遇到“文本限定符”,它将继续处理下一个“限定符”。如果位于文本内部,您可以使用 for 循环和一个布尔值来表示自己来实现这一点。

public string[] CsvParser(string csvText)
{
List<string> tokens = new List<string>();


int last = -1;
int current = 0;
bool inText = false;


while(current < csvText.Length)
{
switch(csvText[current])
{
case '"':
inText = !inText; break;
case ',':
if (!inText)
{
tokens.Add(csvText.Substring(last + 1, (current - last)).Trim(' ', ','));
last = current;
}
break;
default:
break;
}
current++;
}


if (last != csvText.Length - 1)
{
tokens.Add(csvText.Substring(last+1).Trim());
}


return tokens.ToArray();
}

使用 Microsoft.VisualBasic.FileIO.TextFieldParser类。这将处理分隔文件 TextReaderStream的解析,其中有些字段用引号括起来,有些则不用。

例如:

using Microsoft.VisualBasic.FileIO;


string csv = "2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,\"Corvallis, OR\",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";


TextFieldParser parser = new TextFieldParser(new StringReader(csv));


// You can also read from a file
// TextFieldParser parser = new TextFieldParser("mycsvfile.csv");


parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");


string[] fields;


while (!parser.EndOfData)
{
fields = parser.ReadFields();
foreach (string field in fields)
{
Console.WriteLine(field);
}
}


parser.Close();

这应产生以下结果:

2
1016
7/31/2008 14:22
Geoff Dalgas
6/5/2011 22:21
http://stackoverflow.com
Corvallis, OR
7679
351
81
b437f461b3fd27387c5d8ab47a293d35
34

有关更多信息,请参见 Microsoft.VisualBasic. FileIO.TextFieldParser

您需要在 AddReferences. NET 选项卡中添加对 Microsoft.VisualBasic的引用。

这是一个难以分析的问题。时的 csv 文件。Csv 文件可以是逗号分隔的字符串、逗号分隔的带引号的字符串,也可以是两者的混沌组合。我提出的解决方案允许三种可能性中的任何一种。

我创建了一个方法 ParseCsvRow () ,它从 csv 字符串返回一个数组。我首先处理字符串中的双引号,方法是将双引号上的字符串分割成一个名为 quotesArray 的数组。引号字符串。Csv 文件只有在双引号为偶数时才有效。列值中的双引号应该替换为一对双引号(这是 Excel 的方法)。只要。Csv 文件满足这些要求,您可以期望分隔符逗号只出现在双引号对之外。双引号对中的逗号是列值的一部分,在拆分。将 csv 转换为数组。

我的方法将通过仅查看 quotesArray 的偶数索引来测试双引号对之外的逗号。它还从列值的开始和结束处删除双引号。

    public static string[] ParseCsvRow(string csvrow)
{
const string obscureCharacter = "ᖳ";
if (csvrow.Contains(obscureCharacter)) throw new Exception("Error: csv row may not contain the " + obscureCharacter + " character");


var unicodeSeparatedString = "";


var quotesArray = csvrow.Split('"');  // Split string on double quote character
if (quotesArray.Length > 1)
{
for (var i = 0; i < quotesArray.Length; i++)
{
// CSV must use double quotes to represent a quote inside a quoted cell
// Quotes must be paired up
// Test if a comma lays outside a pair of quotes.  If so, replace the comma with an obscure unicode character
if (Math.Round(Math.Round((decimal) i/2)*2) == i)
{
var s = quotesArray[i].Trim();
switch (s)
{
case ",":
quotesArray[i] = obscureCharacter;  // Change quoted comma seperated string to quoted "obscure character" seperated string
break;
}
}
// Build string and Replace quotes where quotes were expected.
unicodeSeparatedString += (i > 0 ? "\"" : "") + quotesArray[i].Trim();
}
}
else
{
// String does not have any pairs of double quotes.  It should be safe to just replace the commas with the obscure character
unicodeSeparatedString = csvrow.Replace(",", obscureCharacter);
}


var csvRowArray = unicodeSeparatedString.Split(obscureCharacter[0]);


for (var i = 0; i < csvRowArray.Length; i++)
{
var s = csvRowArray[i].Trim();
if (s.StartsWith("\"") && s.EndsWith("\""))
{
csvRowArray[i] = s.Length > 2 ? s.Substring(1, s.Length - 2) : "";  // Remove start and end quotes.
}
}


return csvRowArray;
}

我的方法的一个缺点是,我用一个模糊的 Unicode字符暂时替换了分隔符逗号。这个角色需要如此隐晦,它永远不会出现在你的。Csv 文件。你可能需要更多的处理。

我遇到了一个包含引用字符的字段的 CSV 问题,所以使用 TextFieldParser,我得到了以下结果:

private static string[] parseCSVLine(string csvLine)
{
using (TextFieldParser TFP = new TextFieldParser(new MemoryStream(Encoding.UTF8.GetBytes(csvLine))))
{
TFP.HasFieldsEnclosedInQuotes = true;
TFP.SetDelimiters(",");


try
{
return TFP.ReadFields();
}
catch (MalformedLineException)
{
StringBuilder m_sbLine = new StringBuilder();


for (int i = 0; i < TFP.ErrorLine.Length; i++)
{
if (i > 0 && TFP.ErrorLine[i]== '"' &&(TFP.ErrorLine[i + 1] != ',' && TFP.ErrorLine[i - 1] != ','))
m_sbLine.Append("\"\"");
else
m_sbLine.Append(TFP.ErrorLine[i]);
}


return parseCSVLine(m_sbLine.ToString());
}
}
}

仍然使用 StreamReader 逐行读取 CSV,如下所示:

using(StreamReader SR = new StreamReader(FileName))
{
while (SR.Peek() >-1)
myStringArray = parseCSVLine(SR.ReadLine());
}

现在已经很晚了,但是这对某些人是有帮助的。我们可以使用正则表达式作为下面的内容。

Regex CSVParser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String[] Fields = CSVParser.Split(Test);

使用 Cinchoo ETL-一个开源库,它可以自动处理包含分隔符的列值。

string csv = @"2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,""Corvallis, OR"",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";


using (var p = ChoCSVReader.LoadText(csv)
)
{
Console.WriteLine(p.Dump());
}

产出:

Key: Column1 [Type: String]
Value: 2
Key: Column2 [Type: String]
Value: 1016
Key: Column3 [Type: String]
Value: 7/31/2008 14:22
Key: Column4 [Type: String]
Value: Geoff Dalgas
Key: Column5 [Type: String]
Value: 6/5/2011 22:21
Key: Column6 [Type: String]
Value: http://stackoverflow.com
Key: Column7 [Type: String]
Value: Corvallis, OR
Key: Column8 [Type: String]
Value: 7679
Key: Column9 [Type: String]
Value: 351
Key: Column10 [Type: String]
Value: 81
Key: Column11 [Type: String]
Value: b437f461b3fd27387c5d8ab47a293d35
Key: Column12 [Type: String]
Value: 34

更多信息,请访问 codeproject 文章。

希望能有帮助。

这个问题及其副本有很多答案。我试过 这个看起来很有希望,但发现里面有一些虫子。我对它进行了重大修改,以便它能通过我所有的测试。

    /// <summary>
/// Returns a collection of strings that are derived by splitting the given source string at
/// characters given by the 'delimiter' parameter.  However, a substring may be enclosed between
/// pairs of the 'qualifier' character so that instances of the delimiter can be taken as literal
/// parts of the substring.  The method was originally developed to split comma-separated text
/// where quotes could be used to qualify text that contains commas that are to be taken as literal
/// parts of the substring.  For example, the following source:
///     A, B, "C, D", E, "F, G"
/// would be split into 5 substrings:
///     A
///     B
///     C, D
///     E
///     F, G
/// When enclosed inside of qualifiers, the literal for the qualifier character may be represented
/// by two consecutive qualifiers.  The two consecutive qualifiers are distinguished from a closing
/// qualifier character.  For example, the following source:
///     A, "B, ""C"""
/// would be split into 2 substrings:
///     A
///     B, "C"
/// </summary>
/// <remarks>Originally based on: https://stackoverflow.com/a/43284485/2998072</remarks>
/// <param name="source">The string that is to be split</param>
/// <param name="delimiter">The character that separates the substrings</param>
/// <param name="qualifier">The character that is used (in pairs) to enclose a substring</param>
/// <param name="toTrim">If true, then whitespace is removed from the beginning and end of each
/// substring.  If false, then whitespace is preserved at the beginning and end of each substring.
/// </param>
public static List<String> SplitQualified(this String source, Char delimiter, Char qualifier,
Boolean toTrim)
{
// Avoid throwing exception if the source is null
if (String.IsNullOrEmpty(source))
return new List<String> { "" };


var results = new List<String>();
var result = new StringBuilder();
Boolean inQualifier = false;


// The algorithm is designed to expect a delimiter at the end of each substring, but the
// expectation of the caller is that the final substring is not terminated by delimiter.
// Therefore, we add an artificial delimiter at the end before looping through the source string.
String sourceX = source + delimiter;


// Loop through each character of the source
for (var idx = 0; idx < sourceX.Length; idx++)
{
// If current character is a delimiter
// (except if we're inside of qualifiers, we ignore the delimiter)
if (sourceX[idx] == delimiter && inQualifier == false)
{
// Terminate the current substring by adding it to the collection
// (trim if specified by the method parameter)
results.Add(toTrim ? result.ToString().Trim() : result.ToString());
result.Clear();
}
// If current character is a qualifier
else if (sourceX[idx] == qualifier)
{
// ...and we're already inside of qualifier
if (inQualifier)
{
// check for double-qualifiers, which is escape code for a single
// literal qualifier character.
if (idx + 1 < sourceX.Length && sourceX[idx + 1] == qualifier)
{
idx++;
result.Append(sourceX[idx]);
continue;
}
// Since we found only a single qualifier, that means that we've
// found the end of the enclosing qualifiers.
inQualifier = false;
continue;
}
else
// ...we found an opening qualifier
inQualifier = true;
}
// If current character is neither qualifier nor delimiter
else
result.Append(sourceX[idx]);
}


return results;
}

以下是证明其有效性的测试方法:

    [TestMethod()]
public void SplitQualified_00()
{
// Example with no substrings
String s = "";
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "" }, substrings);
}
[TestMethod()]
public void SplitQualified_00A()
{
// just a single delimiter
String s = ",";
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "", "" }, substrings);
}
[TestMethod()]
public void SplitQualified_01()
{
// Example with no whitespace or qualifiers
String s = "1,2,3,1,2,3";
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
}
[TestMethod()]
public void SplitQualified_02()
{
// Example with whitespace and no qualifiers
String s = " 1, 2 ,3,  1  ,2\t,   3   ";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
}
[TestMethod()]
public void SplitQualified_03()
{
// Example with whitespace and no qualifiers
String s = " 1, 2 ,3,  1  ,2\t,   3   ";
// whitespace should be preserved
var substrings = s.SplitQualified(',', '"', false);
CollectionAssert.AreEquivalent(
new List<String> { " 1", " 2 ", "3", "  1  ", "2\t", "   3   " },
substrings);
}
[TestMethod()]
public void SplitQualified_04()
{
// Example with no whitespace and trivial qualifiers.
String s = "1,\"2\",3,1,2,\"3\"";
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);


s = "\"1\",\"2\",3,1,\"2\",3";
substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
}
[TestMethod()]
public void SplitQualified_05()
{
// Example with no whitespace and qualifiers that enclose delimiters
String s = "1,\"2,2a\",3,1,2,\"3,3a\"";
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2,2a", "3", "1", "2", "3,3a" },
substrings);


s = "\"1,1a\",\"2,2b\",3,1,\"2,2c\",3";
substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1,1a", "2,2b", "3", "1", "2,2c", "3" },
substrings);
}
[TestMethod()]
public void SplitQualified_06()
{
// Example with qualifiers enclosing whitespace but no delimiter
String s = "\" 1 \",\"2 \",3,1,2,\"\t3\t\"";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" },
substrings);
}
[TestMethod()]
public void SplitQualified_07()
{
// Example with qualifiers enclosing whitespace but no delimiter
String s = "\" 1 \",\"2 \",3,1,2,\"\t3\t\"";
// whitespace should be preserved
var substrings = s.SplitQualified(',', '"', false);
CollectionAssert.AreEquivalent(new List<String> { " 1 ", "2 ", "3", "1", "2", "\t3\t" },
substrings);
}
[TestMethod()]
public void SplitQualified_08()
{
// Example with qualifiers enclosing whitespace but no delimiter; also whitespace btwn delimiters
String s = "\" 1 \", \"2 \"  ,  3,1, 2 ,\"  3  \"";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" },
substrings);
}
[TestMethod()]
public void SplitQualified_09()
{
// Example with qualifiers enclosing whitespace but no delimiter; also whitespace btwn delimiters
String s = "\" 1 \", \"2 \"  ,  3,1, 2 ,\"  3  \"";
// whitespace should be preserved
var substrings = s.SplitQualified(',', '"', false);
CollectionAssert.AreEquivalent(new List<String> { " 1 ", " 2   ", "  3", "1", " 2 ", "  3  " },
substrings);
}
[TestMethod()]
public void SplitQualified_10()
{
// Example with qualifiers enclosing whitespace and delimiter
String s = "\" 1 \",\"2 , 2b \",3,1,2,\"  3,3c  \"";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2 , 2b", "3", "1", "2", "3,3c" },
substrings);
}
[TestMethod()]
public void SplitQualified_11()
{
// Example with qualifiers enclosing whitespace and delimiter; also whitespace btwn delimiters
String s = "\" 1 \", \"2 , 2b \"  ,  3,1, 2 ,\"  3,3c  \"";
// whitespace should be preserved
var substrings = s.SplitQualified(',', '"', false);
CollectionAssert.AreEquivalent(new List<String> { " 1 ", " 2 , 2b   ", "  3", "1", " 2 ", "  3,3c  " },
substrings);
}
[TestMethod()]
public void SplitQualified_12()
{
// Example with tab characters between delimiters
String s = "\t1,\t2\t,3,1,\t2\t,\t3\t";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
}
[TestMethod()]
public void SplitQualified_13()
{
// Example with newline characters between delimiters
String s = "\n1,\n2\n,3,1,\n2\n,\n3\n";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
}
[TestMethod()]
public void SplitQualified_14()
{
// Example with qualifiers enclosing whitespace and delimiter, plus escaped qualifier
String s = "\" 1 \",\"\"\"2 , 2b \"\"\",3,1,2,\"  \"\"3,3c  \"";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "\"2 , 2b \"", "3", "1", "2", "\"3,3c" },
substrings);
}
[TestMethod()]
public void SplitQualified_14A()
{
// Example with qualifiers enclosing whitespace and delimiter, plus escaped qualifier
String s = "\"\"\"1\"\"\"";
// whitespace should be removed
var substrings = s.SplitQualified(',', '"', true);
CollectionAssert.AreEquivalent(new List<String> { "\"1\"" },
substrings);
}




[TestMethod()]
public void SplitQualified_15()
{
// Instead of comma-delimited and quote-qualified, use pipe and hash


// Example with no whitespace or qualifiers
String s = "1|2|3|1|2,2f|3";
var substrings = s.SplitQualified('|', '#', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2,2f", "3" }, substrings);
}
[TestMethod()]
public void SplitQualified_16()
{
// Instead of comma-delimited and quote-qualified, use pipe and hash


// Example with qualifiers enclosing whitespace and delimiter
String s = "# 1 #|#2 | 2b #|3|1|2|#  3|3c  #";
// whitespace should be removed
var substrings = s.SplitQualified('|', '#', true);
CollectionAssert.AreEquivalent(new List<String> { "1", "2 | 2b", "3", "1", "2", "3|3c" },
substrings);
}
[TestMethod()]
public void SplitQualified_17()
{
// Instead of comma-delimited and quote-qualified, use pipe and hash


// Example with qualifiers enclosing whitespace and delimiter; also whitespace btwn delimiters
String s = "# 1 #| #2 | 2b #  |  3|1| 2 |#  3|3c  #";
// whitespace should be preserved
var substrings = s.SplitQualified('|', '#', false);
CollectionAssert.AreEquivalent(new List<String> { " 1 ", " 2 | 2b   ", "  3", "1", " 2 ", "  3|3c  " },
substrings);
}