在 Objective-C 中进行 NSString 标记化

在 Objective-C 中标记/拆分 NSString 的最佳方法是什么?

71948 次浏览

找到答案 给你:

NSString *string = @"oop:ack:bork:greeble:ponies";
NSArray *chunks = [string componentsSeparatedByString: @":"];

如果只想拆分字符串,可以使用 -[NSString componentsSeparatedByString:]

如果您的标记需求更加复杂,请查看我的开源 Cocoa String 标记/解析工具包: ParseKit:

Http://parsekit.com

对于使用分隔符字符(比如’:’)进行简单的字符串分割,ParseKit 肯定是过分了。但是,对于复杂的标记化需求,ParseKit 非常强大/灵活。

也可以看看 ParseKit 标记文档

每个人都提到了 componentsSeparatedByString:,但是你也可以使用 CFStringTokenizer(记住 NSStringCFString是可以互换的) ,它也可以对自然语言进行标记化(比如中文/日文,它不在空格上分割单词)。

如果要对多个字符进行标记化,可以使用 NSString 的 componentsSeparatedByCharactersInSet。NSCharterSet 有一些方便的预制集,如 whitespaceCharacterSetillegalCharacterSet。它还有 Unicode 范围的初始值设定项。

您还可以组合字符集并使用它们进行标记,如下所示:

// Tokenize sSourceEntityName on both whitespace and punctuation.
NSMutableCharacterSet *mcharsetWhitePunc = [[NSCharacterSet whitespaceAndNewlineCharacterSet] mutableCopy];
[mcharsetWhitePunc formUnionWithCharacterSet:[NSCharacterSet punctuationCharacterSet]];
NSArray *sarrTokenizedName = [self.sSourceEntityName componentsSeparatedByCharactersInSet:mcharsetWhitePunc];
[mcharsetWhitePunc release];

请注意,如果 componentsSeparatedByCharactersInSet在一行中遇到 charSet 的多个成员,它将生成空字符串,因此您可能希望测试长度小于1的字符串。

我遇到过这样的情况: 在使用 ldapsearch 执行 LDAP 查询之后,我必须分割控制台输出。首先设置并执行 NSTask (我在这里找到了一个很好的代码示例: 从 Cocoa 应用程序执行终端命令)。但是之后我必须对输出进行分割和解析,以便只从 Ldap-query-output 中提取打印服务器名称。不幸的是,如果我们使用简单的 C 数组操作来操作 C 字符串/数组,那么这将是相当乏味的字符串操作,根本不会有任何问题。这是我使用可可对象的代码。如果你有更好的建议,请告诉我。

//as the ldap query has to be done when the user selects one of our Active Directory Domains
//(an according comboBox should be populated with print-server names we discover from AD)
//my code is placed in the onSelectDomain event code


//the following variables are declared in the interface .h file as globals
@protected NSArray* aDomains;//domain combo list array
@protected NSMutableArray* aPrinters;//printer combo list array
@protected NSMutableArray* aPrintServers;//print server combo list array


@protected NSString* sLdapQueryCommand;//for LDAP Queries
@protected NSArray* aLdapQueryArgs;
@protected NSTask* tskLdapTask;
@protected NSPipe* pipeLdapTask;
@protected NSFileHandle* fhLdapTask;
@protected NSMutableData* mdLdapTask;


IBOutlet NSComboBox* comboDomain;
IBOutlet NSComboBox* comboPrinter;
IBOutlet NSComboBox* comboPrintServer;
//end of interface globals


//after collecting the print-server names they are displayed in an according drop-down comboBox
//as soon as the user selects one of the print-servers, we should start a new query to find all the
//print-queues on that server and display them in the comboPrinter drop-down list
//to find the shares/print queues of a windows print-server you need samba and the net -S command like this:
// net -S yourPrintServerName.yourBaseDomain.com -U yourLdapUser%yourLdapUserPassWord -W adm rpc share -l
//which dispalays a long list of the shares


- (IBAction)onSelectDomain:(id)sender
{
static int indexOfLastItem = 0; //unfortunately we need to compare this because we are called also if the selection did not change!


if ([comboDomain indexOfSelectedItem] != indexOfLastItem && ([comboDomain indexOfSelectedItem] != 0))
{


indexOfLastItem = [comboDomain indexOfSelectedItem]; //retain this index for next call


//the print-servers-list has to be loaded on a per univeristy or domain basis from a file dynamically or from AN LDAP-QUERY


//initialize an LDAP-Query-Task or console-command like this one with console output
/*


ldapsearch -LLL -s sub -D "cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com" -h "yourLdapServer.com" -p 3268 -w "yourLdapUserPassWord" -b "dc=yourBaseDomainToSearchIn,dc=com" "(&(objectcategory=computer)(cn=ps*))" "dn"


//our print-server names start with ps* and we want the dn as result, wich comes like this:


dn: CN=PSyourPrintServerName,CN=Computers,DC=yourBaseDomainToSearchIn,DC=com


*/


sLdapQueryCommand = [[NSString alloc] initWithString: @"/usr/bin/ldapsearch"];




if ([[comboDomain stringValue] compare: @"firstDomain"] == NSOrderedSame) {


aLdapQueryArgs = [NSArray arrayWithObjects: @"-LLL",@"-s", @"sub",@"-D", @"cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com",@"-h", @"yourLdapServer.com",@"-p",@"3268",@"-w",@"yourLdapUserPassWord",@"-b",@"dc=yourFirstDomainToSearchIn,dc=com",@"(&(objectcategory=computer)(cn=ps*))",@"dn",nil];
}
else {
aLdapQueryArgs = [NSArray arrayWithObjects: @"-LLL",@"-s", @"sub",@"-D", @"cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com",@"-h", @"yourLdapServer.com",@"-p",@"3268",@"-w",@"yourLdapUserPassWord",@"-b",@"dc=yourSecondDomainToSearchIn,dc=com",@"(&(objectcategory=computer)(cn=ps*))",@"dn",nil];


}




//prepare and execute ldap-query task


tskLdapTask = [[NSTask alloc] init];
pipeLdapTask = [[NSPipe alloc] init];//instead of [NSPipe pipe]
[tskLdapTask setStandardOutput: pipeLdapTask];//hope to get the tasks output in this file/pipe


//The magic line that keeps your log where it belongs, has to do with NSLog (see https://stackoverflow.com/questions/412562/execute-a-terminal-command-from-a-cocoa-app and here http://www.cocoadev.com/index.pl?NSTask )
[tskLdapTask setStandardInput:[NSPipe pipe]];


//fhLdapTask  = [[NSFileHandle alloc] init];//would be redundand here, next line seems to do the trick also
fhLdapTask = [pipeLdapTask fileHandleForReading];
mdLdapTask  = [NSMutableData dataWithCapacity:512];//prepare capturing the pipe buffer which is flushed on read and can overflow, start with 512 Bytes but it is mutable, so grows dynamically later
[tskLdapTask setLaunchPath: sLdapQueryCommand];
[tskLdapTask setArguments: aLdapQueryArgs];


#ifdef bDoDebug
NSLog (@"sLdapQueryCommand: %@\n", sLdapQueryCommand);
NSLog (@"aLdapQueryArgs: %@\n", aLdapQueryArgs );
NSLog (@"tskLdapTask: %@\n", [tskLdapTask arguments]);
#endif


[tskLdapTask launch];


while ([tskLdapTask isRunning]) {
[mdLdapTask appendData: [fhLdapTask readDataToEndOfFile]];
}
[tskLdapTask waitUntilExit];//might be redundant here.


[mdLdapTask appendData: [fhLdapTask readDataToEndOfFile]];//add another read for safety after process/command stops


NSString* sLdapOutput = [[NSString alloc] initWithData: mdLdapTask encoding: NSUTF8StringEncoding];//convert output to something readable, as NSData and NSMutableData are mere byte buffers


#ifdef bDoDebug
NSLog(@"LdapQueryOutput: %@\n", sLdapOutput);
#endif


//Ok now we have the printservers from Active Directory, lets parse the output and show the list to the user in its combo box
//output is formatted as this, one printserver per line
//dn: CN=PSyourPrintServer,OU=Computers,DC=yourBaseDomainToSearchIn,DC=com


//so we have to search for "dn: CN=" to retrieve each printserver's name
//unfortunately splitting this up will give us a first line containing only "" empty string, which we can replace with the word "choose"
//appearing as first entry in the comboBox


aPrintServers = (NSMutableArray*)[sLdapOutput componentsSeparatedByString:@"dn: CN="];//split output into single lines and store it in the NSMutableArray aPrintServers


#ifdef bDoDebug
NSLog(@"aPrintServers: %@\n", aPrintServers);
#endif


if ([[aPrintServers objectAtIndex: 0 ] compare: @"" options: NSLiteralSearch] == NSOrderedSame){
[aPrintServers replaceObjectAtIndex: 0 withObject: slChoose];//replace with localized string "choose"


#ifdef bDoDebug
NSLog(@"aPrintServers: %@\n", aPrintServers);
#endif


}


//Now comes the tedious part to extract only the print-server-names from the single lines
NSRange r;
NSString* sTemp;


for (int i = 1; i < [aPrintServers count]; i++) {//skip first line with "choose". To get rid of the rest of the line, we must isolate/preserve the print server's name to the delimiting comma and remove all the remaining characters
sTemp = [aPrintServers objectAtIndex: i];
sTemp = [sTemp stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceAndNewlineCharacterSet]];//remove newlines and line feeds


#ifdef bDoDebug
NSLog(@"sTemp: %@\n", sTemp);
#endif
r = [sTemp rangeOfString: @","];//now find first comma to remove the whole rest of the line
//r.length = [sTemp lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
r.length = [sTemp length] - r.location;//calculate number of chars between first comma found and lenght of string
#ifdef bDoDebug
NSLog(@"range: %i, %i\n", r.location, r.length);
#endif


sTemp = [sTemp stringByReplacingCharactersInRange:r withString: @"" ];//remove rest of line
#ifdef bDoDebug
NSLog(@"sTemp after replace: %@\n", sTemp);
#endif


[aPrintServers replaceObjectAtIndex: i withObject: sTemp];//put back string into array for display in comboBox


#ifdef bDoDebug
NSLog(@"aPrintServer: %@\n", [aPrintServers objectAtIndex: i]);
#endif


}


[comboPrintServer removeAllItems];//reset combo box
[comboPrintServer addItemsWithObjectValues:aPrintServers];
[comboPrintServer setNumberOfVisibleItems:aPrintServers.count];
[comboPrintServer selectItemAtIndex:0];


#ifdef bDoDebug
NSLog(@"comboPrintServer reloaded with new values.");
#endif




//release memory we used for LdapTask
[sLdapQueryCommand release];
[aLdapQueryArgs release];
[sLdapOutput release];


[fhLdapTask release];


[pipeLdapTask release];
//    [tskLdapTask release];//strangely can not be explicitely released, might be autorelease anyway
//    [mdLdapTask release];//strangely can not be explicitely released, might be autorelease anyway


[sTemp release];


}
}

我自己曾经遇到过这样的例子: 仅仅把字符串按组件分开是不够的许多任务,比如 < br > 1)将令牌分类到类型 < br > 2)添加新的令牌 < br > 3)在自定义闭包之间分开字符串,比如在“{”和“}”< br > 之间分开所有的单词。对于任何这样的需求,我发现 解析工具包是一个救命稻草。

我用它来解析。 PGN (prtable 游戏符号)文件成功地它的非常快和精简。

如果您希望将字符串标记为搜索词,同时保留“引用短语”,这里有一个 NSString类别,它尊重各种类型的引用对: "" '' ‘’ “”

用法:

NSArray *terms = [@"This is my \"search phrase\" I want to split" searchTerms];
// results in: ["This", "is", "my", "search phrase", "I", "want", "to", "split"]

密码:

@interface NSString (Search)
- (NSArray *)searchTerms;
@end


@implementation NSString (Search)


- (NSArray *)searchTerms {


// Strip whitespace and setup scanner
NSCharacterSet *whitespace = [NSCharacterSet whitespaceAndNewlineCharacterSet];
NSString *searchString = [self stringByTrimmingCharactersInSet:whitespace];
NSScanner *scanner = [NSScanner scannerWithString:searchString];
[scanner setCharactersToBeSkipped:nil]; // we'll handle whitespace ourselves


// A few types of quote pairs to check
NSDictionary *quotePairs = @{@"\"": @"\"",
@"'": @"'",
@"\u2018": @"\u2019",
@"\u201C": @"\u201D"};


// Scan
NSMutableArray *results = [[NSMutableArray alloc] init];
NSString *substring = nil;
while (scanner.scanLocation < searchString.length) {
// Check for quote at beginning of string
unichar unicharacter = [self characterAtIndex:scanner.scanLocation];
NSString *startQuote = [NSString stringWithFormat:@"%C", unicharacter];
NSString *endQuote = [quotePairs objectForKey:startQuote];
if (endQuote != nil) { // if it's a valid start quote we'll have an end quote
// Scan quoted phrase into substring (skipping start & end quotes)
[scanner scanString:startQuote intoString:nil];
[scanner scanUpToString:endQuote intoString:&substring];
[scanner scanString:endQuote intoString:nil];
} else {
// Single word that is non-quoted
[scanner scanUpToCharactersFromSet:whitespace intoString:&substring];
}
// Process and add the substring to results
if (substring) {
substring = [substring stringByTrimmingCharactersInSet:whitespace];
if (substring.length) [results addObject:substring];
}
// Skip to next word
[scanner scanCharactersFromSet:whitespace intoString:nil];
}


// Return non-mutable array
return results.copy;


}


@end

如果你正在寻找一个字符串的语言特征(单词、段落、字符、句子和行) ,使用字符串枚举:

NSString * string = @" \n word1!    word2,%$?'/word3.word4   ";


[string enumerateSubstringsInRange:NSMakeRange(0, string.length)
options:NSStringEnumerationByWords
usingBlock:
^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
NSLog(@"Substring: '%@'", substring);
}];


// Logs:
// Substring: 'word1'
// Substring: 'word2'
// Substring: 'word3'
// Substring: 'word4'

这个 api 与其他语言一起工作,其中空格并不总是分隔符(例如日语)。使用 NSStringEnumerationByComposedCharacterSequences也是枚举字符的正确方法,因为许多非西方字符的长度超过一个字节。