删除 iPhone 上 NSString 中的 HTML 标记

有几种不同的方法可以从 Cocoa中的 NSString中去除 HTML tags

一种方法 是将字符串呈现为 NSAttributedString,然后获取呈现的文本。

另一种方法 是使用 NSXMLDocument's-objectByApplyingXSLTString方法应用一个 XSLT转换来完成它。

不幸的是,iPhone 不支持 NSAttributedStringNSXMLDocument。有太多的边缘情况和畸形的 HTML文档,我觉得舒适使用正则表达式或 NSScanner。有人能解决这个问题吗?

一个建议是简单地查找开始和结束标记字符,这种方法除了非常普通的情况外不会起作用。

例如,这些案例(来自 Perl Cookbook 关于同一主题的章节)将打破这种方法:

<IMG SRC = "foo.gif" ALT = "A > B">


<!-- <A comment> -->


<script>if (a<b && a>c)</script>


<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
103262 次浏览

I would imagine the safest way would just be to parse for <>s, no? Loop through the entire string, and copy anything not enclosed in <>s to a new string.

Take a look at NSXMLParser. It's a SAX-style parser. You should be able to use it to detect tags or other unwanted elements in the XML document and ignore them, capturing only pure text.

Here's a blog post that discusses a couple of libraries available for stripping HTML http://sugarmaplesoftware.com/25/strip-html-tags/ Note the comments where others solutions are offered.

If you want to get the content without the html tags from the web page (HTML document) , then use this code inside the UIWebViewDidfinishLoading delegate method.

  NSString *myText = [webView stringByEvaluatingJavaScriptFromString:@"document.documentElement.textContent"];

If you are willing to use Three20 framework, it has a category on NSString that adds stringByRemovingHTMLTags method. See NSStringAdditions.h in Three20Core subproject.

use this

NSString *myregex = @"<[^>]*>"; //regex to remove any html tag


NSString *htmlString = @"<html>bla bla</html>";
NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""];

don't forget to include this in your code : #import "RegexKitLite.h" here is the link to download this API : http://regexkit.sourceforge.net/#Downloads

A quick and "dirty" (removes everything between < and >) solution, works with iOS >= 3.2:

-(NSString *) stringByStrippingHTML {
NSRange r;
NSString *s = [[self copy] autorelease];
while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
s = [s stringByReplacingCharactersInRange:r withString:@""];
return s;
}

I have this declared as a category os NSString.

#import "RegexKitLite.h"


string text = [html stringByReplacingOccurrencesOfRegex:@"<[^>]+>" withString:@""]

This NSString category uses the NSXMLParser to accurately remove any HTML tags from an NSString. This is a single .m and .h file that can be included into your project easily.

https://gist.github.com/leighmcculloch/1202238

You then strip html by doing the following:

Import the header:

#import "NSString_stripHtml.h"

And then call stripHtml:

NSString* mystring = @"<b>Hello</b> World!!";
NSString* stripped = [mystring stripHtml];
// stripped will be = Hello World!!

This also works with malformed HTML that technically isn't XML.

I've extended the answer by m.kocikowski and tried to make it a bit more efficient by using an NSMutableString. I've also structured it for use in a static Utils class (I know a Category is probably the best design though), and removed the autorelease so it compiles in an ARC project.

Included here in case anybody finds it useful.

.h

+ (NSString *)stringByStrippingHTML:(NSString *)inputString;

.m

+ (NSString *)stringByStrippingHTML:(NSString *)inputString
{
NSMutableString *outString;


if (inputString)
{
outString = [[NSMutableString alloc] initWithString:inputString];


if ([inputString length] > 0)
{
NSRange r;


while ((r = [outString rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
{
[outString deleteCharactersInRange:r];
}
}
}


return outString;
}
UITextView *textview= [[UITextView alloc]initWithFrame:CGRectMake(10, 130, 250, 170)];
NSString *str = @"This is <font color='red'>simple</font>";
[textview setValue:str forKey:@"contentToHTMLString"];
textview.textAlignment = NSTextAlignmentLeft;
textview.editable = NO;
textview.font = [UIFont fontWithName:@"vardana" size:20.0];
[UIView addSubview:textview];

work fine for me

Extending this more from m.kocikowski's and Dan J's answers with more explanation for newbies

1# First you have to create objective-c-categories to make the code useable in any class.

.h

@interface NSString (NAME_OF_CATEGORY)


- (NSString *)stringByStrippingHTML;


@end

.m

@implementation NSString (NAME_OF_CATEGORY)


- (NSString *)stringByStrippingHTML
{
NSMutableString *outString;
NSString *inputString = self;


if (inputString)
{
outString = [[NSMutableString alloc] initWithString:inputString];


if ([inputString length] > 0)
{
NSRange r;


while ((r = [outString rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
{
[outString deleteCharactersInRange:r];
}
}
}


return outString;
}


@end

2# Then just import the .h file of the category class you've just created e.g.

#import "NSString+NAME_OF_CATEGORY.h"

3# Calling the Method.

NSString* sub = [result stringByStrippingHTML];
NSLog(@"%@", sub);

result is NSString I want to strip the tags from.

This is the modernization of m.kocikowski answer which removes whitespaces:

@implementation NSString (StripXMLTags)


- (NSString *)stripXMLTags
{
NSRange r;
NSString *s = [self copy];
while ((r = [s rangeOfString:@"<[^>]+>\\s*" options:NSRegularExpressionSearch]).location != NSNotFound)
s = [s stringByReplacingCharactersInRange:r withString:@""];
return s;
}


@end

You can use like below

-(void)myMethod
{


NSString* htmlStr = @"<some>html</string>";
NSString* strWithoutFormatting = [self stringByStrippingHTML:htmlStr];


}


-(NSString *)stringByStrippingHTML:(NSString*)str
{
NSRange r;
while ((r = [str rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location     != NSNotFound)
{
str = [str stringByReplacingCharactersInRange:r withString:@""];
}
return str;
}

Here's a more efficient solution than the accepted answer:

- (NSString*)hp_stringByRemovingTags
{
static NSRegularExpression *regex = nil;
static dispatch_once_t onceToken;
dispatch_once(&onceToken, ^{
regex = [NSRegularExpression regularExpressionWithPattern:@"<[^>]+>" options:kNilOptions error:nil];
});


// Use reverse enumerator to delete characters without affecting indexes
NSArray *matches =[regex matchesInString:self options:kNilOptions range:NSMakeRange(0, self.length)];
NSEnumerator *enumerator = matches.reverseObjectEnumerator;


NSTextCheckingResult *match = nil;
NSMutableString *modifiedString = self.mutableCopy;
while ((match = [enumerator nextObject]))
{
[modifiedString deleteCharactersInRange:match.range];
}
return modifiedString;
}

The above NSString category uses a regular expression to find all the matching tags, makes a copy of the original string and finally removes all the tags in place by iterating over them in reverse order. It's more efficient because:

  • The regular expression is initialised only once.
  • A single copy of the original string is used.

This performed well enough for me but a solution using NSScanner might be more efficient.

Like the accepted answer, this solution doesn't address all the border cases requested by @lfalin. Those would be require much more expensive parsing which the average use case most likely doesn't need.

Without a loop (at least on our side) :

- (NSString *)removeHTML {


static NSRegularExpression *regexp;
static dispatch_once_t onceToken;
dispatch_once(&onceToken, ^{
regexp = [NSRegularExpression regularExpressionWithPattern:@"<[^>]+>" options:kNilOptions error:nil];
});


return [regexp stringByReplacingMatchesInString:self
options:kNilOptions
range:NSMakeRange(0, self.length)
withTemplate:@""];
}

following is the accepted answer, but instead of category, it is simple helper method with string passed into it. (thank you m.kocikowski)

-(NSString *) stringByStrippingHTML:(NSString*)originalString {
NSRange r;
NSString *s = [originalString copy];
while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
s = [s stringByReplacingCharactersInRange:r withString:@""];
return s;
}
NSAttributedString *str=[[NSAttributedString alloc] initWithData:[trimmedString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]} documentAttributes:nil error:nil];

I have following the accepted answer by m.kocikowski and modified is slightly to make use of an autoreleasepool to cleanup all of the temporary strings that are created by stringByReplacingCharactersInRange

In the comment for this method it states, /* Replace characters in range with the specified string, returning new string. */

So, depending on the length of your XML you may be creating a huge pile of new autorelease strings which are not cleaned up until the end of the next @autoreleasepool. If you are unsure when that may happen or if a user action could repeatedly trigger many calls to this method before then you can just wrap this up in an @autoreleasepool. These can even be nested and used within loops where possible.

Apple's reference on @autoreleasepool states this... "If you write a loop that creates many temporary objects. You may use an autorelease pool block inside the loop to dispose of those objects before the next iteration. Using an autorelease pool block in the loop helps to reduce the maximum memory footprint of the application." I have not used it in the loop, but at least this method cleans up after itself now.

- (NSString *) stringByStrippingHTML {
NSString *retVal;
@autoreleasepool {
NSRange r;
NSString *s = [[self copy] autorelease];
while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound) {
s = [s stringByReplacingCharactersInRange:r withString:@""];
}
retVal = [s copy];
}
// pool is drained, release s and all temp
// strings created by stringByReplacingCharactersInRange
return retVal;
}

Here's the swift version :

func stripHTMLFromString(string: String) -> String {
var copy = string
while let range = copy.rangeOfString("<[^>]+>", options: .RegularExpressionSearch) {
copy = copy.stringByReplacingCharactersInRange(range, withString: "")
}
copy = copy.stringByReplacingOccurrencesOfString("&nbsp;", withString: " ")
copy = copy.stringByReplacingOccurrencesOfString("&amp;", withString: "&")
return copy
}

Another one way:

Interface:

-(NSString *) stringByStrippingHTML:(NSString*)inputString;

Implementation

(NSString *) stringByStrippingHTML:(NSString*)inputString
{
NSAttributedString *attrString = [[NSAttributedString alloc] initWithData:[inputString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,NSCharacterEncodingDocumentAttribute: @(NSUTF8StringEncoding)} documentAttributes:nil error:nil];
NSString *str= [attrString string];


//you can add here replacements as your needs:
[str stringByReplacingOccurrencesOfString:@"[" withString:@""];
[str stringByReplacingOccurrencesOfString:@"]" withString:@""];
[str stringByReplacingOccurrencesOfString:@"\n" withString:@""];


return str;
}

Realization

cell.exampleClass.text = [self stringByStrippingHTML:[exampleJSONParsingArray valueForKey: @"key"]];

or simple

NSString *myClearStr = [self stringByStrippingHTML:rudeStr];

An updated answer for @m.kocikowski that works on recent iOS versions.

-(NSString *) stringByStrippingHTMLFromString:(NSString *)str {
NSRange range;
while ((range = [str rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
str = [str stringByReplacingCharactersInRange:range withString:@""];
return str;

}