如何从文本中解析自由格式的街道/邮政地址,并将其分解为组件

我们主要在美国开展业务,并试图通过将所有地址字段合并到一个单一的文本区域来改善用户体验。但有几个问题:

  • 用户类型的地址可能不正确或者格式不标准
  • 为了处理信用卡付款,地址必须分成几个部分(街道、城市、州等)
  • 用户可能输入的不仅仅是他们的地址(比如他们的名字或公司)
  • 谷歌可以做到这一点,但服务条款和查询限制是禁止的,特别是在一个紧张的预算

显然,这是一个常见的问题:

有没有办法把一个地址从周围的文字中分离出来,然后把它分成几块?是否有用于解析地址的正则表达式?

145305 次浏览

当我在一家地址验证公司工作时,我经常看到这个问题。我把答案贴在这里是为了让那些用同样的问题搜索的程序员更容易理解。我所在的公司处理了数十亿个地址,在这个过程中我们学到了很多。

首先,我们需要了解一些关于地址的事情。

地址不是 普通的

这意味着正则表达式出局了。从简单的正则表达式(以特定的格式匹配地址)到下面这些,我都见过:

([ a-zA-Z | s + ]{1,5}){1,2})([ s | ,| . ] +) ? ([ a-zA-Z | s + ]{1,30}){1,4})(court | CT | street | st | drive | dr | lane | ln | road | rd | blvd)([ s | | | | . ] +) ? (([ a-zA-Z | s + ]{1,30}){1,2})([ s | ,| . ] +) ? b ([ s | | AL | AZ | CA | CO | DC | DE | FL | GA | GU | IA | ID | IL | IN | KS | KY | LA | MA | MD | ME | MI | MO | MS | NC | ND | NE | NJ | NM | NV | NY | OH | OK | OR | PA | RI | SC | TX | TX | WV | WY)([ s | | | | | | ] +) ? (s + d {5})(s | | | . ] +)

... 到 这个,其中一个900 + 行类文件在运行时生成一个超大规模的正则表达式来匹配更多。我不推荐这些(例如,这里有一个上述正则表达式的小提琴,它会犯很多错误)。没有一个简单的魔法公式能让它起作用。在理论和 作者理论中,不可能将地址与正则表达式匹配。

USPS 出版物28 记录了各种可能的地址格式,以及它们的关键字和变体。最糟糕的是,地址常常含糊不清。单词可以意味着不止一件事(“ St”可以是“ Saint”或“ Street”) ,我很确定有些单词是他们发明的。(谁知道“ Stravenue”是个街头后缀?)

你需要一些真正了解地址的代码,如果这些代码确实存在,那就是商业机密。但如果你真的喜欢的话,你可以自己卷。

地址以意想不到的形状和大小出现

下面是一些人为的(但是完整的)地址:

1)  102 main street
Anytown, state


2)  400n 600e #2, 52173


3)  p.o. #104 60203

即使这些也可能是有效的:

4)  829 LKSDFJlkjsdflkjsdljf Bkpw 12345


5)  205 1105 14 90210

显然,这些都不是标准化的。标点符号和换行符也不能保证。事情是这样的:

  1. 编号1 是完整的,因为它包含一个街道地址以及一个城市和州。有了这些信息,就有了足够的地址标识,可以认为它是“可交付的”(具有一定的标准化)。

  2. 编号2 是完整的,因为它还包含一个街道地址(带有二级/单位编号)和一个5位的邮政编码,这足以识别一个地址。

  3. 编号3 是一个完整的邮政信箱格式,因为它包含一个邮政编码。

  4. 编号4 也是完整的,因为 邮政编码是唯一的意味着私人实体或公司已经购买了该地址空间。一个独特的邮政编码是为大量或集中的交付空间。任何寄往邮政编码12345的地址都会转到斯克内克塔迪的通用电气。这个例子不会到达任何特定的人,但美国邮政仍然能够提供它。

  5. 不管你信不信,第5个 也是完整的。只需要这些数字,就可以在对包含所有可能地址的数据库进行解析时发现完整地址。当您将每个数字视为一个组件时,填充缺少的方向图、辅助指示符和 ZIP + 4代码是很简单的。这就是它看起来的样子,完全扩展和标准化:

205N 1105W 14室

比佛利山加州90210-5221

地址数据不是您自己的

在向许可供应商提供正式地址数据的大多数国家,地址数据本身属于管理机构。在美国,美国邮政拥有这些地址。加拿大邮政、皇家邮政和其他国家的情况也是如此,尽管每个国家对所有权的执行或定义略有不同。了解这一点很重要,因为它通常禁止对地址数据库进行反向工程。您必须注意如何获取、存储和使用数据。

谷歌地图是一种常见的快速地址修复工具,但是 TOS是相当令人望而却步的; 例如,你不能在不显示谷歌地图的情况下使用他们的数据或 API,而且只能用于非商业目的(除非你付费) ,而且你不能存储数据(除了临时缓存)。有道理。谷歌的数据是世界上最好的。然而,谷歌地图确实 没有验证了地址。如果一个地址不存在,它仍然会显示您的地址 是如果它 是的存在(尝试在您自己的街道,使用您知道不存在的门牌号码)。这有时是有用的,但要注意。

Nominatim 的 使用政策也有类似的限制,特别是在大容量和商业使用方面,而且数据大多来自免费来源,所以它不能很好地维护(这就是开放项目的本质)——然而,这可能仍然适合你的需要。它得到了一个伟大社区的支持。

USPS 本身有一个 API,但是 经常这样没有保证和支持。也可能很难使用。有些人很节制地使用它,没有问题。但是很容易忽略的是,USPS 要求您使用他们的 API 只是为了确认通过他们发送的地址。

人们认为地址很难找

不幸的是,我们的社会已经习惯了地址的复杂性。互联网上有很多关于这方面的优秀用户体验文章,但事实是,如果你有一个包含单个字段的地址表单,这就是用户所期望的,即使这会使得边缘地址不符合表单所期望的格式变得更加困难,或者表单可能需要一个不应该包含的字段。或者用户不知道把他们地址的某一部分放在哪里。

我可以继续谈论这些天结帐表单糟糕的用户体验,但我只想说,将地址合并到一个单一的字段将是一个 欢迎的变化——人们将能够键入他们认为合适的地址,而不是试图弄清楚你的冗长的表单。但是,这个更改将是 出乎意料,用户可能会发现它在一开始有点不协调。记住这一点。

这种痛苦的一部分可以通过在演讲之前把乡村球场放在前面来减轻。当他们首先填写 country 字段时,您知道如何显示表单。也许您有一个很好的方法来处理单字段的 US 地址,所以如果他们选择 United,您可以将表单缩减为单个字段,否则显示组件字段。只是一些需要考虑的事情!

现在我们知道为什么这么难了,你能做些什么呢?

USPS 通过一个称为 CASSTM 认证的过程向供应商发放许可证,以便向客户提供经过验证的地址。这些供应商可以访问 USPS 数据库,每月更新一次。他们的软件必须符合严格的标准才能获得认证,而且他们通常不需要对上面讨论的限制性条款达成一致。

有许多 CASS 认证的公司,可以处理名单或 API: 梅丽莎数据,益百利 QAS,和 SmartyStreets 举几个例子。

(由于受到“广告”的抨击,我在这一点上缩短了我的回答。这取决于你是否能找到一个适合你的解决方案。)

真相: 真的,伙计们,我不在这些公司工作,这不是广告。

有许多街道地址解析器。它们有两种基本类型——一种有地名和街道名的数据库,另一种没有。

正则表达式街道地址解析器的成功率最高可达95% ,而且不会遇到太多麻烦。然后你开始调查那些不寻常的案子。CPAN 中的 Perl“ Geo: : StreetAddress: : US”也差不多是这个意思。有 Python 和 Javascript 的端口,都是开源的。我有一个改进的 Python 版本,它通过处理更多的案例来提高成功率。但是,为了使最后的3% 正确,需要使用数据库来帮助消除歧义。

一个包含3位邮政编码和美国各州名称和缩写的数据库是一个很大的帮助。当解析器看到一致的邮政编码和状态名称时,它可以开始锁定该格式。这对美国和英国非常有效。

正确的街道地址解析从结尾开始并向后运行。美国邮政系统就是这么做的。最后的地址是最不模糊的,其中国家名称,城市名称和邮政编码相对容易识别。街道名称通常可以单独使用。解析街道上的位置是最复杂的; 在那里你会遇到诸如“五楼”和“斯台普斯亭”之类的东西。这时数据库就能帮上大忙了。

更新: Geocode.xyz 现在可以在全球范围内使用

对于美国,墨西哥和加拿大,见 Geocoder CA

例如:

输入: 主街和亚瑟街交叉口附近发生了一些事情,纽约街被杀了

产出:

<geodata>
<latt>40.5123510000</latt>
<longt>-74.2500500000</longt>
<AreaCode>347,718</AreaCode>
<TimeZone>America/New_York</TimeZone>
<standard>
<street1>main</street1>
<street2>arthur kill</street2>
<stnumber/>
<staddress/>
<city>STATEN ISLAND</city>
<prov>NY</prov>
<postal>11385</postal>
<confidence>0.9</confidence>
</standard>
</geodata>

您还可以检查 Web 界面中的结果,或者以 Json 或 Jsonp.eg.我在找纽约大街123号附近的餐馆的形式获得输出

没有密码? 真丢脸!

下面是一个简单的 JavaScript 地址解析器。Matt 在他上面的论文中给出的每一个理由都非常糟糕(我几乎100% 同意: 地址是复杂的类型,人类会犯错误; 最好是外包和自动化——当你有能力的时候)。

但我没有哭,而是决定尝试:

这段代码可用于解析 findAddressCandidate的大多数 Esri 结果,也可用于其他一些(反向)地理编码器,这些地理编码器返回单行地址,其中 street/city/state 用逗号分隔。如果需要,您可以进行扩展,或者编写特定于国家的解析器。或者仅仅用这个例子来说明这个练习有多么具有挑战性,或者我在 JavaScript 方面有多么糟糕。我承认我只花了大约30分钟在这上面(未来的迭代可能会添加缓存、压缩验证、状态查找以及用户位置上下文) ,但是它对我的用例起作用了: 最终用户看到将地理代码搜索响应解析为4个文本框的表单。如果地址解析出现错误(除非源数据很糟糕,否则这种情况很少发生) ,这没什么大不了的——用户可以验证并修复它!(但是对于自动化解决方案,可以丢弃/忽略或标记为错误,以便开发人员可以支持新的格式或修复源数据。)

/*
address assumptions:
- US addresses only (probably want separate parser for different countries)
- No country code expected.
- if last token is a number it is probably a postal code
-- 5 digit number means more likely
- if last token is a hyphenated string it might be a postal code
-- if both sides are numeric, and in form #####-#### it is more likely
- if city is supplied, state will also be supplied (city names not unique)
- zip/postal code may be omitted even if has city & state
- state may be two-char code or may be full state name.
- commas:
-- last comma is usually city/state separator
-- second-to-last comma is possibly street/city separator
-- other commas are building-specific stuff that I don't care about right now.
- token count:
-- because units, street names, and city names may contain spaces token count highly variable.
-- simplest address has at least two tokens: 714 OAK
-- common simple address has at least four tokens: 714 S OAK ST
-- common full (mailing) address has at least 5-7:
--- 714 OAK, RUMTOWN, VA 59201
--- 714 S OAK ST, RUMTOWN, VA 59201
-- complex address may have a dozen or more:
--- MAGICICIAN SUPPLY, LLC, UNIT 213A, MAGIC TOWN MALL, 13 MAGIC CIRCLE DRIVE, LAND OF MAGIC, MA 73122-3412
*/


var rawtext = $("textarea").val();
var rawlist = rawtext.split("\n");


function ParseAddressEsri(singleLineaddressString) {
var address = {
street: "",
city: "",
state: "",
postalCode: ""
};


// tokenize by space (retain commas in tokens)
var tokens = singleLineaddressString.split(/[\s]+/);
var tokenCount = tokens.length;
var lastToken = tokens.pop();
if (
// if numeric assume postal code (ignore length, for now)
!isNaN(lastToken) ||
// if hyphenated assume long zip code, ignore whether numeric, for now
lastToken.split("-").length - 1 === 1) {
address.postalCode = lastToken;
lastToken = tokens.pop();
}


if (lastToken && isNaN(lastToken)) {
if (address.postalCode.length && lastToken.length === 2) {
// assume state/province code ONLY if had postal code
// otherwise it could be a simple address like "714 S OAK ST"
// where "ST" for "street" looks like two-letter state code
// possibly this could be resolved with registry of known state codes, but meh. (and may collide anyway)
address.state = lastToken;
lastToken = tokens.pop();
}
if (address.state.length === 0) {
// check for special case: might have State name instead of State Code.
var stateNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];


// check remaining tokens from right-to-left for the first comma
while (2 + 2 != 5) {
lastToken = tokens.pop();
if (!lastToken) break;
else if (lastToken.endsWith(",")) {
// found separator, ignore stuff on left side
tokens.push(lastToken); // put it back
break;
} else {
stateNameParts.unshift(lastToken);
}
}
address.state = stateNameParts.join(' ');
lastToken = tokens.pop();
}
}


if (lastToken) {
// here is where it gets trickier:
if (address.state.length) {
// if there is a state, then assume there is also a city and street.
// PROBLEM: city may be multiple words (spaces)
// but we can pretty safely assume next-from-last token is at least PART of the city name
// most cities are single-name. It would be very helpful if we knew more context, like
// the name of the city user is in. But ignore that for now.
// ideally would have zip code service or lookup to give city name for the zip code.
var cityNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];


// assumption / RULE: street and city must have comma delimiter
// addresses that do not follow this rule will be wrong only if city has space
// but don't care because Esri formats put comma before City
var streetNameParts = [];


// check remaining tokens from right-to-left for the first comma
while (2 + 2 != 5) {
lastToken = tokens.pop();
if (!lastToken) break;
else if (lastToken.endsWith(",")) {
// found end of street address (may include building, etc. - don't care right now)
// add token back to end, but remove trailing comma (it did its job)
tokens.push(lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken);
streetNameParts = tokens;
break;
} else {
cityNameParts.unshift(lastToken);
}
}
address.city = cityNameParts.join(' ');
address.street = streetNameParts.join(' ');
} else {
// if there is NO state, then assume there is NO city also, just street! (easy)
// reasoning: city names are not very original (Portland, OR and Portland, ME) so if user wants city they need to store state also (but if you are only ever in Portlan, OR, you don't care about city/state)
// put last token back in list, then rejoin on space
tokens.push(lastToken);
address.street = tokens.join(' ');
}
}
// when parsing right-to-left hard to know if street only vs street + city/state
// hack fix for now is to shift stuff around.
// assumption/requirement: will always have at least street part; you will never just get "city, state"
// could possibly tweak this with options or more intelligent parsing&sniffing
if (!address.city && address.state) {
address.city = address.state;
address.state = '';
}
if (!address.street) {
address.street = address.city;
address.city = '';
}


return address;
}


// get list of objects with discrete address properties
var addresses = rawlist
.filter(function(o) {
return o.length > 0
})
.map(ParseAddressEsri);
$("#output").text(JSON.stringify(addresses));
console.log(addresses);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea>
27488 Stanford Ave, Bowden, North Dakota
380 New York St, Redlands, CA 92373
13212 E SPRAGUE AVE, FAIR VALLEY, MD 99201
1005 N Gravenstein Highway, Sebastopol CA 95472
A. P. Croll &amp; Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
11522 Shawnee Road, Greenwood, DE 19950
144 Kings Highway, S.W. Dover, DE 19901
Intergrated Const. Services 2 Penns Way Suite 405, New Castle, DE 19720
Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958
Nichols Excavation 2742 Pulaski Hwy, Newark, DE 19711
2284 Bryn Zion Road, Smyrna, DE 19904
VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
580 North Dupont Highway, Dover, DE 19901
P.O. Box 778, Dover, DE 19903
714 S OAK ST
714 S OAK ST, RUM TOWN, VA, 99201
3142 E SPRAGUE AVE, WHISKEY VALLEY, WA 99281
27488 Stanford Ave, Bowden, North Dakota
380 New York St, Redlands, CA 92373
</textarea>
<div id="output">
</div>

对于美国地址解析,我更喜欢使用 usaddress软件包,即 可在 pip

python3 -m pip install usaddress

用法示例:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-


# address_parser.py
import sys
from usaddress import tag
from json import dumps, loads


if __name__ == '__main__':
tag_mapping = {
'Recipient': 'recipient',
'AddressNumber': 'addressStreet',
'AddressNumberPrefix': 'addressStreet',
'AddressNumberSuffix': 'addressStreet',
'StreetName': 'addressStreet',
'StreetNamePreDirectional': 'addressStreet',
'StreetNamePreModifier': 'addressStreet',
'StreetNamePreType': 'addressStreet',
'StreetNamePostDirectional': 'addressStreet',
'StreetNamePostModifier': 'addressStreet',
'StreetNamePostType': 'addressStreet',
'CornerOf': 'addressStreet',
'IntersectionSeparator': 'addressStreet',
'LandmarkName': 'addressStreet',
'USPSBoxGroupID': 'addressStreet',
'USPSBoxGroupType': 'addressStreet',
'USPSBoxID': 'addressStreet',
'USPSBoxType': 'addressStreet',
'BuildingName': 'addressStreet',
'OccupancyType': 'addressStreet',
'OccupancyIdentifier': 'addressStreet',
'SubaddressIdentifier': 'addressStreet',
'SubaddressType': 'addressStreet',
'PlaceName': 'addressCity',
'StateName': 'addressState',
'ZipCode': 'addressPostalCode',
}
try:
address, _ = tag(' '.join(sys.argv[1:]), tag_mapping=tag_mapping)
except:
with open('failed_address.txt', 'a') as fp:
fp.write(sys.argv[1] + '\n')
print(dumps({}))
else:
print(dumps(dict(address)))

运行 address_parser.py:

python3 address_parser.py 9757 East Arcadia Ave. Saugus MA 01906
{"addressStreet": "9757 East Arcadia Ave.", "addressCity": "Saugus", "addressState": "MA", "addressPostalCode": "01906"}

我迟到了,但这里有一个 Excel VBA 脚本,我几年前为澳大利亚写的。它可以很容易地修改以支持其他国家。我在这里创建了一个 C # 代码的 GitHub 存储库。我已经在我的网站上托管它,你可以在这里下载: http://jeremythompson.net/Rocks/ParseAddress.xlsm

策略

对于任何拥有数字邮政编码或可以与正则表达式匹配的国家,我的策略都非常有效:

  1. 首先我们检测被假定为顶行的 First 和 Surname。通过取消勾选复选框(如下所示称为“ Name is top row”) ,可以很容易地跳过名称并从地址开始。

  2. 接下来,它的安全期望地址组成的街道和数字前面的郊区和圣,Pde,Ave,Av,Rd,Cres,循环等是一个分隔符。

  3. 检测郊区 VS 州甚至乡村可以欺骗最复杂的解析器,因为可能存在冲突。为了克服这个问题,我使用了一个邮政编码查找,基于这样一个事实,即剥离街道和公寓/单位号码以及 PoBox、 Ph、 传真、 Mobile 等,只有邮政编码号码将保留。这很容易与 regEx 匹配,然后查找郊区和国家。

    您的国家邮局服务将提供一个邮政编码与郊区和国家免费列表,您可以存储在一个 Excel 表,数据库表,文本/json/xml 文件等。

  4. 最后,由于一些邮政编码有多个郊区,我们检查哪个郊区出现在地址。


例子

Screenshot of Excel cells

VBA 代码

免责声明,我知道这个代码并不完美,甚至写得很好,但它很容易转换成任何编程语言,并在任何类型的应用程序运行。策略是根据您的国家和规则的答案,以下面的代码为例:

Option Explicit


Private Const TopRow As Integer = 0


Public Sub ParseAddress()
Dim strArr() As String
Dim sigRow() As String
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim Stat As String
Dim SpaceInName As Integer
Dim Temp As String
Dim PhExt As String


On Error Resume Next


Temp = ActiveSheet.Range("Address")


'Split info into array
strArr = Split(Temp, vbLf)


'Trim the array
For i = 0 To UBound(strArr)
strArr(i) = VBA.Trim(strArr(i))
Next i


'Remove empty items/rows
ReDim sigRow(LBound(strArr) To UBound(strArr))
For i = LBound(strArr) To UBound(strArr)
If Trim(strArr(i)) <> "" Then
sigRow(j) = strArr(i)
j = j + 1
End If
Next i
ReDim Preserve sigRow(LBound(strArr) To j)


'Find the name (MUST BE ON THE FIRST ROW UNLESS CHECKBOX UNTICKED)
i = TopRow
If ActiveSheet.Shapes("chkFirst").ControlFormat.Value = 1 Then


SpaceInName = InStr(1, sigRow(i), " ", vbTextCompare) - 1


If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
Else
If MsgBox("First Name: " & VBA.Mid$(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
End If


If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
Else
If MsgBox("Surame: " & VBA.Mid(sigRow(i), SpaceInName + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
End If
sigRow(i) = ""
End If


'Find the Street by looking for a "St, Pde, Ave, Av, Rd, Cres, loop, etc"
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
For j = 0 To 8
If InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) > 0 Then
    

'Find the position of the street in order to get the suburb
SpaceInName = InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) + Len(Street(j)) - 1
    

'If its a po box then add 5 chars
If VBA.Right(Street(j), 3) = "BOX" Then SpaceInName = SpaceInName + 5
    

If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
Else
If MsgBox("Street Address: " & VBA.Mid(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
End If
'Trim the Street, Number leaving the Suburb if its exists on the same line
sigRow(i) = VBA.Mid(sigRow(i), SpaceInName) + 2
sigRow(i) = Replace(sigRow(i), VBA.Mid(sigRow(i), 1, SpaceInName), "")
    

GoTo PastAddress:
End If
Next j
End If
Next i
PastAddress:


'Mobile
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
For j = 0 To 3
Temp = Mb(j)
If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
Else
If MsgBox("Mobile: " & VBA.Mid(sigRow(i), Len(Temp) + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
End If
sigRow(i) = ""
GoTo PastMobile:
End If
Next j
End If
Next i
PastMobile:


'Phone
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
For j = 0 To 1
Temp = Ph(j)
If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
            

'TODO: Detect the intl or national extension here.. or if we can from the postcode.
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
Else
If MsgBox("Phone: " & VBA.Mid(sigRow(i), Len(Temp) + 3), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
End If
        

sigRow(i) = ""
GoTo PastPhone:
End If
Next j
End If
Next i
PastPhone:




'Email
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
'replace with regEx search
If InStr(1, sigRow(i), "@", vbTextCompare) And InStr(1, VBA.UCase(sigRow(i)), ".CO", vbTextCompare) Then
Dim email As String
email = sigRow(i)
email = Replace(VBA.UCase(email), "EMAIL:", "")
email = Replace(VBA.UCase(email), "E-MAIL:", "")
email = Replace(VBA.UCase(email), "E:", "")
email = Replace(VBA.UCase(Trim(email)), "E ", "")
email = VBA.LCase(email)
        

If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Email") = email
Else
If MsgBox("Email: " & email, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Email") = email
End If
sigRow(i) = ""
Exit For
End If
End If
Next i


'Now the only remaining items will be the postcode, suburb, country
'there shouldn't be any numbers (eg. from PoBox,Ph,Fax,Mobile) except for the Post Code


'Join the string and filter out the Post Code
Temp = Join(sigRow, vbCrLf)
Temp = Trim(Temp)


For i = 1 To Len(Temp)


Dim postCode As String
postCode = VBA.Mid(Temp, i, 4)
    

'In Australia PostCodes are 4 digits
If VBA.Mid(Temp, i, 1) <> " " And IsNumeric(postCode) Then


If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("PostCode") = postCode
Else
If MsgBox("Post Code: " & postCode, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("PostCode") = postCode
End If


'Lookup the Suburb and State based on the PostCode, the PostCode sheet has the lookup
Dim mySuburbArray As Range
Set mySuburbArray = Sheets("PostCodes").Range("A2:B16670")
    

Dim suburbs As String
For j = 1 To mySuburbArray.Columns(1).Cells.Count
If mySuburbArray.Cells(j, 1) = postCode Then
'Check if the suburb is listed in the address
If InStr(1, UCase(Temp), mySuburbArray.Cells(j, 2), vbTextCompare) > 0 Then


'Set the Suburb and State
ActiveSheet.Range("Suburb") = mySuburbArray.Cells(j, 2)
Stat = mySuburbArray.Cells(j, 3)
ActiveSheet.Range("State") = Stat
                

'Knowing the State - for Australia we can get the telephone Ext
PhExt = PhExtension(VBA.UCase(Stat))
ActiveSheet.Range("PhExt") = PhExt
        

'remove the phone extension from the number
Dim prePhone As String
prePhone = ActiveSheet.Range("Phone")
prePhone = Replace(prePhone, PhExt & " ", "")
prePhone = Replace(prePhone, "(" & PhExt & ") ", "")
prePhone = Replace(prePhone, "(" & PhExt & ")", "")
ActiveSheet.Range("Phone") = prePhone
Exit For
End If
End If
Next j
Exit For
End If
Next i


End Sub


  

Private Function PhExtension(ByVal State As String) As String
Select Case State
Case Is = "NSW"
PhExtension = "02"
Case Is = "QLD"
PhExtension = "07"
Case Is = "VIC"
PhExtension = "03"
Case Is = "NT"
PhExtension = "04"
Case Is = "WA"
PhExtension = "05"
Case Is = "SA"
PhExtension = "07"
Case Is = "TAS"
PhExtension = "06"
End Select
End Function


Private Function Ph(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Ph = "PH"
Case Is = 1
Ph = "PHONE"
'Case Is = 2
'Ph = "P"
End Select
End Function


Private Function Mb(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Mb = "MB"
Case Is = 1
Mb = "MOB"
Case Is = 2
Mb = "CELL"
Case Is = 3
Mb = "MOBILE"
'Case Is = 4
'Mb = "M"
End Select
End Function


Private Function Fax(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Fax = "FAX"
Case Is = 1
Fax = "FACSIMILE"
'Case Is = 2
'Fax = "F"
End Select
End Function


Private Function State(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
State = "NSW"
Case Is = 1
State = "QLD"
Case Is = 2
State = "VIC"
Case Is = 3
State = "NT"
Case Is = 4
State = "WA"
Case Is = 5
State = "SA"
Case Is = 6
State = "TAS"
End Select
End Function


Private Function Street(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Street = " ST"
Case Is = 1
Street = " RD"
Case Is = 2
Street = " AVE"
Case Is = 3
Street = " AV"
Case Is = 4
Street = " CRES"
Case Is = 5
Street = " LOOP"
Case Is = 6
Street = "PO BOX"
Case Is = 7
Street = " STREET"
Case Is = 8
Street = " ROAD"
Case Is = 9
Street = " AVENUE"
Case Is = 10
Street = " CRESENT"
Case Is = 11
Street = " PARADE"
Case Is = 12
Street = " PDE"
Case Is = 13
Street = " LANE"
Case Is = 14
Street = " COURT"
Case Is = 15
Street = " BLVD"
Case Is = 16
Street = "P.O. BOX"
Case Is = 17
Street = "P.O BOX"
Case Is = 18
Street = "PO BOX"
Case Is = 19
Street = "POBOX"
End Select
End Function