像 kayak.com 这样的网站如何聚合内容?

你好, 我一直在酝酿一个新项目的想法,想知道是否有人知道像 Kayak.com 这样的服务是如何能够如此迅速而准确地从如此多的来源收集数据的。更具体地说,你认为 Kayak.com 是在与 API 交互,还是在抓取/刮取航空公司和酒店的网站,以满足用户的需求?我知道对于这种事情没有一个正确的答案,但是我很想知道其他人认为什么是解决这个问题的好方法。如果有帮助的话,假设你明天要创建 kayak.com... 你的数据来自哪里?

79165 次浏览

Only 3 ways I know of to get data from websites.

RSS Feeds - We use rss feeds a lot at my company to integrate existing site's data with our apps. It's fast and most sites already have an RSS feed available. The problem with this is not all sites implement the RSS standard properly so if you're pulling data from many RSS feeds across many sites then make sure you write your code so that you can add exceptions and filters easily.

APIs - These are nice if they are designed well and have all the information you need, however that's not always the case, plus if the sites are not using a standard api format then you'll have to support multiple API's.

Web Scraping - This method would be the most unreliable as well as the most expensive to maintain. But if you're left with nothing else it can be done.

They use a software package like ITA Software, which is one of the companies Google is in the process of picking up.

I'm working in travel industry as a software architect / project lead on the precisely kind of project you describe - in our region we work with suppliers directly, but for outgoing we connect to several aggregators.

To answer your question... some data you have, some you get in various ways, and some you have to torture and twist until it confesses.

What's your angle?

The questions you have to ask are... Do you want to sell advertising like Kayak or do you take a cut like Expedia? Are you into search or into selling travel services? Do you target niche (for example, just air travel) or everything (accommodation, airlines, rent-a-car, additional services like transport/sightseeing/conferences etc)? Do you target region (US or part of US) or the world? How deep do you go - do you just show several sites on a single screen, or do you bundle different services together and package them dynamically?

Getting the data

If you're going with Kayak business model, you technically don't need site's permission... but a lot of sites have affiliate programs with IFrames or other simple ways to direct the customer to their site. On the plus side, you don't have to deal with payments/complaints and travelers themselves. As for the cons... if you want to compare prices yourself and present the cheapest option to the user, you'll have to integrate on a deeper level, and that means APIs and web scraping.

As for web scraping... avoid it. It sucks. Really. Just don't do it. Trust me on this one. For example, some things like lowcosters you can't get without web scraping. Low cost airlines live from value added services. If the user doesn't see their website, they don't sell extra stuff, and they don't earn anything. Therefore, they don't have affiliates, they don't offer APIs, and they change their site layout almost constantly. However, there are companies which earn a living by web scraping lowcoster's sites and wrapping them into nice APIs. If you can afford them, you can give your users cost-comparison of low cost flights and that's huge.

On the other hand, there are "normal" carriers which offer APIs. It's not that big of a problem to get to airlines since they're all united under IATA; basically, you buy from IATA, and IATA distributes the money to carriers. However, you probably don't want to connect directly to carrier network. They have web services and SOAP these days, but believe me when I say that there are SOAP protocols which are just an insanely thin wrappers around a text prompt through which you can interact with a mainframe with an 80es-style protocol (think of a Unix prompt where you're billed per command; and it takes about 20 commands to do one search). That's why you probably want to connect to somebody a bit more down the food chain, with a better API.

Airlines are thus on both extremes of Gaussian curve; on one side are individual suppliers, and on the other highly centralized systems where you implement one API and you're able to fly anywhere in the world. Accommodation and the rest of travel products are in between. There are several big players which aggregate hotels, and a ton of small suppliers with a lot of aggregators which cover only part of a spectrum. For example, you can rent a lighthouse and it's even not that expensive - but you won't be able to compare the prices of different lighthouses in one place.

If you're into Kayak business model, you'll probably end up scraping websites. If you're into integrating different providers, you'll often work with APIs, some of which are pretty good, and most of which are tolerable. I haven't worked with RSS but there's not a lot of difference between RSS and web scraping. There is also a fourth option not mentioned in Jeff's answer... the one where you get your data nightly, for example .CSV files through FTP and similar.

Life sucks (mini-rant)

And then there's complexity. The more value you want to add, the more complexity you'll have to handle. Can you search accommodations which allow pets? For a hostel which is located less than 5 km from the town center? Are you combining flights, and are you able to guarantee that the traveler will have enough time to get from one airport to another... can you sell the transport in advance? A famous cellist doesn't want to part from his precious 18th century cello; can you sell him another seat for the cello (yep, not making this one up)?

Want to compare prices? Sure, the room is EUR 30 per night. But you can either get one double for 30 and one single for 20, or you can get one extra bed in a double and get 70% off for third person. But only if it's a child under 12 years of age; our extra beds are not for adults. And you don't get the price for extra bed in search results - only when you calculate the final price.

And don't even get me started on dynamic packaging. Want to sell accommodation + rent-a-car? No problem; integrate with two different providers, and off you go... manually updating list of locations in the city (from rent-a-car provider) to match with hotels (from accommodation provider, who gives you only the city for each hotel). Of course, provided that you've already matched the list of cities from the two, since there is no international standard for city codes.

Unlike a lot of other industries which have many products, travel industry has many very complex products. Amazon has it easy; selling books and selling potatoes, it's the same thing; you can even ship them in the same box. They combine easily and aren't assembled from many parts. :)

P.S. Linking to an interesting recent thread on Hacker News with some insider info regarding flights. P.P.S. Recently stumbled on a great albeit rather old blogpost on IATA's NDC protocol with overview of how travel industry is connected and a history lesson how this came to be.

This article says that Kayak was asked to stop scrapping a certain airlines page. That leads me to believe that they probably do scraping on sites that they don't have a relationship with (and a data feed that comes with that relationship).

Travelport offer a product called "Universal API" which connects to flights and hotels and car rental companies and copes with package deals and all the various complexities to do with taxes and exchange rates:

https://developer.travelport.com/app/developer-network/resource-centre-uapi

I've just started using it and it seems fine so far. The queries are a little slow, but then so is every query on every OTA (Online travel agent)'s site.

There's two good APIs I've found from flight comparison websites recently

There's one from Wego, and one from Skyscanner. Both seem to have a good range and breadth of data from a number of airlines and good documentation too.

Wego pays each time a user clicks from your app to a booking website and Skyscanner pay affiliates 50% of 'revenue' (I assume that means the commission they make from airlines)

This is an old post but I thought I'd just add. I'm a data architect who works for a company that feeds these travel sites with content. This company enters into contracts with many hotel brands, individual hotels and other content providers. We aggregate this information then pass it onto the different channels. They then aggregate again in to their system. The Large GDS systems are also content providers. Aggregation is done by many methods... matching algorithms(in-house) and keys. Being an aggregation service, we need to communicate on the client level.

Hope this helps! cheers!