Baiduspider常見問題解答

sowang · 發表于 2021-2-11 00:56:59

1. 什么是Baiduspider
Baiduspider是百度搜索引擎的一個自動程序，它的作用是訪問互聯網上的網頁，建立索引數據庫，使用戶能在百度搜索引擎中搜索到您網站上的網頁。

2. Baiduspider的user-agent是什么？
索引擎百度各個產品使用不同的user-agent:
[tr][/tr]

產品名稱	對應user-agent
網頁搜索	Baiduspider
移動搜索	Baiduspider
圖片搜索	Baiduspider-image
視頻搜索	Baiduspider-video
新聞搜索	Baiduspider-news
百度搜藏	Baiduspider-favo
百度聯盟	Baiduspider-cpro
商務搜索	Baiduspider-ads

3.如何區分PC與移動網頁搜索的UA
PC搜索完整UA：Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html）
移動搜索完整UA：Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

pc ua：通過關鍵詞Baiduspider/2.0來確定是pc ua
移動ua：通過關鍵詞android和mobile確定是來自移動端抓取訪問，Baiduspider/2.0 確定為百度爬蟲。

4.Baiduspider對一個網站服務器造成的訪問壓力如何？

為了達到對目標資源較好的檢索效果，Baiduspider需要對您的網站保持一定量的抓取。我們盡量不給網站帶來不合理的負擔，并會根據服務器承受能力，網站質量，網站更新等綜合因素來進行調整。如果您覺得baiduspider的訪問行為有任何不合理的情況，您可以反饋至反饋中心

5. 為什么Baiduspider不停的抓取我的網站？

對于您網站上新產生的或者持續更新的頁面，Baiduspider會持續抓取。此外，您也可以檢查網站訪問日志中Baiduspider的訪問是否正常，以防止有人惡意冒充Baiduspider來頻繁抓取您的網站。如果您發現Baiduspider非正常抓取您的網站，請通過投訴平臺反饋給我們，并請盡量給出Baiduspider對貴站的訪問日志，以便于我們跟蹤處理。

6. 如何判斷是否冒充Baiduspider的抓取？

建議您使用DNS反查方式來確定抓取來源的ip是否屬于百度，根據平臺不同驗證方法不同，如linux/windows/os三種平臺下的驗證方法分別如下：

6.1 在linux平臺下，您可以使用host ip命令反解ip來判斷是否來自Baiduspider的抓取。Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名，非 *.baidu.com 或 *.baidu.jp 即為冒充。
$ host 123.125.66.120
120.66.125.123.in-addr.arpa domain name pointer
baiduspider-123-125-66-120.crawl.baidu.com.

host 119.63.195.254
254.195.63.119.in-addr.arpa domain name pointer
BaiduMobaider-119-63-195-254.crawl.baidu.jp.

6.2 在windows平臺或者IBM OS/2平臺下，您可以使用nslookup ip命令反解ip來判斷是否來自Baiduspider的抓取。打開命令處理器輸入nslookup xxx.xxx.xxx.xxx（IP地址）就能解析ip，來判斷是否來自Baiduspider的抓取，Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名，非 *.baidu.com 或 *.baidu.jp 即為冒充。

6.3 在mac os平臺下，您可以使用dig 命令反解ip來判斷是否來自Baiduspider的抓取。打開命令處理器輸入dig xxx.xxx.xxx.xxx（IP地址）就能解析ip，來判斷是否來自Baiduspider的抓取，Baiduspider的hostname以 *.baidu.com 或 *.baidu.jp 的格式命名，非 *.baidu.com 或 *.baidu.jp 即為冒充。

7. 我不想我的網站被Baiduspider訪問，我該怎么做？

Baiduspider遵守互聯網robots協議。您可以利用robots.txt文件完全禁止Baiduspider訪問您的網站，或者禁止Baiduspider訪問您網站上的部分文件。注意：禁止Baiduspider訪問您的網站，將使您的網站上的網頁，在百度搜索引擎以及所有百度提供搜索引擎服務的搜索引擎中無法被搜索到。關于robots.txt的寫作方法，請參看我們的介紹：robots.txt寫作方法

您可以根據各產品不同的user-agent設置不同的抓取規則，如果您想完全禁止百度所有的產品收錄，可以直接對Baiduspider設置禁止抓取。

以下robots實現禁止所有來自百度的抓取： User-agent: Baiduspider Disallow: /

以下robots實現禁止所有來自百度的抓取但允許圖片搜索抓取/image/目錄： User-agent: Baiduspider Disallow: /

User-agent: Baiduspider-image Allow: /image/

請注意：Baiduspider-cpro抓取的網頁并不會建入索引，只是執行與客戶約定的操作，所以不遵守robots協議，如果Baiduspider-cpro給您造成了困擾，請聯系union1@baidu.com。 Baiduspider-ads抓取的網頁并不會建入索引，只是執行與客戶約定的操作，所以不遵守robots協議，如果Baiduspider-ads給您造成了困擾，請聯系您的客戶服務專員。

8. 為什么我的網站已經加了robots.txt，還能在百度搜索出來？

因為搜索引擎索引數據庫的更新需要時間。雖然Baiduspider已經停止訪問您網站上的網頁，但百度搜索引擎數據庫中已經建立的網頁索引信息，可能需要數月時間才會清除。另外也請檢查您的robots配置是否正確。
如果您的拒絕被收錄需求非常急迫，也可以通過投訴平臺反饋請求處理。

9. 我希望我的網站內容被百度索引但不被保存快照，我該怎么做？

Baiduspider遵守互聯網meta robots協議。您可以利用網頁meta的設置，使百度顯示只對該網頁建索引，但并不在搜索結果中顯示該網頁的快照。
和robots的更新一樣，因為搜索引擎索引數據庫的更新需要時間，所以雖然您已經在網頁中通過meta禁止了百度在搜索結果中顯示該網頁的快照，但百度搜索引擎數據庫中如果已經建立了網頁索引信息，可能需要二至四周才會在線上生效。

10. Baiduspider抓取造成的帶寬堵塞？

Baiduspider的正常抓取并不會造成您網站的帶寬堵塞，造成此現象可能是由于有人冒充Baiduspider惡意抓取。如果您發現有名為Baiduspider的agent抓取并且造成帶寬堵塞，請盡快和我們聯系。您可以將信息反饋至投訴平臺，如果能夠提供您網站該時段的訪問日志將更加有利于我們的分析。

sowang · 發表于 2021-2-11 00:57:23

FAQs of Baiduspider

1. What is Baiduspider?
Baiduspider is Baidu search engine program which is used to visit pages on the internet and build information into Baidu index. This enables users to locate your site when they perform a search.

2. What is Baiduspider’s user-agent?
Baidu uses different user-agents for different products:

Name of Products	User-agent
PC search	Baiduspider
Mobile search	Baiduspider
Image search	Baiduspider-image
Video search	Baiduspider-video
News search	Baiduspider-news
Baidu bookmark	Baiduspider-favo
Union baidu	Baiduspider-cpro
Business search	Baiduspider-ads
other search	Baiduspider

3. Will Baiduspider creates additional loading to customer servers?
In order to ensure the search results cover most of your pages, Baiduspider must keep the crawling at a certain level. We have been trying our best to avoid increasing the loading to your servers, and to adjust the frequency based on combined factors, such as your server's capability, your site’s quality and the update frequency of your site. If you find any unreasonable access from Baiduspider, please inform us at http://webmaster.baidu.com/feedback/index (arab，thai，Português)

4. Why Baiduspider crawls my site continuously？
In order to ensure the latest information is presented, Baiduspider crawls new pages or pages frequently renewed at your site. Please check the log to see whether the crawling from Baiduspider is reasonable.
To avoid the excess crawling by spammers or other trouble makers who pretend to be Baiduspider, you can check the log. When you find any abnormal crawling, please inform us at http://webmaster.baidu.com/feedback/index (arab，thai，Português) and provide the log of Baiduspider.

5. How can I know the crawling is from Baiduspider?
We recommend using reverse DNS lookup to verify Baiduspider. Verification methods are different under linux/windows/os environments.
Instructions:
5.1  In Linux: run the “host IP” command,Examples:
$ host 123.125.66.120
120.66.125.123.in-addr.arpa domain name pointer
Baiduspider-123-125-66-120.crawl.baidu.com.

host 119.63.195.254
254.195.63.119.in-addr.arpa domain name pointer
BaiduMobaider-119-63-195-254.crawl.baidu.jp.
The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

5.2  In Windows or IBM OS/2: run the “nslookup IP” command. Open the command processor and input nslookup xxx.xxx.xxx.xxx(IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

5.3  In MAC OS: run the “dig” command. Open the command processor and input dig -x xxx.xxx.xxx.xxx( IP address) to parse the IP. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames.

6. How can I prevent Baiduspider from crawling my site?
Baiduspider works on the robots.txt protocol. You can prevent Baiduspider from crawling your entire site or the specific contents by specifying them in robots.txt. Please note that by doing this, the pages of your site will not be found in Baidu search results and in any other the search results which is provided by Baidu. For detailsof setting a robots.txt, please see How to create a robots.txt

You can set different rules towards different user-agents. (Please note Baiduspider-video does not support the rules currently). If you prefer to prevent all the user-agents of Baidu, you can simply block Baiduspider.

Below robots command will block all the crawling from Baidu.
User-agent: Baiduspider
Disallow: /

Below robots command will allow Baiduspider-image only to crawl the directory of /image/
User-agent: Baiduspider
Disallow: /

User-agent: Baiduspider-image
Allow: /image/

Please note that the pages that crawled by Baiduspider-cpro will not be built into the index and Baiduspider-cpro works on the agreement that set with customers. In this case, Baiduspider-cpro will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-cpro, please contact union1@baidu.com.Baiduspider-ads will not be built into the index and Baiduspider-ads works on the agreement that set with customers. In this case, Baiduspider-ads will not work on the records set by robots.txt. If you are not comfortable with Baiduspider-ads, please contact your customer service representative.

7. I've set robots.txt to my site, but why the contents of my site are still displayed as the search results?
It takes time for Baidu to update the database. Baidu stops crawling your site once you have added robots.txt.  The index which has been built previously requires several months to be removed from the database of the search engine.  On the other hand, please make sure your robots.txt is created correctly.
If your request of removing your site from the search engine is in urgent, please inform us at http://webmaster.baidu.com/feedback/index (arab，thai，Português)

8. How can I request Baiduspider to index my pages but not to show the cached links in the search results?
Baiduspider works on the meta robots.txt protocol. You can use the meta tag to request Baiduspider to index your pages but not to show the cached links in the search results.
Same as the way Baidu handling update request from robot.txt. Baidu stops showing the cached links after you had updated the meta robots.txt protocol. It takes 2 to 4 weeks to refresh the contents which have been stored in the database of Baidu previously.

9. Will the crawling from Baidu lead to bandwidth congestion?
Generally speaking, the crawling of Baiduspider will not lead to bandwidth congestion. If it happens, it is probably caused by some unauthorized access which pretend to be Baiduspider. When you find there is an agent named as Baiduspider which makes your network busy, please inform us at http://webmaster.baidu.com/feedback/index (arab，thai，Português) as soon as possible. Providing us with the log of that particular time frame will be of great help for us to investigate and analyze the problem.

10. How to index my websites, including independent sites and blogs, in Baidu?
Baidu indexes sites and pages which reach the requirement of user search experience.
To help Baiduspider discover you site more quickly, you are also welcomed to submit your website address at http://www.baidu.com/search/url_submit.htm
Homepage is enough, with no requirement on detailed content pages.
The value of pages is the only reason that justifies the indexing of Baidu, which has nothing to do with commercial factors, for example, Baidu Promotion.

11. How to tell whether my website has been indexed by Baidu? Is the result provided by the website grammar is equivalent to the real amount of indexing?
To check whether your site has been indexed by Baidu, please run website grammar. Input “site: (your domain name)” into the search box, for instance, http://www.baidu.com/s?wd=site%3Awww.baidu.com. Your site will be displayed as a result if it has been indexed.
The result provided by the website grammar is only an estimate for reference.

12. How to prevent my website from being indexed by Baidu?
Baidu strictly complies with robots.txt protocol. For detailed information, please visit http://www.robotstxt.org/.
You can prevent all your pages or parts of them from being crawled and indexed by Baidu with robots.txt. For specific method, please refer to How to Write Robots.txt.
If you set robots.txt to restrict crawling after Baidu has indexed your website, it usually takes 48 hours for the updated robots.txt to take effect and then the new pages won’t be indexed. Note that it may take several months for the contents which have been indexed by Baidu before the restriction of robots.txt to disappear from search results.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arab，thai，Português) and we will deal with it as soon as possible.

13. How to report to Baiduspider when I meet problems?
If you have any problem on the crawling, please contact http://webmaster.baidu.com/feedback/index (arab，thai，Português) for further help.
To deal with your feedback effectively and timely, please make sure that both the problem and the domain name your website are reported. It would be better if you can provide the website log of crawling, which can help us find the reason and solve the problem in time.

14. I have set robots.txt to restrict the crawling of Baidu, but why it doesn’t take effect?
Baidu strictly complies with robots.txt protocol. But our DNS updates periodically. If you have set robots.txt, due to the updating, Baidu needs some time to stop crawling your site.
If you are in urgent need of restricting crawling, you can report to http://webmaster.baidu.com/feedback/index (arab，thai，Português).
Besides, please check whether your robots.txt is correct in format.

15. Why some pages, such as private pages without links or pages requiring access rights, are also indexed by Baidu?
The crawling of pages by Baiduspider depends on the links between pages.
Except for the internal links between pages, there are also external links between different sites. Therefore, although some pages deny access from the internal links, they can also be indexed by search engine if there are links on other website directing to them.
Baiduspider enjoys the same access rights as other users. Consequently, spider can’t visit those pages that ordinary users fail to do so. There are 2 reasons that Baidu seems to index those pages with access rights:
o There is no restriction of access when the spider crawls the content. However, after the crawling, the access rights change.
o The content is protected by access rights, but due to website security holes, users are able to access it through some special paths. If the paths are publicized on the internet, following them, the spider is able to crawl the content.
If you would like the private content not to be indexed, you can restrict crawling with robots.txt. Besides, you are also welcomed to report to http://webmaster.baidu.com/feedback/index (arab，thai，Português) for solution.

16. Why does the amount of my website indexed tend to decrease?
Due to the instability of server, Baiduspider cannot crawl pages when it checks for updates and changes, thus those pages will be deleted temporarily.
You website does not fit for user search experience.

17. Why does my page disappear from the search results of Baidu?
Baidu does not promise that all pages can be searched.
If your page fails to be searched by Baidu for a long time or suddenly disappears from the results, the possible reasons are as follows:
o You website does not fit for the user search experience.
o Due to the instability of server that your website is based on, Baidu deletes it temporarily. The problem will be solved after the server is stable.
o Some contents of the page do not conform to the law and rules.
o Other technical problems.
The following comments are incorrect as well as groundless:
o If a website discontinues paying after participating in Baidu Promotion, it will disappear from Ba

		自動登錄	找回密碼
密碼			禁止注冊