找到你要的答案

Q:Extract text from html with xpath

Q:从HTML文本和XPath

I want to extract text from html just like this-

<div id="sn1058961" class="soundTrack soda odd">Boom Shack-a-Lak<br />
Written by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a> (as  Stephen Kapur) and Ervin Barrington Woolley<br />
Performed by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a><br   />
Courtesy of Island Records Ltd.<br />
Under license from Universal Music Enterprises<br />

in the following form.

If I use the following xpath

//*[@id="soundtracks_content"]/div[2]/div[1]/node()[count(preceding-sibling::br)=1][normalize-space()]

then it must extract one single piece of text "Written by Apache Indian (as Stephen Kapur) and Ervin Barrington Woolley" but the above command is extracting three text elements "Written by", "Apache Indian" and "(as Stephen Kapur) and Ervin Barrington Woolley". Can you suggest another xpath that would extract a single text from the above html. I have been practising my xpath on the url: "http://www.imdb.com/title/tt2096672/soundtrack?ref_=tt_ql_trv_7"

I am using using import.io to scrape data through xpath but I am not allowed to enter the entire xpath I just enter

node()[count(preceding-sibling::br)=1][normalize-space()]

I have pasted the a picture of what I am actually doing - Please note I also need anchor text

我想从中提取文本的HTML一样—

<div id="sn1058961" class="soundTrack soda odd">Boom Shack-a-Lak<br />
Written by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a> (as  Stephen Kapur) and Ervin Barrington Woolley<br />
Performed by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a><br   />
Courtesy of Island Records Ltd.<br />
Under license from Universal Music Enterprises<br />

以下列形式。

如果我使用下面的XPath

//*[@id="soundtracks_content"]/div[2]/div[1]/node()[count(preceding-sibling::br)=1][normalize-space()]

然后它必须提取一块文本”的阿帕奇印第安书面(Stephen Kapur)和Ervin Barrington Woolley“但上述命令提取三文本元素”写成了“,”阿帕奇印第安人”和“(Stephen Kapur)和Ervin Barrington Woolley”。你能提出一个XPath将从上述HTML中提取单个文本。我一直在练习我的网址:“http://www.imdb.com/title/tt2096672/soundtrack XPath?ref_ = tt_ql_trv_7”

我用import.io刮数据通过XPath但我不被允许进入整个XPath我刚刚进入

node()[count(preceding-sibling::br)=1][normalize-space()]

我已经粘贴了我正在做的图片-请注意我还需要锚文本

answer1: 回答1:

with xpath 2.0

string-join(//*[@id="soundtracks_content"]/div[2]/div[1]//text()[count(preceding-sibling::br)=1][normalize-space()], "")

XPath 2

string-join(//*[@id="soundtracks_content"]/div[2]/div[1]//text()[count(preceding-sibling::br)=1][normalize-space()], "")
xpath  imdb