Sopa Linda - Modificando a árvore

Um dos aspectos importantes do BeautifulSoup é pesquisar a árvore de análise e permite que você faça alterações no documento da web de acordo com suas necessidades. Podemos fazer alterações nas propriedades da tag usando seus atributos, como o método .name, .string ou .append (). Ele permite que você adicione novas tags e strings a uma tag existente com a ajuda dos métodos .new_string () e .new_tag (). Existem outros métodos também, como .insert (), .insert_before () ou .insert_after () para fazer várias modificações em seu documento HTML ou XML.

Alterar nomes e atributos de tag

Depois de criar a sopa, é fácil fazer modificações, como renomear a tag, fazer modificações em seus atributos, adicionar novos atributos e excluir atributos.

>>> soup = BeautifulSoup('<b class="bolder">Very Bold</b>')
>>> tag = soup.b

A modificação e adição de novos atributos são as seguintes -

>>> tag.name = 'Blockquote'
>>> tag['class'] = 'Bolder'
>>> tag['id'] = 1.1
>>> tag
<Blockquote class="Bolder" id="1.1">Very Bold</Blockquote>

Os atributos de exclusão são os seguintes -

>>> del tag['class']
>>> tag
<Blockquote id="1.1">Very Bold</Blockquote>
>>> del tag['id']
>>> tag
<Blockquote>Very Bold</Blockquote>

Modificando .string

Você pode modificar facilmente o atributo .string da tag -

>>> markup = '<a href="https://www.tutorialspoint.com/index.htm">Must for every <i>Learner>/i<</a>'
>>> Bsoup = BeautifulSoup(markup)
>>> tag = Bsoup.a
>>> tag.string = "My Favourite spot."
>>> tag
<a href="https://www.tutorialspoint.com/index.htm">My Favourite spot.</a>

Acima, podemos ver se a tag contém alguma outra tag, ela e todo o seu conteúdo serão substituídos por novos dados.

acrescentar()

Adicionar novos dados / conteúdos a uma tag existente é usando o método tag.append (). É muito semelhante ao método append () na lista Python.

>>> markup = '<a href="https://www.tutorialspoint.com/index.htm">Must for every <i>Learner</i></a>'
>>> Bsoup = BeautifulSoup(markup)
>>> Bsoup.a.append(" Really Liked it")
>>> Bsoup
<html><body><a href="https://www.tutorialspoint.com/index.htm">Must for every <i>Learner</i> Really Liked it</a></body></html>
>>> Bsoup.a.contents
['Must for every ', <i>Learner</i>, ' Really Liked it']

NavigableString () e .new_tag ()

Caso você queira adicionar uma string a um documento, isso pode ser feito facilmente usando o construtor append () ou NavigableString () -

>>> soup = BeautifulSoup("<b></b>")
>>> tag = soup.b
>>> tag.append("Start")
>>>
>>> new_string = NavigableString(" Your")
>>> tag.append(new_string)
>>> tag
<b>Start Your</b>
>>> tag.contents
['Start', ' Your']

Note: Se você encontrar algum erro de nome ao acessar a função NavigableString (), da seguinte maneira−

NameError: o nome 'NavigableString' não foi definido

Basta importar o diretório NavigableString do pacote bs4 -

>>> from bs4 import NavigableString

Podemos resolver o erro acima.

Você pode adicionar comentários à sua tag existente ou pode adicionar alguma outra subclasse de NavigableString, basta chamar o construtor.

>>> from bs4 import Comment
>>> adding_comment = Comment("Always Learn something Good!")
>>> tag.append(adding_comment)
>>> tag
<b>Start Your<!--Always Learn something Good!--></b>
>>> tag.contents
['Start', ' Your', 'Always Learn something Good!']

Adicionar uma nova tag inteira (não anexar a uma tag existente) pode ser feito usando o método embutido Beautifulsoup, BeautifulSoup.new_tag () -

>>> soup = BeautifulSoup("<b></b>")
>>> Otag = soup.b
>>>
>>> Newtag = soup.new_tag("a", href="https://www.tutorialspoint.com")
>>> Otag.append(Newtag)
>>> Otag
<b><a href="https://www.tutorialspoint.com"></a></b>

Apenas o primeiro argumento, o nome da tag, é necessário.

inserir()

Semelhante ao método .insert () na lista python, tag.insert () irá inserir um novo elemento, entretanto, ao contrário de tag.append (), o novo elemento não necessariamente vai no final do conteúdo de seu pai. Novo elemento pode ser adicionado em qualquer posição.

>>> markup = '<a href="https://www.djangoproject.com/community/">Django Official website <i>Huge Community base</i></a>'
>>> soup = BeautifulSoup(markup)
>>> tag = soup.a
>>>
>>> tag.insert(1, "Love this framework ")
>>> tag
<a href="https://www.djangoproject.com/community/">Django Official website Love this framework <i>Huge Community base</i></a>
>>> tag.contents
['Django Official website ', 'Love this framework ', <i>Huge Community base</i
>]
>>>

insert_before () e insert_after ()

Para inserir alguma tag ou string antes de algo na árvore de análise, usamos insert_before () -

>>> soup = BeautifulSoup("Brave")
>>> tag = soup.new_tag("i")
>>> tag.string = "Be"
>>>
>>> soup.b.string.insert_before(tag)
>>> soup.b
<b><i>Be</i>Brave</b>

Da mesma forma, para inserir alguma tag ou string logo após algo na árvore de análise, use insert_after ().

>>> soup.b.i.insert_after(soup.new_string(" Always "))
>>> soup.b
<b><i>Be</i> Always Brave</b>
>>> soup.b.contents
[<i>Be</i>, ' Always ', 'Brave']

Claro()

Para remover o conteúdo de uma tag, use tag.clear () -

>>> markup = '<a href="https://www.tutorialspoint.com/index.htm">For <i>technical & Non-technical&lr;/i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> tag = soup.a
>>> tag
<a href="https://www.tutorialspoint.com/index.htm">For <i>technical & Non-technical</i> Contents</a>
>>>
>>> tag.clear()
>>> tag
<a href="https://www.tutorialspoint.com/index.htm"></a>

extrair()

Para remover uma tag ou strings da árvore, use PageElement.extract ().

>>> markup = '<a href="https://www.tutorialspoint.com/index.htm">For <i&gr;technical & Non-technical</i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>>
>>> i_tag = soup.i.extract()
>>>
>>> a_tag
<a href="https://www.tutorialspoint.com/index.htm">For Contents</a>
>>>
>>> i_tag
<i>technical & Non-technical</i>
>>>
>>> print(i_tag.parent)
None

decompor()

O tag.decompose () remove uma tag da árvore e exclui todo o seu conteúdo.

>>> markup = '<a href="https://www.tutorialspoint.com/index.htm">For <i>technical & Non-technical</i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>> a_tag
<a href="https://www.tutorialspoint.com/index.htm">For <i>technical & Non-technical</i> Contents</a>
>>>
>>> soup.i.decompose()
>>> a_tag
<a href="https://www.tutorialspoint.com/index.htm">For Contents</a>
>>>

Substituir com()

Como o nome sugere, a função pageElement.replace_with () substituirá a tag ou string antiga pela nova tag ou string na árvore -

>>> markup = '<a href="https://www.tutorialspoint.com/index.htm">Complete Python <i>Material</i></a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>>
>>> new_tag = soup.new_tag("Official_site")
>>> new_tag.string = "https://www.python.org/"
>>> a_tag.i.replace_with(new_tag)
<i>Material</i>
>>>
>>> a_tag
<a href="https://www.tutorialspoint.com/index.htm">Complete Python <Official_site>https://www.python.org/</Official_site></a>

Na saída acima, você notou que replace_with () retorna a tag ou string que foi substituída (como “Material” em nosso caso), então você pode examiná-lo ou adicioná-lo de volta a outra parte da árvore.

embrulho()

O pageElement.wrap () incluiu um elemento na tag que você especificou e retorna um novo wrapper -

>>> soup = BeautifulSoup("<p>tutorialspoint.com</p>")
>>> soup.p.string.wrap(soup.new_tag("b"))
<b>tutorialspoint.com</b>
>>>
>>> soup.p.wrap(soup.new_tag("Div"))
<Div><p><b>tutorialspoint.com</b></p></Div>

desembrulhar()

O tag.unwrap () é exatamente o oposto de wrap () e substitui uma tag por qualquer coisa dentro dessa tag.

>>> soup = BeautifulSoup('<a href="https://www.tutorialspoint.com/">I liked <i>tutorialspoint</i></a>')
>>> a_tag = soup.a
>>>
>>> a_tag.i.unwrap()
<i></i>
>>> a_tag
<a href="https://www.tutorialspoint.com/">I liked tutorialspoint</a>

Acima, você notou que, assim como o replace_with (), o unbrap () retorna a tag que foi substituída.

Abaixo está mais um exemplo de unbrap () para entendê-lo melhor -

>>> soup = BeautifulSoup("<p>I <strong>AM</strong> a <i>text</i>.</p>")
>>> soup.i.unwrap()
<i></i>
>>> soup
<html><body><p>I <strong>AM</strong> a text.</p></body></html>

Desembrulhar () é bom para eliminar a marcação.

↰ Previous page Next page ↱