python 學習手札: 去除網頁標籤 tag

# Routine by Micah D. Cochran
# Submitted on 26 Aug 2005
# This routine is allowed to be put under any license Open Source (GPL, BSD, LGPL, etc.) License 
# or any Propriety License. Effectively this routine is in public domain. Please attribute where appropriate.
def strip_ml_tags(in_text):
  """Description: Removes all HTML/XML-like tags from the input text.
  Inputs: s --> string of text
  Outputs: text string without the tags
  
  # doctest unit testing framework
  
  >>> test_text = "Keep this Text  KEEP  123"
  >>> strip_ml_tags(test_text)
  'Keep this Text  KEEP  123'
  """

# convert in_text to a mutable object (e.g. list)
  s_list = list(in_text)
  i,j = 0,0

  while i < len(s_list):
    # iterate until a left-angle bracket is found
    if s_list[i] == '<':
      while s_list[i] != '>':
        # pop everything from the the left-angle bracket until the right-angle bracket
        s_list.pop(i)
      # pops the right-angle bracket, too

s_list.pop(i)
    else:
      i=i+1

  # convert the list back into text
  join_char=''
  return join_char.join(s_list)

def bbToHtmltags(html):
  pass

做留言版、討論區、BLOG、網頁相關時，時常用得到

Removes all HTML/XML-like tags from the input text.
Inputs: s --> string of text
Outputs: text string without the tags
文件說明已寫得很清楚

將網頁語法如  "Keep this Text  KEEP  123"

讓程式讀取後，去除 <> 內的文字，再做輸出成如下：

Keep this Text KEEP 123

步驟先將"Keep this Text  KEEP  123"文字

放進 list 列表

>>> text = list("Keep this Text  KEEP  123")
>>> text
['K', 'e', 'e', 'p', ' ', 't', 'h', 'i', 's', ' ', 'T', 'e', 'x', 't', ' ', '<', 'r', 'e', 'm', 'o', 'v', 'e', '>', '<', 'm', 'e', ' ', '/', '>', ' ', 'K', 'E', 'E', 'P', ' ', '<', '/', 'r', 'e', 'm', 'o', 'v', 'e', '>', ' ', '1', '2', '3']
>>>

再用while一一讀進做檢查

如果 == '<' 則進入第二個 while 做刪除動作 == '>' ，則出迴圈

放進 新的 []

最後再將 新的 [] ''.join(n_list) 連結回 我們所要的文字，做

輸出很不錯的練習題材。可自行先試試，Micah D. Cochran寫得很

巧妙或者應該說，目前對我這種初學的，每個資深程式設計師寫出

來的都很妙，有些甚至對排工整，嚴謹。程式跑的如何？可能不知

但看其程式碼，真有如藝術、美學一般，令人讚嘆不已。

python 學習手札

2010年10月27日星期三

去除網頁標籤 tag

沒有留言:

張貼留言

流量統計

2010年10月27日 星期三

去除網頁標籤 tag

沒有留言:

張貼留言

2010年10月27日星期三