So as a quick one I found out some time ago that I need to migrate wiki like information from one system to postgres DB… That basically meant for me that I need to crawl those pages, cut out paragraphs I want and move only those.

And there you can see what I found out… Now I don’t say is the right solution, but for sure quick and kinda neat for what it does :)

So first of all just a small notice, this was just one time rake task that I removed afterward since it served no purpose no more, but was quick to write way to move all that data from wiki without actually hiring two people for weekend to copy paste it… We are developers, there is nothing script cannot do for us right?

So here I exactly as I described needed to cut out only paragraphs that are essential for me, so half of the information there was redundant anyway.

So long story short I needed only to find my two entities between which is my desired text. As I want to provide you more examples, so lets say we are looking for paragraph that starts with link with text ‘Hello’ and ends with hr tag. (note I am getting response from Net::HTTP)

response = Nokogiri::HTML(response.body)
start ="a:contains('Hello')").first
stop ="hr").first

Notice, that these are presented as array to you, but in my case I am sure there is only one of each on the page, so I can afford to just go ahead and call .first on two entities beginning and ending my paragraph.

And now just a nifty method getting all between and we are good to go :)

def collect_between(start, stop)
  start == stop ? [start] : [start, *collect_between(start.next_element, stop)]

And there goes recursion, where I am filling array with next and next element untill I hit the last one. You could do with while, but for me seems nice to use recursion. In my daily job I hardly ever meet it since it could be tricky and not so easy to debug if you make a mistake. But in this case is just nicer and fits one line :)

Ok so here we have it. We call this bad boy with our two elements and are good to go and since I am am from here forward in app using WYSIWYG editor I want to store it as HTML that it was on the page and will apply similar styling to it so I save it like string:

result = collect_between(start, stop)

Thank you for reading and hopefully it was usefull for somebody :)