Issue
I'm trying to use String.replaceAll(String regex, String replacement) to filter information out of an HTML document, i.e. HTML code. My aim is to remove all <>-brackets and the contents within them. To do this, I want to simply use an empty String ("") as the replacement String. For example, this:
<tr class='list odd'>
<td class="list" align="center">Do</td>
<td class="list" align="center">7.7.</td><td class="list" align="center">3 - 4</td>
<td class="list" align="center">---</td>
<td class="list" align="center"><s>Q1e14</s></td>
<td class="list" align="center">Arbeitsauftrag:</td>
<td class="list" align="center">entfällt</td></tr>
Should turn into this:
Do
7.7.
3 - 4
---
Q1e14
Arbeitsauftrag
entfällt
I'm completely new to regex and after watching some tutorials I came up with these regexes:
\u003C([a-zA-Z0-9]|\s|\S)+
[\u003C]([a-zA-Z0-9]|\s|\W)+\u003E
I built them using this website: https://regexr.com However, while they at least kind of seem to work there, they both result in a StackOverflowError in my code.
(Note that my IDE, IntelliJ, automatically makes each backslash into two backslashes. I think this is just adjusting the JavaScript regex to Java, but I could be wrong.)
TL;DR: How can I replace HTML tags with <>-brackets and their contents with an empty String using replaceAll (or something else if there is an alternative)?
Solution
Use a proper HTML-parser like Jsoup, instead of string manipilation or regex. Jsoup provides a very convenient API for extracting and manipulating HTML data and is intuitive to work with. Using Jsoup your code could look like:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Example2 {
public static void main(String[] args) {
String html =
"<html>\n"
+ "<head></head>"
+ "<body>"
+ " <table>"
+ " <tr class='list odd'>\n"
+ " <td class=\"list\" align=\"center\">Do</td>\n"
+ " <td class=\"list\" align=\"center\">7.7.</td><td class=\"list\" align=\"center\">3 - 4</td>\n"
+ " <td class=\"list\" align=\"center\">---</td>\n"
+ " <td class=\"list\" align=\"center\"><s>Q1e14</s></td>\n"
+ " <td class=\"list\" align=\"center\">Arbeitsauftrag:</td>\n"
+ " <td class=\"list\" align=\"center\">entfällt</td></tr>\n"
+ " </table>"
+ "</body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Elements tds = doc.select("td");
tds.forEach(td -> System.out.println(td.text()));
}
}
output:
Do
7.7.
3 - 4
---
Q1e14
Arbeitsauftrag:
entfällt
Maven repo:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.2</version>
</dependency>
Answered By - Eritrean
Answer Checked By - Candace Johnson (JavaFixing Volunteer)