Home

Published

- 2 min read

Chatbot LLMs Gatotkaca.AI #Part 1

img of Chatbot LLMs Gatotkaca.AI #Part 1

Introduction

Gatotkaca.AI is my mini-project where I developed a chatbot using LlamaIndex and Mistral models, designed to provide information about weather conditions in Indonesia.

Steps

I’ve broken it down into parts to help you understand how I developed the LLM-powered chatbot.

  1. Webscraping Indonesia Informations
  2. Implementation ETL (Extract Transformation Load)
  3. Configuration on LLamaIndex
  4. Finalization with implementation using Streamlit

Tools

In this part, there are part several tools you need to prepare to install to run this systems.

  1. Numpy (pip install numpy)
  2. Pandas (pip install pandas)
  3. BeautifulSoup4 (pip install beautifulsoup4)
  4. JupyterLab (pip install jupyterlab)

Actions

  1. First of all, we need to know how many province capital in Indonesia country. To get that informations we can extract it from Wikipedia. I found that informations from this link

https://id.wikipedia.org/wiki/Daftar_ibu_kota_provinsi_di_Indonesia

2. To load data from the url and store it into variables, using this command

   import requests

url = "https://id.wikipedia.org/wiki/Daftar_ibu_kota_provinsi_di_Indonesia"
response = requests.get(url)
wikisite = response.text
wikisite

3. After receiving the responses in HTML, we need to extract the tbody value using Beautiful Soup to simplify our data breakdown.

   from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(wikisite, 'html.parser')
tbody = soup.find('table').find('tbody')
tbody

4. There are some columns we need to store the value into our array and saved it into numpy files.

   
indProv = []

# Check if tbody exists
if tbody:
    # Iterate over the elements inside tbody
for tr in tbody.find_all('tr'):  # Assuming you want to get all rows
        # Get the text inside the td
td = tr.find_all('td')
if td and td[0].get_text(strip = True) != '-':
areas = td[6].get_text(strip = True)
cleaned_areas = re.sub(r'\[.*?\]', '', areas).strip()

capital = td[2].get_text(strip = True)
cleaned_capital = re.sub(r'\[.*?\]', '', capital).strip()

indProv.append({
  'id': td[0].get_text(strip = True),
  'nama': td[1].get_text(strip = True),
  'ibukota': cleaned_capital,
  'luas_wilayah': cleaned_areas,
  'ipm': td[7].get_text(strip = True)
})
else:
print("No tbody found")

5. Make sure that the total length of the array matches the number of capital provinces in Indonesia, which is 38.

   indProv, len(indProv)

6. Some information is missing in the data, so we need to perform normalization to clean it up.

   # there are missing values in ibukota jakarta
indProv[10]['ibukota'] = 'Jakarta'
print(indProv[10])
  • Replace Palangka Raya to palangkaraya because the API is only provided on palangkaraya
   # there are missing values in ibukota jakarta
indProv[10]['ibukota'] = 'Jakarta'
print(indProv[10])
  • Remove data which not found in the API Weather data
   indProv.remove(indProv[36])
len(indProv)

7. Saved it into numpy files

   import numpy as np

indProv_array = np.array(indProv)
np.save('temp/ind_prov.npy', indProv_array)